I am looking for some input on how to scale out a Windows Service that is currently running at my company. We are using .NET 4.0 (can and will be upgraded to 4.5 at some point in the future) and running this on Windows Server 2012.
About the service
The service's job is to query for new rows in a logging table (We're working with an Oracle database), process the information, create and/or update a bunch of rows in 5 other tables (let's call them Tracking tables), update the logging table and repeat.
The logging table has large amounts of XML (can go up to 20 MB per row) which needs to be selected and saved in the other 5 Tracking tables. New rows are added all the time at the maximum rate of 500,000 rows an hour.
The Tracking tables' traffic is much higher, ranging from 90,000 new rows in the smallest one to potentially millions of rows in the largest table, each hour. Not to mention that there are Update operations on those tables as well.
About the data being processed
I feel this bit is important for finding a solution based on how these objects are grouped and processed. The data structure looks like this:
public class Report
{
public long Id { get; set; }
public DateTime CreateTime { get; set; }
public Guid MessageId { get; set; }
public string XmlData { get; set; }
}
public class Message
{
public Guid Id { get; set; }
}
Report is the logging data I need to select and process
For every Message there are on average 5 Reports. This can vary between 1 to hundreds in some cases.
Message has a bunch of other collections and other relations, but they are irrelevant to the question.
Today the Windows Service we have barely manages the load on a 16-core server (I don't remember the full specs, but it's safe to say this machine is a beast). I have been tasked with finding a way to scale out and add more machines that will process all this data and not interfere with the other instances.
Currently each Message gets it's own Thread and handles the relevant reports. We handle reports in batches, grouped by their MessageId to reduce the number of DB queries to a minimum when processing the data.
Limitations
At this stage I am allowed to re-write this service from scratch using any architecture I see fit.
Should an instance crash, the other instances need to be able to pick up where the crashed one left. No data can be lost.
This processing needs to be as close to real-time as possible from the reports being inserted into the database.
I'm looking for any input or advice on how to build such a project. I assume the services will need to be stateless, or is there a way to synchronize caches for all the instances somehow? How should I coordinate between all the instances and make sure they're not processing the same data? How can I distribute the load equally between them? And of course, how to handle an instance crashing and not completing it's work?
EDIT
Removed irrelevant information
For your work items, Windows Workflow is probably your quickest means to refactor your service.
Windows Workflow Foundation # MSDN
The most useful thing you'll get out of WF is workflow persistence, where a properly designed workflow may resume from a Persist point, should anything happen to the workflow from the last point at which it was saved.
Workflow Persistence # MSDN
This includes the ability for a workflow to be recovered from another process should any other process crash while processing the workflow. The resuming process doesn't need to be on the same machine if you use the shared workflow store. Note that all recoverable workflows require the use of the workflow store.
For work distribution, you have a couple options.
A service to produce messages combined with host-based load balancing via workflow invocation using WCF endpoints via the WorkflowService class. Note that you'll probably want to use the design-mode editor here to construct entry methods rather than manually setup Receive and corresponding SendReply handlers (these map to WCF methods). You would likely call the service for every Message, and perhaps also call the service for every Report. Note that the CanCreateInstance property is important here. Every invocation tied to it will create a running instance that runs independently.
~
WorkflowService Class (System.ServiceModel.Activities) # MSDN
Receive Class (System.ServiceModel.Activities) # MSDN
Receive.CanCreateInstance Property (System.ServiceModel.Activities) # MSDN
SendReply Class (System.ServiceModel.Activities) # MSDN
Use a service bus that has Queue support. At the minimum, you want something that potentially accepts input from any number of clients, and whose outputs may be uniquely identified and handled exactly once. A few that come to mind are NServiceBus, MSMQ, RabbitMQ, and ZeroMQ. Out of the items mentioned here, NServiceBus is exclusively .NET ready out-of-the-box. In a cloud context, your options also include platform-specific offerings such as Azure Service Bus and Amazon SQS.
~
NServiceBus
MSMQ # MSDN
RabbitMQ
ZeroMQ
Azure Service Bus # MSDN
Amazon SQS # Amazon AWS
~
Note that the service bus is just the glue between a producer that will initiate Messages and a consumer that can exist on any number of machines to read from the queue. Similarly, you can use this indirection for Report generation. Your consumer will create workflow instances that may then use workflow persistence.
Windows AppFabric may be used to host workflows, allowing you to use many techniques that apply to IIS load balancing to distribute your work. I don't personally have any experience with it, so there's not much I can say for it other than it has good monitoring support out-of-the-box.
~
How to: Host a Workflow Service with Windows App Fabric # MSDN
I solved this by coding all this scalability and redundancy stuff on my own. I will explain what I did and how I did it, should anyone ever need this.
I created a few processes in each instance to keep track of the others and know which records the particular instance can process. On start up, the instance would register in the database (if it's not already) in a table called Instances. This table has the following columns:
Id Number
MachineName Varchar2
LastActive Timestamp
IsMaster Number(1)
After registering and creating a row in this table if the instance's MachineName wasn't found, the instance starts pinging this table every second in a separate thread, updating its LastActive column. Then it selects all the rows from this table and makes sure that the Master Instance (more on that later) is still alive - meaning that it's LastActive time is in the last 10 seconds. If the master instance stopped responding, it will assume control and set itself as master. In the next iteration it will make sure that there is only one master (in case another instance decided to assume control as well simultaneously), and if not it will yield to the instance with the lowest Id.
What is the master instance?
The service's job is to scan a logging table and process that data so people can filter and read through it easily. I didn't state this in my question, but it might be relevant here. We have a bunch of ESB servers writing multiple records to the logging table per request, and my service's job is to keep track of them in near real-time. Since they're writing their logs asynchronously, I could potentially get a finished processing request A before started processing request A entry in the log. So, I have some code that sorts those records and makes sure my service processes the data in the correct order. Because I needed to scale out this service, only one instance can do this logic to avoid lots of unnecessary DB queries and possibly insane bugs.
This is where the Master Instance comes in. Only it executes this sorting logic and temporarily saves the log record Id's in another table called ReportAssignment. This table's job is to keep track of which records were processed and by whom. Once processing is complete, the record is deleted. The table looks like this:
RecordId Number
InstanceId Number Nullable
The master instance sorts the log entries and inserts their Id's here. All my service instances check this table in 1 second intervals for new records that aren't being processed by anyone or that are being processed by an inactive instance, and that the [record's Id] % [number of isnstances] == [index of current instance in a sorted array of all the active instances] (that were acquired during the Pinging process). The query looks somewhat like this:
SELECT * FROM ReportAssignment
WHERE (InstanceId IS NULL OR InstanceId NOT IN (1, 2, 3)) // 1,2,3 are the active instances
AND RecordId % 3 == 0 // 0 is the index of the current instance in the list of active instances
Why do I need to do this?
The other two instances would query for RecordId % 3 == 1 and RecordId % 3 == 2.
RecordId % [instanceCount] == [indexOfCurrentInstance] ensures that the records are distributed evenly between all instances.
InstanceId NOT IN (1,2,3) allows the instances to take over records that were being processed by an instance that crashed, and not process the records of already active instances when a new instance is added.
Once an instance queries for these records, it will execute an update command, setting the InstanceId to its own and query the logging table for records with those Id's. When processing is complete, it deletes the records from ReportAssignment.
Overall I am very pleased with this. It scales nicely, ensures that no data is lost should the instance go down, and there were nearly no alterations to the existing code we have.
Related
Explanation:
I am developing a simple car business system and I have to implement the following feature:
A very special car model is delivered to a shop. There are a lot of people on waiting list exactly for this model.
When the car arrives the first client receives the right to buy it, he / she has 24 hours to use this opportunity.
I have a special state in the DB that determines if the user is: on waiting list (I have the exact position, as well) or can use opportunity to buy the car. Whenever the car arrives, I run a method that changes the state of the first client on waiting list. And here comes the problem:
Problem:
The client can use his opportunity, during the 24 hours period. But I have to check at the end, if he/she has bought the car. For this reason, I have to schedule a method to run in 24 hours.
Possible solution:
I am thinking about two things. First is using a job scheduler like Hangfire. The problem is that since I do not have any other jobs in my app, I do not want to include a whole package for such a small thing. Second is using making the checking method asynchronous and making the thread sleep for 24 hours before proceeding (I do not feel comfortable in working with threads and this is just an idea). I got the idea from this article. Keep in mind that more than one car can arrive in more than one shop. Does it mean that I should use many threads and how it is going to affect the performance of the system?
Question:
Which of the two solutions is better?
Is there another possibility that you can suggest in this particular case?
I agree. Importing a package for only one job if you aren't going to use it for many jobs is a little bit of overkill.
If you are running SQL server, I'd recommend writing a .NET console application to run on a schedule using the SQL Server Agent. (see image) If you have stored procedures that need to run, you also have the option to run them directly from the SQL job if for some reason you don't want to run them from your .NET application.
Since it sounds like you need this to run on a data driven schedule, you may consider adding a trigger to look for a new record in your database whenever that "special" car is inserted into the database. MSDN SQL Job using Trigger
I've done something similar to this where every morning, an hour prior to business hours starting, I run a .NET executable that checks the latest record in table A and compares it to a value in table B and determines if the record in table A needs to be updated.
I also use SQL Server to run jobs that send emails on a schedule based on data that has been added or modified in a database.
There are advantages to using SQL server to running your jobs as there are many options available to notify you of events, retry running failed jobs, logging and job history. You can specify any type of schedule from repeating frequently to only running once a week.
Using SQL Server 2012, ASP.NET 4.6.2, and EF6. I have a table of urls. each url has to go through a number of third party processes via API calls, and the state reflected in that table. I'm planning to use scheduled background processes of some sort to kick those processes off. I've come up with a structure like:
Id int (PK)
SourceUrl varchar(200)
Process1Status int
Process2Status int
When rows go into the table, Status flags will be 0 for AwaitingProcessing. 1 would mean InProgress, and 2 would be Complete.
To ensure the overall processing is quicker, I want to run these two processes in parallel. In addition, there may be multiple instances of each of these background processors picking up urls from the queue.
I'm new to multi threaded processing though, so I'm a bit concerned that there will be some conflicting processing going on.
What I want to be able to do is to ensure that no Process1runner selects the same row as another Process1runner, by ensuring that Process1Runner takes only 1 item and flags that it is currently in progress. I'd also like to ensure that when separate third party services call back to notification urls, that no status update is lost if two processes are attempting to update Process1Status and Process2Status at the same time.
I've seen two possible relevant answers: How can I lock a table on read, using Entity Framework?
and: Get "next" row from SQL Server database and flag it in single transaction
But I'm not much clearer about which route I should take for my needs. Could someone point me in the right direction? Am I on the right track?
If by design multiple actors need access the same row of data I would split the data to avoid this situation.
My first thought is to suggest building a UrlProcessStatus table with URLId, ProcessId, and Status columns. This way the workers can read/write their data independently.
I'm attempting to improve query performance for an application and I'm logically stuck.
So the application is proprietary and thus we're unable to alter application-side code. We have, however, received permission to work with the underlying database (surprisingly enough). The application calls a SQL Server database, so the current idea we're running with is to create a view with the same name as the table and rename the underlying table. When the application hits the view, the view calls one of two SQL CLR functions, which both do nothing more than call a web service we've put together. The web service performs all the logic, and contains an API call to an external, proprietary API that performs some additional logic and then returns the result.
This all works, however, we're having serious performance issues when scaling up to large data sets (100,000+ rows). The pretty clear source of this is the fact we're having to work on one row at a time with the web service, which includes the API call, which makes for a lot of latency overhead.
The obvious solution to this is to figure out a way to limit the number of times that the web service has to be hit per query, but this is where I'm stuck. I've read about a few different ways out there for potentially handling scenarios like this, but as a total database novice I'm having difficulty getting a grasp on what would be appropriate in this situation.
If any there are any ideas/recommendations out there, I'd be very appreciative.
There are probably a few things to look at here:
Is your SQLCLR TVF streaming the results out (i.e. are you adding to a collection and then returning that collection at the end, or are you releasing each row as it is completed -- either with yield return or building out a full Enumerator)? If not streaming, then you should do this as it allows for the rows to be consumed immediately instead of waiting for the entire process to finish.
Since you are replacing a Table with a View that is sourced by a TVF, you are naturally going to have performance degradation since TVFs:
don't report their actual number of rows. T-SQL Multi-statement TVFs always appear to return 1 row, and SQLCLR TVFs always appear to return 1000 rows.
don't maintain column statistics. When selecting from a Table, SQL Server will automatically create statistics for columns referenced in WHERE and JOIN conditions.
Because of these two things, the Query Optimizer is not going to have an easy time generating an appropriate plan if the actual number of rows is 100k.
How many SELECTs, etc are hitting this View concurrently? Since the View is hitting the same URI each time, you are bound by the concurrent connection limit imposed by ServicePointManager ( ServicePointManager.DefaultConnectionLimit ). And the default limit is a whopping 2! Meaning, all additional requests to that URI, while there are already 2 active/open HttpWebRequests, will wait inline, patiently. You can increase this by setting the .ServicePoint.ConnectionLimit property of the HttpWebRequest object.
How often does the underlying data change? Since you switched to a View, that doesn't take any parameters, so you are always returning everything. This opens the door for doing some caching, and there are two options (at least):
cache the data in the Web Service, and if it hasn't reached a particular time limit, return the cached data, else get fresh data, cache it, and return it.
go back to using a real Table. Create a SQL Server Agent job that will, every few minutes (or maybe longer if the data doesn't change that often): start a transaction, delete the current data, repopulate via the SQLCLR TVF, and commit the transaction. This requires that extra piece of the SQL Agent job, but you are then back to having more accurate statistics!!
For more info on working with SQLCLR in general, please visit: SQLCLR Info
I've run into this a few times recently at work. Where we have to develop an application that completes a series of items on a schedule, sometimes this schedule is configurable by the end user, other times its set in Config File. Either way, this task is something that should only be executed once, by a single machine. This isnt generally difficult, until you introduce the need for SOA/Geo Redundancy. In this particular case there are a total of 4 (could be 400) instances of this application running. There are two in each data center on opposite sides of the US.
I'm investigating successful patterns for this sort of thing. My current solution has each physical location determining if it should be active or dormant. We do this by checking a Session object that is maintained to another server. If DataCenter A is the live setup, then the logic auto-magically prevents the instances in DataCenter B from performing any execution. (We dont want the work to traverse the MPLS between DCs)
The two remaining instances in DC A will then query the Database for any jobs that need to be executed in the next 3 hours and cache them. A separate timer runs every second checking for jobs that need executed.
If it finds one it will execute a stored procedure first, that forces a full table lock, queries for the job that needs to be executed, checks the "StartedByInstance" Column for a value, if it doesnt find a value then it marks that record as being executed by InstanceX. Only then will it actually execute the job.
My direct questions are:
Is this a good pattern?
Are there any better patterns?
Are there any libraries/apis that would be of interest?
Thanks!
This is my situation:
I have a account (userid/password) to communicate with an airline central reservation system through their API.
The API provide methods to connect, disconnect, sign in, signout, sendcommand and getdatareturn.
This is the steps I do sequentially to get wanted data.
Connect to host
sign in to system
send a command to get a list of passergers of a flight at a specified date from a city to another city (LD command with some parameter like flight no,
flight date, pair of city for original and destination),
but in this step, the host only return a part of the full list (for example, it return only 20 passengers and
end of this list is a # character to signal that there are more)
if I want a full list, I must send another command (MD command) to move down and so on to the end of list (with the signal by the END string)
.The passenger list content passenger name, class and a PRN code, base on these PNR code, I must send another command to get details passenger information
like fullname, itinerary, contact information ... then process it (this consume some time to do)(and in this details, I can send various command to get more information...)
sign out of system
disconnect from host
Can I use multithread or parallel techonology for #3 to get data from server?
Depends on the type of connection. How do you connect, and do you remain connected?
If it's a pair of sockets that keep communicating (i.e. stateful), you could try to create another connection, log in again, and request the data you want. If it's done stateless (over HTTP for example) using some kind of session ID to correlate subsequent requests, you could simply simultaneously issue multiple requests with the same session ID and see if that works.
So through your initial connection you request the list of PNR's, and then use that connection and new connections to request passenger data for multiple passengers at the time, getting all data for all passengers on the list.
If both options to achieve that don't work, and you're stuck to using one connection, I'm afraid there is no other solution. Couldn't you try to contact them to ask if this is possible?
I'm afraid my answer is "It depends". I see no problems with parallel queries from the client side, and some of the information (such as per passenger detail) could probably be done in a separate, parallel query, but getting the full list sounds as if it should be done in a single thread / connection.
Why: I don't know the system you're querying, but it sounds as if it saves the state of your query (what is he asking for, how far down the list is he currently) and so would probably not handle "Give me parts 1, 2, and 3" of the list very well, especially if part 3 didn't exist (and you don't know until you see the "#" at the end of part 2, which depends on part 1...)
Can I use multithread or parallel techonology for #2 to get data from server?
What purpose would it serve?
You cannot log out of the system until the data is returned, and the last two steps certainly are not resource intensive, or dependent upon your user interface.
What you actually want to do, sending the command more then once, is not really a task for multiple threads. You simply want to send the command until the symbol that indicates there isn't additional data is not detected.
If you are not already doing this that simply means you should.
This is no different then reading user input within a console application.
Well, you could perhaps use parallel operations to allow users to issue more than one query at the same time, assuming that the CRS allows multiple connections from the same IP. The user might, for example have more than one 'CRS' form on their screen and so be handling more than one query at the same time, eg for different dates, airports, flights, pax.
As noted by other posters, if the users is only processing one query at a time, there is not much point in paralleling anything, (except maybe the UI and client protocol so that the UI is not locked and so allow query cancel).
That said, given a requirement like this, I would normally design in such a way that multiple queries are the default behaviour anyway. I would have the CRS query form host everything needed to interact with the CRS so that, if necessary/possible, two instances of the form would allow two concurrent queries if supported by the server. This is more flexible than the alternative of running two processes.