Let's say we are building some public service that grabs the setup of a user (what server, user and pwd he wants to perform the call), logs in into that server and do some processing...
the process takes about 15 seconds to complete
each user has a different setup (server/user/pwd), so the process needs to run against each one
if 1000 users tells the system to run the method at 1:00PM
How can I insure that the method is processed in the next 15 minutes?
What should be the correct approach to this little problem?
I'm thinking that I need to do something Asynchronously, and parallel processing could speed up things, maybe throttling the processes, maybe execute 100 calls per each 30 seconds?
I never did something like this and would love to get your feedback on ideas and future problems just to spend 100 hours of work and realize that I took a wrong road :(
Thank you.
added
The only thing to have in consideration is that this should be a 100% web solution.
If one call to your method does not affect the result of another method call (which seems to be the case here), parallel programming seems to be the way to go.
Consider not processing this in the asp.net application directly, but rather placing such requests on a queue and having another process (windows service may be a good candidate here) pulling items off the queue for processing. The windows service can have multiple threads and can pull as many items off the queue at once as there are processing threads available. With an appropriate queuing mechanism, the windows service can run on separate hardware if needed to reach your performance goals.
You can have the original web page query the result using e.g. Ajax to provide the user feedback if that's a requirement.
UPDATE:
Microsoft has recommended a pattern for long running tasks that can be used in a hosted environment.
Well, 1000 * 15 seconds is more than 4 hours, so you can only complete the entire task within the 15 minute time frame if you parallelize the batch.
I would set up a queue and have a sufficient number of threads or processes pull from that queue.
You can define an in-process queue with Queue<T> or out-of-process either with a database table or MSMQ.
If you don't want to write multithreaded code, you can just have a bunch of different processes running on different machines, all pulling from the same queue.
A console application can do this, but a Windows Service is definitely also an alternative.
Related
I've created a c# WPF project, I've to process a csv file having some records which may not be limited to few hundreds or few thousands or millions. I need to read the line of record, then process the record which generally takes 5 to 10 seconds and then update the record with new value.
The operation consists of a network call to server through web service, the server then calls another server to connect to authority server, the authority server responds back with data in the same loop as requested. The authority takes time because it is having a very large database consists of about one billion records. So, to encrypt decrypt and authenticate operation takes about 5-10 seconds to process completely.
I can not perform the operation in one thread as for processing whole file may take months so I want to create hundreds of threads which will process the data. The approach I'm thinking is that I'm trying to create a thread which creates threads up to 100 and monitors them for free threads if available. When a thread returns data after process then it writes it in file and create new thread for new line to process.
This approach I'm thinking seems to be too complex, should I implement the same and how or how should I resolve the problem.
There are two options that can help you here:
Parallel LINQ
TPL Dataflow
Parallel LINQ is the simpler option, but provides a lot less customization. It would look something like:
var results = File.ReadLines("input.csv")
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(100)
.Select(ProcessLine);
File.WriteAllLines("output.csv", results);
(You need to implement the ProcessLine method, of course.)
Now that will give you a lot of parallelism, but probably via lots of threads which are blocked a lot of the time... whereas a more sophisticated solution would end up using asynchronous IO so that actually you probably hardly need any actual threads.
One thing to be aware of: if you're making web requests over the network, you may need to configure the maximum number of requests you can make in parallel to the host. See ServicePointManager.DefaultConnectionLimit and the <connectionManagement> settings element.
This program is used for taking data that has been imported and exported it line by line to normalized tables. It can take hours sometimes.
How it works is
Uses presses button to begin processing
A call is made from jquery to a method on the MVC controller.
That controller calls a DLL which then begins the long processing
Only one person uses this program at a time.
We are worried about it tying up a ASP.NET IIS thread. Would this be more efficient if instead of the web site running the code we could make a scheduled task that runs a EXE every 30 minutes to check and process the code..
EDIT 1: After talking to a coworker the work around we will do for now is simply remove the button from the web site that processes and instead refactor that processing into a scheduled task that runs every 5 minutes ... if there are any comments about this let me know.
The question is really about the differences between the web site running code vs. a completely separate EXE...IIS threads vs. processes... Does it help any ?
If the processing takes hours, it should definitely be in a separate process, not just a separate thread. You complicate thread locking and management, garbage collection and other things by dropping this into a separate process. For example, if your web server needs to be rebooted, your separate process can continue running without being affected. With a little work, you could even spin up this process on a separate server if you want (of course you would need to change the process start mechanism to do this)
When the task can run for hours having it block an ASP.Net thread is definitely the wrong thing to do. A web call should complete in a reasonable amount of time (seconds ideally, minutes at worst). Tasks which take considerably longer than that can be initiated from a web request but definitely shouldn't block the entire execution.
I think there are a couple of possible paths forward
If this is a task that does need to be executed on a semi-regular basis then factor it into an EXE and schedule the task to run at the correct interval
If this task should run on demand then factor it out into an EXE and have the web request kick off the EXE but not wait for its completion
Another possibility is to factor it out into a long running server process. Then use remoting or WCF to communicate between asp.net and the process
We wrote service that using ~200 threads .
200 Threads must do:
1- Download from internet
2- Parse the raw data (html,xml,json...)
3- Store the newly created data to db
For ~10 threads elapsed time for second operation(Parsing) is 50ms (per thread)
For ~50 threads elapsed time for second operation(Parsing) is 80-18000 ms (per thread)
So we have an idea !
We can download documents as multithreaded but using MSMQ we can send rawdata to another process (consumer). And another process implement second part (Parsing) as single threaded.
You can say why dont you use c# Queue class in same process.. We could not prevent our "precious parsing thread" from Thread Context switch. If there are 200 threads in same process the precious will be context switch victim.
Using MSMQ for this requirement is normal?
Yes, this is an excellent example of where MSMQ makes a lot of sense. You can offload your difficult work to a different process to handle without affecting the performance of your current process which clearly doesn't care about the results. Not only that, but if your new worker process goes down, the queue will preserve state and messages (other than maybe the one being worked on when it went down) will not be lost.
Depending on your needs and goals I'd consider offloading the download to the other process as well - passing URLs to work on to the queue for example. Then, scaling up your system is as easy as dialing up the queue receivers, since queue messages are received in a thread safe manner when implemented correctly.
Yes, it is normal. And there are frameworks/libraries that help you building these kind of solutions providing you more than only transports.
NServiceBus or MassTransit are examples (both can sit on top of MSMQ)
I need to setup an automated task that runs every minute and sends emails in the queue. I'm using ASP.NET 4.5 and C#. Currently, I use a scheduler class that starts in the global.asax and makes use of caching and cache callback. I've read this leads to several problems.
The reason I did it that way is because this app runs on multiple load balanced servers and this allows me to have the execution in one place and the code will run even if one or more servers are offline.
I'm looking for some direction to make this better. I've read about Quartz.NET but never used it. Does Quartz.NET call methods from the application? or from a windows service? or from a web service?
I've also read about using a Windows service, but as far as I can tell, those are installed to the server direct. The thing is, I need the task to execute regardless of how many servers are online and don't want to duplicate it. For example, if I have a scheduled task setup on server 1 and server 2, they would both run together therefore duplicating the requests. However, if server 1 was offline, I need server 2 to run the task.
Any advice on how to move forward here or is the global.asax method the best way for the multi-server environment? BTW, the web servers are running Win Server 2012 with IIS 8.
EDIT
In a request for more information, the queue is stored in a database. I should also make mention that the database servers are separate from the web servers. There are two database servers, but only one runs at a time. There is a central storage they both read from so there is only one instance of the database. When one database server goes down, the other comes online.
That being said, would it make more sense to put a Windows Service deployed to both database servers? That would make sure only one runs at a time.
Also, what are your thoughts about running Quartz.NET from the application? As millimoose mentions, I don't necessarily need it running on the web front end, however, doing so allows me to not deploy a windows service to multiple machines and I don't think there would be a performance difference going either way. Thoughts?
Thanks everyone for the input so far. If any additional info is needed, please let me know.
I have had to tackle the exact problem you're facing now.
First, you have to realize that you absolutely cannot reliably run a long-running process inside ASP.NET. If you instantiate your scheduler class from global.asax, you have no control over the lifetime of that class.
In other words, IIS may decide to recycle the worker process that hosts your class at any time. At best, this means your class will be destroyed (and there's nothing you can do about it). At worst, your class will be killed in the middle of doing work. Oops.
The appropriate way to run a long-lived process is by installing a Windows Service on the machine. I'd install the service on each web box, not on the database.
The Service instantiates the Quartz scheduler. This way, you know that your scheduler is guaranteed to continue running as long as the machine is up. When it's time for a job to run, Quartz simply calls a method on a IJob class that you specify.
class EmailSender : Quartz.IJob
{
public void Execute(JobExecutionContext context)
{
// send your emails here
}
}
Keep in mind that Quartz calls the Execute method on a separate thread, so you must be careful to be thread-safe.
Of course, you'll now have the same service running on multiple machines. While it sounds like you're concerned about this, you can actually leverage this into a positive thing!
What I did was add a "lock" column to my database. When a send job executes, it grabs a lock on specific emails in the queue by setting the lock column. For example, when the job executes, generate a guid and then:
UPDATE EmailQueue SET Lock=someGuid WHERE Lock IS NULL LIMIT 1;
SELECT * FROM EmailQueue WHERE Lock=someGuid;
In this way, you let the database server deal with the concurrency. The UPDATE query tells the DB to assign one email in the queue (that is currently unassigned) to the current instance. You then SELECT the the locked email and send it. Once sent, delete the email from the queue (or however you handle sent email), and repeat the process until the queue is empty.
Now you can scale in two directions:
By running the same job on multiple threads concurrently.
By virtue of the fact this is running on multiple machines, you're effectively load balancing your send work across all your servers.
Because of the locking mechanism, you can guarantee that each email in the queue gets sent only once, even though multiple threads on multiple machines are all running the same code.
In response to comments: There's a few differences in the implementation I ended up with.
First, my ASP application can notify the service that there are new emails in the queue. This means that I don't even have to run on a schedule, I can simply tell the service when to start work. However, this kind of notification mechanism is very difficult to get right in a distributed environment, so simply checking the queue every minute or so should be fine.
The interval you go with really depends on the time sensitivity of your email delivery. If emails need to be delivered ASAP, you might need to trigger every 30 seconds or even less. If it's not so urgent, you can check every 5 minutes. Quartz limits the number of jobs executing at once (configurable), and you can configure what should happen if a trigger is missed, so you don't have to worry about having hundreds of jobs backing up.
Second, I actually grab a lock on 5 emails at a time to reduce query load on the DB server. I deal with high volumes, so this helped efficiency (fewer network roundtrips between the service and the DB). The thing to watch out here is what happens if a node happens to go down (for whatever reason, from an Exception to the machine itself crashing) in the middle of sending a group of emails. You'll end up with "locked" rows in the DB and nothing servicing them. The larger the size of the group, the bigger this risk. Also, an idle node obviously can't work on anything if all remaining emails are locked.
As far as thread safety, I mean it in the general sense. Quartz maintains a thread pool, so you don't have to worry about actually managing the threads themselves.
You do have to be careful about what the code in your job accesses. As a rule of thumb, local variables should be fine. However, if you access anything outside the scope of your function, thread safety is a real concern. For example:
class EmailSender : IJob {
static int counter = 0;
public void Execute(JobExecutionContext context) {
counter++; // BAD!
}
}
This code is not thread-safe because multiple threads may try to access counter at the same time.
Thread A Thread B
Execute()
Execute()
Get counter (0)
Get counter (0)
Increment (1)
Increment (1)
Store value
Store value
counter = 1
counter should be 2, but instead we have an extremely hard to debug race condition. Next time this code runs, it might happen this way:
Thread A Thread B
Execute()
Execute()
Get counter (0)
Increment (1)
Store value
Get counter (1)
Increment (2)
Store value
counter = 2
...and you're left scratching your head why it worked this time.
In your particular case, as long as you create a new database connection in each invocation of Execute and don't access any global data structures, you should be fine.
You'll have to be more specific about your architecture. Where is the email queue; in memory or a database? If they exist on a database, you could have a flag column named "processing" and when a task grabs an email from the queue it only grabs emails that are not currently processing, and sets the processing flag to true for emails it grabs. You then leave concurrency woes to the database.
I have a question around threading and background workers that I hope you can help with.
I plan on making an ftp application to upload a file to 50 servers. Rather than the user having to wait for each upload to finish before the next one starts I was looking at threading/background workers. Once an upload finishes I want to report the status of the upload "completed/failed" back to the UI. From my understanding, I will need to use background workers for this so I know when the task has completed. I know with threading I can use producer/consumer queue or a semaphore to run a given amount of threads at once but I am not quite sure how I can achieve this with back ground workers.
So my question is, what would be a sensible number of background workers controlling uploading to run at once and what would be the best way to queue the rest?
There is no limit on the size of the upload file so this could be quite small or up to a few MB.
Thanks in advance.
Edit - I tested out one backgroundworker for each server running simultaneousness. The results where faster than just a single backgroundworker but I can't say that i was fully comfortable with running 50 plus background workers at once and since the server count may increase in the future, I decided to stick with just the one, which seems to be fast enough. I may in future look at increasing the count of workers to 2 or 3 but currently 1 seems to be adequate. Thanks for everyones help.
Thanks
I'd go in a completely different direction with it, tbh. Your app should take the file and store it once, responding to the client that it's got it. The file should then be propagated to the other servers. You can do this many ways, but if you want it controlled by the same application (i.e. not done using a windows service or the like) then a good way would be to use a message queue (either MSMQ or one of the OS ones).
This is much easier than using a semaphore or producer-consumer queue.
Put all your tasks in a queue (doesn't need to be a thread-safe queue, it will only be used from the UI thread).
Loop from 1 to N, taking out a task and starting a BackgroundWorker. (Be sure to handle the empty queue, when there were less than N tasks to begin with). In the RunWorkerCompleted event, update your UI, dequeue another task, and start another BackgroundWorker.
The bottleneck here is going to be your network bandwidth. If your local upstream connection is so fast that you can saturate the incoming connections on two or more remote hosts, then you'll benefit from running multiple uploads in parallel. If not, then it makes very little difference to the total upload time, since it'll be dictated by (file size * number of uploads) / (local bandwidth). In other words - if you do 20 uploads one at a time, it'll take an hour; if you do 20 uploads in parallel, it'll still take an hour. The advantage of the first approach is that if you lose connectivity you'll only need to resume/restart a single upload - whichever one was in progress when the connection was lost.
I'd therefore use a single background thread to sequentially upload the file to each server in turn. If you're using the .NET BackgroundWorker to do this, you can get it to ReportProgress at the end of each file (and you know in advance how many files are to be uploaded so you can calculate progress as a percentage), and attach some custom state to the progress update to inform the user whether the last upload succeeded or not.
The only way to know for sure is to test and measure, but it can be different from machine to machine, mostly depending on uplink speed.
Starting 50 backgroundworkers at the same time is a bit on the high end, but is not incredibly many. A simple approach would be to start 50 all at the same time and measure memory consumption and upload speed.
If the FTP servers are each much faster than the client uplink speed the most efficient would be to just upload one (or possibly two) at a time.