Question regarding threading/background workers

Question regarding threading/background workers - c#

I have a question around threading and background workers that I hope you can help with.
I plan on making an ftp application to upload a file to 50 servers. Rather than the user having to wait for each upload to finish before the next one starts I was looking at threading/background workers. Once an upload finishes I want to report the status of the upload "completed/failed" back to the UI. From my understanding, I will need to use background workers for this so I know when the task has completed. I know with threading I can use producer/consumer queue or a semaphore to run a given amount of threads at once but I am not quite sure how I can achieve this with back ground workers.
So my question is, what would be a sensible number of background workers controlling uploading to run at once and what would be the best way to queue the rest?
There is no limit on the size of the upload file so this could be quite small or up to a few MB.
Thanks in advance.
Edit - I tested out one backgroundworker for each server running simultaneousness. The results where faster than just a single backgroundworker but I can't say that i was fully comfortable with running 50 plus background workers at once and since the server count may increase in the future, I decided to stick with just the one, which seems to be fast enough. I may in future look at increasing the count of workers to 2 or 3 but currently 1 seems to be adequate. Thanks for everyones help.
Thanks

I'd go in a completely different direction with it, tbh. Your app should take the file and store it once, responding to the client that it's got it. The file should then be propagated to the other servers. You can do this many ways, but if you want it controlled by the same application (i.e. not done using a windows service or the like) then a good way would be to use a message queue (either MSMQ or one of the OS ones).

This is much easier than using a semaphore or producer-consumer queue.
Put all your tasks in a queue (doesn't need to be a thread-safe queue, it will only be used from the UI thread).
Loop from 1 to N, taking out a task and starting a BackgroundWorker. (Be sure to handle the empty queue, when there were less than N tasks to begin with). In the RunWorkerCompleted event, update your UI, dequeue another task, and start another BackgroundWorker.

The bottleneck here is going to be your network bandwidth. If your local upstream connection is so fast that you can saturate the incoming connections on two or more remote hosts, then you'll benefit from running multiple uploads in parallel. If not, then it makes very little difference to the total upload time, since it'll be dictated by (file size * number of uploads) / (local bandwidth). In other words - if you do 20 uploads one at a time, it'll take an hour; if you do 20 uploads in parallel, it'll still take an hour. The advantage of the first approach is that if you lose connectivity you'll only need to resume/restart a single upload - whichever one was in progress when the connection was lost.
I'd therefore use a single background thread to sequentially upload the file to each server in turn. If you're using the .NET BackgroundWorker to do this, you can get it to ReportProgress at the end of each file (and you know in advance how many files are to be uploaded so you can calculate progress as a percentage), and attach some custom state to the progress update to inform the user whether the last upload succeeded or not.

The only way to know for sure is to test and measure, but it can be different from machine to machine, mostly depending on uplink speed.
Starting 50 backgroundworkers at the same time is a bit on the high end, but is not incredibly many. A simple approach would be to start 50 all at the same time and measure memory consumption and upload speed.
If the FTP servers are each much faster than the client uplink speed the most efficient would be to just upload one (or possibly two) at a time.

Related

Could MSMQ resolve performance bottleneck of out multithreaded services?

We wrote service that using ~200 threads .
200 Threads must do:
1- Download from internet
2- Parse the raw data (html,xml,json...)
3- Store the newly created data to db
For ~10 threads elapsed time for second operation(Parsing) is 50ms (per thread)
For ~50 threads elapsed time for second operation(Parsing) is 80-18000 ms (per thread)
So we have an idea !
We can download documents as multithreaded but using MSMQ we can send rawdata to another process (consumer). And another process implement second part (Parsing) as single threaded.
You can say why dont you use c# Queue class in same process.. We could not prevent our "precious parsing thread" from Thread Context switch. If there are 200 threads in same process the precious will be context switch victim.
Using MSMQ for this requirement is normal?

Yes, this is an excellent example of where MSMQ makes a lot of sense. You can offload your difficult work to a different process to handle without affecting the performance of your current process which clearly doesn't care about the results. Not only that, but if your new worker process goes down, the queue will preserve state and messages (other than maybe the one being worked on when it went down) will not be lost.
Depending on your needs and goals I'd consider offloading the download to the other process as well - passing URLs to work on to the queue for example. Then, scaling up your system is as easy as dialing up the queue receivers, since queue messages are received in a thread safe manner when implemented correctly.

Yes, it is normal. And there are frameworks/libraries that help you building these kind of solutions providing you more than only transports.
NServiceBus or MassTransit are examples (both can sit on top of MSMQ)

Quickest way to process large number of files with thousands of Data in each file

I need to process data from some large number of file with thousands of data in terms of rows.Earlier i was reading the whole file row by row and processing.It took a lot of time for processing all the file when the number of files increased.Then some one said that threads can be used to perform the task in less amount of time??Can threading make this process fast.I'm using c# language.

It certainly can although it depends on the particular job in question. A very common pattern is to have one thread doing the file IO and multiple threads processing the actual lines.
How many processing threads to start will depend on how many processors/cores you have on your system, and how the results of the processing get written out. If the processing time per line is very small however, you probably won't get too much speed improvement having multiple processing threads and a single processing thread would be optimal.

Good thing with performance question is to assume that your code is just doing something unnecessary and try to find what it is - measure, review, draw - whatever works for you. I'm not saying that the code you have is slow, it just a way to look at it.
With adding multithreading to the mix first you may find it to be much harder to analyze the code.
More concrete for your task: combining multiple similar operation (like read a record from file or commit to DB) together may save significant amount of time (you need to prototype and measure).

I would recommend you do batch insert to your database.
You can have a thread that reads a line to a concurrent queue. while other thread is pulling the data from concurrent queue. agregating it if necessary or if you are doing any operation on it. then batch insert the data to database. it will save you quite a time.
Inserting a line to db would be very slow. you have to do batch inserts.

Yes, using threads can speed thigns up.
Threads are to be used when you have time onsuming tasks you can run in the background (like, when you process say 10 files, but only need one, you can have a thread process each of them which will be a lot faster then processing them all on your main thread).
Please not that there may be bugs related, so you should make sure all threads finished running before continuing and trying to access what they got.
Look up "C#.NET multithreading"
any thread can run a specified function, and background worker is a nice class as well (I prefer pure multithreading though).
Also note that this may backfire and wind up slower, but it's a good idea to try.

Threading is one way (there are others) of letting you overlap the processing with the I/O. That means instead of the total time being the sum of the time to read the data and the time to processing the data, you can reduce it to (roughly) whichever of the two is larger (usually the I/O time).
If you mostly want to overlap the I/O time, you might want to look at overlapped I/O and/or I/O completion ports.
Edit: If you're going to do this, you normally want to base the number of I/O threads on the number of separate physical disks you're going to be reading from, and the number of processing threads on the number of processors you have available to do the processing (but only as many as necessary to keep up with the data being supplied by the reader thread). For a typical desktop machine, that will often mean only two threads, one to read and one to process data.

Using multithreading for loop

I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie

Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.

I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/

One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.

You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.

A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.

Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.

Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.

Run one method 1000 times in a short period of time

Let's say we are building some public service that grabs the setup of a user (what server, user and pwd he wants to perform the call), logs in into that server and do some processing...
the process takes about 15 seconds to complete
each user has a different setup (server/user/pwd), so the process needs to run against each one
if 1000 users tells the system to run the method at 1:00PM
How can I insure that the method is processed in the next 15 minutes?
What should be the correct approach to this little problem?
I'm thinking that I need to do something Asynchronously, and parallel processing could speed up things, maybe throttling the processes, maybe execute 100 calls per each 30 seconds?
I never did something like this and would love to get your feedback on ideas and future problems just to spend 100 hours of work and realize that I took a wrong road :(
Thank you.
added
The only thing to have in consideration is that this should be a 100% web solution.

If one call to your method does not affect the result of another method call (which seems to be the case here), parallel programming seems to be the way to go.
Consider not processing this in the asp.net application directly, but rather placing such requests on a queue and having another process (windows service may be a good candidate here) pulling items off the queue for processing. The windows service can have multiple threads and can pull as many items off the queue at once as there are processing threads available. With an appropriate queuing mechanism, the windows service can run on separate hardware if needed to reach your performance goals.
You can have the original web page query the result using e.g. Ajax to provide the user feedback if that's a requirement.
UPDATE:
Microsoft has recommended a pattern for long running tasks that can be used in a hosted environment.

Well, 1000 * 15 seconds is more than 4 hours, so you can only complete the entire task within the 15 minute time frame if you parallelize the batch.
I would set up a queue and have a sufficient number of threads or processes pull from that queue.
You can define an in-process queue with Queue<T> or out-of-process either with a database table or MSMQ.
If you don't want to write multithreaded code, you can just have a bunch of different processes running on different machines, all pulling from the same queue.
A console application can do this, but a Windows Service is definitely also an alternative.

Reading same file from multiple threads in C#

I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.

I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.

Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.

You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.