In a thread "A", I want to read a very long file, and as that happens, I want to send each new line read to another thread "B", which would do -something- to them.
Basically, I don't want to wait for the file-loading to finish before I start processing the lines.
(I definitely want 2 threads and communication between them; I've never done this before and I wanna learn)
So, how do I go about doing this?
Thread A should wait for thread B to finish processing the "current line", before thread A sends another line to Thread B. But that won't be efficient; so how about a buffer in thread B?(to catch the lines)
Also, please give an example of what methods I have to use for this cross thread communication since I haven't found/seen any useful examples.
Thank you.
First of all, it's not clear that two threads will necessarily be useful here. A single thread reading one line at a time (which is pretty easy with StreamReader) and processing each line as you go might perform at least as well. File reads are buffered, and the OS can read ahead of your code requesting data, in which case most of your reads will either complete immediately because the next line has already been read off disk in advance by the OS, or both of your threads will have to wait because the data isn't there on disk. (And having 2 threads sat waiting for the disk doesn't make things happen any faster than having 1 thread sat waiting.) The only possible benefit is that you avoid dead time by getting the next read underway before you finish processing the previous one, but the OS will often do that for you in any case. So the benefits of multithreading will be marginal at best here.
However, since you say you're doing this as a learning exercise, that may not be a problem...
I'd use a BlockingCollection<string> as the mechanism for passing data from one thread to another. (As long as you're using .NET 4 or later. And if not...I suggest you move to .NET 4 - it will simplify this task considerably.) You'll read a line from the file and put it into the collection from one thread:
string nextLine = myFileReader.ReadLine();
myBlockingCollection.Add(nextLine);
And then some other thread can retrieve lines from that:
while (true)
{
string lineToProcess = myBlockingCollection.Take();
ProcessLine(lineToProcess);
}
That'll let the reading thread run through the file just as fast as the disk will let it, while the processing thread processes data at whatever rate it can. The Take method simply sits and waits if your processing thread gets ahead of the file reading thread.
One problem with this is that your reading thread might get way ahead if the file is large and your processing is slow - your program might attempt to read gigabytes of data from a file while having only processed the first few kilobytes. There's not much point reading data way ahead of processing it - you really only want to read a little in advance. You could use the BlockingCollection<T>'s BoundedCapacity property to throttle things - if you set that to some number, then the call to Add will block if the collection already has that number of lines in it, and your reading thread won't proceed until the processing loop processes its next line.
It would be interesting to compare performance of a program using your two-threaded technique against one that simply reads lines out of a file and processes them in a loop on a single thread. You would be able to see what, if any, benefit you get from a multithreaded approach here.
Incidentally, if your processing is very CPU intensive, you could use a variation on this theme to have multiple processing threads (and still a single file-reading thread), because BlockingCollection<T> is perfectly happy to have numerous consumers all reading out of the collection. Of course, if the order in which you finish processing the lines of the file matters, that won't be an option, because although you'll start processing in the right order, if you have multiple processing threads, it's possible that one thread might overtake another one, causing out-of-order completion.
Related
I am writing two applications that work with each other. One is in c++, the other in C#. As there is streaming data involved, I am using, in multiple places, code such as:
while (true)
{
//copy streamed data to other places
}
I am noticing that these programmes use a lot of cpu, and become slow to respond. Is it bad programming to use this kind of loop? Should I be adding a:
Sleep(5);
In each one? Will that help with cpu usage?
Thanks!
Generally, using Thread.Sleep() in the code will also freeze the thread so if what you worry about is responsiveness you shouldn't use it. You should consider moving the streaming methods out of the main (UI) thread
Also, you mentioned that it is some streaming process, so the best practise wouldn't be something like
while(!stream.EndOfStream)
{
//streaming
}
but rather using some DataReceived events (if available)
you will probably find that the code is more of the format
while (true)
{
//WAIT for streamed data to arrive
//READ data from stream
//COPY streamed data to other places
//BREAK when no more data/program is ending
}
which is totally normal. the other common solution is
while (boolean_variable_if_program_is_to_keep_running)
{
//WAIT for streamed data to arrive
//READ data from stream
//COPY streamed data to other places
//when no more data/program is ending
// set boolean_variable_if_program_is_to_keep_running = FALSE
}
What you really need to avoid (for the health of your CPU) is to make your program waiting for data in a while(true) loop without using a system threading wait function.
Example 1:
while(true) {
if (thereIsSomeDataToStream) {
StreamDataFunction();
}
}
In this example, if thereIsSomeDataToStream is false for a time, then CPU will still continue to work 100% performing while loop even if there is no data to stream. So this would be waste of CPU and leads your computer to slow down.
Example 2:
while(true) {
if (thereIsSomeDataToStream) {
StreamDataFunction();
}
else {
SystemThreadingWaitFunction();
}
}
On the contrary, in this example, when there is no more data to stream, then the thread stops for a time. Operating system will use this free time to execute other threads and, after a while, system will wake up your thread which will resume and loop again for streaming possible new data. Then there is not too much waste of CPU and your computer will remain responsive.
To perform the thread waiting procedure, you may have several possibilities:
First, you can use, as you suggested, Sleep(t). This may do the
job: Sleep functions provided by compilers logically would use
operating system API to idle current thread. But in this case you
will have to wait all the specified time, even if some data came
meanwhile. So if waiting time is too short, CPU will overwork, and if
time is too long, your streaming will lag.
Or you can use operating system API directly, which I would
recommend. If you are using Microsoft environnement, there are lots
of waiting methods you can document on here: Microsoft wait
functions API. For example you can create an Event object which
will signal incoming data to stream. You can signal this event
anywhere in your code or from another program. In the while loop you
may then call a function like WaitForSingleObject API which will wait
the event for signal state. Here is documentation on how to do this:
Using event objects. I do not know about linux or other systems,
but I am sure you can find it on the web. I did it few times myself,
it is not so hard to code. Enjoy ! :)
This is expanding on my comment above, but for the sake of completeness I am pasting what I've written in the commend here (and adding to it).
The general solution to this sort of problem is to wait for arrival of data and process it as it arrives (possibly caching newly arrived data as you're processing previously arrived data). As to the mechanics of the implementation - well, that depends on a concrete example. For example, often in a graphics processing application (games, for instance), the "game loop" is essentially what you describe (without sleeps). This is also encountered in GUI app design. For a design where the app waits for an event before processing, look to typical network client-server design.
As for the slowdown of the CPU, try the following code to observe significant slowdown:
while(true);
versus
while(true) sleep(1);
The reason the first slows down is because on each cycle the CPU checks if the condition in the loop evaluates to true (that is, in this case, if true == true). On the other hand, in the second example, the CPU checks if true == true and then sleeps for 1ms, which is significantly longer than the amount of time it takes to check true == true, freeing the CPU to do other things.
Now, as for your example, presumably processing the data inside the loop takes significantly more CPU cycles than checking true == true. So adding a sleep statement will not help, but only worsen the situation.
You should leave it to the kernel to decide when to interrupt your process to dedicate CPU cycles to other things.
Of course, take what I wrote above with a grain of salt (especially the last paragraph), as it paints a very simplistic picture not taking into account your specific situation, which may benefit from one design versus another (I don't know, because you haven't provided any details).
I have a problem with scalability and processing and I want to get the opinion of the stack overflow community.
I basically have XML data coming down a socket and I want to process that data. For each XML line sent processing can include writing to a text file, opening a socket to another server and using various database queries; all of which take time.
At the minute my solution involves the following threads:
Thread 1
Accepts incoming sockets and thus generates child threads that handle each socket (there will only be a couple of incoming sockets from clients). When an XML line comes through (ReadLine() method on StreamReader) I basically put this line into a Queue, which is accessible via a static method on a class. This static method contains locking logic to ensure that the program is threadsafe (I could use Concurrent Queue for this of course instead of manual locking).
Threads 2-5
Constantly take XML lines from the queue and processes them one at a time (database queries, file writes etc).
This method seems to be working but I was curious if there is a better way of doing things because this seems very crude. If I take the processing that threads 2-5 do into thread 1 this results in extremely slow performance, which I expected, so I created my worker threads (2-5).
I appreciate I could replace threads 2-5 with a thread pool but the thread pool would still be reading from the same Queue of XML lines so I wandered if there is a more efficient way of processing these events instead of using the Queue?
A queue1 is the right approach. But I would certainly move from manual thread control to the thread pool (and thus I don't need to do thread management) and let it manage the number of threads.2
But in the end there is only so much processing a single computer (however expensive) can do. At some point one of memory size, CPU-memory bandwidth, storage IO, network IO, … is going to be saturated. At that point using an external queuing system (MSMQ, WebSphere*MQ, Rabbit-MQ, …) with each task being a separate message allows many workers on many computers to process the data ("competing consumers" pattern).
1 I would move immediately to ConcurrentQueue: getting locking right is hard, the more you don't need to do it yourself the better.
2 At some point you might find you need more control than the thread pool providers, that is the time to switch to a custom thread pool. But prototype and test: it is quite possible your implementation will actually be worse: see paragraph 2.
I need to process data from some large number of file with thousands of data in terms of rows.Earlier i was reading the whole file row by row and processing.It took a lot of time for processing all the file when the number of files increased.Then some one said that threads can be used to perform the task in less amount of time??Can threading make this process fast.I'm using c# language.
It certainly can although it depends on the particular job in question. A very common pattern is to have one thread doing the file IO and multiple threads processing the actual lines.
How many processing threads to start will depend on how many processors/cores you have on your system, and how the results of the processing get written out. If the processing time per line is very small however, you probably won't get too much speed improvement having multiple processing threads and a single processing thread would be optimal.
Good thing with performance question is to assume that your code is just doing something unnecessary and try to find what it is - measure, review, draw - whatever works for you. I'm not saying that the code you have is slow, it just a way to look at it.
With adding multithreading to the mix first you may find it to be much harder to analyze the code.
More concrete for your task: combining multiple similar operation (like read a record from file or commit to DB) together may save significant amount of time (you need to prototype and measure).
I would recommend you do batch insert to your database.
You can have a thread that reads a line to a concurrent queue. while other thread is pulling the data from concurrent queue. agregating it if necessary or if you are doing any operation on it. then batch insert the data to database. it will save you quite a time.
Inserting a line to db would be very slow. you have to do batch inserts.
Yes, using threads can speed thigns up.
Threads are to be used when you have time onsuming tasks you can run in the background (like, when you process say 10 files, but only need one, you can have a thread process each of them which will be a lot faster then processing them all on your main thread).
Please not that there may be bugs related, so you should make sure all threads finished running before continuing and trying to access what they got.
Look up "C#.NET multithreading"
any thread can run a specified function, and background worker is a nice class as well (I prefer pure multithreading though).
Also note that this may backfire and wind up slower, but it's a good idea to try.
Threading is one way (there are others) of letting you overlap the processing with the I/O. That means instead of the total time being the sum of the time to read the data and the time to processing the data, you can reduce it to (roughly) whichever of the two is larger (usually the I/O time).
If you mostly want to overlap the I/O time, you might want to look at overlapped I/O and/or I/O completion ports.
Edit: If you're going to do this, you normally want to base the number of I/O threads on the number of separate physical disks you're going to be reading from, and the number of processing threads on the number of processors you have available to do the processing (but only as many as necessary to keep up with the data being supplied by the reader thread). For a typical desktop machine, that will often mean only two threads, one to read and one to process data.
I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie
Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.
I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/
One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.
You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.
A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.
Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.
Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.
I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.
I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.
Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.
You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.