I have a problem with scalability and processing and I want to get the opinion of the stack overflow community.
I basically have XML data coming down a socket and I want to process that data. For each XML line sent processing can include writing to a text file, opening a socket to another server and using various database queries; all of which take time.
At the minute my solution involves the following threads:
Thread 1
Accepts incoming sockets and thus generates child threads that handle each socket (there will only be a couple of incoming sockets from clients). When an XML line comes through (ReadLine() method on StreamReader) I basically put this line into a Queue, which is accessible via a static method on a class. This static method contains locking logic to ensure that the program is threadsafe (I could use Concurrent Queue for this of course instead of manual locking).
Threads 2-5
Constantly take XML lines from the queue and processes them one at a time (database queries, file writes etc).
This method seems to be working but I was curious if there is a better way of doing things because this seems very crude. If I take the processing that threads 2-5 do into thread 1 this results in extremely slow performance, which I expected, so I created my worker threads (2-5).
I appreciate I could replace threads 2-5 with a thread pool but the thread pool would still be reading from the same Queue of XML lines so I wandered if there is a more efficient way of processing these events instead of using the Queue?
A queue1 is the right approach. But I would certainly move from manual thread control to the thread pool (and thus I don't need to do thread management) and let it manage the number of threads.2
But in the end there is only so much processing a single computer (however expensive) can do. At some point one of memory size, CPU-memory bandwidth, storage IO, network IO, … is going to be saturated. At that point using an external queuing system (MSMQ, WebSphere*MQ, Rabbit-MQ, …) with each task being a separate message allows many workers on many computers to process the data ("competing consumers" pattern).
1 I would move immediately to ConcurrentQueue: getting locking right is hard, the more you don't need to do it yourself the better.
2 At some point you might find you need more control than the thread pool providers, that is the time to switch to a custom thread pool. But prototype and test: it is quite possible your implementation will actually be worse: see paragraph 2.
Related
I've created a c# WPF project, I've to process a csv file having some records which may not be limited to few hundreds or few thousands or millions. I need to read the line of record, then process the record which generally takes 5 to 10 seconds and then update the record with new value.
The operation consists of a network call to server through web service, the server then calls another server to connect to authority server, the authority server responds back with data in the same loop as requested. The authority takes time because it is having a very large database consists of about one billion records. So, to encrypt decrypt and authenticate operation takes about 5-10 seconds to process completely.
I can not perform the operation in one thread as for processing whole file may take months so I want to create hundreds of threads which will process the data. The approach I'm thinking is that I'm trying to create a thread which creates threads up to 100 and monitors them for free threads if available. When a thread returns data after process then it writes it in file and create new thread for new line to process.
This approach I'm thinking seems to be too complex, should I implement the same and how or how should I resolve the problem.
There are two options that can help you here:
Parallel LINQ
TPL Dataflow
Parallel LINQ is the simpler option, but provides a lot less customization. It would look something like:
var results = File.ReadLines("input.csv")
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(100)
.Select(ProcessLine);
File.WriteAllLines("output.csv", results);
(You need to implement the ProcessLine method, of course.)
Now that will give you a lot of parallelism, but probably via lots of threads which are blocked a lot of the time... whereas a more sophisticated solution would end up using asynchronous IO so that actually you probably hardly need any actual threads.
One thing to be aware of: if you're making web requests over the network, you may need to configure the maximum number of requests you can make in parallel to the host. See ServicePointManager.DefaultConnectionLimit and the <connectionManagement> settings element.
We wrote service that using ~200 threads .
200 Threads must do:
1- Download from internet
2- Parse the raw data (html,xml,json...)
3- Store the newly created data to db
For ~10 threads elapsed time for second operation(Parsing) is 50ms (per thread)
For ~50 threads elapsed time for second operation(Parsing) is 80-18000 ms (per thread)
So we have an idea !
We can download documents as multithreaded but using MSMQ we can send rawdata to another process (consumer). And another process implement second part (Parsing) as single threaded.
You can say why dont you use c# Queue class in same process.. We could not prevent our "precious parsing thread" from Thread Context switch. If there are 200 threads in same process the precious will be context switch victim.
Using MSMQ for this requirement is normal?
Yes, this is an excellent example of where MSMQ makes a lot of sense. You can offload your difficult work to a different process to handle without affecting the performance of your current process which clearly doesn't care about the results. Not only that, but if your new worker process goes down, the queue will preserve state and messages (other than maybe the one being worked on when it went down) will not be lost.
Depending on your needs and goals I'd consider offloading the download to the other process as well - passing URLs to work on to the queue for example. Then, scaling up your system is as easy as dialing up the queue receivers, since queue messages are received in a thread safe manner when implemented correctly.
Yes, it is normal. And there are frameworks/libraries that help you building these kind of solutions providing you more than only transports.
NServiceBus or MassTransit are examples (both can sit on top of MSMQ)
In a thread "A", I want to read a very long file, and as that happens, I want to send each new line read to another thread "B", which would do -something- to them.
Basically, I don't want to wait for the file-loading to finish before I start processing the lines.
(I definitely want 2 threads and communication between them; I've never done this before and I wanna learn)
So, how do I go about doing this?
Thread A should wait for thread B to finish processing the "current line", before thread A sends another line to Thread B. But that won't be efficient; so how about a buffer in thread B?(to catch the lines)
Also, please give an example of what methods I have to use for this cross thread communication since I haven't found/seen any useful examples.
Thank you.
First of all, it's not clear that two threads will necessarily be useful here. A single thread reading one line at a time (which is pretty easy with StreamReader) and processing each line as you go might perform at least as well. File reads are buffered, and the OS can read ahead of your code requesting data, in which case most of your reads will either complete immediately because the next line has already been read off disk in advance by the OS, or both of your threads will have to wait because the data isn't there on disk. (And having 2 threads sat waiting for the disk doesn't make things happen any faster than having 1 thread sat waiting.) The only possible benefit is that you avoid dead time by getting the next read underway before you finish processing the previous one, but the OS will often do that for you in any case. So the benefits of multithreading will be marginal at best here.
However, since you say you're doing this as a learning exercise, that may not be a problem...
I'd use a BlockingCollection<string> as the mechanism for passing data from one thread to another. (As long as you're using .NET 4 or later. And if not...I suggest you move to .NET 4 - it will simplify this task considerably.) You'll read a line from the file and put it into the collection from one thread:
string nextLine = myFileReader.ReadLine();
myBlockingCollection.Add(nextLine);
And then some other thread can retrieve lines from that:
while (true)
{
string lineToProcess = myBlockingCollection.Take();
ProcessLine(lineToProcess);
}
That'll let the reading thread run through the file just as fast as the disk will let it, while the processing thread processes data at whatever rate it can. The Take method simply sits and waits if your processing thread gets ahead of the file reading thread.
One problem with this is that your reading thread might get way ahead if the file is large and your processing is slow - your program might attempt to read gigabytes of data from a file while having only processed the first few kilobytes. There's not much point reading data way ahead of processing it - you really only want to read a little in advance. You could use the BlockingCollection<T>'s BoundedCapacity property to throttle things - if you set that to some number, then the call to Add will block if the collection already has that number of lines in it, and your reading thread won't proceed until the processing loop processes its next line.
It would be interesting to compare performance of a program using your two-threaded technique against one that simply reads lines out of a file and processes them in a loop on a single thread. You would be able to see what, if any, benefit you get from a multithreaded approach here.
Incidentally, if your processing is very CPU intensive, you could use a variation on this theme to have multiple processing threads (and still a single file-reading thread), because BlockingCollection<T> is perfectly happy to have numerous consumers all reading out of the collection. Of course, if the order in which you finish processing the lines of the file matters, that won't be an option, because although you'll start processing in the right order, if you have multiple processing threads, it's possible that one thread might overtake another one, causing out-of-order completion.
I'm creating a server-type application at the moment which will do the usual listening for connections from external clients and, when they connect, handle requests, etc.
At the moment, my implementation creates a pair of threads every time a client connects. One thread simply reads requests from the socket and adds them to a queue, and the second reads the requests from the queue and processes them.
I'm basically looking for opinions on whether or not you think having all of these threads is overkill, and importantly whether this approach is going to cause me problems.
It is important to note that most of the time these threads will be idle - I use wait handles (ManualResetEvent) in both threads. The Reader thread waits until a message is available and if so, reads it and dumps it in a queue for the Process thread. The Process thread waits until the reader signals that a message is in the queue (again, using a wait handle). Unless a particular client is really hammering the server, these threads will be sat waiting. Is this costly?
I'm done a bit of testing - had 1,000 clients connected continually nagging - the server (so, 2,000+ threads) and it seemed to cope quite well.
I think your implementation is flawed. This kind of design doesn't scale because creating threads is expensive and there is a limit on how many threads can be created.
That is the reason that most implementations of this type use a thread pool. That makes it easy to put a cap on the maximum amount of threads while easily managing new connections and reusing the threads when the work is finished.
If all you are doing with your thread is putting items in a queue, then use the
ThreadPool.QueueUserWorkItem method to use the default .NET thread pool.
You haven't given enough information in your question to specify for definite but perhaps you now only need one other thread, constantly running clearing down the queue, you can use a wait handle to signal when something has been added.
Just make sure to synchronise access to your queue or things will go horribly wrong.
I advice to use following patter. First you need thread pool - build in or custom. Have a thread that checks is there something available to read, if yes it picks Reader thread. Then reading thread puts into queue and then thread from pool of processing threads will pick it. it will minimize number of threads and minimize time spend in waiting state
I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.
I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.
Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.
You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.