I have a data table full of summary entries and my software needs to go through and reach out to a web service to get details, then record those details back to the database. Looping through the table synchronously while calling the web service and waiting for the response is too slow (there are thousands of entries) so I'd like to take the results (10 or so at a time) and thread it out so it performs 10 operations at the same time.
My experience with C# threads is limited to say the least, so what's the best approach? Does .NET have some sort of threadsafe queue system that I can use to make sure that the results get handled properly and in order?
Depending on which version of the .NET Framework you have two pretty good options.
You can use ThreadPool.QueueUserWorkItem in any version.
int pending = table.Rows.Count;
var finished = new ManualResetEvent(false);
foreach (DataRow row in table.Rows)
{
DataRow capture = row; // Required to close over the loop variable correctly.
ThreadPool.QueueUserWorkItem(
(state) =>
{
try
{
ProcessDataRow(capture);
}
finally
{
if (Interlocked.Decrement(ref pending) == 0)
{
finished.Set(); // Signal completion of all work items.
}
}
}, null);
}
finished.WaitOne(); // Wait for all work items to complete.
If you are using .NET Framework 4.0 you can use the Task Parallel Library.
var tasks = new List<Task>();
foreach (DataRow row in table.Rows)
{
DataRow capture = row; // Required to close over the loop variable correctly.
tasks.Add(
Task.Factory.StartNew(
() =>
{
ProcessDataRow(capture);
}));
}
Task.WaitAll(tasks.ToArray()); // Wait for all work items to complete.
There are many other reasonable ways to do this. I highlight the patterns above because they are easy and work well. In the absence of specific details I cannot say for certain that either will be a perfect match for your situation, but they should be a good starting point.
Update:
I had a short period of subpar cerebral activity. If you have the TPL available you could also use Parallel.ForEach as a simpler method than all of that Task hocus-pocus I mentioned above.
Parallel.ForEach(table.Rows,
(DataRow row) =>
{
ProcessDataRow(row);
});
Does .NET have some sort of threadsafe queue system that I can use to make sure that the results get handled properly and in order?
This was something added in .NET 4. The BlockingCollection<T> class, by defaults, acts as a thread safe queue for producer/consumer scenarios.
It makes it fairly easy to create a number of elements that "consume" from the collection and process, with one or more elements adding to the collection.
Related
I want to process a list of 5000 items. For each item, the process can be very quick (1sec) or take much time (>1min). But I want to process this list the fastest possible way.
I can't put this 5000 items in the .NET ThreadPool, plus I need to know when the items are all processed, so I was thinking to have a specific number of Threads and to do:
foreach(var item in items)
{
// wait for a Thread to be available
// give the item to process to the Thread
}
but what is the easiest way to do that in c#? Should I use Threads, or are there some higher level classes that I could use?
I would start with Parallel.ForEach and measure your performance. That is a simple, powerful approach and the scheduling does a pretty decent job for a generic scheduler.
Parallel.ForEach(items, i => { /* your code here */ });
I can't put this 5000 items in the .NET ThreadPool
Nor would you want to. It is relatively expensive to create a thread. Context switches take time. If you had say 8 cores processing 5000 threads, a meaningful fraction of your execution time would be context switches.
to do parallel processing this is the structure to use
Parallel.ForEach(items, (item) =>
{
....
}
and if you want not to overload the thread pool you can use ParallelOptions
var po = new ParallelOptions
{
MaxDegreeOfParallelism = 5
}
Parallel.ForEach(items, po,(item) =>
{
....
}
I agree with the answers recommending Parallel.ForEach. Without knowing all of the specifics (like what's going on in the loop) I can't say 100%. As long as the iterations in the loop aren't doing anything that conflict with each other (like concurrent operations with some other object that aren't thread safe) then it should be fine.
You mentioned in a comment that it's throwing an exception. That can be a problem because if one iteration throws an exception then the loop will terminate leaving your tasks only partially complete.
To avoid that, handle exceptions within each iteration of the loop. For example,
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(items, i =>
{
try
{
//Your code to do whatever
}
catch(Exception ex)
{
exceptions.Enqueue(ex);
}
});
By using a ConcurrentQueue any iteration can safely add its own exception. When it's done you have a list of exceptions. Now you can decide what to do with them. You could throw a new exception:
if (exceptions.Count > 0) throw new AggregateException(exceptions);
Or if there's something that uniquely identifies each item you could do (for example)
var exceptions = new ConcurrentDictionary<Guid, Exception>();
And then when an exception is thrown,
exceptions.TryAdd(item.Id, ex); //making up the Id property
Now you know specifically which items succeeded and which failed.
When initializing the UI in my C# Silverlight application, I make several asynchronous calls to different services. While doing all the loading asynchronously is very nice and speedy, there are still times when I need some step of the loading to happen at the very end.
On some views in the past, I had implemented a "loading list" mechanic to help me keep track of loading, and guarantee the order of whatever actions are picky about when they fire. Here is a very simplified example:
private List<string> _loadingList = new List<string>();
// Called to begin the loading process
public void LoadData(List<long> IDs){
foreach(long id in IDs){
DoSomethingToLoadTheID(id);
_loadingList.Add(id.ToString());
}
}
// Called every time an ID finishes loading
public void LoadTheIDCompleted(object sender, ServiceArgs e){
UseTheLoadedData(e);
_loadingList.Remove(id.ToString());
if(_loadingList.Count == 0)
LoadDataFinally();
}
// Must be called after all other loading is completed
public void LoadDataFinally(){
ImportantFinishingTouches();
}
This thing works for my purposes, and I haven't experienced any problems with it yet. But I am not as confident about my knowledge of thread safety as I'd like to be, so I'd like to ask a couple of questions:
Is there any way this kind of thing can mess up catastrophically?
Is there a better way to accomplish this same functionality?
(I'm using Visual Studio 2013 and .NET Framework 4.5.51209, and Silverlight 5.0)
Is there any way this kind of thing can mess up catastrophically?
Depending on what you are doing, absolutely. First, you are accessing List(of T) from multiple threads. You should be using ConcurrentList<T> instead as it is thread safe. Remember that ANY class (including .NET framework components) being accessed/modified by multiple threads must be thread safe. MSDN indicates which framework components are and are not.
Is there a better way to accomplish this same functionality?
I don't see anywhere in your code where multiple threads are being used, but I assume that what you are trying to do is:
Get a list of IDs
Run some long-running or CPU-bound process for each ID
Remove that ID from the list of IDs
Depending on the nature of what you are doing with each ID, the approach may be different. For instance, if the work is CPU-bound then you could use Tasks/TPL:
// Using TPL
var ids = new List<int>() {1, 2, 3, 4};
Parallel.ForEach(ids, id => DoSomething(id)); // Invokes DoSomething on each ID in the list in parallel
Or if you need fine-grained control of the order that things execute...
var ids = new List<int>() {1, 2, 3, 4};
TaskFactory factory = new TaskFactory();
List<Task> tasks = new List<Task>();
foreach (var id in ids)
{
tasks.Add(factory.StartNew(() => DoSomething(id))); // executes async and keeps track of the task in the list
}
Task.WaitAll(tasks.ToArray()); // waits till everything is done
tasks.Clear();
foreach (var id in ids)
{
tasks.Add(factory.StartNew(() => DoSomethingElse(id))); // executes async and keeps track of the task in the list
}
Task.WaitAll(tasks.ToArray()); // Wait till all DoSomethingElse is done
// etc.
Now if the work you are doing is IO bound (eg. making webservice calls where you have to wait for slow responses), you should look into Asynchronous Programming with async and await (C#).
There are a lot of ways to handle multi-threading, and sometimes figuring out the right way is part of the challenge (async/await vs. tasks vs. semaphores vs. background workers vs. etc.).
I'm playing around with BlockingCollection to try to understand them better, but I'm struggling to understand why my code hangs when it finishes processing all my items when I use a Parallel.For
I'm just adding a number to it (producer?):
var blockingCollection = new BlockingCollection<long>();
Task.Factory.StartNew(() =>
{
while (count <= 10000)
{
blockingCollection.Add(count);
count++;
}
});
Then I'm trying to process (Consumer?):
Parallel.For(0, 5, x =>
{
foreach (long value in blockingCollection.GetConsumingEnumerable())
{
total[x] += 1;
Console.WriteLine("Worker {0}: {1}", x, value);
}
});
But when it completes processing all the numbers, it just hangs there? What am I doing wrong?
Also, when I set my Parallel.For to 5, does it mean it's processing the data on 5 separate thread?
As its name implies, operations on BlockingCollection<T> block when they can't do anything, and this includes GetConsumingEnumerable().
The reason for this is that the collection can't tell if your producer is already done, or just busy producing the next item.
What you need to do is to notify the collection that you're done adding items to it by calling CompleteAdding(). For example:
while (count <= 10000)
{
blockingCollection.Add(count);
count++;
}
blockingCollection.CompleteAdding();
It's a GetConsumingEnumerable method feature.
Enumerating the collection in this way blocks the consumer thread if no items are available or if the collection is empty.
You can read more about it here
Also using Parallel.For(0,5) doesn't guarantee that the data will be processed in 5 separate threads. It depends on Environment.ProcessorCount.
Also, when I set my Parallel.For to 5, does it mean it's processing the data on 5 separate thread?
No, quoting from a previous answer in SO(How many threads Parallel.For(Foreach) will create? Default MaxDegreeOfParallelism?):
The default scheduler for Task Parallel Library and PLINQ uses the
.NET Framework ThreadPool to queue and execute work. In the .NET
Framework 4, the ThreadPool uses the information that is provided by
the System.Threading.Tasks.Task type to efficiently support the
fine-grained parallelism (short-lived units of work) that parallel
tasks and queries often represent.
Put it simply, TPL creates Tasks, not threads. The framework decides how many threads should handle them.
I have a method that returns XML elements, but that method takes some time to finish and return a value.
What I have now is
foreach (var t in s)
{
r.add(method(test));
}
but this only runs the next statement after previous one finishes. How can I make it run simultaneously?
You should be able to use tasks for this:
//first start a task for each element in s, and add the tasks to the tasks collection
var tasks = new List<Task>();
foreach( var t in s)
{
tasks.Add(Task.Factory.StartNew(method(t)));
}
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
foreach( var task in tasks)
{
r.Add(task.Result);
}
EDIT
There are some problems with the code above. See the code below for a working version. Here I have also rewritten the loops to use LINQ for readability issues (and in the case of the first loop, to avoid the closure on t inside the lambda expression causing problems).
var tasks = s.Select(t => Task<int>.Factory.StartNew(() => method(t))).ToArray();
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
r = tasks.Select(task => task.Result).ToList();
You can use Parallel.ForEach which will utilize multiple threads to do the execution in parallel. You have to make sure that all code called is thread safe and can be executed in parallel.
Parallel.ForEach(s, t => r.add(method(t));
From what I'm seeing you are updating a shared collection inside the loop. This means that if you execute the loop in parallel a data race will occur because multiple threads will try to update a non-synchronized collection (assuming r is a List or something like this) at the same time, causing an inconsistent state.
To execute correctly in parallel, you will need to wrap that section of code inside a lock statement:
object locker = new object();
Parallel.Foreach (s,
t =>
{
lock(locker) r.add(method(t));
});
However, this will make the execution actually serial, because each thread needs to acquire the lock and two threads cannot do so at the same time.
The better solution would be to have a local list for each thread, add the partial results to that list and then merge the results when all threads have finished. Probably #Øyvind Knobloch-Bråthen's second solution is the best one, assuming method(t) is the real CPU-hog in this case.
Modification to the correct answer for this question
change
tasks.Add(Task.Factory.StartNew(method(t);));
to
//solution will be the following code
tasks.Add(Task.Factory.StartNew(() => { method(t);}));
I need to increase the performance of a ForEach.
//Pseudocode
foreach (item i item in items)
{
//Call service to open DB conn and get data
}
Within this loop make a call to a service that opens a session sqlserver, gets data from the database and closes the session, so for each iteration.
What I can do?.
Thanks.
Well that does sound like a perfectly good use of Parallel.ForEach - so have you tried it?
Parallel.ForEach(queries, query => {
// Perform query
});
You may well want to specify options around the level of parallelism etc - and make sure your connection pool supports as many connections as you want. And of course, measure the performance before and after to make sure it's actually helping.
Perhaps you could start a new thread in each iteration:
foreach (item i in collection)
{
Thread t = new Thread(functionToCall);
t.Start()
}
functionToCall()
{
database = openSQLSession();
data databaseData = database.getData();
dataCollection.Add(databaseData);
closeSQLSession();
}
Of course this is a simple example and pretty pseudocode-y, but I hope you get the gist of it?