When initializing the UI in my C# Silverlight application, I make several asynchronous calls to different services. While doing all the loading asynchronously is very nice and speedy, there are still times when I need some step of the loading to happen at the very end.
On some views in the past, I had implemented a "loading list" mechanic to help me keep track of loading, and guarantee the order of whatever actions are picky about when they fire. Here is a very simplified example:
private List<string> _loadingList = new List<string>();
// Called to begin the loading process
public void LoadData(List<long> IDs){
foreach(long id in IDs){
DoSomethingToLoadTheID(id);
_loadingList.Add(id.ToString());
}
}
// Called every time an ID finishes loading
public void LoadTheIDCompleted(object sender, ServiceArgs e){
UseTheLoadedData(e);
_loadingList.Remove(id.ToString());
if(_loadingList.Count == 0)
LoadDataFinally();
}
// Must be called after all other loading is completed
public void LoadDataFinally(){
ImportantFinishingTouches();
}
This thing works for my purposes, and I haven't experienced any problems with it yet. But I am not as confident about my knowledge of thread safety as I'd like to be, so I'd like to ask a couple of questions:
Is there any way this kind of thing can mess up catastrophically?
Is there a better way to accomplish this same functionality?
(I'm using Visual Studio 2013 and .NET Framework 4.5.51209, and Silverlight 5.0)
Is there any way this kind of thing can mess up catastrophically?
Depending on what you are doing, absolutely. First, you are accessing List(of T) from multiple threads. You should be using ConcurrentList<T> instead as it is thread safe. Remember that ANY class (including .NET framework components) being accessed/modified by multiple threads must be thread safe. MSDN indicates which framework components are and are not.
Is there a better way to accomplish this same functionality?
I don't see anywhere in your code where multiple threads are being used, but I assume that what you are trying to do is:
Get a list of IDs
Run some long-running or CPU-bound process for each ID
Remove that ID from the list of IDs
Depending on the nature of what you are doing with each ID, the approach may be different. For instance, if the work is CPU-bound then you could use Tasks/TPL:
// Using TPL
var ids = new List<int>() {1, 2, 3, 4};
Parallel.ForEach(ids, id => DoSomething(id)); // Invokes DoSomething on each ID in the list in parallel
Or if you need fine-grained control of the order that things execute...
var ids = new List<int>() {1, 2, 3, 4};
TaskFactory factory = new TaskFactory();
List<Task> tasks = new List<Task>();
foreach (var id in ids)
{
tasks.Add(factory.StartNew(() => DoSomething(id))); // executes async and keeps track of the task in the list
}
Task.WaitAll(tasks.ToArray()); // waits till everything is done
tasks.Clear();
foreach (var id in ids)
{
tasks.Add(factory.StartNew(() => DoSomethingElse(id))); // executes async and keeps track of the task in the list
}
Task.WaitAll(tasks.ToArray()); // Wait till all DoSomethingElse is done
// etc.
Now if the work you are doing is IO bound (eg. making webservice calls where you have to wait for slow responses), you should look into Asynchronous Programming with async and await (C#).
There are a lot of ways to handle multi-threading, and sometimes figuring out the right way is part of the challenge (async/await vs. tasks vs. semaphores vs. background workers vs. etc.).
Related
I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?
Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.
I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.
If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.
I am calling an external API which is slow. Currently if I havent called the API to get some orders for a while the call can be broken up into pages (pagingation).
So therefore fetching orders could be making multiple calls rather than the 1 call. Sometimes each call can be around 10 seconds per call so this could be about a minute in total which is far too long.
GetOrdersCall getOrders = new GetOrdersCall();
getOrders.DetailLevelList.Add(DetailLevelCodeType.ReturnSummary);
getOrders.CreateTimeFrom = lastOrderDate;
getOrders.CreateTimeTo = DateTime.Now;
PaginationType paging = new PaginationType();
paging.EntriesPerPage = 20;
paging.PageNumber = 1;
getOrders.Pagination = paging;
getOrders.Execute();
var response = getOrders.ApiResponse;
OrderTypeCollection orders = new OrderTypeCollection();
while (response != null && response.OrderArray.Count > 0)
{
eBayConverter.ConvertOrders(response.OrderArray, 1);
if (response.HasMoreOrders)
{
getOrders.Pagination.PageNumber++;
getOrders.Execute();
response = getOrders.ApiResponse;
orders.AddRange(response.OrderArray);
}
}
This is a summary of my code above... The getOrders.Execute() is when the api fires.
After the 1st "getOrders.Execute()" there is a Pagination result which tells me how many pages of data there are. My thinking is that I should be able to start an asnychronous call for each page and to populate the OrderTypeCollection. When all the calls are made and the collection is fully loaded then I will commit to the database.
I have never done Asynchronous calls via c# before and I can kind of follow Async await but I think my scenario falls out of the reading I have done so far?
Questions:
I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db.
I've read somewhere that I want to avoid combining the API call and the db write to avoid locking in SQL server - Is this correct?
If someone can point me in the right direction - It would be greatly appreciated.
I think I can set it up to fire off the multiple calls asynchronously
but I'm not sure how to check when all tasks have been completed i.e.
ready to commit to db.
Yes you can break this up
The problem is ebay doesn't have an async Task Execute Method, so you are left with blocking threaded calls and no IO optimised async await pattern. If there were, you could take advantage of a TPL Dataflow pipeline which is async aware (and fun for the whole family to play), you could anyway, though i propose a vanilla TPL solution...
However, all is not lost, just fall back to Parallel.For and a ConcurrentBag<OrderType>
Example
var concurrentBag = new ConcurrentBag<OrderType>();
// make first call
// add results to concurrentBag
// pass the pageCount to the for
int pagesize = ...;
Parallel.For(1, pagesize,
page =>
{
// Set up
// add page
// make Call
foreach(var order in getOrders.ApiResponse)
concurrentBag.Add(order);
});
// all orders have been downloaded
// save to db
Note : There are MaxDegreeOfParallelism which you configure, maybe set it to 50, though it wont really matter how much you give it, the Task Scheduler is not going to aggressively give you threads, maybe 10 or so initially and grow slowly.
The other way you can do this, is create your own Task Scheduler, or just spin up your own Threads with the old fashioned Thread Class
I've read somewhere that I want to avoid combining the API call and
the db write to avoid locking in SQL server - Is this correct?
If you mean locking as in slow DB insert, use Sql Bulk Insert and update tools.
If you mean locking as in the the DB deadlock error message, then this is an entirely different thing, and worthy of its own question
Additional Resources
For(Int32, Int32, ParallelOptions, Action)
Executes a for (For in Visual Basic) loop in which iterations may run
in parallel and loop options can be configured.
ParallelOptions Class
Stores options that configure the operation of methods on the Parallel
class.
MaxDegreeOfParallelism
Gets or sets the maximum number of concurrent tasks enabled by this
ParallelOptions instance.
ConcurrentBag Class
Represents a thread-safe, unordered collection of objects.
Yes ConcurrentBag<T> Class can be used to server the purpose of one of your questions which was: "I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db."
This generic class can be used to Run your every task and wait all your tasks to be completed to do further processing. It is thread safe and useful for parallel processing.
I'm currently trying to improve my understanding of Multithreading and the TPL in particular.
A lot of the constructs make complete sense and I can see how they improve scalability / execution speed.
I know that for asynchronous calls that don't tie up a thread (like I/O bound calls), Task.WhenAll would be the perfect fit.
One thing I am wondering about, though, is the best practice for making CPU-bound work that I want to run in parallel asynchronous.
To make code run in parallel the obvious choice would be the Parallel class.
As an example, say I have an array of data I want to perform some number crunching on:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
Parallel.ForEach(arr, (s) =>
{
SomeReallyLongRunningMethod(s);
});
This would run in parallel (if the analyser decides that parallel is faster than synchronous), but it would also block the thread.
Now the first thing that came to my mind was simply wrapping it all in Task.Run() ala:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
await Task.Run(() => Parallel.ForEach(arr, (s) =>
{
SomeReallyLongRunningMethod(s);
}));
Another option would be to either have a seperate Task returing method or inline it and use Task.WhenAll like so:
static async Task SomeReallyLongRunningMethodAsync(string s)
{
await Task.Run(() =>
{
//work...
});
}
// ...
await Task.WhenAll(arr.Select(s => SomeReallyLongRunningMethodAsync(s)));
The way I understand it is that option 1 creates a whole Task that will, for the life of it, tie up a thread to just sit there and wait until the Parallel.ForEach finishes.
Option 2 uses Task.WhenAll (for which I don't know whether it ties up a thread or not) to await all Tasks, but the Tasks had to be created manually. Some of my resources (expecially MS ExamRef 70-483) have explicitly advised against manually creating Tasks for CPU-bound work as the Parallel class is supposed to be used for it.
Now I'm left wondering about the best performing version / best practice for the problem of wanting parallel execution that can be awaited.
I hope some more experienced programmer can shed some light on this for me!
You really should use Microsoft's Reactive Framework for this. It's the perfect solution. You can do this:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
var query =
from s in arr.ToObservable()
from r in Observable.Start(() => SomeReallyLongRunningMethod(s))
select new { s, r };
IDisposable subscription =
query
.Subscribe(x =>
{
/* Do something with each `x.s` and `x.r` */
/* Values arrive as soon as they are computed */
}, () =>
{
/* All Done Now */
});
This assuming that the signature of SomeReallyLongRunningMethod is int SomeReallyLongRunningMethod(string input), but it is easy to cope with something else.
It's all run on multi-threads in parallel.
If you need to marshal back to the UI thread you can do that with an .ObserveOn just prior to the .Subscribe call.
If you want to stop the computation early you can call subscription.Dispose().
Option 1 is the way to go as the thread from thread pool being used for the task will also get used in parallel for loop. Similar question answered here.
Recently I've stumbled upon a Parralel.For loop that performs way better than a regular for loop for my purposes.
This is how I use it:
Parallel.For(0, values.Count, i =>Products.Add(GetAllProductByID(values[i])));
It made my application work a lot faster, but still not fast enough. My question to you guys is:
Does Parallel.Foreach performs faster than Parallel.For?
Is there some "hybrid" method with whom I can combine my Parralel.For loop to perform even faster (i.e. use more CPU power)? If yes, how?
Can someone help me out with this?
If you want to play with parallel, I suggest using Parallel Linq (PLinq) instead of Parallel.For / Parallel.ForEach , e.g.
var Products = Enumerable
.Range(0, values.Count)
.AsParallel()
//.WithDegreeOfParallelism(10) // <- if you want, say 10 threads
.Select(i => GetAllProductByID(values[i]))
.ToList(); // <- this is thread safe now
With a help of With methods (e.g. WithDegreeOfParallelism) you can try tuning you implementation.
There are two related concepts: asynchronous programming and multithreading. Basically, to do things "in parallel" or asynchronously, you can either create new threads or work asynchronously on the same thread.
Keep in mind that either way you'll need some mechanism to prevent race conditions. From the Wikipedia article I linked to, a race condition is defined as follows:
A race condition or race hazard is the behavior of an electronic,
software or other system where the output is dependent on the sequence
or timing of other uncontrollable events. It becomes a bug when events
do not happen in the order the programmer intended.
As a few people have mentioned in the comments, you can't rely on the standard List class to be thread-safe - i.e. it might behave in unexpected ways if you're updating it from multiple threads. Microsoft now offers special "built-in" collection classes (in the System.Collections.Concurrent namespace) that'll behave in the expected way if you're updating it asynchronously or from multiple threads.
For well-documented libraries (and Microsoft's generally pretty good about this in their documentation), the documentation will often explicitly state whether the class or method in question is thread-safe. For example, in the documentation for System.Collections.Generic.List, it states the following:
Public static (Shared in Visual Basic) members of this type are thread
safe. Any instance members are not guaranteed to be thread safe.
In terms of asynchronous programming (vs. multithreading), my standard illustration of this is as follows: suppose you go a restaurant with 10 people. When the waiter comes by, the first person he asks for his order isn't ready; however, the other 9 people are. Thus, the waiter asks the other 9 people for their orders and then comes back to the original guy. (It's definitely not the case that they'll get a second waiter to wait for the original guy to be ready to order and doing so probably wouldn't save much time anyway). That's how async/await typically works (the exception being that some of the Task Parallel library calls, like Thread.Run(...), actually are executing on other threads - in our illustration, bringing in a second waiter - so make sure you check the documentation for which is which).
Basically, which you choose (asynchronously on the same thread or creating new threads) depends on whether you're trying to do something that's I/O-bound (i.e. you're just waiting for an operation to complete or for a result) or CPU-bound.
If your main purpose is to wait for a result from Ebay, it would probably be better to work asynchronously in the same thread as you may not get much of a performance benefit for using multithreading. Think back to our analogy: bringing in a second waiter just to wait for the first guy to be ready to order isn't necessarily any better than just having the waiter to come back to him.
I'm not sitting in front of an IDE so forgive me if this syntax isn't perfect, but here's an approximate idea of what you can do:
public async Task GetResults(int[] productIDsToGet) {
var tasks = new List<Task>();
foreach (int productID in productIDsToGet) {
Task task = GetResultFromEbay(productID);
tasks.Add(task);
}
// Wait for all of the tasks to complete
await Task.WhenAll(tasks);
}
private async Task GetResultFromEbay(int productIdToGet) {
// Get result asynchronously from eBay
}
I've got a database entity type Entity, a long list of Thingy and method
private Task<Entity> MakeEntity(Thingy thingy) {
...
}
MakeEntity does lots of stuff, and is CPU bound. I would like to convert all my thingies to entities, and save them in a db.context. Considering that
I don't want to finish as fast as possible
The amount of entities is large, and I want to effectively use the database, so I want to start saving changes and waiting for the remote database to do it's thing
how can I do this performantly? What I would really like is to loop while waiting for the database to do its thing, and offer all the newly made entities so far, untill the database has processed them all. What's the best route there? I've run in to saveChanges throwing if it's called concurrently, so I can't do that. What I'd really like is to have a threadpool of eight threads (or rather, as many threads as I have cores) to do the CPU bound work, and a single thread doing the SaveChanges()
This is a kind of "asynchronous stream", which is always a bit awkward.
In this case (assuming you really do want to multithread on ASP.NET, which is not recommended in general), I'd say TPL Dataflow is your best option. You can use a TransformBlock with MaxDegreeOfParallelism set to 8 (or unbounded, for that matter), and link it to an ActionBlock that does the SaveChanges.
Remember, use synchronous signatures (not async/await) for CPU-bound code, and asynchronous methods for I/O-bound code (i.e., SaveChangesAsync).
You could set up a pipeline of N CPU workers feeding into a database worker. The database worker could batch items up.
Since MakeEntity is CPU bound there is no need to use async and await there. await does not create tasks or threads (a common misconception).
var thingies = ...;
var entities = thingies.AsParallel().WithDOP(8).Select(MakeEntity);
var batches = CreateBatches(entities, batchSize: 100);
foreach (var batch in batches) {
Insert(batch);
}
You need to provide a method that creates batches from an IEnumerable. This is available on the web.
If you don't need batching for the database part you can delete that code.
For the database part you probably don't need async IO because it seems to be a low-frequency operation.