I have a procedure that needs to be executed on another thread, asynchronously. The procedure will process a number of data in batches (can receive 2 items or 40000). From local tests, the longest runtime was about 2 minutes (for 40000 items).
The scenario is the following: UI call to back-end, back-end starts an asynchronous thread that will run the procedure and then returns a boolean, to know if the request was received or not (these are the requirements, I do not use await/wait). I am not quite sure what to use here, between:
Task.Run(()=> MyProcedure())
OR
Task.Factory.StartNew(()=> MyProcedure(), TaskCreationOptions.LongRunning)
What would be the best option for this situation?
The procedure will process a number of data in batches
This can mean multiple things. If you are going to use async/await in those batches, such that your MyProcedure is actually asynchronous, then you should be fine with Task.Run; the fact that MyProcedure releases itself back to the thread-pool whenever it has gone async means that this should work fine, for example:
async Task MyProcedure()
{
while (ThereIsWorkToDo)
{
var batch = // ... gather some work
await ProcessBatchAsync(batch).ConfigureAwait(false);
}
}
However: if MyProcedure() is not asynchronous, but you just want to run it in the background, then yes, TaskCreationOptions.LongRunning might be reasonable, but for something that takes 2 minutes: so might be a regular dedicated thread.
Related
I have a construct like this:
Func<object> func = () =>
{
foreach(var x in y)
{
...
Task.Delay(100).Wait();
}
return null;
}
var t = new Task<object>(func);
t.Start(); // t is never awaited or waited for, the main thread does not care when it's done.
...
So basically i create a function func that calls Task.Delay(100).Wait() quite a few times. I know the use of .Wait() is discouraged in general.
But I want to know if there are concrete performance losses for the displayed case.
The .Wait() calls happen in a separate Task, that is completely independend from the main Thread, i.e. it is never awaited or waited for. I am curious what happens when I call Wait() this way, and what happens on the processor of my machine. During the 100 ms that are waited, can the processor core execute another thread, and returns to the Task after the time has passed? Or did I just produce a busy waiting procedure, where my CPU "actively does nothing" for 100 ms, thus slowing down the rest of my program?
Does this approach have any practical downsides when compared to a solution where i make it an async function and call await Task.Delay(100)? That would be an option for me, but I would rather not go for it if it is reasonably avoidable.
There are concrete efficiency losses because of the inefficient use of threads. Each thread requires at least 1 MB for its stack, so the more threads that are created in order to do nothing, the more memory is allocated for unproductive purposes.
It is also possible for concrete performance losses to appear, in case the demand for threads surpasses the ThreadPool availability. In this case the ThreadPool becomes saturated, and new threads are injected in the pool in a conservative (slow) rate. So the tasks you create and Start will not start immediately, but instead they will be entered in an internal queue, waiting for a free thread, either one that completed some previous work, or an new injected one.
Regarding your concerns about creating busy waiting procedures, no, that's not what happening. A sleeping thread does not consume CPU resources.
As a side note, creating cold Tasks using the Task constructor is an advanced technique that's only used in special occasions. The common way of creating delegate-based tasks is through the convenient Task.Run method, that returns hot (already started) tasks.
I am calling an external API which is slow. Currently if I havent called the API to get some orders for a while the call can be broken up into pages (pagingation).
So therefore fetching orders could be making multiple calls rather than the 1 call. Sometimes each call can be around 10 seconds per call so this could be about a minute in total which is far too long.
GetOrdersCall getOrders = new GetOrdersCall();
getOrders.DetailLevelList.Add(DetailLevelCodeType.ReturnSummary);
getOrders.CreateTimeFrom = lastOrderDate;
getOrders.CreateTimeTo = DateTime.Now;
PaginationType paging = new PaginationType();
paging.EntriesPerPage = 20;
paging.PageNumber = 1;
getOrders.Pagination = paging;
getOrders.Execute();
var response = getOrders.ApiResponse;
OrderTypeCollection orders = new OrderTypeCollection();
while (response != null && response.OrderArray.Count > 0)
{
eBayConverter.ConvertOrders(response.OrderArray, 1);
if (response.HasMoreOrders)
{
getOrders.Pagination.PageNumber++;
getOrders.Execute();
response = getOrders.ApiResponse;
orders.AddRange(response.OrderArray);
}
}
This is a summary of my code above... The getOrders.Execute() is when the api fires.
After the 1st "getOrders.Execute()" there is a Pagination result which tells me how many pages of data there are. My thinking is that I should be able to start an asnychronous call for each page and to populate the OrderTypeCollection. When all the calls are made and the collection is fully loaded then I will commit to the database.
I have never done Asynchronous calls via c# before and I can kind of follow Async await but I think my scenario falls out of the reading I have done so far?
Questions:
I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db.
I've read somewhere that I want to avoid combining the API call and the db write to avoid locking in SQL server - Is this correct?
If someone can point me in the right direction - It would be greatly appreciated.
I think I can set it up to fire off the multiple calls asynchronously
but I'm not sure how to check when all tasks have been completed i.e.
ready to commit to db.
Yes you can break this up
The problem is ebay doesn't have an async Task Execute Method, so you are left with blocking threaded calls and no IO optimised async await pattern. If there were, you could take advantage of a TPL Dataflow pipeline which is async aware (and fun for the whole family to play), you could anyway, though i propose a vanilla TPL solution...
However, all is not lost, just fall back to Parallel.For and a ConcurrentBag<OrderType>
Example
var concurrentBag = new ConcurrentBag<OrderType>();
// make first call
// add results to concurrentBag
// pass the pageCount to the for
int pagesize = ...;
Parallel.For(1, pagesize,
page =>
{
// Set up
// add page
// make Call
foreach(var order in getOrders.ApiResponse)
concurrentBag.Add(order);
});
// all orders have been downloaded
// save to db
Note : There are MaxDegreeOfParallelism which you configure, maybe set it to 50, though it wont really matter how much you give it, the Task Scheduler is not going to aggressively give you threads, maybe 10 or so initially and grow slowly.
The other way you can do this, is create your own Task Scheduler, or just spin up your own Threads with the old fashioned Thread Class
I've read somewhere that I want to avoid combining the API call and
the db write to avoid locking in SQL server - Is this correct?
If you mean locking as in slow DB insert, use Sql Bulk Insert and update tools.
If you mean locking as in the the DB deadlock error message, then this is an entirely different thing, and worthy of its own question
Additional Resources
For(Int32, Int32, ParallelOptions, Action)
Executes a for (For in Visual Basic) loop in which iterations may run
in parallel and loop options can be configured.
ParallelOptions Class
Stores options that configure the operation of methods on the Parallel
class.
MaxDegreeOfParallelism
Gets or sets the maximum number of concurrent tasks enabled by this
ParallelOptions instance.
ConcurrentBag Class
Represents a thread-safe, unordered collection of objects.
Yes ConcurrentBag<T> Class can be used to server the purpose of one of your questions which was: "I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db."
This generic class can be used to Run your every task and wait all your tasks to be completed to do further processing. It is thread safe and useful for parallel processing.
While the following question is generally applicable to all usage of async/await in C#, it refers to Json.NET. The JsonConvert.DeserializeObjectAsync() method has been marked as obsolete by the development team as it would be difficult to maintain and not of much use since most JSON files are small (Refer this).
I have some code following this structure:
public async Task<CarObj> GetCarAsync()
{
string json = await GetJsonStringFromRestEndpoint();
// At this point, we should already be on a separate thread since we have awaited a long running task.
// 1 - Running this relatively long task on this thread should be fine since we're already on a new thread than the caller.
CarObj obj = JsonConvert.DeserializeObject<CarObj>(json);
// 2 - Would this better for some reason?
CarObj obj2 = await Task.Run(() => JsonConvert.DeserializeObject<CarObj>(json));
}
Would option 1 or 2 in the code above be the better solution here?
Arguably this is primarily opinion-based. But…
Assuming the library authors are correct, your first option is better. But not for the reason you think.
When the await GetJsonStringFromRestEndpoint() completes, then assuming the GetCarAsync() method was called from a thread with a synchronization context, the call to DeserializeObject<CarObj>(json); will happen on that same thread.
The reason calling the method synchronously isn't a problem isn't because you're on a different thread (you're not), but rather because as the library authors point out, the input data isn't likely to be large enough for there to be any significant performance problem. You can probably parse the entire JSON data and construct your CarObj value in less time than it takes to queue up the thread pool work item, context-switch to that thread, and then context-switch back.
In other words, don't use worker threads to perform computationally inexpensive work.
// At this point, we should already be on a separate thread since we have awaited a long running task.
No, the caller thread is called back (resumed) instead.
But - if this is not your intended behavior - I'd advice to add
.ConfigureAwait(false);
That would save some synchronization work and afterwards you'll reasonably expect to be in a thread pull thread.
This question already has answers here:
Brief explanation of Async/Await in .Net 4.5
(3 answers)
Closed 7 years ago.
C# offers multiple ways to perform asynchronous execution such as threads, futures, and async.
In what cases is async the best choice?
I have read many articles about the how and what of async, but so far I have not seen any article that discusses the why.
Initially I thought async was a built-in mechanism to create a future. Something like
async int foo(){ return ..complex operation..; }
var x = await foo();
do_something_else();
bar(x);
Where call to 'await foo' would return immediately, and the use of 'x' would wait on the the return value of 'foo'. async does not do this. If you want this behavior you can use the futures library: https://msdn.microsoft.com/en-us/library/Ff963556.aspx
The above example would instead be something like
int foo(){ return ..complex operation..; }
var x = Task.Factory.StartNew<int>(() => foo());
do_something_else();
bar(x.Result);
Which isn't as pretty as I would have hoped, but it works nonetheless.
So if you have a problem where you want to have multiple threads operate on the work then use futures or one of the parallel operations, such as Parallel.For.
async/await, then, is probably not meant for the use case of performing work in parallel to increase throughput.
async solves the problem of scaling an application for a large number of asynchronous events, such as I/O, when creating many threads is expensive.
Imagine a web server where requests are processed immediately as they come in. The processing happens on a single thread where every function call is synchronous. To fully process a thread might take a few seconds, which means that an entire thread is consumed until the processing is complete.
A naive approach to server programming is to spawn a new thread for each request. In this way it does not matter how long each thread takes to complete because no thread will block any other. The problem with this approach is that threads are not cheap. The underlying operating system can only create so many threads before running out of memory, or some other kind of resource. A web server that uses 1 thread per request will probably not be able to scale past a few hundred/thousand requests per second. The c10k challenge asks that modern servers be able to scale to 10,000 simultaneous users. http://www.kegel.com/c10k.html
A better approach is to use a thread pool where the number of threads in existence is more or less fixed (or at least, does not expand past some tolerable maximum). In that scenario only a fixed number of threads are available for processing the incoming requests. If there are more requests than there are threads available for processing then some requests must wait. If a thread is processing a request and has to wait on a long running I/O process then effectively the thread is not being utilized to its fullest extent, and the server throughput will be much less than it otherwise could be.
The question is now, how can we have a fixed number of threads but still use them efficiently? One answer is to 'cut up' the program logic so that when a thread would normally wait on an I/O process, instead it will start the I/O process but immediately become free for any other task that wants to execute. The part of the program that was going to execute after the I/O will be stored in a thing that knows how to keep executing later on.
For example, the original synchronous code might look like
void process(){
string name = get_user_name();
string address = look_up_address(name);
string tax_forms = find_tax_form(address);
render_tax_form(name, address, tax_forms);
}
Where look_up_address and find_tax_form have to talk to a database and/or make requests to other websites.
The asynchronous version might look like
void process(){
string name = get_user_name();
invoke_after(() => look_up_address(name), (address) => {
invoke_after(() => find_tax_form(address), (tax_forms) => {
render_tax_form(name, address, tax_forms);
}
}
}
This is continuation passing style, where next thing to do is passed as the second lambda to a function that will not block the current thread when the blocking operation (in the first lambda) is invoked. This works but it quickly becomes very ugly and hard to follow the program logic.
What the programmer has manually done in splitting up their program can be automatically done by async/await. Any time there is a call to an I/O function the program can mark that function call with await to inform the caller of the program that it can continue to do other things instead of just waiting.
async void process(){
string name = get_user_name();
string address = await look_up_address(name);
string tax_forms = await find_tax_form(address);
render_tax_form(name, address, tax_forms);
}
The thread that executes process will break out of the function when it gets to look_up_address and continue to do other work: such as processing other requests. When look_up_address has completed and process is ready to continue, some thread (or the same thread) will pick up where the last thread left off and execute the next line find_tax_forms(address).
Since my current belief of async is about managing threads, I don't believe that async makes a lot of sense for UI programming. Generally UI's will not have that many simultaneous events that need to be processed. The use case for async with UI's is preventing the UI thread from being blocked. Even though async can be used with a UI, I would find it dangerous because ommitting an await on some long running function, due to either an accident or forgetfulness, would cause the UI to block.
async void button_callback(){
await do_something_long();
....
}
This code won't block the UI because it uses an await for the long running function that it invokes. If later on another function call is added
async void button_callback(){
do_another_thing();
await do_something_long();
...
}
Where it wasn't clear to the programmer who added the call to do_another_thing just how long it would take to execute, the UI will now be blocked. It seems safer to just always execute all processing in a background thread.
void button_callback(){
new Thread(){
do_another_thing();
do_something_long();
....
}.start();
}
Now there is no possibility that the UI thread will be blocked, and the chances that too many threads will be created is very small.
I've got a database entity type Entity, a long list of Thingy and method
private Task<Entity> MakeEntity(Thingy thingy) {
...
}
MakeEntity does lots of stuff, and is CPU bound. I would like to convert all my thingies to entities, and save them in a db.context. Considering that
I don't want to finish as fast as possible
The amount of entities is large, and I want to effectively use the database, so I want to start saving changes and waiting for the remote database to do it's thing
how can I do this performantly? What I would really like is to loop while waiting for the database to do its thing, and offer all the newly made entities so far, untill the database has processed them all. What's the best route there? I've run in to saveChanges throwing if it's called concurrently, so I can't do that. What I'd really like is to have a threadpool of eight threads (or rather, as many threads as I have cores) to do the CPU bound work, and a single thread doing the SaveChanges()
This is a kind of "asynchronous stream", which is always a bit awkward.
In this case (assuming you really do want to multithread on ASP.NET, which is not recommended in general), I'd say TPL Dataflow is your best option. You can use a TransformBlock with MaxDegreeOfParallelism set to 8 (or unbounded, for that matter), and link it to an ActionBlock that does the SaveChanges.
Remember, use synchronous signatures (not async/await) for CPU-bound code, and asynchronous methods for I/O-bound code (i.e., SaveChangesAsync).
You could set up a pipeline of N CPU workers feeding into a database worker. The database worker could batch items up.
Since MakeEntity is CPU bound there is no need to use async and await there. await does not create tasks or threads (a common misconception).
var thingies = ...;
var entities = thingies.AsParallel().WithDOP(8).Select(MakeEntity);
var batches = CreateBatches(entities, batchSize: 100);
foreach (var batch in batches) {
Insert(batch);
}
You need to provide a method that creates batches from an IEnumerable. This is available on the web.
If you don't need batching for the database part you can delete that code.
For the database part you probably don't need async IO because it seems to be a low-frequency operation.