I am writing a very, very simple query which just gets a document from a collection according to its unique Id. After some frusteration (I am new to mongo and the async / await programming model), I figured this out:
IMongoCollection<TModel> collection = // ...
FindOptions<TModel> options = new FindOptions<TModel> { Limit = 1 };
IAsyncCursor<TModel> task = await collection.FindAsync(x => x.Id.Equals(id), options);
List<TModel> list = await task.ToListAsync();
TModel result = list.FirstOrDefault();
return result;
It works, great! But I keep seeing references to a "Find" method, and I worked this out:
IMongoCollection<TModel> collection = // ...
IFindFluent<TModel, TModel> findFluent = collection.Find(x => x.Id == id);
findFluent = findFluent.Limit(1);
TModel result = await findFluent.FirstOrDefaultAsync();
return result;
As it turns out, this too works, great!
I'm sure that there's some important reason that we have two different ways to achieve these results. What is the difference between these methodologies, and why should I choose one over the other?
The difference is in a syntax.
Find and FindAsync both allows to build asynchronous query with the same performance, only
FindAsync returns cursor which doesn't load all documents at once and provides you interface to retrieve documents one by one from DB cursor. It's helpful in case when query result is huge.
Find provides you more simple syntax through method ToListAsync where it inside retrieves documents from cursor and returns all documents at once.
Imagine that you execute this code in a web request, with invoking find method the thread of the request will be frozen until the database return results it's a synchron call, if it's a long database operation that takes seconds to complete, you will have one of the threads available to serve web request doing nothing simply waiting that database return the results, and wasting valuable resources (the number of threads in thread pool is limited).
With FindAsync, the thread of your web request will be free while is waiting the database for returning the results, this means that during the database call this thread is free to attend an another web request. When the database returns the result then the code continue execution.
For long operations like read/writes from file system, database operations, comunicate with another services, it's a good idea to use async calls. Because while you are waiting for the results, the threads are available for serve another web request. This is more scalable.
Take a look to this microsoft article https://msdn.microsoft.com/en-us/magazine/dn802603.aspx.
Related
I'm fairly new to programming (< 3 years exp), so I don't have a great understanding of the subjects in this post. Please bear with me.
My team is developing an integration with a third party system, and one of the third party's endpoints lacks a meaningful way to get a list of entities matching a condition.
We have been fetching these entities by looping over the collection of requests, and adding the results of each awaited call to a list. This works just fine, but getting the entities takes a lot longer than getting entities from other endpoints that lets us get a list of entities by providing a list of ids.
.NET 6.0 introduced Parallel.ForEachAsync(), which lets us execute multiple awaitable tasks asynchronously in parallel.
For example:
public async Task<List<TEntity>> GetEntitiesInParallelAsync<TEntity>(List<IRestRequest> requests)
where TEntity : IEntity
{
var entities = new ConcurrentBag<TEntity>();
// Create a function that takes a RestRequest and returns the
// result of the request's execution, for each request
var requestExecutionTasks = requests.Select(i =>
new Func<Task<TEntity>>(() => GetAsync<TEntity>(i)));
// Execute each of the functions asynchronously in parallel,
// and add the results to the aggregate as they come in
await Parallel.ForEachAsync(requestExecutionTasks, new ParallelOptions
{
// This lets us limit the number of threads to use. -1 is unlimited
MaxDegreeOfParallelism = -1
}, async (func, _) => entities.Add(await func()));
return entities.ToList();
}
Using this code rather than the simple foreach-loop sped up the time it takes to get the ~30 entities on my test instance, by 91% on average. That's awesome. However, we are worried about the rate limiting that is likely to occur when we use it on a client's system with possibly thousands of entities. We have a system in place that detects the "you are rate limited"-message from their API, and cues the requests for a second or so before trying again, but this is not as much a good solution as it is a safety measure.
If we where just looping over the requests, we could have throttled the calls by doing something like await Task.Delay(minimumDelay) in each iteration of the loop. Correct me if I'm wrong, but from what I understand this wouldn't actually work when executing the requests in parallel foreach, as it would make all requests wait the same amount of time before the execution. Is there a way to make each individual request wait a certain amount of time before execution, only if we are close to being rate limited? If at all possible, I would like to do this without limiting the number of threads to use.
Edit
I wanted to let this question sit a little so more people could answer. Since no new answers or comments have been added, I'm marking the one answer I got as correct. That being said, the answer suggests a different approach than using Parallel.ForEachAsync.
If I understand the current answer correctly, the answer to my original question of whether or not it's possible to throttle Parallel.ForEachAsync, would be: "no, it's not".
My suggestion is to ditch the Parallel.ForEachAsync approach, and use instead the new Chunk LINQ operator in combination with the Task.WhenAll method. You can launch 100 asynchronous operations every second like this:
public async Task<List<TEntity>> GetEntitiesInParallelAsync<TEntity>(
List<IRestRequest> requests) where TEntity : IEntity
{
var tasks = new List<Task<TEntity>>();
foreach (var chunk in requests.Chunk(100))
{
tasks.AddRange(chunk.Select(request => GetAsync<TEntity>(request)));
await Task.Delay(TimeSpan.FromSeconds(1.0));
}
return (await Task.WhenAll(tasks)).ToList();
}
It is assumed that the time required to launch an asynchronous operation (to invoke the GetAsync method) is negligible.
This approach has the inherent disadvantage that in case of an exception, the failure will not be propagated before all operations are completed. For comparison the Parallel.ForEachAsync method stops invoking the async delegate and completes ASAP, after the first failure is detected.
In my ASP.NET Core app, at some points I'm querying a couple ADs for data. This being AD, the queries take some time to complete and the DirectoryServices API contains only synchronous calls.
Is it a good practice to try and wrap the AD sync calls as async? I think it's done like this (just an example, not the real query):
private async Task<string[]> GetUserGroupsAsync(string samAccountName)
{
var func = new Func<string, string[]>(sam =>
{
var result = new List<string>();
using (var ctx = new PrincipalContext(ContextType.Domain, "", "", ""))
{
var p = new UserPrincipal(ctx)
{
SamAccountName = sam
};
using (var search_obj = new PrincipalSearcher(p))
{
var query_result = search_obj.FindOne();
if (query_result != null)
{
var usuario = query_result as UserPrincipal;
var directory_entry = usuario.GetUnderlyingObject() as DirectoryEntry;
var grupos = usuario.GetGroups(ctx).OfType<GroupPrincipal>().ToArray();
if (grupos != null)
{
foreach (GroupPrincipal g in grupos)
{
result.Add(g.Name);
}
}
}
}
}
return result.ToArray();
});
var result = await Task.Run(() => func(samAccountName));
return result;
}
Is it a good practice
Usually not.
In a desktop app where you don't want to hold up the UI thread, then this idea can actually be a good idea. That Task.Run moves the work to a different thread and the UI thread can continue responding to user input while you're waiting for a response.
You tagged ASP.NET. The answer there is also "it depends". ASP.NET has a limited amount of worker threads that it's allowed to use. The benefit of asynchronous code is to allow a thread to go and work on some other request while you're waiting for a response. Thus, you can serve more requests with the same amount of available threads. It helps the overall performance of your application.
If you're calling await GetUserGroupsAsync(), then there is absolutely no benefit to doing what you're doing. You're freeing up the calling thread, but you've created a new thread that is going to sit locked until a response is returned. So your net thread savings is zero, and you have the additional CPU overhead of setting up the task.
If you intend on calling GetUserGroupsAsync() and then going out and getting other data while you wait for a response, then this can save time. It won't save threads, but just time. But you should be conscious that you are now taking up two threads for each request instead of just one, which means you can hit the ASP.NET max thread count faster, potentially hurting the overall performance of your application.
But whether you want to save time in ASP.NET, or if you want to free up the UI thread in a desktop app, I would still argue that you should not use Task.Run inside GetUserGroupsAsync(). If the caller wants to offload that waiting to another thread so it can then go get other data, then the caller can use Task.Run, like this:
var groupsTask = Task.Run(() => GetUserGroupsAsync());
// make HTTP request or get some other external data while we wait
var groups = await groupsTask;
The decision on whether you should create a method for a class should depend on the answer to the question: if someone thinks of what this class represents, would he think that this class will have this functionality?
Compare this with class string and methods about string equality. Most people would think that two strings are equal if they have exactly the same characters in the same order. However, for a lot of applications, it might be handy to be able to compare two strings with case insensitivity. Instead of changing the equality method of string, a new class is created. This StringComparer class contains a lot of methods to compare strings using different definitions of equality.
If someone would say: "Okay, I've just created a class that represents several methods to compare two strings for equality". Would you expect that comparing with case insensitivity is one of the methods of this class? Of course you would!
The same should be with your class. I don't know what your class represents. However, apparently you thought, that someone who has an object of this class would be happy to "Get User Groups". He is happy that he doesn't have to know how that someone made this method for him, and that he doesn't need to know the insides of the class to be able to get the user groups.
This information hiding is an important thing of classes. It gives the creator of the class the freedom to internally change how the class works, without having to change usage of the class.
So if everyone who knows what your class represents would think: "of course getting user groups will take a considerable amount of time", and "of course, my thread will be waiting idly when getting user groups", then users of your class would expect the presence of asyn-await, to prevent idly waiting.
On the other hand, it might be that users of your class would say: "Well, I know that getting user groups will take some heavy calculations. It will take some time, but my thread will be very busy". In that case, they won't expect an async method.
Assuming that you have a non-async method to get the user groups:
string[] GetUserGroups(string samAccountName) {...}
The async method would be very simple:
Task<string[] GetUserGroupsAsync(string samAccountName)
{
return Task.Run(() => GetUserGroups(samAccountName));
}
The only thing you would have to decide is: do the users of my class expect this method?
Advantages and Disadvantages
Disadvantage of having a Sync and an Async method:
People who learn about your class have to learn about more methods
Users of your class can't decide how the async method calls the sync one, without creating an extra async method, which will only add to the confusion
You'll have to add an extra unit test
You'll have to maintain the async method forever.
Advantages of having an async method:
If in future a user group would be fetched from another process, for instance a database, or an XML file, or maybe the internet, then you can internally change the class, without having to change the many, many users (after all, all your classes are very popular, aren't they :)
Conclusion
If people look at your class, and they wouldn't even think that fetching user groups would be an async method, then don't create it.
If you think that maybe in future it could be that another process provides the user groups, then it would be wise to prepare your users about this.
I am calling an external API which is slow. Currently if I havent called the API to get some orders for a while the call can be broken up into pages (pagingation).
So therefore fetching orders could be making multiple calls rather than the 1 call. Sometimes each call can be around 10 seconds per call so this could be about a minute in total which is far too long.
GetOrdersCall getOrders = new GetOrdersCall();
getOrders.DetailLevelList.Add(DetailLevelCodeType.ReturnSummary);
getOrders.CreateTimeFrom = lastOrderDate;
getOrders.CreateTimeTo = DateTime.Now;
PaginationType paging = new PaginationType();
paging.EntriesPerPage = 20;
paging.PageNumber = 1;
getOrders.Pagination = paging;
getOrders.Execute();
var response = getOrders.ApiResponse;
OrderTypeCollection orders = new OrderTypeCollection();
while (response != null && response.OrderArray.Count > 0)
{
eBayConverter.ConvertOrders(response.OrderArray, 1);
if (response.HasMoreOrders)
{
getOrders.Pagination.PageNumber++;
getOrders.Execute();
response = getOrders.ApiResponse;
orders.AddRange(response.OrderArray);
}
}
This is a summary of my code above... The getOrders.Execute() is when the api fires.
After the 1st "getOrders.Execute()" there is a Pagination result which tells me how many pages of data there are. My thinking is that I should be able to start an asnychronous call for each page and to populate the OrderTypeCollection. When all the calls are made and the collection is fully loaded then I will commit to the database.
I have never done Asynchronous calls via c# before and I can kind of follow Async await but I think my scenario falls out of the reading I have done so far?
Questions:
I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db.
I've read somewhere that I want to avoid combining the API call and the db write to avoid locking in SQL server - Is this correct?
If someone can point me in the right direction - It would be greatly appreciated.
I think I can set it up to fire off the multiple calls asynchronously
but I'm not sure how to check when all tasks have been completed i.e.
ready to commit to db.
Yes you can break this up
The problem is ebay doesn't have an async Task Execute Method, so you are left with blocking threaded calls and no IO optimised async await pattern. If there were, you could take advantage of a TPL Dataflow pipeline which is async aware (and fun for the whole family to play), you could anyway, though i propose a vanilla TPL solution...
However, all is not lost, just fall back to Parallel.For and a ConcurrentBag<OrderType>
Example
var concurrentBag = new ConcurrentBag<OrderType>();
// make first call
// add results to concurrentBag
// pass the pageCount to the for
int pagesize = ...;
Parallel.For(1, pagesize,
page =>
{
// Set up
// add page
// make Call
foreach(var order in getOrders.ApiResponse)
concurrentBag.Add(order);
});
// all orders have been downloaded
// save to db
Note : There are MaxDegreeOfParallelism which you configure, maybe set it to 50, though it wont really matter how much you give it, the Task Scheduler is not going to aggressively give you threads, maybe 10 or so initially and grow slowly.
The other way you can do this, is create your own Task Scheduler, or just spin up your own Threads with the old fashioned Thread Class
I've read somewhere that I want to avoid combining the API call and
the db write to avoid locking in SQL server - Is this correct?
If you mean locking as in slow DB insert, use Sql Bulk Insert and update tools.
If you mean locking as in the the DB deadlock error message, then this is an entirely different thing, and worthy of its own question
Additional Resources
For(Int32, Int32, ParallelOptions, Action)
Executes a for (For in Visual Basic) loop in which iterations may run
in parallel and loop options can be configured.
ParallelOptions Class
Stores options that configure the operation of methods on the Parallel
class.
MaxDegreeOfParallelism
Gets or sets the maximum number of concurrent tasks enabled by this
ParallelOptions instance.
ConcurrentBag Class
Represents a thread-safe, unordered collection of objects.
Yes ConcurrentBag<T> Class can be used to server the purpose of one of your questions which was: "I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db."
This generic class can be used to Run your every task and wait all your tasks to be completed to do further processing. It is thread safe and useful for parallel processing.
I'm developing an API where my endpoint accepts an ID and then performs a heavy operation to compose the result (generates a PDF).
It's possible that I will get the same request for the same resource multiple times in a short timeframe, so I'd like some sort of work queue that keeps track of IDs, and the first request would initiate the actual work and the following requests would (first check if it's already in the pipeline then) simply wait until the work is done and return the same result.
First I thought maybe a ConcurrentDictionary would be useful, but I don't know how to handle the wait until the work is done part. I also looked at ObservableCollection, but I'm not sure about its safety (without putting too much effort into making it safe manually).
If you want to be able to scale this I would recommend an alternative architecture using a message queue (such as RabbitMQ). Your API will simply publish the messages to the queue and then you could have a Windows Service that will consume the messages and process them. The 2 applications could of course share a common data store in order to synchronize the information.
I would probably use a normal dictionary with a lock around it and Task's inside, which complete when the result for the work item has been calculated.
That means doing each part of work is represented as a Task<WorkResult>.
If a new request comes in you can do something like:
lock (currentWork)
{
Task<WorkResult> workItem = null;
if (currentWork.TryGetResult(workId, out workItem))
{
// Work is already in progress. Reuse that work item
return workItem; // The caller can await that
}
// Create a new workItem
workItem = FunctionThatStartsWork(); // Could also be a Task.Run(...) thing
// Store it in the map
currentWork[workId] = workItem;
// Return it to the caller so that he can await it
return workItem;
}
Now there's a question left on who removes the item from the dictionary. One approach could be to attach a continuation on creation of the work task that will remove it from the map once it's finished.
workItem.ContinueWith(wi => {
lock (currentWork) {
currentWork.Remove(workId);
}
});
Working with the Azure Storage Client library 2.1, I'm working on making a query of Table storage async. I created this code:
public async Task<List<TAzureTableEntity>> GetByPartitionKey(string partitionKey)
{
var theQuery = _table.CreateQuery<TAzureTableEntity>()
.Where(tEnt => tEnt.PartitionKey == partitionKey);
TableQuerySegment<TAzureTableEntity> querySegment = null;
var returnList = new List<TAzureTableEntity>();
while(querySegment == null || querySegment.ContinuationToken != null)
{
querySegment = await theQuery.AsTableQuery()
.ExecuteSegmentedAsync(querySegment != null ?
querySegment.ContinuationToken : null);
returnList.AddRange(querySegment);
}
return returnList;
}
Let's assume there is a large set of data coming back so there will be a lot of round trips to Table Storage. The problem I have is that we're awaiting a set of data, adding it to an in-memory list, awaiting more data, adding it to the same list, awaiting yet more data, adding it to the list... and so on and so forth. Why not just wrap a Task.Factory.StartNew() around a regular TableQuery? Like so:
public async Task<List<TAzureTableEntity>> GetByPartitionKey(string partitionKey)
{
var returnList = await Task.Factory.StartNew(() =>
table.CreateQuery<TAzureTableEntity>()
.Where(ent => ent.PartitionKey == partitionKey)
.ToList());
return returnList;
}
Doing it this way seems like we're not bouncing the SynchronizationContext back and forth so much. Or does it really matter?
Edit to Rephrase Question
What's the difference between the two scenarios mentioned above?
The difference between the two is that your second version will block a ThreadPool thread for the whole time the query is executing. This might be acceptable in a GUI application (where all you want is to execute the code somewhere other than the UI thread), but it will negate any scalability advantages of async in a server application.
Also, if you don't want your first version to return to the UI context for each roundtrip (which is a reasonable requirement), then use ConfigureAwait(false) whenever you use await:
querySegment = await theQuery.AsTableQuery()
.ExecuteSegmentedAsync(…)
.ConfigureAwait(false);
This way, all iterations after the first one will (most likely) execute on a ThreadPool thread and not on the UI context.
BTW, in your second version, you don't actually need await at all, you could just directly return the Task:
public Task<List<TAzureTableEntity>> GetByPartitionKey(string partitionKey)
{
return Task.Run(() => table.CreateQuery<TAzureTableEntity>()
.Where(ent => ent.PartitionKey == partitionKey)
.ToList());
}
Not sure if this is the answer you're looking for but I still want to mention it :).
As you may already know, the 2nd method (using Task) handles continuation tokens internally and comes out of the method when all entities have been fetched whereas the 1st method fetches a set of entities (up to a maximum of 1000) and then comes out giving you the result set as well as a continuation token.
If you're interested in fetching all entities from a table, both methods can be used however the 1st one gives you the flexibility of breaking out of loop gracefully anytime, which you don't get in the 2nd one. So using the 1st function you could essentially introduce pagination concept.
Let's assume you're building a web application which shows data from a table. Further let's assume that the table contains large number of entities (let's say 100000 entities). Using 1st method, you can just fetch 1000 entities return the result back to the user and if the user wants, you can fetch next set of 1000 entities and show them to the user. You could continue doing that till the time user wants and there's data in the table. With the 2nd method the user would have to wait till all 100000 entities are fetched from the table.