I wish to download around 100,000 files from a web site. The answers from this qustion have the same issues as what I tried.
I have tried two approaches, both of which use highly erratic amounts of bandwidth:
The first attempts to synchronously download the files:
ParallelOptions a = new ParallelOptions();
a.MaxDegreeOfParallelism = 30;
ServicePointManager.DefaultConnectionLimit = 10000;
Parallel.For(start, end, a, i =>
{
using (var client = new WebClient())
{
...
}
});
This works, but my throughput looks like this:
The second way involves using semaphore and async to do the parallelism more manually (without the semaphores it will obviously spawn too many work items):
Parallel.For(start, end, a, i =>
{
list.Add(getAndPreprocess(/*get URL from somewhere*/);
});
...
static async Task getAndPreprocess(string url)
{
var client = new HttpClient();
sem.WaitOne();
string content = "";
try
{
var data = client.GetStringAsync(url);
content = await data;
}
catch (Exception ex) { Console.WriteLine(ex.InnerException.Message); sem.Release(); return; }
sem.Release();
try
{
//try to use results from content
}
catch { return; }
}
My throughput now looks like this:
Is there a nice way to do this, such that it starts downloading anouther file when the speed falls, and stops adding when the aggregate speed is constant (like what you would expect a download manager to do)?
Additionally, even though the second form gives better results, I dislike that I have to use semaphores, as it is error prone.
What the standard way to do this?
Note: these are all small files (<50KB)
Related
I have the following code that downloads blob file to a string. It works fine but just very poor performance. It takes about 50 seconds to process 500 files.
`
try
{
var sourceClient = new BlobServiceClient(storageConnectionString);
var foundItems = sourceClient.FindBlobsByTags("Client = 'TEST'").ToList();
foreach (var blob in foundItems)
{
var blobClient = blobContainer.GetBlockBlobClient(blob.BlobName);
BlobDownloadResult download = await blobClient.DownloadContentAsync();
string downloadedData = download.Content.ToString();
myList.Add(downloadedData);
}
}
catch (Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");`
}
`
I tried with multi threads for the code but it still takes about 25 seconds to process 500 files.
var semaphore = new SemaphoreSlim(50);
var tasks = new List<Task>();
try
{
var sourceClient = new BlobServiceClient(storageConnectionString);
var foundItems = sourceClient.FindBlobsByTags("Client = 'TEST'").ToList();
foreach (var blob in foundItems)
{
tasks.Add(Task.Run(async () =>
{
try
{
await semaphore.WaitAsync();
var blobClient = blobContainer.GetBlockBlobClient(blob.BlobName);
BlobDownloadResult download = await blobClient.DownloadContentAsync();
string downloadedData = download.Content.ToString();
myList.Add(downloadedData);
;
}
finally
{
semaphore.Release();
}
}));
}
await Task.WhenAll(tasks);
}
catch (Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");
}
I'm pretty new to C#, am I doing anything wrong with multi-threading? what's the fastest way to read file from blob storage?
Note: the following line of code causes the most delay.
BlobDownloadResult download = await blobClient.DownloadContentAsync();
Two biggest performance problems with your code are:
Don't wrap that download task in Task.Run, you're just using thread pool threads for no reason.
Stop switching contexts for no reason, use .ConfigureAwait(false) on your await calls.
A third problem, minor in comparison:
You're corrupting your memory by pushing to a List<> from multiple threads. Use a proper concurrent container, like ConcurrentBag<>. Edit: in fact I'm not even convinced you need the list, use the return value of Task.WhenAll to gather the results instead of doing result gathering by hand.
Hi Recently i was working in .net core web api project which is downloading files from external api.
In this .net core api recently found some issues while the no of files is more say more than 100. API is downloading max of 50 files and skipping others. WebAPI is deployed on AWS Lambda and timeout is 15mnts.
Actually the operation is timing out due to the long download process
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
bool DownloadFlag = false;
foreach (DownloadAttachment downloadAttachment in downloadAttachments)
{
DownloadFlag = await DownloadAttachment(downloadAttachment.id);
//update the download status in database
if(DownloadFlag)
{
bool UpdateFlag = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (UpdateFlag)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
return true;
}
catch (Exception ext)
{
log.Error(ext, "Error in Saving attachment {attachemntId}",downloadAttachment.id);
return false;
}
}
Document service code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
return await _documentRepository.UpdateAttachmentDownloadStatus(AttachmentID);
}
And DB update code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
using (var db = new SqlConnection(_connectionString.Value))
{
var Result = 0; bool SuccessFlag = false;
var parameters = new DynamicParameters();
parameters.Add("#pm_AttachmentID", AttachmentID);
parameters.Add("#pm_Result", Result, System.Data.DbType.Int32, System.Data.ParameterDirection.Output);
var result = await db.ExecuteAsync("[Loan].[UpdateDownloadStatus]", parameters, commandType: CommandType.StoredProcedure);
Result = parameters.Get<int>("#pm_Result");
if (Result > 0) { SuccessFlag = true; }
return SuccessFlag;
}
}
How can i move this async task to run parallel ? and get the result? i tried following code
var task = Task.Run(() => DownloadAttachment( downloadAttachment.id));
bool result = task.Result;
Is this approach is fine? how can improve the performance? how to get the result from each parallel task and update to DB and delete based on success flag? Or this error is due to AWS timeout?
Please help
If you extracted the code that handles individual files to a separate method :
private async Task DownloadSingleAttachment(DownloadAttachment attachment)
{
try
{
var download = await DownloadAttachment(downloadAttachment.id);
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (update)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
catch(....)
{
....
}
}
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
foreach (var attachment in downloadAttachments)
{
await DownloadSingleAttachment(attachment);
}
}
....
}
It would be easy to start all downloads at once, although not very efficient :
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
//Start all of them
var tasks=downloadAttachments.Select(att=>DownloadSingleAttachment(att));
await Task.WhenAll(tasks);
}
....
}
This isn't very efficient because external services hate lots of concurrent calls from a single source as you do, and almost certainly impose throttling. The database doesn't like lots of concurrent calls either, because in all database products concurrent calls lead to blocking one way or another. Even in databases that use multiversioning, this comes with an overhead.
Using Dataflow classes - Single block
One easy way to fix this is to use .NET's Dataflow classes to break the operation into a pipeline of steps, and execute each one with a different number of concurrent tasks.
We could put the entire operation into a single block, but that could cause problems if the update and delete operations aren't thread-safe :
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
};
var downloader=new ActionBlock<DownloadAttachment>(async att=>{
await DownloadSingleAttachment(att);
},dlOptions);
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await downloader.Completion;
Dataflow - Multiple steps
To avoid possible thread issues, the rest of the methods can go to their own blocks. They could both go into one ActionBlock that calls both Update and Delete, or they could go into separate blocks if the methods talk to different services with different concurrency requirements.
The downloader block will execute at most 10 concurrent downloads. By default, each block uses only a single task at a time.
The updater and deleter blocks have their default DOP=1, which means there's no risk of race conditions as long as they don't try to use eg the same connection at the same time.
var downloader=new TransformBlock<string,(string id,bool download)>(
async id=> {
var download=await DownloadAttachment(id);
return (id,download);
},dlOptions);
var updater=new TransformBlock<(string id,bool download),(string id,bool update)>(
async (id,download)=> {
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(id);
return (id,update);
}
return (id,false);
});
var deleter=new ActionBlock<(string id,bool update)>(
async (id,update)=> {
if(update)
{
await DeleteAttachment(id);
}
});
The blocks can be linked into a pipeline now and used. The setting PropagateCompletion = true means that as soon as a block is finished processing, it will tell all its connected blocks to finish as well :
var linkOptions=new DataflowLinkOptions { PropagateCompletion = true};
downloader.LinkTo(updater, linkOptions);
updater.LinkTo(deleter,linkOptions);
We can pump data into the head block as long as we need. When we're done, we call the head block's Complete() method. As each block finishes processing its data, it will propagate its completion to the next block in the pipeline. We need to await for the last (tail) block to complete to ensure all the attachments have been processed:
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await deleter.Completion;
Each block has an input and (when necessary) an output buffer, which means the "producer" and "consumers" of the messages don't have to be in sync, or even know of each other. All the "producer" needs to know is where to find the head block in a pipeline.
Throttling and backpressure
One way to throttle is to use a fixed number of tasks through MaxDegreeOfParallelism.
It's also possible to put a limit to the input buffer, thus blocking previous steps or producers if a block can't process messages fast enough. This can be done simply by setting the BoundedCapacity option for a block:
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
BoundedCapacity=20,
};
var updaterOptions= new ExecutionDataflowBlockOptions
{
BoundedCapacity=20,
};
...
var downloader=new TransformBlock<...>(...,dlOptions);
var updater=new TransformBlock<...>(...,updaterOptions);
No other changes are necessary
To run multiple asynchronous operations you could do something like this:
public async Task RunMultipleAsync<T>(IEnumerable<T> myList)
{
const int myNumberOfConcurrentOperations = 10;
var mySemaphore = new SemaphoreSlim(myNumberOfConcurrentOperations);
var tasks = new List<Task>();
foreach(var myItem in myList)
{
await mySemaphore.WaitAsync();
var task = RunOperation(myItem);
tasks.Add(task);
task.ContinueWith(t => mySemaphore.Release());
}
await Task.WhenAll(tasks);
}
private async Task RunOperation<T>(T myItem)
{
// Do stuff
}
Put your code from DownloadAttachmentsAsync at the 'Do stuff' comment
This will use a semaphore to limit the number of concurrent operations, since running to many concurrent operations is often a bad idea due to contention. You would need to experiment to find the optimal number of concurrent operations for your use case. Also note that error handling have been omitted to keep the example short.
I'm trying to make several GET requests to an API, in parallel, but I'm getting an error ("Too many requests") when trying to do large volumes of requests (1600 items).
The following is a snippet of the code.
Call:
var metadataItemList = await GetAssetMetadataBulk(unitHashList_Unique);
Method:
private static async Task<List<MetadataModel>> GetAssetMetadataBulk(List<string> assetHashes)
{
List<MetadataModel> resultsList = new();
int batchSize = 100;
int batches = (int)Math.Ceiling((double)assetHashes.Count() / batchSize);
for (int i = 0; i < batches; i++)
{
var currentAssets = assetHashes.Skip(i * batchSize).Take(batchSize);
var tasks = currentAssets.Select(asset => EndpointProcessor<MetadataModel>.LoadAddress($"assets/{asset}"));
resultsList.AddRange(await Task.WhenAll(tasks));
}
return resultsList;
}
The method runs tasks in parallel in batches of 100, it works fine for small volumes of requests (<~300), but for greater amounts (~1000+), I get the aforementioned "Too many requests" error.
I tried stepping through the code, and to my surprise, it worked when I manually stepped through it. But I need it to work automatically.
Is there any way to slow down requests, or a better way to somehow circumvent the error whilst maintaining relatively good performance?
The request does not return a "Retry-After" header, and I also don't know how I'd implement this in C#. Any input on what code to edit, or direction to a doc is much appreciated!
The following is the Class I'm using to send HTTP requests:
class EndpointProcessor<T>
{
public static async Task<T> LoadAddress(string url)
{
using (HttpResponseMessage response = await ApiHelper.apiClient.GetAsync(url))
{
if (response.IsSuccessStatusCode)
{
T result = await response.Content.ReadAsAsync<T>();
return result;
}
else
{
//Console.WriteLine("Error: {0} ({1})\nTrailingHeaders: {2}\n", response.StatusCode, response.ReasonPhrase, response.TrailingHeaders);
throw new Exception(response.ReasonPhrase);
}
}
}
}
You can use a semaphore as a limiter for currently active threads. Add a field of Semaphore type to your API client and initialize it with a maximum count and an initial count of, say, 250 or what you determine as a safe maximum number of running requests. In your method ApiHelper.apiClient.GetAsync(), before making the real connection, try to acquire the semaphore, then release it after completing/failing the download. This will allow you enforce a maximum number of concurrently running requests.
I have to send 10000 messages. At the moment, it happens synchronously and takes up to 20 minutes to send them all.
// sending messages in a sync way
foreach (var message in messages)
{
var result = Send(message);
_logger.Info($"Successfully sent {message.Title}.")
}
To shorten the message sending time, I'd like to use async and await, but my concern is if C# runtime can handle 15000 number of tasks in the worker process.
var tasks = new List<Task>();
foreach (var message in messages)
{
tasks.Add(Task.Run(() => Send(message))
}
var t = Task.WhenAll(tasks);
t.Wait();
...
Also, in terms of memory, I'm not sure if it's a good idea to create a list of 15000 tasks
Since I came home from work, I have played with this a bit and here is my answer.
First of all Parallel.ForEach is bretty cool to use, and with my 8 core runs very fast.
I suggest to limit the CPU usage so you do not use 100% capacity, but that depends on your system, I have made two suggestion for it.
The other things is you need to monitor and be sure that your sender server can eat all these jobs with out getting trouble.
Here is a the implementation:
public void MessMessageSender(List<Message> messages)
{
try
{
var parallelOptions = new ParallelOptions();
_cancelToken = new CancellationTokenSource();
parallelOptions.CancellationToken = _cancelToken.Token;
var maxProc = System.Environment.ProcessorCount;
// this option use around 75% core capacity
parallelOptions.MaxDegreeOfParallelism = Convert.ToInt32(Math.Ceiling(maxProc * 0.75));
// the following option use all cores expect 1
//parallelOptions.MaxDegreeOfParallelism = maxProc - 1;
try
{
Parallel.ForEach(messages, parallelOptions, message =>
{
try
{
Send(message);
//_logger.Info($"Successfully sent {text.Title}.");
}
catch (Exception ex)
{
//_logger.Error($"Something went wrong {ex}.");
}
});
}
catch (OperationCanceledException e)
{
//User has cancelled this request.
}
}
finally
{
//What ever dispose of clients;
}
}
My answer is inspired for this page.
Documentation:
Parallel.Foreach
Environment.ProcessorCount
Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}