I have the following code that downloads blob file to a string. It works fine but just very poor performance. It takes about 50 seconds to process 500 files.
`
try
{
var sourceClient = new BlobServiceClient(storageConnectionString);
var foundItems = sourceClient.FindBlobsByTags("Client = 'TEST'").ToList();
foreach (var blob in foundItems)
{
var blobClient = blobContainer.GetBlockBlobClient(blob.BlobName);
BlobDownloadResult download = await blobClient.DownloadContentAsync();
string downloadedData = download.Content.ToString();
myList.Add(downloadedData);
}
}
catch (Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");`
}
`
I tried with multi threads for the code but it still takes about 25 seconds to process 500 files.
var semaphore = new SemaphoreSlim(50);
var tasks = new List<Task>();
try
{
var sourceClient = new BlobServiceClient(storageConnectionString);
var foundItems = sourceClient.FindBlobsByTags("Client = 'TEST'").ToList();
foreach (var blob in foundItems)
{
tasks.Add(Task.Run(async () =>
{
try
{
await semaphore.WaitAsync();
var blobClient = blobContainer.GetBlockBlobClient(blob.BlobName);
BlobDownloadResult download = await blobClient.DownloadContentAsync();
string downloadedData = download.Content.ToString();
myList.Add(downloadedData);
;
}
finally
{
semaphore.Release();
}
}));
}
await Task.WhenAll(tasks);
}
catch (Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");
}
I'm pretty new to C#, am I doing anything wrong with multi-threading? what's the fastest way to read file from blob storage?
Note: the following line of code causes the most delay.
BlobDownloadResult download = await blobClient.DownloadContentAsync();
Two biggest performance problems with your code are:
Don't wrap that download task in Task.Run, you're just using thread pool threads for no reason.
Stop switching contexts for no reason, use .ConfigureAwait(false) on your await calls.
A third problem, minor in comparison:
You're corrupting your memory by pushing to a List<> from multiple threads. Use a proper concurrent container, like ConcurrentBag<>. Edit: in fact I'm not even convinced you need the list, use the return value of Task.WhenAll to gather the results instead of doing result gathering by hand.
Related
I have a chron job which calls a database table and gets about half a million records returned. I need to loop through all of that data, and send API post's to a third party API. In general, this works fine, but the processing time is forever (10 hours). I need a way to speed it up. I've been trying to use a list of Task with SemaphoreSlim, but running into issues (it doesn't like that my api call returns a Task). I'm wondering if anyone has a solution to this that won't destroy the VM's memory?
Current code looks something like:
foreach(var data in dataList)
{
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
} catch//
}
But I'm trying to do this and getting the syntax wrong:
var tasks = new List<Task<DataObj>>();
var throttler = new SemaphoreSlim(10);
foreach(var data in dataList)
{
await throttler.WaitAsync();
tasks.Add(Task.Run(async () => {
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
}
finally
{
throttler.Release();
}
}));
}
Your list is of type Task<DataObj>, but your async lambda doesn't return anything, so its return type is Task. To fix the syntax, just return the value:
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
return response;
As others have noted in the comments, I also recommend not using Task.Run here. A local async method would work fine:
var tasks = new List<Task<DataObj>>();
var throttler = new SemaphoreSlim(10);
foreach(var data in dataList)
{
tasks.Add(ThrottledPostData(data));
}
var results = await Task.WhenAll(tasks);
async Task<DataObj> ThrottledPostData(Data data)
{
await throttler.WaitAsync();
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
return response;
}
finally
{
throttler.Release();
}
}
Hi Recently i was working in .net core web api project which is downloading files from external api.
In this .net core api recently found some issues while the no of files is more say more than 100. API is downloading max of 50 files and skipping others. WebAPI is deployed on AWS Lambda and timeout is 15mnts.
Actually the operation is timing out due to the long download process
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
bool DownloadFlag = false;
foreach (DownloadAttachment downloadAttachment in downloadAttachments)
{
DownloadFlag = await DownloadAttachment(downloadAttachment.id);
//update the download status in database
if(DownloadFlag)
{
bool UpdateFlag = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (UpdateFlag)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
return true;
}
catch (Exception ext)
{
log.Error(ext, "Error in Saving attachment {attachemntId}",downloadAttachment.id);
return false;
}
}
Document service code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
return await _documentRepository.UpdateAttachmentDownloadStatus(AttachmentID);
}
And DB update code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
using (var db = new SqlConnection(_connectionString.Value))
{
var Result = 0; bool SuccessFlag = false;
var parameters = new DynamicParameters();
parameters.Add("#pm_AttachmentID", AttachmentID);
parameters.Add("#pm_Result", Result, System.Data.DbType.Int32, System.Data.ParameterDirection.Output);
var result = await db.ExecuteAsync("[Loan].[UpdateDownloadStatus]", parameters, commandType: CommandType.StoredProcedure);
Result = parameters.Get<int>("#pm_Result");
if (Result > 0) { SuccessFlag = true; }
return SuccessFlag;
}
}
How can i move this async task to run parallel ? and get the result? i tried following code
var task = Task.Run(() => DownloadAttachment( downloadAttachment.id));
bool result = task.Result;
Is this approach is fine? how can improve the performance? how to get the result from each parallel task and update to DB and delete based on success flag? Or this error is due to AWS timeout?
Please help
If you extracted the code that handles individual files to a separate method :
private async Task DownloadSingleAttachment(DownloadAttachment attachment)
{
try
{
var download = await DownloadAttachment(downloadAttachment.id);
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (update)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
catch(....)
{
....
}
}
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
foreach (var attachment in downloadAttachments)
{
await DownloadSingleAttachment(attachment);
}
}
....
}
It would be easy to start all downloads at once, although not very efficient :
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
//Start all of them
var tasks=downloadAttachments.Select(att=>DownloadSingleAttachment(att));
await Task.WhenAll(tasks);
}
....
}
This isn't very efficient because external services hate lots of concurrent calls from a single source as you do, and almost certainly impose throttling. The database doesn't like lots of concurrent calls either, because in all database products concurrent calls lead to blocking one way or another. Even in databases that use multiversioning, this comes with an overhead.
Using Dataflow classes - Single block
One easy way to fix this is to use .NET's Dataflow classes to break the operation into a pipeline of steps, and execute each one with a different number of concurrent tasks.
We could put the entire operation into a single block, but that could cause problems if the update and delete operations aren't thread-safe :
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
};
var downloader=new ActionBlock<DownloadAttachment>(async att=>{
await DownloadSingleAttachment(att);
},dlOptions);
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await downloader.Completion;
Dataflow - Multiple steps
To avoid possible thread issues, the rest of the methods can go to their own blocks. They could both go into one ActionBlock that calls both Update and Delete, or they could go into separate blocks if the methods talk to different services with different concurrency requirements.
The downloader block will execute at most 10 concurrent downloads. By default, each block uses only a single task at a time.
The updater and deleter blocks have their default DOP=1, which means there's no risk of race conditions as long as they don't try to use eg the same connection at the same time.
var downloader=new TransformBlock<string,(string id,bool download)>(
async id=> {
var download=await DownloadAttachment(id);
return (id,download);
},dlOptions);
var updater=new TransformBlock<(string id,bool download),(string id,bool update)>(
async (id,download)=> {
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(id);
return (id,update);
}
return (id,false);
});
var deleter=new ActionBlock<(string id,bool update)>(
async (id,update)=> {
if(update)
{
await DeleteAttachment(id);
}
});
The blocks can be linked into a pipeline now and used. The setting PropagateCompletion = true means that as soon as a block is finished processing, it will tell all its connected blocks to finish as well :
var linkOptions=new DataflowLinkOptions { PropagateCompletion = true};
downloader.LinkTo(updater, linkOptions);
updater.LinkTo(deleter,linkOptions);
We can pump data into the head block as long as we need. When we're done, we call the head block's Complete() method. As each block finishes processing its data, it will propagate its completion to the next block in the pipeline. We need to await for the last (tail) block to complete to ensure all the attachments have been processed:
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await deleter.Completion;
Each block has an input and (when necessary) an output buffer, which means the "producer" and "consumers" of the messages don't have to be in sync, or even know of each other. All the "producer" needs to know is where to find the head block in a pipeline.
Throttling and backpressure
One way to throttle is to use a fixed number of tasks through MaxDegreeOfParallelism.
It's also possible to put a limit to the input buffer, thus blocking previous steps or producers if a block can't process messages fast enough. This can be done simply by setting the BoundedCapacity option for a block:
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
BoundedCapacity=20,
};
var updaterOptions= new ExecutionDataflowBlockOptions
{
BoundedCapacity=20,
};
...
var downloader=new TransformBlock<...>(...,dlOptions);
var updater=new TransformBlock<...>(...,updaterOptions);
No other changes are necessary
To run multiple asynchronous operations you could do something like this:
public async Task RunMultipleAsync<T>(IEnumerable<T> myList)
{
const int myNumberOfConcurrentOperations = 10;
var mySemaphore = new SemaphoreSlim(myNumberOfConcurrentOperations);
var tasks = new List<Task>();
foreach(var myItem in myList)
{
await mySemaphore.WaitAsync();
var task = RunOperation(myItem);
tasks.Add(task);
task.ContinueWith(t => mySemaphore.Release());
}
await Task.WhenAll(tasks);
}
private async Task RunOperation<T>(T myItem)
{
// Do stuff
}
Put your code from DownloadAttachmentsAsync at the 'Do stuff' comment
This will use a semaphore to limit the number of concurrent operations, since running to many concurrent operations is often a bad idea due to contention. You would need to experiment to find the optimal number of concurrent operations for your use case. Also note that error handling have been omitted to keep the example short.
I have millions of log files which generating every day and I need to read all of them and put together as a single file to do some process on it in other app.
I'm looking for the fastest way to do this. Currently I'm using Threads, Tasks and parallel like this:
Parallel.For(0, files.Length, new ParallelOptions { MaxDegreeOfParallelism = 100 }, i =>
{
ReadFiles(files[i]);
});
void ReadFiles(string file)
{
try
{
var txt = File.ReadAllText(file);
filesTxt.Add(tmp);
}
catch { }
GlobalCls.ThreadNo--;
}
or
foreach (var file in files)
{
//Int64 index = i;
//var file = files[index];
while (Process.GetCurrentProcess().Threads.Count > 100)
{
Thread.Sleep(100);
Application.DoEvents();
}
new Thread(() => ReadFiles(file)).Start();
GlobalCls.ThreadNo++;
// Task.Run(() => ReadFiles(file));
}
The problem is that after a few thousand reading files, the reading gets slower and slower!!
Any idea why? and what's the fastest approaches to reading millions small files? Thank you.
It seems that you are loading the contents of all files in memory, before writing them back to the single file. This could explain why the process becomes slower over time.
A way to optimize the process is to separate the reading part from the writing part, and do them in parallel. This is called the producer-consumer pattern. It can be implemented with the Parallel class, or with threads, or with tasks, but I will demonstrate instead an implementation based on the powerful TPL Dataflow library, that is particularly suited for jobs like this.
private static async Task MergeFiles(IEnumerable<string> sourceFilePaths,
string targetFilePath, CancellationToken cancellationToken = default,
IProgress<int> progress = null)
{
var readerBlock = new TransformBlock<string, string>(async filePath =>
{
return File.ReadAllText(filePath); // Read the small file
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2, // Reading is parallelizable
BoundedCapacity = 100, // No more than 100 file-paths buffered
CancellationToken = cancellationToken, // Cancel at any time
});
StreamWriter streamWriter = null;
int filesProcessed = 0;
var writerBlock = new ActionBlock<string>(text =>
{
streamWriter.Write(text); // Append to the target file
filesProcessed++;
if (filesProcessed % 10 == 0) progress?.Report(filesProcessed);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 1, // We can't parallelize the writer
BoundedCapacity = 100, // No more than 100 file-contents buffered
CancellationToken = cancellationToken, // Cancel at any time
});
readerBlock.LinkTo(writerBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
// This is a tricky part. We use BoundedCapacity, so we must propagate manually
// a possible failure of the writer to the reader, otherwise a deadlock may occur.
PropagateFailure(writerBlock, readerBlock);
// Open the output stream
using (streamWriter = new StreamWriter(targetFilePath))
{
// Feed the reader with the file paths
foreach (var filePath in sourceFilePaths)
{
var accepted = await readerBlock.SendAsync(filePath,
cancellationToken); // Cancel at any time
if (!accepted) break; // This will happen if the reader fails
}
readerBlock.Complete();
await writerBlock.Completion;
}
async void PropagateFailure(IDataflowBlock block1, IDataflowBlock block2)
{
try { await block1.Completion.ConfigureAwait(false); }
catch (Exception ex)
{
if (block1.Completion.IsCanceled) return; // On cancellation do nothing
block2.Fault(ex);
}
}
}
Usage example:
var cts = new CancellationTokenSource();
var progress = new Progress<int>(value =>
{
// Safe to update the UI
Console.WriteLine($"Files processed: {value:#,0}");
});
var sourceFilePaths = Directory.EnumerateFiles(#"C:\SourceFolder", "*.log",
SearchOption.AllDirectories); // Include subdirectories
await MergeFiles(sourceFilePaths, #"C:\AllLogs.log", cts.Token, progress);
The BoundedCapacity is used to keep the memory usage under control.
If the disk drive is SSD, you can try reading with a MaxDegreeOfParallelism larger than 2.
For best performance you could consider writing to a different disc drive than the drive containing the source files.
The TPL Dataflow library is available as a package for .NET Framework, and is build-in for .NET Core.
When it comes to IO operations, CPU parallelism is useless. Your IO device (disk, network, whatever) is your bottleneck. By reading from the device concurrently you risk to even lower your performance.
Perhaps you can just use PowerShell to concatenate the files, such as in this answer.
Another alternative is to write a program that uses the FileSystemWatcher class to watch for new files and append them as they are created.
I'm making an app that show some data collected from web in a windows form, today I have to wait sequentially to download all data before show them, how I can do it in parallel in a limited queue (with max concurrent tasks executing) to show result refreshing a datagridview while they are downloaded?
what I have today is a method
internal async Task<string> RequestDataAsync(string uri)
{
var wb = new System.Net.WebClient(); //
var sourceAsync = wb.DownloadStringTaskAsync(uri);
string data = await sourceAsync;
return data;
}
that I put on a foreach() and after it ends, parse data to a list of custom object, then convert that object to a DataTable and bind the datagridview to that.
I not sure if the best way is using LimitedConcurrencyLevelTaskScheduler from example on https://msdn.microsoft.com/library/system.threading.tasks.taskscheduler.aspx (that I not sure how can report to grid each time a resource is downlaoded) or there is a best way to do this.
I not like to start all tasks at same time, because sometimes can be that I have to request 100 downlads at same time, and I like that it will be executed for example 10 tasks at same time maximum.
I know that it is a question that involves control concurrent tasks and report while download that, but not sure what is best nowadays to do that.
I don't often recommend my book, but I think it would help you.
Concurrent asynchrony is done via Task.WhenAll (recipe 2.4 in my book):
List<string> uris = ...;
var tasks = uris.Select(uri => RequestDataAsync(uri));
string[] results = await Task.WhenAll(tasks);
To limit concurrency, use a SemaphoreSlim (recipe 11.5 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestDataAsync(uri); }
finally { semaphore.Release(); }
});
string[] results = await Task.WhenAll(tasks);
To process data as it arrives, introduce another async method (recipe 2.6 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestAndProcessDataAsync(uri); }
finally { semaphore.Release(); }
});
await Task.WhenAll(tasks);
async Task RequestAndProcessDataAsync(string uri)
{
var data = await RequestDataAsync(uri);
var myObject = Parse(data);
_listBoundToDataTable.Add(myObject);
}
I wish to download around 100,000 files from a web site. The answers from this qustion have the same issues as what I tried.
I have tried two approaches, both of which use highly erratic amounts of bandwidth:
The first attempts to synchronously download the files:
ParallelOptions a = new ParallelOptions();
a.MaxDegreeOfParallelism = 30;
ServicePointManager.DefaultConnectionLimit = 10000;
Parallel.For(start, end, a, i =>
{
using (var client = new WebClient())
{
...
}
});
This works, but my throughput looks like this:
The second way involves using semaphore and async to do the parallelism more manually (without the semaphores it will obviously spawn too many work items):
Parallel.For(start, end, a, i =>
{
list.Add(getAndPreprocess(/*get URL from somewhere*/);
});
...
static async Task getAndPreprocess(string url)
{
var client = new HttpClient();
sem.WaitOne();
string content = "";
try
{
var data = client.GetStringAsync(url);
content = await data;
}
catch (Exception ex) { Console.WriteLine(ex.InnerException.Message); sem.Release(); return; }
sem.Release();
try
{
//try to use results from content
}
catch { return; }
}
My throughput now looks like this:
Is there a nice way to do this, such that it starts downloading anouther file when the speed falls, and stops adding when the aggregate speed is constant (like what you would expect a download manager to do)?
Additionally, even though the second form gives better results, I dislike that I have to use semaphores, as it is error prone.
What the standard way to do this?
Note: these are all small files (<50KB)