Reading millions of small files with C#

Reading millions of small files with C# - c#

I have millions of log files which generating every day and I need to read all of them and put together as a single file to do some process on it in other app.
I'm looking for the fastest way to do this. Currently I'm using Threads, Tasks and parallel like this:
Parallel.For(0, files.Length, new ParallelOptions { MaxDegreeOfParallelism = 100 }, i =>
{
ReadFiles(files[i]);
});
void ReadFiles(string file)
{
try
{
var txt = File.ReadAllText(file);
filesTxt.Add(tmp);
}
catch { }
GlobalCls.ThreadNo--;
}
or
foreach (var file in files)
{
//Int64 index = i;
//var file = files[index];
while (Process.GetCurrentProcess().Threads.Count > 100)
{
Thread.Sleep(100);
Application.DoEvents();
}
new Thread(() => ReadFiles(file)).Start();
GlobalCls.ThreadNo++;
// Task.Run(() => ReadFiles(file));
}
The problem is that after a few thousand reading files, the reading gets slower and slower!!
Any idea why? and what's the fastest approaches to reading millions small files? Thank you.

It seems that you are loading the contents of all files in memory, before writing them back to the single file. This could explain why the process becomes slower over time.
A way to optimize the process is to separate the reading part from the writing part, and do them in parallel. This is called the producer-consumer pattern. It can be implemented with the Parallel class, or with threads, or with tasks, but I will demonstrate instead an implementation based on the powerful TPL Dataflow library, that is particularly suited for jobs like this.
private static async Task MergeFiles(IEnumerable<string> sourceFilePaths,
string targetFilePath, CancellationToken cancellationToken = default,
IProgress<int> progress = null)
{
var readerBlock = new TransformBlock<string, string>(async filePath =>
{
return File.ReadAllText(filePath); // Read the small file
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2, // Reading is parallelizable
BoundedCapacity = 100, // No more than 100 file-paths buffered
CancellationToken = cancellationToken, // Cancel at any time
});
StreamWriter streamWriter = null;
int filesProcessed = 0;
var writerBlock = new ActionBlock<string>(text =>
{
streamWriter.Write(text); // Append to the target file
filesProcessed++;
if (filesProcessed % 10 == 0) progress?.Report(filesProcessed);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 1, // We can't parallelize the writer
BoundedCapacity = 100, // No more than 100 file-contents buffered
CancellationToken = cancellationToken, // Cancel at any time
});
readerBlock.LinkTo(writerBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
// This is a tricky part. We use BoundedCapacity, so we must propagate manually
// a possible failure of the writer to the reader, otherwise a deadlock may occur.
PropagateFailure(writerBlock, readerBlock);
// Open the output stream
using (streamWriter = new StreamWriter(targetFilePath))
{
// Feed the reader with the file paths
foreach (var filePath in sourceFilePaths)
{
var accepted = await readerBlock.SendAsync(filePath,
cancellationToken); // Cancel at any time
if (!accepted) break; // This will happen if the reader fails
}
readerBlock.Complete();
await writerBlock.Completion;
}
async void PropagateFailure(IDataflowBlock block1, IDataflowBlock block2)
{
try { await block1.Completion.ConfigureAwait(false); }
catch (Exception ex)
{
if (block1.Completion.IsCanceled) return; // On cancellation do nothing
block2.Fault(ex);
}
}
}
Usage example:
var cts = new CancellationTokenSource();
var progress = new Progress<int>(value =>
{
// Safe to update the UI
Console.WriteLine($"Files processed: {value:#,0}");
});
var sourceFilePaths = Directory.EnumerateFiles(#"C:\SourceFolder", "*.log",
SearchOption.AllDirectories); // Include subdirectories
await MergeFiles(sourceFilePaths, #"C:\AllLogs.log", cts.Token, progress);
The BoundedCapacity is used to keep the memory usage under control.
If the disk drive is SSD, you can try reading with a MaxDegreeOfParallelism larger than 2.
For best performance you could consider writing to a different disc drive than the drive containing the source files.
The TPL Dataflow library is available as a package for .NET Framework, and is build-in for .NET Core.

When it comes to IO operations, CPU parallelism is useless. Your IO device (disk, network, whatever) is your bottleneck. By reading from the device concurrently you risk to even lower your performance.

Perhaps you can just use PowerShell to concatenate the files, such as in this answer.
Another alternative is to write a program that uses the FileSystemWatcher class to watch for new files and append them as they are created.

Related

Running parallel async tasks and return result in .NET Core Web API

Hi Recently i was working in .net core web api project which is downloading files from external api.
In this .net core api recently found some issues while the no of files is more say more than 100. API is downloading max of 50 files and skipping others. WebAPI is deployed on AWS Lambda and timeout is 15mnts.
Actually the operation is timing out due to the long download process
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
bool DownloadFlag = false;
foreach (DownloadAttachment downloadAttachment in downloadAttachments)
{
DownloadFlag = await DownloadAttachment(downloadAttachment.id);
//update the download status in database
if(DownloadFlag)
{
bool UpdateFlag = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (UpdateFlag)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
return true;
}
catch (Exception ext)
{
log.Error(ext, "Error in Saving attachment {attachemntId}",downloadAttachment.id);
return false;
}
}
Document service code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
return await _documentRepository.UpdateAttachmentDownloadStatus(AttachmentID);
}
And DB update code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
using (var db = new SqlConnection(_connectionString.Value))
{
var Result = 0; bool SuccessFlag = false;
var parameters = new DynamicParameters();
parameters.Add("#pm_AttachmentID", AttachmentID);
parameters.Add("#pm_Result", Result, System.Data.DbType.Int32, System.Data.ParameterDirection.Output);
var result = await db.ExecuteAsync("[Loan].[UpdateDownloadStatus]", parameters, commandType: CommandType.StoredProcedure);
Result = parameters.Get<int>("#pm_Result");
if (Result > 0) { SuccessFlag = true; }
return SuccessFlag;
}
}
How can i move this async task to run parallel ? and get the result? i tried following code
var task = Task.Run(() => DownloadAttachment( downloadAttachment.id));
bool result = task.Result;
Is this approach is fine? how can improve the performance? how to get the result from each parallel task and update to DB and delete based on success flag? Or this error is due to AWS timeout?
Please help

If you extracted the code that handles individual files to a separate method :
private async Task DownloadSingleAttachment(DownloadAttachment attachment)
{
try
{
var download = await DownloadAttachment(downloadAttachment.id);
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (update)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
catch(....)
{
....
}
}
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
foreach (var attachment in downloadAttachments)
{
await DownloadSingleAttachment(attachment);
}
}
....
}
It would be easy to start all downloads at once, although not very efficient :
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
//Start all of them
var tasks=downloadAttachments.Select(att=>DownloadSingleAttachment(att));
await Task.WhenAll(tasks);
}
....
}
This isn't very efficient because external services hate lots of concurrent calls from a single source as you do, and almost certainly impose throttling. The database doesn't like lots of concurrent calls either, because in all database products concurrent calls lead to blocking one way or another. Even in databases that use multiversioning, this comes with an overhead.
Using Dataflow classes - Single block
One easy way to fix this is to use .NET's Dataflow classes to break the operation into a pipeline of steps, and execute each one with a different number of concurrent tasks.
We could put the entire operation into a single block, but that could cause problems if the update and delete operations aren't thread-safe :
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
};
var downloader=new ActionBlock<DownloadAttachment>(async att=>{
await DownloadSingleAttachment(att);
},dlOptions);
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await downloader.Completion;
Dataflow - Multiple steps
To avoid possible thread issues, the rest of the methods can go to their own blocks. They could both go into one ActionBlock that calls both Update and Delete, or they could go into separate blocks if the methods talk to different services with different concurrency requirements.
The downloader block will execute at most 10 concurrent downloads. By default, each block uses only a single task at a time.
The updater and deleter blocks have their default DOP=1, which means there's no risk of race conditions as long as they don't try to use eg the same connection at the same time.
var downloader=new TransformBlock<string,(string id,bool download)>(
async id=> {
var download=await DownloadAttachment(id);
return (id,download);
},dlOptions);
var updater=new TransformBlock<(string id,bool download),(string id,bool update)>(
async (id,download)=> {
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(id);
return (id,update);
}
return (id,false);
});
var deleter=new ActionBlock<(string id,bool update)>(
async (id,update)=> {
if(update)
{
await DeleteAttachment(id);
}
});
The blocks can be linked into a pipeline now and used. The setting PropagateCompletion = true means that as soon as a block is finished processing, it will tell all its connected blocks to finish as well :
var linkOptions=new DataflowLinkOptions { PropagateCompletion = true};
downloader.LinkTo(updater, linkOptions);
updater.LinkTo(deleter,linkOptions);
We can pump data into the head block as long as we need. When we're done, we call the head block's Complete() method. As each block finishes processing its data, it will propagate its completion to the next block in the pipeline. We need to await for the last (tail) block to complete to ensure all the attachments have been processed:
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await deleter.Completion;
Each block has an input and (when necessary) an output buffer, which means the "producer" and "consumers" of the messages don't have to be in sync, or even know of each other. All the "producer" needs to know is where to find the head block in a pipeline.
Throttling and backpressure
One way to throttle is to use a fixed number of tasks through MaxDegreeOfParallelism.
It's also possible to put a limit to the input buffer, thus blocking previous steps or producers if a block can't process messages fast enough. This can be done simply by setting the BoundedCapacity option for a block:
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
BoundedCapacity=20,
};
var updaterOptions= new ExecutionDataflowBlockOptions
{
BoundedCapacity=20,
};
...
var downloader=new TransformBlock<...>(...,dlOptions);
var updater=new TransformBlock<...>(...,updaterOptions);
No other changes are necessary

To run multiple asynchronous operations you could do something like this:
public async Task RunMultipleAsync<T>(IEnumerable<T> myList)
{
const int myNumberOfConcurrentOperations = 10;
var mySemaphore = new SemaphoreSlim(myNumberOfConcurrentOperations);
var tasks = new List<Task>();
foreach(var myItem in myList)
{
await mySemaphore.WaitAsync();
var task = RunOperation(myItem);
tasks.Add(task);
task.ContinueWith(t => mySemaphore.Release());
}
await Task.WhenAll(tasks);
}
private async Task RunOperation<T>(T myItem)
{
// Do stuff
}
Put your code from DownloadAttachmentsAsync at the 'Do stuff' comment
This will use a semaphore to limit the number of concurrent operations, since running to many concurrent operations is often a bad idea due to contention. You would need to experiment to find the optimal number of concurrent operations for your use case. Also note that error handling have been omitted to keep the example short.

TPL Dataflow LinkTo TransformBlock is very slow

I have two TransformBlocks which are arranged in a loop. They link their data to each other. TransformBlock 1 is an I/O block reading data and is limited to a maximum of 50 tasks. It reads the data and some meta data. Then they are passed to the second block. The second block decides on the meta data if the message goes again to the first block. So after the meta data matches the criteria and a short wait the data should go again back again to the I/O block. The second blocks MaxDegreeOfParallelism can be unlimited.
Now I have noticed when I send a lot of data to the I/O block it takes a long time till the messages are linked to the second block. It takes like 10 minutes to link the data and they are all sent in a bunch. Like 1000 entries in a few seconds.
Normally I would implement it like so:
public void Start()
{
_ioBlock = new TransformBlock<Data,Tuple<Data, MetaData>>(async data =>
{
var metaData = await ReadAsync(data).ConfigureAwait(false);
return new Tuple<Data, MetaData>(data, metaData);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
_waitBlock = new TransformBlock<Tuple<Data, MetaData>,Data>(async dataMetaData =>
{
var data = dataMetaData.Item1;
var metaData = dataMetaData.Item2;
if (!metaData.Repost)
{
return null;
}
await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);
return data;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
_ioBlock.LinkTo(_waitBlock);
_waitBlock.LinkTo(_ioBlock, data => data != null);
_waitBlock.LinkTo(DataflowBlock.NullTarget<Data>());
foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
{
_ioBlock.Post(data);
}
}
But because of the described problem I have to implement it like so:
public void Start()
{
_ioBlock = new ActionBlock<Data>(async data =>
{
var metaData = await ReadAsync(data).ConfigureAwait(false);
var dataMetaData= new Tuple<Data, MetaData>(data, metaData);
_waitBlock.Post(dataMetaData);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
_waitBlock = new ActionBlock<Tuple<Data, MetaData>>(async dataMetaData =>
{
var data = dataMetaData.Item1;
var metaData = dataMetaData.Item2;
if (metaData.Repost)
{
await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);
_ioBlock.Post(data);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
{
_ioBlock.Post(data);
}
}
When I use the second approach the data get linked/posted faster (one by one). But it feels more like a hack to me. Anybody know how to fix the problem? Some friends recommended me to use TPL Pipeline but it seems much more complicated to me.

Problem solved. You need to set
ExecutionDataflowBlockOptions.EnsureOrdered
to forward the data immediately to the next/wait block.
Further information:
Why do blocks run in this order?

extracting zips, parsing files and flattening out to CSV

I'm trying to maximize the performance of the following task:
Enumerate directory of zip files
Extract zips in memory looking for .json files (handling nested zips)
Parse the json files
Write properties from json file into an aggregated .CSV file
The TPL layout I was going for was:
producer -> parser block -> batch block -> csv writer block
With the idea being that a single producer extracts the zips and finds the json files, sends the text to the parser block which is running in parallel (multi consumer). The batch block is grouping into batches of 200, and the writer block is dumping 200 rows to a CSV file each call.
Questions:
The longer the jsonParseBlock TransformBlock takes, the more messages are dropped. How can I prevent this?
How could I better utilize TPL to maximize performance?
class Item
{
public string ID { get; set; }
public string Name { get; set; }
}
class Demo
{
const string OUT_FILE = #"c:\temp\tplflat.csv";
const string DATA_DIR = #"c:\temp\tpldata";
static ExecutionDataflowBlockOptions parseOpts = new ExecutionDataflowBlockOptions() { SingleProducerConstrained=true, MaxDegreeOfParallelism = 8, BoundedCapacity = 100 };
static ExecutionDataflowBlockOptions writeOpts = new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 };
public static void Run()
{
Console.WriteLine($"{Environment.ProcessorCount} processors available");
_InitTest(); // reset csv file, generate test data if needed
// start TPL stuff
var sw = Stopwatch.StartNew();
// transformer
var jsonParseBlock = new TransformBlock<string, Item>(rawstr =>
{
var item = Newtonsoft.Json.JsonConvert.DeserializeObject<Item>(rawstr);
System.Threading.Thread.Sleep(15); // the more sleep here, the more messages lost
return item;
}, parseOpts);
// batch block
var jsonBatchBlock = new BatchBlock<Item>(200);
// writer block
var flatWriterBlock = new ActionBlock<Item[]>(items =>
{
//Console.WriteLine($"writing {items.Length} to csv");
StringBuilder sb = new StringBuilder();
foreach (var item in items)
{
sb.AppendLine($"{item.ID},{item.Name}");
}
File.AppendAllText(OUT_FILE, sb.ToString());
});
jsonParseBlock.LinkTo(jsonBatchBlock, new DataflowLinkOptions { PropagateCompletion = true });
jsonBatchBlock.LinkTo(flatWriterBlock, new DataflowLinkOptions { PropagateCompletion = true });
// start doing the work
var crawlerTask = GetJsons(DATA_DIR, jsonParseBlock);
crawlerTask.Wait();
flatWriterBlock.Completion.Wait();
Console.WriteLine($"ALERT: tplflat.csv row count should match the test data");
Console.WriteLine($"Completed in {sw.ElapsedMilliseconds / 1000.0} secs");
}
static async Task GetJsons(string filepath, ITargetBlock<string> queue)
{
int count = 1;
foreach (var zip in Directory.EnumerateFiles(filepath, "*.zip"))
{
Console.WriteLine($"working on zip #{count++}");
var zipStream = new FileStream(zip, FileMode.Open);
await ExtractJsonsInMemory(zip, zipStream, queue);
}
queue.Complete();
}
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
ZipArchive archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (TextReader reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
}
Update1
I've added async, but it is not clear to me how to wait for all the dataflow blocks to complete (new to c#, async and tpl). I basically want to say, "keep running until all of the queues/blocks are empty". I've added the following 'wait' code, and appears to be working.
// wait for crawler to finish
crawlerTask.Wait();
// wait for the last block
flatWriterBlock.Completion.Wait();

In short your posting and ignoring the return value. You've got two options: add an unbound BufferBlock to hold all your incoming data or await on SendAsync, that will prevent any messages from being dropped.
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
var archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (var reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
You'll need to pull the async all the way back up, but this should get you started.

From MSDN, about the DataflowBlock.Post<TInput> method:
Return Value
Type: System.Boolean
true if the item was accepted by the target block; otherwise, false.
So, the problem here is that you're sending your messages without checking, can the pipeline accept another one, or not. This is happening because of your options for blocks:
new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 }
and this line:
// this line isn't waiting for long operations and simply drops the message as it can't be accepted by the target block
queue.Post(jsonText);
Here you're saying that the processing should be postponed until the input queue length is equal to 100. In this case either the MSDN or #StephenCleary in his Introduction to Dataflow series suggest simple solution:
However, it’s possible to throttle a block by limiting its buffer size; in this case, you could use SendAsync to (asynchronously) wait for space to be available and then place the data into the block’s input buffer.
So, as #JSteward already suggested, you can introduce the infinite buffer between your workers to avoid the message dropping, and this is a general practice to do that, as checking the result of the Post method could block the producer thread for a long time.
The second part of the question, about the performance, is to use the async-oriented solution (which will perfectly fit with SendAsync method usage), as you use I/O operations all the time. Asynchronous operation is basically a way to say the program "start doing this and notify me when it's done". And, as there is no thread for such operations, you will gain by freeing up the thread pool for other operations you have in your pipeline.
PS: #JSteward had provided your a good sample code for this approaches.

Using async/await and yield return with TPL Dataflow

I am trying to implement a data processing pipeline using TPL Dataflow. However, I am relatively new to dataflow and not completely sure how to use it properly for the problem I am trying to solve.
Problem:
I am trying to iterate through the list of files and process each file to read some data and then further process that data. Each file is roughly 700MB to 1GB in size. Each file contains JSON data. In order to process these files in parallel and not run of of memory, I am trying to use IEnumerable<> with yield return and then further process the data.
Once I get list of files, I want to process maximum 4-5 files at a time in parallel. My confusion comes from:
How to use IEnumerable<> and yeild return with async/await and dataflow. Came across this answer by svick, but still not sure how to convert IEnumerable<> to ISourceBlock and then link all blocks together and track completion.
In my case, producer will be really fast (going through list of files), but consumer will be very slow (processing each file - read data, deserialize JSON). In this case, how to track completion.
Should I use LinkTo feature of datablocks to connect various blocks? or use method such as OutputAvailableAsync() and ReceiveAsync() to propagate data from one block to another.
Code:
private const int ProcessingSize= 4;
private BufferBlock<string> _fileBufferBlock;
private ActionBlock<string> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
var bufferTask = ListFilesAsync(_fileBufferBlock, token);
var tasks = new List<Task> { bufferTask, _processingBlock.Completion };
return Task.WhenAll(tasks);
}
private async Task ListFilesAsync(ITargetBlock<string> targetBlock, CancellationToken token)
{
...
// Get list of file Uris
...
foreach(var fileNameUri in fileNameUris)
await targetBlock.SendAsync(fileNameUri, token);
targetBlock.Complete();
}
private async Task ProcessFileAsync(string fileNameUri, CancellationToken token)
{
var httpClient = new HttpClient();
try
{
using (var stream = await httpClient.GetStreamAsync(fileNameUri))
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
var data = _jsonSerializer.Deserialize<DataType>(jsonTextReader)
await _messageBufferBlock.SendAsync(data, token);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
catch(Exception ex)
{
// Should throw?
// Or if converted to block then report using Fault() method?
}
finally
{
httpClient.Dispose();
buffer.Complete();
}
}
private void PrepareDataflow(CancellationToken token)
{
_fileBufferBlock = new BufferBlock<string>(new DataflowBlockOptions
{
CancellationToken = token
});
var actionExecuteOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = ProcessingSize,
MaxMessagesPerTask = 1,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new ActionBlock<string>(async fileName =>
{
try
{
await ProcessFileAsync(fileName, token);
}
catch (Exception ex)
{
_logger.Fatal(ex, $"Failed to process fiel: {fileName}, Error: {ex.Message}");
// Should fault the block?
}
}, actionExecuteOptions);
_fileBufferBlock.LinkTo(_processingBlock, new DataflowLinkOptions { PropagateCompletion = true });
_messageBufferBlock = new BufferBlock<DataType>(new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
_messageBufferBlock.LinkTo(DataflowBlock.NullTarget<DataType>());
}
In the above code, I am not using IEnumerable<DataType> and yield return as I cannot use it with async/await. So I am linking input buffer to ActionBlock<DataType> which in turn posts to another queue. However by using ActionBlock<>, I cannot link it to next block for processing and have to manually Post/SendAsync from ActionBlock<> to BufferBlock<>. Also, in this case, not sure, how to track completion.
This code works, but, I am sure there could be better solution then this and I can just link all the block (instead of ActionBlock<DataType> and then sending messages from it to BufferBlock<DataType>)
Another option could be to convert IEnumerable<> to IObservable<> using Rx, but again I am not much familiar with Rx and don't know exactly how to mix TPL Dataflow and Rx

Question 1
You plug an IEnumerable<T> producer into your TPL Dataflow chain by using Post or SendAsync directly on the consumer block, as follows:
foreach (string fileNameUri in fileNameUris)
{
await _processingBlock.SendAsync(fileNameUri).ConfigureAwait(false);
}
You can also use a BufferBlock<TInput>, but in your case it actually seems rather unnecessary (or even harmful - see the next part).
Question 2
When would you prefer SendAsync instead of Post? If your producer runs faster than the URIs can be processed (and you have indicated this to be the case), and you choose to give your _processingBlock a BoundedCapacity, then when the block's internal buffer reaches the specified capacity, your SendAsync will "hang" until a buffer slot frees up, and your foreach loop will be throttled. This feedback mechanism creates back pressure and ensures that you don't run out of memory.
Question 3
You should definitely use the LinkTo method to link your blocks in most cases. Unfortunately yours is a corner case due to the interplay of IDisposable and very large (potentially) sequences. So your completion will flow automatically between the buffer and processing blocks (due to LinkTo), but after that - you need to propagate it manually. This is tricky, but doable.
I'll illustrate this with a "Hello World" example where the producer iterates over each character and the consumer (which is really slow) outputs each character to the Debug window.
Note: LinkTo is not present.
// REALLY slow consumer.
var consumer = new ActionBlock<char>(async c =>
{
await Task.Delay(100);
Debug.Print(c.ToString());
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
var producer = new ActionBlock<string>(async s =>
{
foreach (char c in s)
{
await consumer.SendAsync(c);
Debug.Print($"Yielded {c}");
}
});
try
{
producer.Post("Hello world");
producer.Complete();
await producer.Completion;
}
finally
{
consumer.Complete();
}
// Observe combined producer and consumer completion/exceptions/cancellation.
await Task.WhenAll(producer.Completion, consumer.Completion);
This outputs:
Yielded H
H
Yielded e
e
Yielded l
l
Yielded l
l
Yielded o
o
Yielded
Yielded w
w
Yielded o
o
Yielded r
r
Yielded l
l
Yielded d
d
As you can see from the output above, the producer is throttled and the handover buffer between the blocks never grows too large.
EDIT
You might find it cleaner to propagate completion via
producer.Completion.ContinueWith(
_ => consumer.Complete(), TaskContinuationOptions.ExecuteSynchronously
);
... right after producer definition. This allows you to slightly reduce producer/consumer coupling - but at the end you still have to remember to observe Task.WhenAll(producer.Completion, consumer.Completion).

In order to process these files in parallel and not run of of memory, I am trying to use IEnumerable<> with yield return and then further process the data.
I don't believe this step is necessary. What you're actually avoiding here is just a list of filenames. Even if you had millions of files, the list of filenames is just not going to take up a significant amount of memory.
I am linking input buffer to ActionBlock which in turn posts to another queue. However by using ActionBlock<>, I cannot link it to next block for processing and have to manually Post/SendAsync from ActionBlock<> to BufferBlock<>. Also, in this case, not sure, how to track completion.
ActionBlock<TInput> is an "end of the line" block. It only accepts input and does not produce any output. In your case, you don't want ActionBlock<TInput>; you want TransformManyBlock<TInput, TOutput>, which takes input, runs a function on it, and produces output (with any number of output items for each input item).
Another point to keep in mind is that all buffer blocks have an input buffer. So the extra BufferBlock is unnecessary.
Finally, if you're already in "dataflow land", it's usually best to end with a dataflow block that actually does something (e.g., ActionBlock instead of BufferBlock). In this case, you could use the BufferBlock as a bounded producer/consumer queue, where some other code is consuming the results. Personally, I would consider that it may be cleaner to rewrite the consuming code as the action of an ActionBlock, but it may also be cleaner to keep the consumer independent of the dataflow. For the code below, I left in the final bounded BufferBlock, but if you use this solution, consider changing that final block to a bounded ActionBlock instead.
private const int ProcessingSize= 4;
private static readonly HttpClient HttpClient = new HttpClient();
private TransformBlock<string, DataType> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
ListFiles(_fileBufferBlock, token);
_processingBlock.Complete();
return _processingBlock.Completion;
}
private void ListFiles(ITargetBlock<string> targetBlock, CancellationToken token)
{
... // Get list of file Uris, occasionally calling token.ThrowIfCancellationRequested()
foreach(var fileNameUri in fileNameUris)
_processingBlock.Post(fileNameUri);
}
private async Task<IEnumerable<DataType>> ProcessFileAsync(string fileNameUri, CancellationToken token)
{
return Process(await HttpClient.GetStreamAsync(fileNameUri), token);
}
private IEnumerable<DataType> Process(Stream stream, CancellationToken token)
{
using (stream)
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
token.ThrowIfCancellationRequested();
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
yield _jsonSerializer.Deserialize<DataType>(jsonTextReader);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
private void PrepareDataflow(CancellationToken token)
{
var executeOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new TransformManyBlock<string, DataType>(fileName =>
ProcessFileAsync(fileName, token), executeOptions);
_messageBufferBlock = new BufferBlock<DataType>(new DataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
}
Alternatively, you could use Rx. Learning Rx can be pretty difficult though, especially for mixed asynchronous and parallel dataflow situations, which you have here.
As for your other questions:
How to use IEnumerable<> and yeild return with async/await and dataflow.
async and yield are not compatible at all. At least in today's language. In your situation, the JSON readers have to read from the stream synchronously anyway (they don't support asynchronous reading), so the actual stream processing is synchronous and can be used with yield. Doing the initial back-and-forth to get the stream itself can still be asynchronous and can be used with async. This is as good as we can get today, until the JSON readers support asynchronous reading and the language supports async yield. (Rx could do an "async yield" today, but the JSON reader still doesn't support async reading, so it won't help in this particular situation).
In this case, how to track completion.
If the JSON readers did support asynchronous reading, then the solution above would not be the best one. In that case, you would want to use a manual SendAsync call, and would need to link just the completion of these blocks, which can be done as such:
_processingBlock.Completion.ContinueWith(
task =>
{
if (task.IsFaulted)
((IDataflowBlock)_messageBufferBlock).Fault(task.Exception);
else if (!task.IsCanceled)
_messageBufferBlock.Complete();
},
CancellationToken.None,
TaskContinuationOptions.DenyChildAttach | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
Should I use LinkTo feature of datablocks to connect various blocks? or use method such as OutputAvailableAsync() and ReceiveAsync() to propagate data from one block to another.
Use LinkTo whenever you can. It handles all the corner cases for you.
// Should throw?
// Should fault the block?
That's entirely up to you. By default, when any processing of any item fails, the block faults, and if you are propagating completion, the entire chain of blocks would fault.
Faulting blocks are rather drastic; they throw away any work in progress and refuse to continue processing. You have to build a new dataflow mesh if you want to retry.
If you prefer a "softer" error strategy, you can either catch the exceptions and do something like log them (which your code currently does), or you can change the nature of your dataflow block to pass along the exceptions as data items.

It would be worth looking at Rx. Unless I'm missing something your entire code that you need (apart from your existing ProcessFileAsync method) would look like this:
var query =
fileNameUris
.Select(fileNameUri =>
Observable
.FromAsync(ct => ProcessFileAsync(fileNameUri, ct)))
.Merge(maxConcurrent : 4);
var subscription =
query
.Subscribe(
u => { },
() => { Console.WriteLine("Done."); });
Done. It's run asynchronously. It's cancellable by calling subscription.Dispose();. And you can specify the maximum parallelism.

How to aggregate the data from an async producer and write it to a file?

I'm learning about async/await patterns in C#. Currently I'm trying to solve a problem like this:
There is a producer (a hardware device) that generates 1000 packets per second. I need to log this data to a file.
The device only has a ReadAsync() method to report a single packet at a time.
I need to buffer the packets and write them in the order they are generated to the file, only once a second.
Write operation should fail if the write process is not finished in time when the next batch of packets is ready to be written.
So far I have written something like below. It works but I am not sure if this is the best way to solve the problem. Any comments or suggestion? What is the best practice to approach this kind of Producer/Consumer problem where the consumer needs to aggregate the data received from the producer?
static async Task TestLogger(Device device, int seconds)
{
const int bufLength = 1000;
bool firstIteration = true;
Task writerTask = null;
using (var writer = new StreamWriter("test.log")))
{
do
{
var buffer = new byte[bufLength][];
for (int i = 0; i < bufLength; i++)
{
buffer[i] = await device.ReadAsync();
}
if (!firstIteration)
{
if (!writerTask.IsCompleted)
throw new Exception("Write Time Out!");
}
writerTask = Task.Run(() =>
{
foreach (var b in buffer)
writer.WriteLine(ToHexString(b));
});
firstIteration = false;
} while (--seconds > 0);
}
}

You could use the following idea, provided the criteria for flush is the number of packets (up to 1000). I did not test it. It makes use of Stephen Cleary's AsyncProducerConsumerQueue<T> featured in this question.
AsyncProducerConsumerQueue<byte[]> _queue;
Stream _stream;
// producer
async Task ReceiveAsync(CancellationToken token)
{
while (true)
{
var list = new List<byte>();
while (true)
{
token.ThrowIfCancellationRequested(token);
var packet = await _device.ReadAsync(token);
list.Add(packet);
if (list.Count == 1000)
break;
}
// push next batch
await _queue.EnqueueAsync(list.ToArray(), token);
}
}
// consumer
async Task LogAsync(CancellationToken token)
{
Task previousFlush = Task.FromResult(0);
CancellationTokenSource cts = null;
while (true)
{
token.ThrowIfCancellationRequested(token);
// get next batch
var nextBatch = await _queue.DequeueAsync(token);
if (!previousFlush.IsCompleted)
{
cts.Cancel(); // cancel the previous flush if not ready
throw new Exception("failed to flush on time.");
}
await previousFlush; // it's completed, observe for any errors
// start flushing
cts = CancellationTokenSource.CreateLinkedTokenSource(token);
previousFlush = _stream.WriteAsync(nextBatch, 0, nextBatch.Count, cts.Token);
}
}
If you don't want to fail the logger but rather prefer to cancel the flush and proceed to the next batch, you can do so with a minimal change to this code.
In response to #l3arnon comment:
A packet is not a byte, it's byte[]. 2. You haven't used the OP's ToHexString. 3. AsyncProducerConsumerQueue is much less robust and
tested than .Net's TPL Dataflow. 4. You await previousFlush for errors
just after you throw an exception which makes that line redundant.
etc. In short: I think the possible added value doesn't justify this
very complicated solution.
"A packet is not a byte, it's byte[]" - A packet is a byte, this is obvious from the OP's code: buffer[i] = await device.ReadAsync(). Then, a batch of packets is byte[].
"You haven't used the OP's ToHexString." - The goal was to show how to use Stream.WriteAsync which natively accepts a cancellation token, instead of WriteLineAsync which doesn't allow cancellation. It's trivial to use ToHexString with Stream.WriteAsync and still take advantage of cancellation support:
var hexBytes = Encoding.ASCII.GetBytes(ToHexString(nextBatch) +
Environment.NewLine);
_stream.WriteAsync(hexBytes, 0, hexBytes.Length, token);
"AsyncProducerConsumerQueue is much less robust and tested than .Net's TPL Dataflow" - I don't think this is a determined fact. However, if the OP is concerned about it, he can use regular BlockingCollection, which doesn't block the producer thread. It's OK to block the consumer thread while waiting for the next batch, because writing is done in parallel. As opposed to this, your TPL Dataflow version carries one redundant CPU and lock intensive operation: moving data from producer pipeline to writer pipleline with logAction.Post(packet), byte by byte. My code doesn't do that.
"You await previousFlush for errors just after you throw an exception which makes that line redundant." - This line is not redundant. Perhaps, you're missing this point: previousFlush.IsCompleted can be true when previousFlush.IsFaulted or previousFlush.IsCancelled is also true. So, await previousFlush is relevant there to observe any errors on the completed tasks (e.g., a write failure), which otherwise will be lost.

A better approach IMHO would be to have 2 "workers", a producer and a consumer. The producer reads from the device and simply fills a list. The consumer "wakes up" every second and writes the batch to a file.
List<byte[]> _data = new List<byte[]>();
async Task Producer(Device device)
{
while (true)
{
_data.Add(await device.ReadAsync());
}
}
async Task Consumer(Device device)
{
using (var writer = new StreamWriter("test.log")))
{
while (true)
{
Stopwatch watch = Stopwatch.StartNew();
var batch = _data;
_data = new List<byte[]>();
foreach (var packet in batch)
{
writer.WriteLine(ToHexString(packet));
if (watch.Elapsed >= TimeSpan.FromSeconds(1))
{
throw new Exception("Write Time Out!");
}
}
await Task.Delay(TimeSpan.FromSeconds(1) - watch.Elapsed);
}
}
}
The while (true) should probably be replaced by a system wide cancellation token.

Assuming you can batch by amount (1000) instead of time (1 second), the simplest solution is probably using TPL Dataflow's BatchBlock which automatically batches a flow of items by size:
async Task TestLogger(Device device, int seconds)
{
var writer = new StreamWriter("test.log");
var batch = new BatchBlock<byte[]>(1000);
var logAction = new ActionBlock<byte[]>(
packet =>
{
return writer.WriteLineAsync(ToHexString(packet));
});
ActionBlock<byte[]> transferAction;
transferAction = new ActionBlock<byte[][]>(
bytes =>
{
foreach (var packet in bytes)
{
if (transferAction.InputCount > 0)
{
return; // or throw new Exception("Write Time Out!");
}
logAction.Post(packet);
}
}
);
batch.LinkTo(transferAction);
logAction.Completion.ContinueWith(_ => writer.Dispose());
while (true)
{
batch.Post(await device.ReadAsync());
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading millions of small files with C# - c#

When it comes to IO operations, CPU parallelism is useless. Your IO device (disk, network, whatever) is your bottleneck. By reading from the device concurrently you risk to even lower your performance.

Perhaps you can just use PowerShell to concatenate the files, such as in this answer. Another alternative is to write a program that uses the FileSystemWatcher class to watch for new files and append them as they are created.

Related

Running parallel async tasks and return result in .NET Core Web API

TPL Dataflow LinkTo TransformBlock is very slow

extracting zips, parsing files and flattening out to CSV

Using async/await and yield return with TPL Dataflow

How to aggregate the data from an async producer and write it to a file?

Categories

Resources