Calling async functions on parallel threads in c# - c#

I have a three async function that I want to call from multiple threads in parallel at the same time. Till now I have tried the following approach -
int numOfThreads = 4;
var taskList = List<Task>();
using(fs = new FileStream(inputFilePath, FileMode.OpenOrCreate,FileAccess.ReadWrite,FileShare.ReadWrite))
{
for(int i=1; i<= numOfThreads ; i++)
{
taskList.Add(Task.Run( async() => {
byte[] buffer = new byte[length]; // length could be upto a few thousand
await Function1Async(); // Reads from the file into a byte array
long result = await Function2Aync(); // Does some async operation with that byte array data
await Function3Async(result); // Writes the result into the file
}
}
}
Task.WaitAll(taskList.toArray());
However, not all of the tasks complete before the execution reaches an end. I have limited experience with threading in c#. What am I doing wrong in my code? Or should I take an alternative approach?
EDIT -
So I made some changes to my approach. I got rid of the Function3Async for now -
for(int i=1;i<=numOfThreads; i++)
{
using(fs = new FileStream(----))
{
taskList.Add(Task.Run( async() => {
byte[] buffer = new byte[length]; // length could be upto a few thousand
await Function1Async(buffer); // Reads from the file into a byte array
Stream data = new MemoryStream(buffer);
/** Write the Stream into a file and return
* the offset at which the write operation was done
*/
long blockStartOffset = await Function2Aync(data);
Console.WriteLine($"Block written at - {blockStartOffset}");
}
}
}
Task.WaitAll(taskList.toArray());
Now all threads seem to proceed to completion but the Function2Async seems to randomly write some Japanese characters to the output file. I guess it is some threading issue perhaps?
Here is the implementation of the Function2Async ->
public async Task<long> Function2Async(Stream data)
{
long offset = getBlockOffset();
using(var outputFs = new FileStream(fileName,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.ReadWrite))
{
outputFs.Seek(offset, SeekOrigin.Begin);
await data.CopyToAsync(outputFs);
}
return offset;
}

In your example you have passed neither fs nor buffer into Function1Async but your comment says it reads from fs into buffer, so I will assume that is what happens.
You cannot read from a stream in parallel. It does not support that. If you find one that supports it, it will be horribly inefficient, because that is how hard disk storage works. Even worse if it is a network drive.
Read from the stream into your buffers first and in sequence, then let your threads loose and run your logic. In parallel, on the already existing buffers in memory.
Writing by the way would have the same problem if you wrote to the same file. If you write to one file per buffer, that's fine, otherwise, do it sequentially.

Related

extracting zips, parsing files and flattening out to CSV

I'm trying to maximize the performance of the following task:
Enumerate directory of zip files
Extract zips in memory looking for .json files (handling nested zips)
Parse the json files
Write properties from json file into an aggregated .CSV file
The TPL layout I was going for was:
producer -> parser block -> batch block -> csv writer block
With the idea being that a single producer extracts the zips and finds the json files, sends the text to the parser block which is running in parallel (multi consumer). The batch block is grouping into batches of 200, and the writer block is dumping 200 rows to a CSV file each call.
Questions:
The longer the jsonParseBlock TransformBlock takes, the more messages are dropped. How can I prevent this?
How could I better utilize TPL to maximize performance?
class Item
{
public string ID { get; set; }
public string Name { get; set; }
}
class Demo
{
const string OUT_FILE = #"c:\temp\tplflat.csv";
const string DATA_DIR = #"c:\temp\tpldata";
static ExecutionDataflowBlockOptions parseOpts = new ExecutionDataflowBlockOptions() { SingleProducerConstrained=true, MaxDegreeOfParallelism = 8, BoundedCapacity = 100 };
static ExecutionDataflowBlockOptions writeOpts = new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 };
public static void Run()
{
Console.WriteLine($"{Environment.ProcessorCount} processors available");
_InitTest(); // reset csv file, generate test data if needed
// start TPL stuff
var sw = Stopwatch.StartNew();
// transformer
var jsonParseBlock = new TransformBlock<string, Item>(rawstr =>
{
var item = Newtonsoft.Json.JsonConvert.DeserializeObject<Item>(rawstr);
System.Threading.Thread.Sleep(15); // the more sleep here, the more messages lost
return item;
}, parseOpts);
// batch block
var jsonBatchBlock = new BatchBlock<Item>(200);
// writer block
var flatWriterBlock = new ActionBlock<Item[]>(items =>
{
//Console.WriteLine($"writing {items.Length} to csv");
StringBuilder sb = new StringBuilder();
foreach (var item in items)
{
sb.AppendLine($"{item.ID},{item.Name}");
}
File.AppendAllText(OUT_FILE, sb.ToString());
});
jsonParseBlock.LinkTo(jsonBatchBlock, new DataflowLinkOptions { PropagateCompletion = true });
jsonBatchBlock.LinkTo(flatWriterBlock, new DataflowLinkOptions { PropagateCompletion = true });
// start doing the work
var crawlerTask = GetJsons(DATA_DIR, jsonParseBlock);
crawlerTask.Wait();
flatWriterBlock.Completion.Wait();
Console.WriteLine($"ALERT: tplflat.csv row count should match the test data");
Console.WriteLine($"Completed in {sw.ElapsedMilliseconds / 1000.0} secs");
}
static async Task GetJsons(string filepath, ITargetBlock<string> queue)
{
int count = 1;
foreach (var zip in Directory.EnumerateFiles(filepath, "*.zip"))
{
Console.WriteLine($"working on zip #{count++}");
var zipStream = new FileStream(zip, FileMode.Open);
await ExtractJsonsInMemory(zip, zipStream, queue);
}
queue.Complete();
}
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
ZipArchive archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (TextReader reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
}
Update1
I've added async, but it is not clear to me how to wait for all the dataflow blocks to complete (new to c#, async and tpl). I basically want to say, "keep running until all of the queues/blocks are empty". I've added the following 'wait' code, and appears to be working.
// wait for crawler to finish
crawlerTask.Wait();
// wait for the last block
flatWriterBlock.Completion.Wait();
In short your posting and ignoring the return value. You've got two options: add an unbound BufferBlock to hold all your incoming data or await on SendAsync, that will prevent any messages from being dropped.
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
var archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (var reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
You'll need to pull the async all the way back up, but this should get you started.
From MSDN, about the DataflowBlock.Post<TInput> method:
Return Value
Type: System.Boolean
true if the item was accepted by the target block; otherwise, false.
So, the problem here is that you're sending your messages without checking, can the pipeline accept another one, or not. This is happening because of your options for blocks:
new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 }
and this line:
// this line isn't waiting for long operations and simply drops the message as it can't be accepted by the target block
queue.Post(jsonText);
Here you're saying that the processing should be postponed until the input queue length is equal to 100. In this case either the MSDN or #StephenCleary in his Introduction to Dataflow series suggest simple solution:
However, it’s possible to throttle a block by limiting its buffer size; in this case, you could use SendAsync to (asynchronously) wait for space to be available and then place the data into the block’s input buffer.
So, as #JSteward already suggested, you can introduce the infinite buffer between your workers to avoid the message dropping, and this is a general practice to do that, as checking the result of the Post method could block the producer thread for a long time.
The second part of the question, about the performance, is to use the async-oriented solution (which will perfectly fit with SendAsync method usage), as you use I/O operations all the time. Asynchronous operation is basically a way to say the program "start doing this and notify me when it's done". And, as there is no thread for such operations, you will gain by freeing up the thread pool for other operations you have in your pipeline.
PS: #JSteward had provided your a good sample code for this approaches.

Task.FromAsync and two threads

I'm using .net 4.0 and have following code:
var stream = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize,
FileOptions.Asynchronous | FileOptions.SequentialScan);
var buffer = new byte[bufferSize];
Debug.Assert(stream.IsAsync, "stream.IsAsync");
var ia = stream.BeginRead(buffer, 0, buffer.Length, t =>
{
var ms = new MemoryStream(buffer);
using (TextReader rdr = new StreamReader(ms, Encoding.ASCII))
{
for (uint iEpoch = 0; item < FileHeader.NUMBER_OF_ITEMS; item++)
{
dataList.Add(epochData);
}
}
}, null);
return Task<int>.Factory.FromAsync(ia, t =>
{
var st = stream;
var bytes1 = st.EndRead(t);
var a = EpochDataList.Count;
var b = FileHeader.NUMBER_OF_EPOCHS;
Debug.Assert(a == b);
st.Dispose();
return bytes1;
});
And it seems that there are race conditions between execution of async callback and end method lambda function(assert is raising). But according to msdn it is explicitly stated that end method should be executing after async callback is finished:
Creates a Task that executes an end method function when a specified IAsyncResult completes.
Am I right that I'm confusing fact of completion of IO operation which triggering end method and fact of completion of async callback, so they both can potentially execute in the same time?
Meanwhile this code works great:
return Task<int>.Factory.FromAsync(stream.BeginRead, (ai) =>
{
var ms = new MemoryStream(buffer);
using (TextReader rdr = new StreamReader(ms, Encoding.ASCII))
{
using (TextReader rdr = new StreamReader(ms, Encoding.ASCII))
{
for (uint iEpoch = 0; item < FileHeader.NUMBER_OF_ITEMS; item++)
{
dataList.Add(epochData);
}
}
}
stream.Dispose();
return stream.EndRead(ai);
}, buffer, 0, buffer.Length, null);
Also I need to mention that returned task is used within continuation.
Thanks in advance.
You're doing this so wrong, I'm almost inclined not to answer - you're going to hurt someone with that code. But since this isn't Code Review...
Your most immediate problem is that the callback you provide to BeginRead isn't part of the IAsyncResult at all. Thus, when a specified IAsyncResult completes doesn't talk about your callback, it only talks about the underlying asynchronous operation - you get two separate callbacks launched by the same event.
Now, for the other problems:
You need to keep issuing BeginReads over and over again, until EndRead returns 0. Otherwise, you're only ever reading the whole buffer at most - if your file is longer than that, you're not going to read the whole file.
You're combining old-school asynchronous API callbacks with Task-based asynchrony. This is bound to give you trouble. Just learn to use Tasks properly, and you'll find the callbacks are 100% unnecessary.
EndRead is telling you how many bytes were actually read in the preceding BeginRead operation - you're ignoring that information.
Doing this correctly isn't all that easy - if possible, I'd suggest upgrading to .NET 4.5, and taking advantage of the await keyword. If that's not possible, you can install the async targetting pack, which adds await to 4.0 as a simple NuGet package.
With await, reading the whole file is as simple as
using (var sr = new StreamReader(fs))
{
string line;
while ((line = await sr.ReadLineAsync(buffer, 0, buffer.Length)) > 0)
{
// Do whatever
}
}

file writing using blockingcollection

I have a tcp listener which listens and writes data from the server. I used a BlockingCollection to store data. Here I don't know when the file ends. So, my filestream is always open.
Part of my code is:
private static BlockingCollection<string> Buffer = new BlockingCollection<string>();
Process()
{
var consumer = Task.Factory.StartNew(() =>WriteData());
while()
{
string request = await reader.ReadLineAsync();
Buffer.Add(request);
}
}
WriteData()
{
FileStream fStream = new FileStream(filename,FileMode.Append,FileAccess.Write,FileShare.Write, 16392);
foreach(var val in Buffer.GetConsumingEnumerable(token))
{
fStream.Write(Encoding.UTF8.GetBytes(val), 0, val.Length);
fStream.Flush();
}
}
The problem is I cannot dispose filestream within loop otherwise I have to create filestream for each line and the loop may never end.
This would be much easier in .NET 4.5 if you used a DataFlow ActionBlock. An ActionBlock accepts and buffers incoming messages and processes them asynchronously using one or more Tasks.
You could write something like this:
public static async Task ProcessFile(string sourceFileName,string targetFileName)
{
//Pass the target stream as part of the message to avoid globals
var block = new ActionBlock<Tuple<string, FileStream>>(async tuple =>
{
var line = tuple.Item1;
var stream = tuple.Item2;
await stream.WriteAsync(Encoding.UTF8.GetBytes(line), 0, line.Length);
});
//Post lines to block
using (var targetStream = new FileStream(targetFileName, FileMode.Append,
FileAccess.Write, FileShare.Write, 16392))
{
using (var sourceStream = File.OpenRead(sourceFileName))
{
await PostLines(sourceStream, targetStream, block);
}
//Tell the block we are done
block.Complete();
//And wait fo it to finish
await block.Completion;
}
}
private static async Task PostLines(FileStream sourceStream, FileStream targetStream,
ActionBlock<Tuple<string, FileStream>> block)
{
using (var reader = new StreamReader(sourceStream))
{
while (true)
{
var line = await reader.ReadLineAsync();
if (line == null)
break;
var tuple = Tuple.Create(line, targetStream);
block.Post(tuple);
}
}
}
Most of the code deals with reading each line and posting it to the block. By default, an ActionBlock uses only a single Task to process one message at a time, which is fine in this scenario. More tasks can be used if needed to process data in parallel.
Once all lines are read, we notify the block with a call to Complete and await for it to finish processing with await block.Completion.
Once the block's Completion task finishes we can close the target stream.
The beauty of the DataFlow library is that you can link multiple blocks together, to create a pipeline of processing steps. ActionBlock is typically the final step in such a chain. The library takes care to pass data from one block to the next and propagate completion down the chain.
For example, one step can read files from a log, a second can parse them with a regex to find specific patterns (eg error messages) and pass them on, a third can receive the error messages and write them to another file. Each step will execute on a different thread, with intermediate messages buffered at each step.

How to aggregate the data from an async producer and write it to a file?

I'm learning about async/await patterns in C#. Currently I'm trying to solve a problem like this:
There is a producer (a hardware device) that generates 1000 packets per second. I need to log this data to a file.
The device only has a ReadAsync() method to report a single packet at a time.
I need to buffer the packets and write them in the order they are generated to the file, only once a second.
Write operation should fail if the write process is not finished in time when the next batch of packets is ready to be written.
So far I have written something like below. It works but I am not sure if this is the best way to solve the problem. Any comments or suggestion? What is the best practice to approach this kind of Producer/Consumer problem where the consumer needs to aggregate the data received from the producer?
static async Task TestLogger(Device device, int seconds)
{
const int bufLength = 1000;
bool firstIteration = true;
Task writerTask = null;
using (var writer = new StreamWriter("test.log")))
{
do
{
var buffer = new byte[bufLength][];
for (int i = 0; i < bufLength; i++)
{
buffer[i] = await device.ReadAsync();
}
if (!firstIteration)
{
if (!writerTask.IsCompleted)
throw new Exception("Write Time Out!");
}
writerTask = Task.Run(() =>
{
foreach (var b in buffer)
writer.WriteLine(ToHexString(b));
});
firstIteration = false;
} while (--seconds > 0);
}
}
You could use the following idea, provided the criteria for flush is the number of packets (up to 1000). I did not test it. It makes use of Stephen Cleary's AsyncProducerConsumerQueue<T> featured in this question.
AsyncProducerConsumerQueue<byte[]> _queue;
Stream _stream;
// producer
async Task ReceiveAsync(CancellationToken token)
{
while (true)
{
var list = new List<byte>();
while (true)
{
token.ThrowIfCancellationRequested(token);
var packet = await _device.ReadAsync(token);
list.Add(packet);
if (list.Count == 1000)
break;
}
// push next batch
await _queue.EnqueueAsync(list.ToArray(), token);
}
}
// consumer
async Task LogAsync(CancellationToken token)
{
Task previousFlush = Task.FromResult(0);
CancellationTokenSource cts = null;
while (true)
{
token.ThrowIfCancellationRequested(token);
// get next batch
var nextBatch = await _queue.DequeueAsync(token);
if (!previousFlush.IsCompleted)
{
cts.Cancel(); // cancel the previous flush if not ready
throw new Exception("failed to flush on time.");
}
await previousFlush; // it's completed, observe for any errors
// start flushing
cts = CancellationTokenSource.CreateLinkedTokenSource(token);
previousFlush = _stream.WriteAsync(nextBatch, 0, nextBatch.Count, cts.Token);
}
}
If you don't want to fail the logger but rather prefer to cancel the flush and proceed to the next batch, you can do so with a minimal change to this code.
In response to #l3arnon comment:
A packet is not a byte, it's byte[]. 2. You haven't used the OP's ToHexString. 3. AsyncProducerConsumerQueue is much less robust and
tested than .Net's TPL Dataflow. 4. You await previousFlush for errors
just after you throw an exception which makes that line redundant.
etc. In short: I think the possible added value doesn't justify this
very complicated solution.
"A packet is not a byte, it's byte[]" - A packet is a byte, this is obvious from the OP's code: buffer[i] = await device.ReadAsync(). Then, a batch of packets is byte[].
"You haven't used the OP's ToHexString." - The goal was to show how to use Stream.WriteAsync which natively accepts a cancellation token, instead of WriteLineAsync which doesn't allow cancellation. It's trivial to use ToHexString with Stream.WriteAsync and still take advantage of cancellation support:
var hexBytes = Encoding.ASCII.GetBytes(ToHexString(nextBatch) +
Environment.NewLine);
_stream.WriteAsync(hexBytes, 0, hexBytes.Length, token);
"AsyncProducerConsumerQueue is much less robust and tested than .Net's TPL Dataflow" - I don't think this is a determined fact. However, if the OP is concerned about it, he can use regular BlockingCollection, which doesn't block the producer thread. It's OK to block the consumer thread while waiting for the next batch, because writing is done in parallel. As opposed to this, your TPL Dataflow version carries one redundant CPU and lock intensive operation: moving data from producer pipeline to writer pipleline with logAction.Post(packet), byte by byte. My code doesn't do that.
"You await previousFlush for errors just after you throw an exception which makes that line redundant." - This line is not redundant. Perhaps, you're missing this point: previousFlush.IsCompleted can be true when previousFlush.IsFaulted or previousFlush.IsCancelled is also true. So, await previousFlush is relevant there to observe any errors on the completed tasks (e.g., a write failure), which otherwise will be lost.
A better approach IMHO would be to have 2 "workers", a producer and a consumer. The producer reads from the device and simply fills a list. The consumer "wakes up" every second and writes the batch to a file.
List<byte[]> _data = new List<byte[]>();
async Task Producer(Device device)
{
while (true)
{
_data.Add(await device.ReadAsync());
}
}
async Task Consumer(Device device)
{
using (var writer = new StreamWriter("test.log")))
{
while (true)
{
Stopwatch watch = Stopwatch.StartNew();
var batch = _data;
_data = new List<byte[]>();
foreach (var packet in batch)
{
writer.WriteLine(ToHexString(packet));
if (watch.Elapsed >= TimeSpan.FromSeconds(1))
{
throw new Exception("Write Time Out!");
}
}
await Task.Delay(TimeSpan.FromSeconds(1) - watch.Elapsed);
}
}
}
The while (true) should probably be replaced by a system wide cancellation token.
Assuming you can batch by amount (1000) instead of time (1 second), the simplest solution is probably using TPL Dataflow's BatchBlock which automatically batches a flow of items by size:
async Task TestLogger(Device device, int seconds)
{
var writer = new StreamWriter("test.log");
var batch = new BatchBlock<byte[]>(1000);
var logAction = new ActionBlock<byte[]>(
packet =>
{
return writer.WriteLineAsync(ToHexString(packet));
});
ActionBlock<byte[]> transferAction;
transferAction = new ActionBlock<byte[][]>(
bytes =>
{
foreach (var packet in bytes)
{
if (transferAction.InputCount > 0)
{
return; // or throw new Exception("Write Time Out!");
}
logAction.Post(packet);
}
}
);
batch.LinkTo(transferAction);
logAction.Completion.ContinueWith(_ => writer.Dispose());
while (true)
{
batch.Post(await device.ReadAsync());
}
}

How to learn WriteAllBytes progress

Can I use a progress bar to show the progress of
File.WriteAllBytes(file, array)
in C#?
No.
You'll need to write the bytes in chunks using a loop. Something like the following should get you started. Note that this needs to be running in a background thread. I you are using WinForms, you can use a BackgroundWorker.
using(var stream = new FileStream(...))
using(var writer = new BinaryWriter(stream)) {
var bytesLeft = array.Length; // assuming array is an array of bytes
var bytesWritten = 0;
while(bytesLeft > 0) {
var chunkSize = Math.Min(64, bytesLeft);
writer.WriteBytes(array, bytesWritten, chunkSize);
bytesWritten += chunkSize;
bytesLeft -= chunkSize;
// notify progressbar (assuming you're using a background worker)
backgroundWorker.ReportProgress(bytesWritten * 100 / array.Length);
}
}
EDIT: as Patashu pointed out below, you can also you tasks and await. I think my method is fairly straightforward and doesn't require any additional thread stuff (besides the one background thread you need to do the operation). It's the traditional way and works well enough.
Since WriteAllBytes is a synchronous method, you can do nothing and know nothing about the operation until it finishes.
What you need to do is have a method like WriteAllBytes, but written to be asynchronous, such as in http://msdn.microsoft.com/en-AU/library/jj155757.aspx . You can have your asynchronous method every so often stop and report its progress to the GUI, as it runs separately.

Categories

Resources