Scope:
I want to process a large file (1 GB+) by splitting it into smaller (manageable) chunks (partitions), persist them on some storage infrastructure (local disk, blob, network, etc.) and process them one by one, in memory.
I want to achieve this by leveraging the TPL Dataflow library and I've created several processing blocks, each of them performing a specific action, on a in-memory file partition.
Further on, I'm using a SemaphoreSlim object to limit to max number of in-memory partitions being processed at a given time, until it is loaded and fully processed.
I'm also using the MaxDegreeOfParallelism configuration attribute at block level to limit the degree of parallelism for each block.
From a technical perspective, the scope is to limit the processing of multiple partitions in parallel, across several continuous pipeline steps, by using a Semaphore, thus avoiding overloading the memory.
Issue description: When MaxDegreeOfParallelism is set to a value greater than 1 for all Dataflow blocks except the first one, the process hangs and seems that it reaches a deadlock. When MaxDegreeOfParallelism is set to 1, everything works as expected. Code sample below...
Do you have any idea/hint/tip why this happens?
using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
namespace DemoConsole
{
class Program
{
private static readonly SemaphoreSlim _localSemaphore = new(1);
static async Task Main(string[] args)
{
Console.WriteLine("Configuring pipeline...");
var dataflowLinkOptions = new DataflowLinkOptions() { PropagateCompletion = true };
var filter1 = new TransformManyBlock<string, PartitionInfo>(CreatePartitionsAsync, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });
// when MaxDegreeOfParallelism on the below line is set to 1, everything works as expected; any value greater than 1 causes issues
var blockOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 };
var filter2 = new TransformBlock<PartitionInfo, PartitionInfo>(ReadPartitionAsync, blockOptions);
var filter3 = new TransformBlock<PartitionInfo, PartitionInfo>(MapPartitionAsync, blockOptions);
var filter4 = new TransformBlock<PartitionInfo, PartitionInfo>(ValidatePartitionAsync, blockOptions);
var actionBlock = new ActionBlock<PartitionInfo>(async (x) => { await Task.CompletedTask; });
filter1.LinkTo(filter2, dataflowLinkOptions);
filter2.LinkTo(filter3, dataflowLinkOptions);
filter3.LinkTo(filter4, dataflowLinkOptions);
filter4.LinkTo(actionBlock, dataflowLinkOptions);
await filter1.SendAsync("my-file.csv");
filter1.Complete();
await actionBlock.Completion;
Console.WriteLine("Pipeline completed.");
Console.ReadKey();
Console.WriteLine("Done");
}
private static async Task<IEnumerable<PartitionInfo>> CreatePartitionsAsync(string input)
{
var partitions = new List<PartitionInfo>();
const int noOfPartitions = 10;
Log($"Creating {noOfPartitions} partitions from raw file on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
for (short i = 1; i <= noOfPartitions; i++)
{
partitions.Add(new PartitionInfo { FileName = $"{Path.GetFileNameWithoutExtension(input)}-p{i}-raw.json", Current = i });
}
await Task.CompletedTask;
Log($"Creating {noOfPartitions} partitions from raw file completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");
return partitions;
}
private static async Task<PartitionInfo> ReadPartitionAsync(PartitionInfo input)
{
Log($"Sempahore - trying to enter for partition [{input.Current}] - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
await _localSemaphore.WaitAsync();
Log($"Sempahore - entered for partition [{input.Current}] - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
Log($"Reading partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
await Task.Delay(1000);
Log($"Reading partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");
return input;
}
private static async Task<PartitionInfo> MapPartitionAsync(PartitionInfo input)
{
Log($"Mapping partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
await Task.Delay(1000);
Log($"Mapping partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");
return input;
}
private static async Task<PartitionInfo> ValidatePartitionAsync(PartitionInfo input)
{
Log($"Validating partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
await Task.Delay(1000);
Log($"Validating partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");
Log($"Sempahore - releasing - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
_localSemaphore.Release();
Log($"Sempahore - released - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
return input;
}
private static void Log(string message) => Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} : {message}");
}
class PartitionInfo
{
public string FileName { get; set; }
public short Current { get; set; }
}
}
Before implementing this solution take a look at the comments because there is a fundamental architecture problem in your code.
However, the issue you've posted is reproducible and can be solved with the following ExecutionDataflowBlockOption change:
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5, EnsureOrdered = false });
The EnsureOrdered property defaults to true. When parallelism > 1, there's no guarantee which message will be processed first. If the message processed first was not the first one received by the block, it will wait in a reordering buffer until the first message it received completes. Because filter1 is a TransformManyBlock, I'm not sure it's even possible to know what order each message is sent to filter2 in.
If you run your code enough times you will eventually you get lucky, and the first message sent to filter2 also gets processed first, in which case it will release the semaphore and progress. But you will have the same issue on the very next message processed; if it wasn't the second message received, it will wait in the reordering buffer.
Related
I am working on a protocol and trying to use as much async/await as I can to make it scale well. The protocol will have to support hundreds to thousands of simultaneous connections. Below is a little bit of pseudo code to illustrate my problem.
private static async void DoSomeWork()
{
var protocol = new FooProtocol();
await protocol.Connect("127.0.0.1", 1234);
var i = 0;
while(i != int.MaxValue)
{
i++;
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var task = protocol.Send(request);
_ = task.ContinueWith(async tmp =>
{
var resp = await task;
Console.WriteLine($"Request {resp.SequenceNr} Successful: {(resp.Status == 0)}");
});
}
}
And below is a little pseudo code for the protocol.
public class FooProtocol
{
private int sequenceNr = 0;
private SemaphoreSlim ss = new SemaphoreSlim(20, 20);
public Task<FooResponse> Send(FooRequest fooRequest)
{
var tcs = new TaskCompletionSource<FooResponse>();
ss.Wait();
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
// Faking some arbitrary delay. This work is done over sockets.
Task.Run(async () =>
{
await Task.Delay(1000);
tcs.SetResult(new FooResponse() {SequenceNr = tmp});
ss.Release();
});
return tcs.Task;
}
}
I have a protocol with request and response pairs. I have used asynchronous socket programming. The FooProtocol will take care of matching up request with responses (sequence numbers) and will also take care of the maximum number of pending requests. (Done in the pseudo and my code with a semaphore slim, So I am not worried about run away requests). The DoSomeWork method calls the Protocol.Send method, but I don't want to await the response, I want to spin around and send the next one until I am blocked by the maximum number of pending requests. When the task does complete I want to check the response and maybe do some work.
I would like to fix two things
I would like to avoid using Task.ContinueWith() because it seems to not fit in cleanly with the async/await patterns
Because I have awaited on the connection, I have had to use the async modifier. Now I get warnings from the IDE "Because this call is not waited, execution of the current method continues before this call is complete. Consider applying the 'await' operator to the result of the call." I don't want to do that, because as soon as I do it ruins the protocol's ability to have many requests in flight. The only way I can get rid of the warning is to use a discard. Which isn't the worst thing but I can't help but feel like I am missing a trick and fighting this too hard.
Side note: I hope your actual code is using SemaphoreSlim.WaitAsync rather than SemaphoreSlim.Wait.
In most socket code, you do end up with a list of connections, and along with each connection is a "processor" of some kind. In the async world, this is naturally represented as a Task.
So you will need to keep a list of Tasks; at the very least, your consuming application will need to know when it is safe to shut down (i.e., all responses have been received).
Don't preemptively worry about using Task.Run; as long as you aren't blocking (e.g., SemaphoreSlim.Wait), you probably will not starve the thread pool. Remember that during the awaits, no thread pool thread is used.
I am not sure that it's a good idea to enforce the maximum concurrency at the protocol level. It seems to me that this responsibility belongs to the caller of the protocol. So I would remove the SemaphoreSlim, and let it do the one thing that it knows to do well:
public class FooProtocol
{
private int sequenceNr = 0;
public async Task<FooResponse> Send(FooRequest fooRequest)
{
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
await Task.Delay(1000); // Faking some arbitrary delay
return new FooResponse() { SequenceNr = tmp };
}
}
Then I would use an ActionBlock from the TPL Dataflow library in order to coordinate the process of sending a massive number of requests through the protocol, by handling the concurrency, the backpreasure (BoundedCapacity), the cancellation (if needed), the error-handling, and the status of the whole operation (running, completed, failed etc). Example:
private static async Task DoSomeWorkAsync()
{
var protocol = new FooProtocol();
var actionBlock = new ActionBlock<FooRequest>(async request =>
{
var resp = await protocol.Send(request);
Console.WriteLine($"Request {resp.SequenceNr} Status: {resp.Status}");
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 20,
BoundedCapacity = 100
});
await protocol.Connect("127.0.0.1", 1234);
foreach (var i in Enumerable.Range(0, Int32.MaxValue))
{
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var accepted = await actionBlock.SendAsync(request);
if (!accepted) break; // The block has failed irrecoverably
}
actionBlock.Complete();
await actionBlock.Completion; // Propagate any exceptions
}
The BoundedCapacity = 100 configuration means that the ActionBlock will store in its internal buffer at most 100 requests. When this threshold is reached, anyone who wants to send more requests to it will have to wait. The awaiting will happen in the await actionBlock.SendAsync line.
Questions on Lambda to delete partitions.
The existing query which uses parallelization is failing since it exceeds the number of parallel queries. We want to replace it with sequential queries and increased timeout for lambda.
Can we change the lambda to parallel with limited threads?
Database -> aws athena = Getting the List of clients from Athena. Looping throgh it.
Right now it works fine with sequential calls as well but since the number of clients is small now, it would pose a problem for future.
The only issue with limited parallel threads is that we need some code to handle the thread count as well.
Then someone suggested me use this: https://devblogs.microsoft.com/pfxteam/implementing-a-simple-foreachasync-part-2/
https://gist.github.com/0xced/94f6c50d620e582e19913742dbd76eb6
public class AthenaClient {
private readonly IAmazonAthena _client;
private readonly string _databaseName;
private readonly string _outputLocation;
private readonly string _tableName;
const int MaxQueryLength = 262144;
readonly int _maxclientsToBeProcessed;
public AthenaClient(string databaseName, string tableName, string outputLocation, int maxclientsToBeProcessed) {
_databaseName = databaseName;
_tableName = tableName;
_outputLocation = outputLocation;
_maxclientsToBeProcessed = maxclientsToBeProcessed == 0 ? 1 : maxclientsToBeProcessed;
_client = new AmazonAthenaClient();
}
public async Task < bool > DeletePartitions() {
var clients = await GetClients();
for (int i = 0; i < clients.Count; i = i + _maxclientsToBeProcessed) {
var clientItems = clients.Skip(i).Take(_maxclientsToBeProcessed);
var queryBuilder = new StringBuilder();
queryBuilder.AppendLine($ "ALTER TABLE { _databaseName }.{_tableName} DROP IF EXISTS");
foreach(var client in clientItems) {
queryBuilder.AppendLine($ " PARTITION (client_id = '{client}'), ");
}
var query = queryBuilder.ToString().Trim().TrimEnd(',') + ";";
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = StartQueryExecution(query).Result;
await CheckQueryExecutionStatus(queryExecutionId);
}
return true;
}
}
It seems that the actual question should be :
How can I change the database partitions for lots of clients in AWS Athena without executing them sequentially?
The answer isn't ForEachAsync or the upcoming await foreach in C# 8. An asynchronous loop would still send calls to the service one at a time, it "just" wouldn't block while waiting for an answer.
Concurrent workers
This is a concurrent worker problem that can be handled using eg the TPL Dataflow library's ActionBlock class or the new System.Threading.Channel classes.
The Dataflow library is meant to create event/message processing pipelines similar to a shell script pipeline, by moving data between independent blocks. Each block runs on its own task/thread which means you can get concurrent execution simply by breaking up processing into blocks.
It's also possible to increase the number of processing tasks per block, by specifying the MaxDegreeOfParallelism option when creating the block. This allows us to quickly create "workers" that can work on lots of messages concurrently.
Example
In this case, the "message" is the Client whatever that is. A single ActionBlock could create the DDL statement and execute it. Each block has an input queue which means we can just post messages to a block and await for it to execute everything using the DOP we specified.
We can also specify a limit to the queue so it won't get flooded if the worker tasks can't run fast enough :
var options=new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = _maxclientsToBeProcessed,
BoundedCapacity = _maxclientsToBeProcessed*3, //Just a guess
});
var block=new ActionBlock<Client>(client=>CreateAndRunDDL(client));
//Post the client requests
foreach(var client in clients)
{
await block.SendAsync(client);
}
//Tell the block we're done
block.Complete();
//Await for all queued messages to finish processing
await block.Completion;
The CreateAndRunDDL(Client) method should do what the code inside the question's loop does. A good idea would be to refactor it though, and create separate functions to create and execute the query , eg :
async Task CreateAndRunDDL(Client client)
{
var query = QueryForClient(...);
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = await StartQueryExecution(query);
await CheckQueryExecutionStatus(queryExecutionId);
}
Blocks can be linked too. If we wanted to batch multiple clients together for processing, we can use a BatchBlock and feed its results to our action block, eg :
var batchClients = new BatchBlock<Client>(20);
var linkOptions = new DataflowLinkOptions
{
PropagateCompletion = true
};
var block=new ActionBlock<Client>(clients=>CreateAndRunDDL(clients));
batchClients.LinkTo(block,linkOptions);
This time the CreateAndRunDDL method accepts a Client[] array with the number of clients/messages we specified in the batch size.
async Task CreateAndRunDDL(Client[] clients)
{
var query = QueryForClients(clients);
...
}
Messages should be posted to the batchClients block now. Once that completes, we need to wait for the last block in the pipeline to complete :
foreach(var client in clients)
{
await batchClients.SendAsync(client);
}
//Tell the *batch block* we're done
batchClient.Complete();
//Await for all queued messages to finish processing
await block.Completion;
I'm trying to do a stable multi threading system (Use exact number of threads set)
Here's the code I'm actually using :
public void Start()
{
List<String> list = new List<String>(File.ReadAllLines("urls.txt"));
int maxThreads = 100;
var framework = new Sender();
ThreadPool.SetMinThreads(maxThreads, maxThreads);
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = maxThreads }, delegate (string url)
{
framework.Send(url, "proxy:port");
});
Console.WriteLine("Done.");
}
It is fast and working, but it exceed 100 threads limit, wouldn't be a problem if the proxies I'm using where locked to 100 simultaneous connections, so a lot of requests get cancelled by my proxy provider, any idea of how I can keep that threads speed without exceeding limit?
Thanks.
Your Framwork.Send method is returning immediately and processing asynchronously. To validate this, I created the following test method, which works as expected:
public static void Main()
{
List<String> list = new List<String>(Enumerable.Range(0,10000).Select(i=>i.ToString()));
int maxThreads = 100;
ThreadPool.SetMinThreads(maxThreads, maxThreads);
int currentCount = 0;
int maxCount = 0;
object locker = new object();
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = maxThreads }, delegate (string url)
{
lock (locker)
{
currentCount++;
maxCount = Math.Max(currentCount, maxCount);
}
Thread.Sleep(10);
lock (locker)
{
maxCount = Math.Max(currentCount, maxCount);
currentCount--;
}
});
Console.WriteLine("Max Threads: " + maxCount); //Max Threads: 100
Console.Read();
}
Parallel.For/Foreach are meant for data parallelism - processing a large number of data that doesn't need to perform IO. In this case there's no reason to use more threads than cores that can run them.
This question though is about network IO, concurrent connections and throttling. If the proxy provider has a limit, MaxDegreeOfParallelism must be set to a value low enough that the limit isn't exceeded.
A better solution would be to use an ActionBlock with limited MaxDegreeOfParallelism and a limit to its input buffer so it doesn't get flooded with urls that await processing.
static async Task Main()
{
var maxConnections=20;
var options=new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxConnections,
BoundedCapacity = maxConnections * 2
};
var framework = new Sender();
var myBlock=new ActionBlock<string>(url=>
{
framework.Send(...);
}, options);
//ReadLines doesn't load everything, it returns an IEnumerable<string> that loads
//lines as needed
var lines = File.ReadLines("urls.txt");
foreach(var url in lines)
{
//Send each line to the block, waiting if the buffer is full
await myBlock.SendAsync(url);
}
//Tell the block we are done
myBlock.Complete();
//And wait until it finishes everything
await myBlock.Completion;
}
Setting the bounded capacity and MaxDegreeOfParallelism helps with concurrency limits, but not with request/sec limits. To limit that, one could add a small delay after each request. The block's code would have to change to eg :
var delay=250; // Milliseconds, 4 reqs/sec per connection
var myBlock=new ActionBlock<string>( async url=>
{
framework.Send(...);
await Task.Delay(delay);
}, options);
This can be improved further if Sender.Send became an asynchronous method. It could use for example HttpClient which only provides asynchronous methods, so it doesn't block waiting for a response. The changes would be minimal :
var myBlock=new ActionBlock<string>( async url=>
{
await framework.SendAsync(...);
await Task.Delay(delay);
}, options);
But the program would use less threads and less CPU - each call to await ... releases the current thread until a response is received.
Blocking a thread on the other hand stands with a spinwait which means it wastes CPU cycles waiting for a response before putting the thread to sleep.
I have an application where i have 1000+ small parts of 1 large file.
I have to upload maximum of 16 parts at a time.
I used Thread parallel library of .Net.
I used Parallel.For to divide in multiple parts and assigned 1 method which should be executed for each part and set DegreeOfParallelism to 16.
I need to execute 1 method with checksum values which are generated by different part uploads, so i have to set certain mechanism where i have to wait for all parts upload say 1000 to complete.
In TPL library i am facing 1 issue is it is randomly executing any of the 16 threads from 1000.
I want some mechanism using which i can run first 16 threads initially, if the 1st or 2nd or any of the 16 thread completes its task next 17th part should be started.
How can i achieve this ?
One possible candidate for this can be TPL Dataflow. This is a demonstration which takes in a stream of integers and prints them out to the console. You set the MaxDegreeOfParallelism to whichever many threads you wish to spin in parallel:
void Main()
{
var actionBlock = new ActionBlock<int>(
i => Console.WriteLine(i),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 16});
foreach (var i in Enumerable.Range(0, 200))
{
actionBlock.Post(i);
}
}
This can also scale well if you want to have multiple producer/consumers.
Here is the manual way of doing this.
You need a queue. The queue is sequence of pending tasks. You have to dequeue and put them inside list of working task. When ever the task is done remove it from list of working task and take another from queue. Main thread controls this process. Here is the sample of how to do this.
For the test i used List of integer but it should work for other types because its using generics.
private static void Main()
{
Random r = new Random();
var items = Enumerable.Range(0, 100).Select(x => r.Next(100, 200)).ToList();
ParallelQueue(items, DoWork);
}
private static void ParallelQueue<T>(List<T> items, Action<T> action)
{
Queue pending = new Queue(items);
List<Task> working = new List<Task>();
while (pending.Count + working.Count != 0)
{
if (pending.Count != 0 && working.Count < 16) // Maximum tasks
{
var item = pending.Dequeue(); // get item from queue
working.Add(Task.Run(() => action((T)item))); // run task
}
else
{
Task.WaitAny(working.ToArray());
working.RemoveAll(x => x.IsCompleted); // remove finished tasks
}
}
}
private static void DoWork(int i) // do your work here.
{
// this is just an example
Task.Delay(i).Wait();
Console.WriteLine(i);
}
Please let me know if you encounter problem of how to implement DoWork for your self. because if you change method signature you may need to do some changes.
Update
You can also do this with async await without blocking the main thread.
private static void Main()
{
Random r = new Random();
var items = Enumerable.Range(0, 100).Select(x => r.Next(100, 200)).ToList();
Task t = ParallelQueue(items, DoWork);
// able to do other things.
t.Wait();
}
private static async Task ParallelQueue<T>(List<T> items, Func<T, Task> func)
{
Queue pending = new Queue(items);
List<Task> working = new List<Task>();
while (pending.Count + working.Count != 0)
{
if (working.Count < 16 && pending.Count != 0)
{
var item = pending.Dequeue();
working.Add(Task.Run(async () => await func((T)item)));
}
else
{
await Task.WhenAny(working);
working.RemoveAll(x => x.IsCompleted);
}
}
}
private static async Task DoWork(int i)
{
await Task.Delay(i);
}
var workitems = ... /*e.g. Enumerable.Range(0, 1000000)*/;
SingleItemPartitioner.Create(workitems)
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(16)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(i => { Thread.Slee(1000); Console.WriteLine(i); });
This should be all you need. I forgot how the methods are named exactly... Look at the documentation.
Test this by printing to the console after sleeping for 1sec (which this sample code does).
Another option would be to use a BlockingCollection<T> as a queue between your file reader thread and your 16 uploader threads. Each uploader thread would just loop around consuming the blocking collection until it is complete.
And, if you want to limit memory consumption in the queue you can set an upper limit on the blocking collection such that the file-reader thread will pause when the buffer has reached capacity. This is particularly useful in a server environment where you may need to limit memory used per user/API call.
// Create a buffer of 4 chunks between the file reader and the senders
BlockingCollection<Chunk> queue = new BlockingCollection<Chunk>(4);
// Create a cancellation token source so you can stop this gracefully
CancellationTokenSource cts = ...
File reader thread
...
queue.Add(chunk, cts.Token);
...
queue.CompleteAdding();
Sending threads
for(int i = 0; i < 16; i++)
{
Task.Run(() => {
foreach (var chunk in queue.GetConsumingEnumerable(cts.Token))
{
.. do the upload
}
});
}
I'm learning about async/await patterns in C#. Currently I'm trying to solve a problem like this:
There is a producer (a hardware device) that generates 1000 packets per second. I need to log this data to a file.
The device only has a ReadAsync() method to report a single packet at a time.
I need to buffer the packets and write them in the order they are generated to the file, only once a second.
Write operation should fail if the write process is not finished in time when the next batch of packets is ready to be written.
So far I have written something like below. It works but I am not sure if this is the best way to solve the problem. Any comments or suggestion? What is the best practice to approach this kind of Producer/Consumer problem where the consumer needs to aggregate the data received from the producer?
static async Task TestLogger(Device device, int seconds)
{
const int bufLength = 1000;
bool firstIteration = true;
Task writerTask = null;
using (var writer = new StreamWriter("test.log")))
{
do
{
var buffer = new byte[bufLength][];
for (int i = 0; i < bufLength; i++)
{
buffer[i] = await device.ReadAsync();
}
if (!firstIteration)
{
if (!writerTask.IsCompleted)
throw new Exception("Write Time Out!");
}
writerTask = Task.Run(() =>
{
foreach (var b in buffer)
writer.WriteLine(ToHexString(b));
});
firstIteration = false;
} while (--seconds > 0);
}
}
You could use the following idea, provided the criteria for flush is the number of packets (up to 1000). I did not test it. It makes use of Stephen Cleary's AsyncProducerConsumerQueue<T> featured in this question.
AsyncProducerConsumerQueue<byte[]> _queue;
Stream _stream;
// producer
async Task ReceiveAsync(CancellationToken token)
{
while (true)
{
var list = new List<byte>();
while (true)
{
token.ThrowIfCancellationRequested(token);
var packet = await _device.ReadAsync(token);
list.Add(packet);
if (list.Count == 1000)
break;
}
// push next batch
await _queue.EnqueueAsync(list.ToArray(), token);
}
}
// consumer
async Task LogAsync(CancellationToken token)
{
Task previousFlush = Task.FromResult(0);
CancellationTokenSource cts = null;
while (true)
{
token.ThrowIfCancellationRequested(token);
// get next batch
var nextBatch = await _queue.DequeueAsync(token);
if (!previousFlush.IsCompleted)
{
cts.Cancel(); // cancel the previous flush if not ready
throw new Exception("failed to flush on time.");
}
await previousFlush; // it's completed, observe for any errors
// start flushing
cts = CancellationTokenSource.CreateLinkedTokenSource(token);
previousFlush = _stream.WriteAsync(nextBatch, 0, nextBatch.Count, cts.Token);
}
}
If you don't want to fail the logger but rather prefer to cancel the flush and proceed to the next batch, you can do so with a minimal change to this code.
In response to #l3arnon comment:
A packet is not a byte, it's byte[]. 2. You haven't used the OP's ToHexString. 3. AsyncProducerConsumerQueue is much less robust and
tested than .Net's TPL Dataflow. 4. You await previousFlush for errors
just after you throw an exception which makes that line redundant.
etc. In short: I think the possible added value doesn't justify this
very complicated solution.
"A packet is not a byte, it's byte[]" - A packet is a byte, this is obvious from the OP's code: buffer[i] = await device.ReadAsync(). Then, a batch of packets is byte[].
"You haven't used the OP's ToHexString." - The goal was to show how to use Stream.WriteAsync which natively accepts a cancellation token, instead of WriteLineAsync which doesn't allow cancellation. It's trivial to use ToHexString with Stream.WriteAsync and still take advantage of cancellation support:
var hexBytes = Encoding.ASCII.GetBytes(ToHexString(nextBatch) +
Environment.NewLine);
_stream.WriteAsync(hexBytes, 0, hexBytes.Length, token);
"AsyncProducerConsumerQueue is much less robust and tested than .Net's TPL Dataflow" - I don't think this is a determined fact. However, if the OP is concerned about it, he can use regular BlockingCollection, which doesn't block the producer thread. It's OK to block the consumer thread while waiting for the next batch, because writing is done in parallel. As opposed to this, your TPL Dataflow version carries one redundant CPU and lock intensive operation: moving data from producer pipeline to writer pipleline with logAction.Post(packet), byte by byte. My code doesn't do that.
"You await previousFlush for errors just after you throw an exception which makes that line redundant." - This line is not redundant. Perhaps, you're missing this point: previousFlush.IsCompleted can be true when previousFlush.IsFaulted or previousFlush.IsCancelled is also true. So, await previousFlush is relevant there to observe any errors on the completed tasks (e.g., a write failure), which otherwise will be lost.
A better approach IMHO would be to have 2 "workers", a producer and a consumer. The producer reads from the device and simply fills a list. The consumer "wakes up" every second and writes the batch to a file.
List<byte[]> _data = new List<byte[]>();
async Task Producer(Device device)
{
while (true)
{
_data.Add(await device.ReadAsync());
}
}
async Task Consumer(Device device)
{
using (var writer = new StreamWriter("test.log")))
{
while (true)
{
Stopwatch watch = Stopwatch.StartNew();
var batch = _data;
_data = new List<byte[]>();
foreach (var packet in batch)
{
writer.WriteLine(ToHexString(packet));
if (watch.Elapsed >= TimeSpan.FromSeconds(1))
{
throw new Exception("Write Time Out!");
}
}
await Task.Delay(TimeSpan.FromSeconds(1) - watch.Elapsed);
}
}
}
The while (true) should probably be replaced by a system wide cancellation token.
Assuming you can batch by amount (1000) instead of time (1 second), the simplest solution is probably using TPL Dataflow's BatchBlock which automatically batches a flow of items by size:
async Task TestLogger(Device device, int seconds)
{
var writer = new StreamWriter("test.log");
var batch = new BatchBlock<byte[]>(1000);
var logAction = new ActionBlock<byte[]>(
packet =>
{
return writer.WriteLineAsync(ToHexString(packet));
});
ActionBlock<byte[]> transferAction;
transferAction = new ActionBlock<byte[][]>(
bytes =>
{
foreach (var packet in bytes)
{
if (transferAction.InputCount > 0)
{
return; // or throw new Exception("Write Time Out!");
}
logAction.Post(packet);
}
}
);
batch.LinkTo(transferAction);
logAction.Completion.ContinueWith(_ => writer.Dispose());
while (true)
{
batch.Post(await device.ReadAsync());
}
}