TPL Dataflow LinkTo TransformBlock is very slow - c#

I have two TransformBlocks which are arranged in a loop. They link their data to each other. TransformBlock 1 is an I/O block reading data and is limited to a maximum of 50 tasks. It reads the data and some meta data. Then they are passed to the second block. The second block decides on the meta data if the message goes again to the first block. So after the meta data matches the criteria and a short wait the data should go again back again to the I/O block. The second blocks MaxDegreeOfParallelism can be unlimited.
Now I have noticed when I send a lot of data to the I/O block it takes a long time till the messages are linked to the second block. It takes like 10 minutes to link the data and they are all sent in a bunch. Like 1000 entries in a few seconds.
Normally I would implement it like so:
public void Start()
{
_ioBlock = new TransformBlock<Data,Tuple<Data, MetaData>>(async data =>
{
var metaData = await ReadAsync(data).ConfigureAwait(false);
return new Tuple<Data, MetaData>(data, metaData);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
_waitBlock = new TransformBlock<Tuple<Data, MetaData>,Data>(async dataMetaData =>
{
var data = dataMetaData.Item1;
var metaData = dataMetaData.Item2;
if (!metaData.Repost)
{
return null;
}
await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);
return data;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
_ioBlock.LinkTo(_waitBlock);
_waitBlock.LinkTo(_ioBlock, data => data != null);
_waitBlock.LinkTo(DataflowBlock.NullTarget<Data>());
foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
{
_ioBlock.Post(data);
}
}
But because of the described problem I have to implement it like so:
public void Start()
{
_ioBlock = new ActionBlock<Data>(async data =>
{
var metaData = await ReadAsync(data).ConfigureAwait(false);
var dataMetaData= new Tuple<Data, MetaData>(data, metaData);
_waitBlock.Post(dataMetaData);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
_waitBlock = new ActionBlock<Tuple<Data, MetaData>>(async dataMetaData =>
{
var data = dataMetaData.Item1;
var metaData = dataMetaData.Item2;
if (metaData.Repost)
{
await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);
_ioBlock.Post(data);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
{
_ioBlock.Post(data);
}
}
When I use the second approach the data get linked/posted faster (one by one). But it feels more like a hack to me. Anybody know how to fix the problem? Some friends recommended me to use TPL Pipeline but it seems much more complicated to me.

Problem solved. You need to set
ExecutionDataflowBlockOptions.EnsureOrdered
to forward the data immediately to the next/wait block.
Further information:
Why do blocks run in this order?

Related

Reading millions of small files with C#

I have millions of log files which generating every day and I need to read all of them and put together as a single file to do some process on it in other app.
I'm looking for the fastest way to do this. Currently I'm using Threads, Tasks and parallel like this:
Parallel.For(0, files.Length, new ParallelOptions { MaxDegreeOfParallelism = 100 }, i =>
{
ReadFiles(files[i]);
});
void ReadFiles(string file)
{
try
{
var txt = File.ReadAllText(file);
filesTxt.Add(tmp);
}
catch { }
GlobalCls.ThreadNo--;
}
or
foreach (var file in files)
{
//Int64 index = i;
//var file = files[index];
while (Process.GetCurrentProcess().Threads.Count > 100)
{
Thread.Sleep(100);
Application.DoEvents();
}
new Thread(() => ReadFiles(file)).Start();
GlobalCls.ThreadNo++;
// Task.Run(() => ReadFiles(file));
}
The problem is that after a few thousand reading files, the reading gets slower and slower!!
Any idea why? and what's the fastest approaches to reading millions small files? Thank you.
It seems that you are loading the contents of all files in memory, before writing them back to the single file. This could explain why the process becomes slower over time.
A way to optimize the process is to separate the reading part from the writing part, and do them in parallel. This is called the producer-consumer pattern. It can be implemented with the Parallel class, or with threads, or with tasks, but I will demonstrate instead an implementation based on the powerful TPL Dataflow library, that is particularly suited for jobs like this.
private static async Task MergeFiles(IEnumerable<string> sourceFilePaths,
string targetFilePath, CancellationToken cancellationToken = default,
IProgress<int> progress = null)
{
var readerBlock = new TransformBlock<string, string>(async filePath =>
{
return File.ReadAllText(filePath); // Read the small file
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2, // Reading is parallelizable
BoundedCapacity = 100, // No more than 100 file-paths buffered
CancellationToken = cancellationToken, // Cancel at any time
});
StreamWriter streamWriter = null;
int filesProcessed = 0;
var writerBlock = new ActionBlock<string>(text =>
{
streamWriter.Write(text); // Append to the target file
filesProcessed++;
if (filesProcessed % 10 == 0) progress?.Report(filesProcessed);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 1, // We can't parallelize the writer
BoundedCapacity = 100, // No more than 100 file-contents buffered
CancellationToken = cancellationToken, // Cancel at any time
});
readerBlock.LinkTo(writerBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
// This is a tricky part. We use BoundedCapacity, so we must propagate manually
// a possible failure of the writer to the reader, otherwise a deadlock may occur.
PropagateFailure(writerBlock, readerBlock);
// Open the output stream
using (streamWriter = new StreamWriter(targetFilePath))
{
// Feed the reader with the file paths
foreach (var filePath in sourceFilePaths)
{
var accepted = await readerBlock.SendAsync(filePath,
cancellationToken); // Cancel at any time
if (!accepted) break; // This will happen if the reader fails
}
readerBlock.Complete();
await writerBlock.Completion;
}
async void PropagateFailure(IDataflowBlock block1, IDataflowBlock block2)
{
try { await block1.Completion.ConfigureAwait(false); }
catch (Exception ex)
{
if (block1.Completion.IsCanceled) return; // On cancellation do nothing
block2.Fault(ex);
}
}
}
Usage example:
var cts = new CancellationTokenSource();
var progress = new Progress<int>(value =>
{
// Safe to update the UI
Console.WriteLine($"Files processed: {value:#,0}");
});
var sourceFilePaths = Directory.EnumerateFiles(#"C:\SourceFolder", "*.log",
SearchOption.AllDirectories); // Include subdirectories
await MergeFiles(sourceFilePaths, #"C:\AllLogs.log", cts.Token, progress);
The BoundedCapacity is used to keep the memory usage under control.
If the disk drive is SSD, you can try reading with a MaxDegreeOfParallelism larger than 2.
For best performance you could consider writing to a different disc drive than the drive containing the source files.
The TPL Dataflow library is available as a package for .NET Framework, and is build-in for .NET Core.
When it comes to IO operations, CPU parallelism is useless. Your IO device (disk, network, whatever) is your bottleneck. By reading from the device concurrently you risk to even lower your performance.
Perhaps you can just use PowerShell to concatenate the files, such as in this answer.
Another alternative is to write a program that uses the FileSystemWatcher class to watch for new files and append them as they are created.

Tasks with number of child tasks

Scenario is something like this, I have 4 specific URLs in hand, each URL page contains many links to a web page, I need to extract some information of those web pages. I'm planning to use nested task to do this job, Multiple tasks inside one task. Something like below.
var t1Actions = new List<Action>();
var t1 = Task.Factory.StartNew(() =>
{
foreach (var action in t1Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t2Actions = new List<Action>();
var t2 = Task.Factory.StartNew(() =>
{
foreach (var action in t2Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t3Actions = new List<Action>();
var t3 = Task.Factory.StartNew(() =>
{
foreach (var action in t3Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t4Actions = new List<Action>();
var t4 = Task.Factory.StartNew(() =>
{
foreach (var action in t4Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
Task.WhenAll(t1, t2, t3, t4);
Here is my questions:
Is this way a good way to do jobs like what I mentioned above?
Which one is efficient, replace child tasks with Parallel.Invoke(action) or leave it as it is?
How should I notify (for example UI) if a nested task completed, Do I have control over nested tasks?
Any advice will be helpful.
The actual problem isn't how to handle child tasks. It's how to get a list of URLs from some directory pages, retrieve those pages and process them.
This can be done easily using .NET's Dataflow library. Each step can be implemented as a block that reads one URL and produces an output.
The first block can be a TransformManyBlock that accepts one page URL and retursn a list of page URLs
The second block can be a TransformBlock that accepts a single page URL and returns its contents
The third block can be an Action Block that accepts the page and does whatever is needed with it.
For example:
var listBlock = new TransformManyBlock<Uri,Uri>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
var uris=ProcessThePage(contents);
return uris;
});
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
return (uri,content);
});
var processingBlock = new ActionBlock<(Uri uri,string content)>(async msg=>
{
//Do something
var pathFromUri(msg.uri);
File.WriteAllText(pathFromUri,msg.content);
});
var linkOptions=new DataflowLinkOptions{PropagateCompletion=true};
listBlock.LinkTo(downloadBlock,linkOptions);
downloadBlock.LinkTo(processingBlock,linkOptions);
Each block runs using its own Task. You can specify that a block may use more than one tasks, eg to download multiple pages concurrently.
Each block has an input and output buffer. You can specify a limit to the input buffer to avoid flooding a block with too many messages to process. If a block reaches the limit upstream blocks will pause. This way, you could prevent eg the downloadBlock from flooding a slow processingBlock with thousands of pages.
Once you have a pipeline, you can post messages to the first block. When you're done, you can tell the block to Complete(). Each block in the pipeline will finish processing messages in its input buffer and propagate the completion call to the next linked block.
You can await for all messages to finish by awaiting the last block's Completion task.
var directoryPages=new Uri[]{..};
foreach(var uri in directoryPages)
{
listBlock.Post(uri);
}
listBlock.Complete();
await processingBlock.Complete();
The ExecutionDataflowBlockOptions can be used to specify the use of multiple tasks and the intput buffer limits, eg :
var options=new ExecutionDataflowBlockOptions
{
BoundedCapacity=10,
MaxDegreeOfParallelism=4,
};
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(...,options);
This means that downloadBlock will accept up to 10 URIs before signalling the listBlock to pause. It will process up to 4 Uris concurrently

Throttling events based on previous occurrence

I have this test method
[Fact]
public async Task MethodName(){
var strings = new BufferBlock<string>();
var ints = new BufferBlock<int>();
ConcurrentStack<Tuple<string,int> > concurrentStack=new ConcurrentStack<Tuple<string, int>>();
var transformBlock = new TransformBlock<Tuple<string,int>,string>(async tuple => {
var formattableString = $"{tuple.Item2}-{tuple.Item1}-{Environment.CurrentManagedThreadId}";
if (concurrentStack.TryPeek(out var result) && Equals(result, tuple)){
WriteLine($"Await -> {formattableString}");
await Task.Delay(1000);
}
concurrentStack.Push(tuple);
await DoAsync(tuple);
WriteLine(formattableString);
return formattableString;
});
var joinBlock = new JoinBlock<string,int>();
var dataflowLinkOptions = new DataflowLinkOptions(){PropagateCompletion = true};
joinBlock.LinkTo(transformBlock,dataflowLinkOptions);
strings.LinkTo(joinBlock.Target1,dataflowLinkOptions);
ints.LinkTo(joinBlock.Target2,dataflowLinkOptions);
strings.Post("a");
strings.Post("a");
strings.Post("b");
strings.Post("b");
ints.Post(1);
ints.Post(1);
ints.Post(2);
ints.Post(1);
strings.Complete();
transformBlock.LinkTo(DataflowBlock.NullTarget<string>());
await transformBlock.Completion;
}
which outputs
15:36:53.2369|1-a-28
15:36:53.2369|Await -> 1-a-28
15:36:54.2479|1-a-28
15:36:54.2479|2-b-38
15:36:54.2479|1-b-38
I am looking to throttle this operation in a way that if the previous item is the same as the current to wait for a certain amount of time. I would like to use this flow in a parallel scenario and my guess is that RX extensions could be the solution here as I do not feel my approach is correct although it seems to produce the correct results.
UPDATE - CONCRETE EXAMPLE
In comments #Enigmativity requested to provide examples of my sources and transforms, so I think the real implementation is the best fit to this.
_requestHandlerDatas = new BufferBlock<IRequestHandlerData>(executionOption);
_webProxies = new TransformBlock<WebProxy, PingedWebProxy>(async webProxy => {
var isOnline =await requestDataflowData.Pinger.IsOnlineAsync(webProxy.Address).ConfigureAwait(false);
return new PingedWebProxy(webProxy.Address, isOnline);
}, executionOption);
var joinBlock = new JoinBlock<IRequestHandlerData, WebProxy>(new GroupingDataflowBlockOptions { Greedy = false });
_transformBlock = new TransformBlock<Tuple<IRequestHandlerData, WebProxy>, TOut>(async tuple => {
await requestDataflowData.LimitRateAsync(tuple.Item1, tuple.Item2).ConfigureAwait(false);
var #out = await requestDataflowData.GetResponseAsync<TOut>(tuple.Item1, tuple.Item2).ConfigureAwait(false);
_webProxies.Post(tuple.Item2);
_requestHandlerDatas.Post(tuple.Item1);
return #out;
}, executionOption);
var dataflowLinkOptions = new DataflowLinkOptions(){PropagateCompletion = true};
joinBlock.LinkTo(_transformBlock,dataflowLinkOptions,tuple => ((PingedWebProxy)tuple.Item2).IsOnline);
joinBlock.LinkTo(DataflowBlock.NullTarget<Tuple<IRequestHandlerData, WebProxy>>(),dataflowLinkOptions,tuple =>
!((PingedWebProxy) tuple.Item2).IsOnline);
_requestHandlerDatas.LinkTo(joinBlock.Target1,dataflowLinkOptions);
_webProxies.LinkTo(joinBlock.Target2,dataflowLinkOptions);
So, first I am posting data to _webProxies which verifies that I have a working proxy and if so it links it together with my _requestHandlerDatas in a _transformBlock that will use them to call another service. When that call returns I am releasing both the used parameters (proxy, handlerData) back to their pools and output the TOut for further processing from another block. My goal is to delay the webservice calls from the same proxy,handlerData combination.

extracting zips, parsing files and flattening out to CSV

I'm trying to maximize the performance of the following task:
Enumerate directory of zip files
Extract zips in memory looking for .json files (handling nested zips)
Parse the json files
Write properties from json file into an aggregated .CSV file
The TPL layout I was going for was:
producer -> parser block -> batch block -> csv writer block
With the idea being that a single producer extracts the zips and finds the json files, sends the text to the parser block which is running in parallel (multi consumer). The batch block is grouping into batches of 200, and the writer block is dumping 200 rows to a CSV file each call.
Questions:
The longer the jsonParseBlock TransformBlock takes, the more messages are dropped. How can I prevent this?
How could I better utilize TPL to maximize performance?
class Item
{
public string ID { get; set; }
public string Name { get; set; }
}
class Demo
{
const string OUT_FILE = #"c:\temp\tplflat.csv";
const string DATA_DIR = #"c:\temp\tpldata";
static ExecutionDataflowBlockOptions parseOpts = new ExecutionDataflowBlockOptions() { SingleProducerConstrained=true, MaxDegreeOfParallelism = 8, BoundedCapacity = 100 };
static ExecutionDataflowBlockOptions writeOpts = new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 };
public static void Run()
{
Console.WriteLine($"{Environment.ProcessorCount} processors available");
_InitTest(); // reset csv file, generate test data if needed
// start TPL stuff
var sw = Stopwatch.StartNew();
// transformer
var jsonParseBlock = new TransformBlock<string, Item>(rawstr =>
{
var item = Newtonsoft.Json.JsonConvert.DeserializeObject<Item>(rawstr);
System.Threading.Thread.Sleep(15); // the more sleep here, the more messages lost
return item;
}, parseOpts);
// batch block
var jsonBatchBlock = new BatchBlock<Item>(200);
// writer block
var flatWriterBlock = new ActionBlock<Item[]>(items =>
{
//Console.WriteLine($"writing {items.Length} to csv");
StringBuilder sb = new StringBuilder();
foreach (var item in items)
{
sb.AppendLine($"{item.ID},{item.Name}");
}
File.AppendAllText(OUT_FILE, sb.ToString());
});
jsonParseBlock.LinkTo(jsonBatchBlock, new DataflowLinkOptions { PropagateCompletion = true });
jsonBatchBlock.LinkTo(flatWriterBlock, new DataflowLinkOptions { PropagateCompletion = true });
// start doing the work
var crawlerTask = GetJsons(DATA_DIR, jsonParseBlock);
crawlerTask.Wait();
flatWriterBlock.Completion.Wait();
Console.WriteLine($"ALERT: tplflat.csv row count should match the test data");
Console.WriteLine($"Completed in {sw.ElapsedMilliseconds / 1000.0} secs");
}
static async Task GetJsons(string filepath, ITargetBlock<string> queue)
{
int count = 1;
foreach (var zip in Directory.EnumerateFiles(filepath, "*.zip"))
{
Console.WriteLine($"working on zip #{count++}");
var zipStream = new FileStream(zip, FileMode.Open);
await ExtractJsonsInMemory(zip, zipStream, queue);
}
queue.Complete();
}
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
ZipArchive archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (TextReader reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
}
Update1
I've added async, but it is not clear to me how to wait for all the dataflow blocks to complete (new to c#, async and tpl). I basically want to say, "keep running until all of the queues/blocks are empty". I've added the following 'wait' code, and appears to be working.
// wait for crawler to finish
crawlerTask.Wait();
// wait for the last block
flatWriterBlock.Completion.Wait();
In short your posting and ignoring the return value. You've got two options: add an unbound BufferBlock to hold all your incoming data or await on SendAsync, that will prevent any messages from being dropped.
static async Task ExtractJsonsInMemory(string filename, Stream stream, ITargetBlock<string> queue)
{
var archive = new ZipArchive(stream);
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.Name.EndsWith(".json", StringComparison.OrdinalIgnoreCase))
{
using (var reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
var jsonText = reader.ReadToEnd();
await queue.SendAsync(jsonText);
}
}
else if (entry.Name.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
await ExtractJsonsInMemory(entry.FullName, entry.Open(), queue);
}
}
}
You'll need to pull the async all the way back up, but this should get you started.
From MSDN, about the DataflowBlock.Post<TInput> method:
Return Value
Type: System.Boolean
true if the item was accepted by the target block; otherwise, false.
So, the problem here is that you're sending your messages without checking, can the pipeline accept another one, or not. This is happening because of your options for blocks:
new ExecutionDataflowBlockOptions() { BoundedCapacity = 100 }
and this line:
// this line isn't waiting for long operations and simply drops the message as it can't be accepted by the target block
queue.Post(jsonText);
Here you're saying that the processing should be postponed until the input queue length is equal to 100. In this case either the MSDN or #StephenCleary in his Introduction to Dataflow series suggest simple solution:
However, it’s possible to throttle a block by limiting its buffer size; in this case, you could use SendAsync to (asynchronously) wait for space to be available and then place the data into the block’s input buffer.
So, as #JSteward already suggested, you can introduce the infinite buffer between your workers to avoid the message dropping, and this is a general practice to do that, as checking the result of the Post method could block the producer thread for a long time.
The second part of the question, about the performance, is to use the async-oriented solution (which will perfectly fit with SendAsync method usage), as you use I/O operations all the time. Asynchronous operation is basically a way to say the program "start doing this and notify me when it's done". And, as there is no thread for such operations, you will gain by freeing up the thread pool for other operations you have in your pipeline.
PS: #JSteward had provided your a good sample code for this approaches.

Execute 4 tasks simultaneously, auto-starting another as and when each completes

I have a list of 100 urls. I need to fetch the html content of those urls. Lets say I don't use the async version of DownloadString and instead do the following.
var task1 = SyTask.Factory.StartNew(() => new WebClient().DownloadString("url1"));
What I want to achieve is to get the html string for at max 4 urls at a time.
I start 4 tasks for the first four urls. Assume the 2nd url completes, I want to immediately start the 5th task for the 5th url. And so on. This way at max 4 only 4 urls will be downloaded, and for all purposes there will always be 4 urls being downloaded, ie till all 100 are processed.
I can't seem to visualize how will I actually achieve this. There must be an established pattern for doing this. Thoughts?
EDIT:
Following up on #Damien_The_Unbeliever's comment to use Parallel.ForEach, I wrote the following
var urls = new List<string>();
var results = new Dictionary<string, string>();
var lockObj = new object();
Parallel.ForEach(urls,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
url =>
{
var str = new WebClient().DownloadString(url);
lock (lockObj)
{
results[url] = str;
}
});
I think the above reads better than creating individual tasks and using a semaphore to limit concurrency. That said having never used or worked with Parallel.ForEach, I am unsure if this correctly does what I need to do.
SemaphoreSlim sem = new SemaphoreSlim(4);
foreach (var url in urls)
{
sem.Wait();
Task.Factory.StartNew(() => new WebClient().DownloadString(url))
.ContinueWith(t => sem.Release());
}
Actually, Task.WaitAnyis much better for what you're trying to achieve than ContinueWith
int tasksPerformedCount = 0
Task[] tasks = //initial 4 tasks
while(tasksPerformedCount< 100)
{
//returns the index of the first task to complete, as soon as it completes
int index = Task.WaitAny(tasks);
tasksPerformedCount++;
//replace it with a new one
tasks[index] = //new task
}
Edit:
Another example of Task.WaitAny from http://www.amazon.co.uk/Exam-Ref-70-483-Programming-In/dp/0735676828/ref=sr_1_1?ie=UTF8&qid=1378105711&sr=8-1&keywords=exam+ref+70-483+programming+in+c
namespace Chapter1 {    
public static class Program     {         
public static void Main()         {
Task<int>[] tasks = new Task<int>[3];
tasks[0] = Task.Run(() => { Thread.Sleep(2000); return 1; });
tasks[1] = Task.Run(() => { Thread.Sleep(1000); return 2; });
tasks[2] = Task.Run(() => { Thread.Sleep(3000); return 3; });
while (tasks.Length > 0)
{
int i = Task.WaitAny(tasks);
Task<int> completedTask = tasks[i];
Console.WriteLine(completedTask.Result);
var temp = tasks.ToList();
temp.RemoveAt(i);
tasks = temp.ToArray();
}
}
}
}

Categories

Resources