TPL Dataflow Speedup? - c#

I wonder whether the following code can be optimized to execute faster. I currently seem to max out at around 1.4 million simple messages per second on a pretty simple data flow structure. I am aware that this sample process passes/transforms messages synchronously, however, I currently test TPL Dataflow as a possible replacement for my own custom solution based on Tasks and concurrent collections. I know the terms "concurrent" already suggest I run things in parallel but for current testing purposes I pushed messages on my own solution through synchronously and I get to about 5.1 million messages per second. What am I missing here, I read TPL Dataflow was pushed as a high throughput, low latency solution but so far I must be overlooking performance tweaks. Anyone who could point me into the right direction please?
class TPLDataFlowExperiments
{
public TPLDataFlowExperiments()
{
var buf1 = new BufferBlock<int>();
var transform = new TransformBlock<int, string>(t =>
{
return "";
});
var action = new ActionBlock<string>(s =>
{
//Thread.Sleep(100);
//Console.WriteLine(s);
});
buf1.LinkTo(transform);
transform.LinkTo(action);
//Propagate all Completions down the flow
buf1.Completion.ContinueWith(t =>
{
transform.Complete();
transform.Completion.ContinueWith(u =>
{
action.Complete();
});
});
Stopwatch watch = new Stopwatch();
watch.Start();
int cap = 10000000;
for (int i = 0; i < cap; i++)
{
buf1.Post(i);
}
//Mark Buffer as Complete
buf1.Complete();
action.Completion.ContinueWith(t =>
{
watch.Stop();
Console.WriteLine("All Blocks finished processing");
Console.WriteLine("Units processed per second: " + cap / watch.ElapsedMilliseconds * 1000);
});
Console.ReadLine();
}
}

I think this mostly comes down to one thing: your test is pretty much meaningless. All those blocks are supposed to do something, and use multiple cores and asynchronous operations to do that.
Also, in your test, it's likely that a lot of time is spent on synchronization. With a more realistic code, the code will take some time to execute, so there will be less contention, so the actual overhead will be smaller than what you measured.
But to actually answer your question, yes, you're overlooking some performance tweaks. Specifically, SingleProducerConstrained, which means data structures with less locking can be used. If I use this on both blocks (the BufferBlock is completely useless here, you can safely remove it), the rate raises from about 3–4 millions of items per second to more than 5 millions on my computer.

To add to svick's answer, the test uses only a single processing thread for a single action block. This way it tests nothing more than the overhead of using the blocks.
DataFlow works in a manner similar to F# Agents, Scala actors and MPI implementations. Each action block executes a single task at a time, listening to input and producing output. Speedup is provided by breaking an algorithm in steps that can be executed independently on multiple cores, passing only messages to each other.
While you can increase the number of concurrent tasks, the most important issue is designing a flow that perform the maximum amount of steps independently of the others.

You can also increase the degrees of parallelism for dataflow blocks. This may offer an additional speedup and can also help with load balancing between linear tasks if you find one of your blocks acts as a bottleneck to the rest.

If your workload is so granular that you expect to process millions of messages per second, then passing individual messages through the pipeline becomes not viable because of the associated overhead. You'll need to chunkify the workload by batching the messages to arrays or lists. For example:
var transform = new TransformBlock<int[], string[]>(batch =>
{
var results = new string[batch.Length];
for (int i = 0; i < batch.Length; i++)
{
results[i] = ProcessItem(batch[i]);
}
return results;
});
For batching your input you could use a BatchBlock, or the "linqy" Buffer extension method from the System.Interactive package, or the similar in functionality Batch method from the MoreLinq package, or do it manually.

Related

BlockingCollection bounded capacity performance degradation

I've got this C# process (.net 5.0) that reads from a zip file, deserializes the json to an object, and then transforms the json objects to DataTables for storage into a Sql Server database. After a lot of testing and optimization, I got these three phases to be very nearly identical in processing time (via Stopwatch measurements).
I thought I could improve throughput by having separate threads for each phase, but when I tried running it, the BlockingCollection<T> performance went down the tubes pretty quickly. I bounded the two queues to keep any one phase from getting too far off pace with any other, but after a short while, I got this very gap-toothed performance - spurts of activity with long periods of cpu quiescence.
Did I find some kind of degenerate case? Does BlockingCollection<T> have a lot of overhead relating to the boundedCapacity?
The implementation looked like this:
var readingQueue = new BlockingCollection<string>(1000);
var objectQueue = new BlockingCollection<JsonObj>(1000);
var phases = new Task[3];
phases[0] = Task.Run(() =>
{
for (;;)
{
var l = reader.ReadLine();
readingQueue.Add(l);
if (l == null)
break;
}
});
phases[1] = Task.Run(() =>
{
for (;;)
{
var json = readingQueue.Take();
if (json == null)
{
objectQueue.Add(null);
break;
}
var o = Deserializer.Deserialize<JsonObject>(json);
objectQueue.Add(o);
}
});
phases[2] = Task.Run(() =>
{
for (;;)
{
var o = objectQueue.Take();
if (o == null)
break;
TransformJsonObject(DataSet set, JsonObject o);
}
}
Task.WaitAll(phases);
Dropping the BlockingCollections entirely and just using Task.Run(() => reader.Readline()) for the I/O produces benefit, but parallelizing all three phases with BlockingCollection<T> goes south pretty fast.
EDIT:
I tried dropping to two threads and moving the work around, but whenever there was a BlockingCollection involved it got worse than the single threaded performance and the memory consumption went through the roof.
The version that worked best was
var nextLine = Task.Run(() => reader.ReadLine());
for (;;)
{
var json = nextLine.Result();
nextLine = Task.Run(() => reader.ReadLine());
if (l == null)
break;
var o = Deserializer.Deserialize<JsonObject>(json);
TransformJsonObject(DataSet set, JsonObject o);
}
The timings with that version were
Total time spent Reading: 8388140 ms, Deserializing: 8870633 ms, Transform: 9240809 ms, Writing to db: 10231972 (separate queue)
but the middle 2 were synchronous. I noticed there was a slight weight on the last step, so I tried putting read and deserialize in one thread on transform to dataset on another, and the performance was still way below the above.
That's over about 22 million lines/objects.
EDIT: to move some of the comment discussion into the main section, I was given this program to maintain. We get daily dumps of largeish zip files. The program starts up a configurable number of threads to process the zip files (currently set at 5). Originally, each thread did the read/deserialize/transform to DataSet/write DataSet to Sql Server steps synchronously.
The first thing I did was to add a "write to db" thread/queue, and that worked well.
Then I started improving the times of the read/deserialize/transform steps... Cleaning up code, swapping one deserializer for another, etc. The timings for each of those phases were getting near identical, so I thought I'd parallelize further to try and improve the speed.
Now each of the zip file threads had one BlockingCollection for each line from the jsonl file, and one for the deserialized objects. Each thread fires up Tasks for the reading and the deserialization. The main file processing thread pulled from the deserialized object collection, did the transforms, and put the result on the db writing queue.
At that level of parallelization, the process ended up taking more than twice as long. I did a minidump of the process, and I found each threads' BlockingCollections completely empty, the db writing queue empty, and almost 5 gig of ram in use somewhere.
The individual phase stats (like the time spent on file i/o and deserializing the objects) were double what just leaving the 5 file processing threads (read/deserialize/transform) steps synchronous. That's the part that puzzled me. Takes longer, a bunch of phantom ram, and all the queues empty when doing these things in parallel compared to doing 3 of the 4 steps synch
I did find Oflow assertions that bounded BlockingCollections would sometimes wedge when they hit their bounds but not a lot of detail as to why.
BlockingCollection will perform poorly if all the collections are always simultaneously at their bounded capacity. Resulting in performance that mimics a single threaded implementation, but with the blocking overhead. The thread pool will also perform sub-optimally if Tasks have a lot of blocking in them. It might be worth exploring using the TryTake and TryAdd methods of the collections and allowing the idle tasks to yield.
for (;;;)
{
if (!collection.TryTake(out item))
{
//there's nothing to do, so we'll just chill out
await Task.Delay(/*whatever interval makes sense*/);
continue;
}
}
It's also worth noting that if you can tune the processing to work well in your current environment, that won't necessarily translate to your target environment, so you may find yourself constantly tweaking the workload between the tasks to get acceptable performance.
Since this is a pipeline of operations, you'll probably have better luck with BufferBlock and other parts of the TPL. They also have the advantage of being async/await compatible, so there's less blocking in general, and they support the same bounding limits as a BlockingCollection.
Here's a link to a tutorial that demonstrates the basics. BufferBlock allows chaining the blocks together and managing the pipeline as a unit, supports cancellation.
EDIT: If this is a long running operation, like your stats suggest, then you could benefit from using full Threads.

Correctly parallelising lots of little tasks within a method using C#.NET

I'm implementing image processing algorithms in C# using .NET Framework 4.72 and need to decrease the computation code. Overall the code is sequential but there are quite a few methods with parameters that do not depend on each other. For example, it might be something like this
public void Algorithm(Object x, Object y) {
x = Filter(x);
x = Morphology(x);
y = Filter(y);
y = Morphology(y);
var z = Add(x,y);
//Similar pattern of separate operation that are then combined.
}
These functions generally take around 100ms to 500ms. They can be parallelised, and my approach has been something like this:
public void Algorithm(Object x, Object y) {
var xTask = Task.Run(() => {
x = Filter(x);
x = Morphology(x);
});
var yTask = Task.Run(() => {
y = Filter(y);
y = Morphology(y);
});
Task.WaitAll(xTask, yTask);
var z = Add(x,y);
}
It seems to work, a similar bit of code runs approximately twice as fast. (Note that the whole thing is wrapped in another Task.Run in the top most level function, so that is why I'm not awaiting here.
Question: Is this a valid approach, or is there another method for parallelising lots of little method calls that is more safe or efficient?
Update: This is not for parallelising processing a batch of images. It is about processing a single image as quick as possible.
This is valid enough - if you can process your workload in parallel then you should. You just need to be very aware of WHEN your workload can and should be parallel - and when it needs to be performed in order.
You also need to consider the cost of creating a new task, versus the benefits of doing so (i.e. sometimes avoid very small, very fast tasks).
I would strongly recommend you create additional methods and collections for managing your tasks - when they complete, and handle running lots of separate sets in parallel. Avoiding locking, managing shared memory/variables etc. For example, are you only ever processing one image at a time, or can you start processing the next one if you have cores available?
You need to be very careful with Task.WaitAll() - obviously you need to draw all your work together at some point, but be careful not to lock or block other work.
There's lots of articles out there on the various patterns you can use (pipelines sounds like a good match here).
Here's a few starters:
https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/tpl-and-traditional-async-programming
https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

Async slower than Sync

I have been working on Async calls and I found that the Async version of a method is running much slower than the Sync version. Can anyone comment on what I may be missing. Thanks.
Statistics
Sync method time is 00:00:23.5673480
Async method time is 00:01:07.1628415
Total Records/Entries returned per call = 19972
Below is the code that i am running.
-------------------- Test class ----------------------
[TestMethod]
public void TestPeoplePerformanceSyncVsAsync()
{
DateTime start;
DateTime end;
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
IList<IPerson> people1 = repository.GetPeople();
IList<IPerson> people2 = repository.GetPeople();
}
}
end = DateTime.Now;
var diff = start - end;
Console.WriteLine(diff);
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
Task<IList<IPerson>> people1 = GetPeopleAsync();
Task<IList<IPerson>> people2 = GetPeopleAsync();
Task.WaitAll(new Task[] {people1, people2});
}
}
end = DateTime.Now;
diff = start - end;
Console.WriteLine(diff);
}
private async Task<IList<IPerson>> GetPeopleAsync()
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
return await repository.GetPeopleAsync();
}
}
-------------------------- Repository ----------------------------
public IList<IPerson> GetPeople()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(context.People);
}
return people;
}
public async Task<IList<IPerson>> GetPeopleAsync()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(await context.People.ToListAsync());
}
return people;
}
So we've got a whole bunch of issues here, so I'll just say right off the bat that this isn't going to be an exhaustive list.
First off, the point of asynchrony is not strictly to improve performance. It can be, in certain contexts, used to improve performance, but that's not necessarily its goal. It can also be used to keep a UI responsive, for example. Paralleization is usually used to increase performance, but parallelization and asynchrony aren't equivalent. On top of that, parallelization has an overhead. You're spending time creating threads, scheduling them, synchronizing data between them, etc. The benefit of performing some operations in parallel may or may not surpass this overhead. If it doesn't, a synchronous solution may well be more performant.
Next, your "asynchronous" example isn't asynchronous "all the way up". You're calling WaitAll on the tasks inside the loop. For the example to be properly asynchronous one would like to see it be asynchronous all the way up to a single operation, namely some form of message loop.
Next, the two aren't don't the exact same thing in an asynchronous and synchronous manor. They are doing different things, which will obviously affect performance:
Your "asynchronous" solution creates 3 repositories. Your synchronous solution creates one. There is going to be some overhead here.
GetPeopleAsync takes a list, then pulls all of the items out of the list and puts them into another list. That's unnecessary overhead.
Then there are problems with your benchmarking:
You're using DateTime.Now, which is not designed for timing how long an operation takes. it's precision isn't particularly high, for example. You should use a StopWatch to time how long code takes.
You aren't performing all that many iterations. There's plenty of opportunity for the variation to affect the results here.
You aren't accounting for the fact that the first few runs through a section of code will take longer. The JITter needs to "warm up".
Garbage collections can be affecting your timings, namely that the objects created in the first test can end up being cleaned up during the second test.
It may depend on your data, or rather the amount of it. You didn't post what test metrics you're using to run your tests but this is my experience:
Usually when you see a slowdown in the performance of parallel algorithms when you're expecting improvement it's that the overhead of loading the extra libraries and spawning threads etc. slows down the parallel algorithm and makes it look like the linear/single-threaded version is performing better.
A greater amount of data should show better performance. Also try running the same test twice when all the libraries are loaded to avoid the load overhead.
If you don't see improvement, something is seriously wrong.
Note: You're getting voted down, I'm guessing, because you posted much more code than context, metrics etc. in the OP. IMO, very few SOers will actually bother to read and grok even that much code without being able to execute it while also being presented with metrics that are not at all useful!
Why I didn't read the code: When I see a code block with scroll bars along with the kind of text that was present in the original OP, my brain says: Don't bother. I think many if not most, probably do this.
Things to try:
Two different synch times does not mean statistically significant data. You should run each algorithm a number of times (5 at least) to see if you're experiencing anomalies. If your results for the same algorithms vary wildly then you may have other issues such as bandwidth restriction, server load etc. and the issue is external.
Try a .NET memory performance and/or memory profiler to help you track down the issue.
See #servy's great answer for more clues. It seems that he actually took the time to look at your code more closely.

WithDegreeOfParallelism(N>CPU count)

System.Threading.ThreadPool.SetMaxThreads(50, 50);
File.ReadLines().AsParallel().WithDegreeOfParallelism(100).ForAll((s)->{
/*
some code which is waiting external API call
and do not utilize CPU
*/
});
I have never got threads count more than CPU count in my system.
Can I use PLINQ and get more than one thread per CPU?
If you're calling external web API, you might be hitting the limit of concurrent simultaneous connections, which is set to 2. In the begining of your application do the following:
System.Net.ServicePointManager.DefaultConnectionLimit = 4096;
System.Net.ServicePointManager.Expect100Continue = false;
Try if that helps. If not, there might be some other bottleneck within the routine you're trying to parallelize.
Also, just like other responders said, ThreadPool decides how many threads to spin up based on load. In my experience with TPL I've seen that thread cound increases by time: longer the app runs, and heavier load gets, more threads are spun up.
PLINQ uses a hill-climbing algorithm to determine the optimum size of the thread pool which is used by the TPL. I think that if you put a lot of I/O in your tasks, seeing more threads than the cpu count is likeable.
That said, I've never seen more threads than the cpu count :) . But maybe I never had the right situation.
I tested this with the following code:
var lines = Enumerable.Range(0, 200).ToArray();
int currentThreads = 0;
int maxThreads = 0;
object l = new object();
lines.AsParallel().WithDegreeOfParallelism(100).ForAll(
s =>
{
lock (l)
{
currentThreads++;
if (currentThreads > maxThreads)
{
maxThreads = currentThreads;
Console.WriteLine(maxThreads);
}
}
Thread.Sleep(3000);
lock (l)
{
currentThreads--;
}
});
Console.WriteLine();
Console.WriteLine(maxThreads);
Basically, it records the current number of concurrently executing iterations and then saves the maximum encountered value.
The results vary quite a bit, between 15 and 25, but it's always much more than the number of CPUs my computer has (4). Increasing the sleep time increases the maximum number of concurrent threads. So it looks like the limiting factor here is the ThreadPool: it will create new threads slowly, especially when jobs are being completed relatively quickly.
If you want to increase the number of threads used, you would need to use SetMinThreads() (not SetMaxThreads()). If I set the minimum to 50, the number of threads actually used is around 60.
But having dozens of threads that do nothing but wait is quite inefficient, especially when it comes to memory consumption. You should consider using asynchronous methods instead.
PLINQ does not fit in this case.
I have found next article useful for me.
http://msdn.microsoft.com/en-us/library/hh228609(v=vs.110).aspx
Short answer: nope.
The amount of threading is simply up to the .Net Framework runtime. There is no developer control for controlling the number of threads for TPL (Task Parallel Library) usage.
EDIT
Thanks to some other feedback: it is actually possible--but not recommended--to manually control the number of threads in the ThreadPool, which PLINQ and TPL use.
It's my opinion that any parallelization problem needs to be carefully thought out, and carefully constructed and tested. There's a lot of subtlety in this.

When is Parallel.Invoke useful?

I'm just diving into learning about the Parallel class in the 4.0 Framework and am trying to understand when it would be useful. At first after reviewing some of the documentation I tried to execute two loops, one using Parallel.Invoke and one sequentially like so:
static void Main()
{
DateTime start = DateTime.Now;
Parallel.Invoke(BasicAction, BasicAction2);
DateTime end = DateTime.Now;
var parallel = end.Subtract(start).TotalSeconds;
start = DateTime.Now;
BasicAction();
BasicAction2();
end = DateTime.Now;
var sequential = end.Subtract(start).TotalSeconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
static void BasicAction2()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction2, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
There is no noticeable difference in time of execution here, or am I missing the point? Is it more useful for asynchronous invocations of web services or...?
EDIT: I removed the DateTime with Stopwatch, removed the write to the console with a simple addition operation.
UPDATE - Big Time Difference Now: Thanks for clearing up the problems I had when I involved Console
static void Main()
{
Stopwatch s = new Stopwatch();
s.Start();
Parallel.Invoke(BasicAction, BasicAction2);
s.Stop();
var parallel = s.ElapsedMilliseconds;
s.Reset();
s.Start();
BasicAction();
BasicAction2();
s.Stop();
var sequential = s.ElapsedMilliseconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
Thread.Sleep(100);
}
static void BasicAction2()
{
Thread.Sleep(100);
}
The test you are doing is nonsensical; you are testing to see if something that you can not perform in parallel is faster if you perform it in parallel.
Console.Writeline handles synchronization for you so it will always act as though it is running on a single thread.
From here:
...call the SetIn, SetOut, or SetError method, respectively. I/O
operations using these streams are synchronized, which means multiple
threads can read from, or write to, the streams.
Any advantage that the parallel version gains from running on multiple threads is lost through the marshaling done by the console. In fact I wouldn't be surprised to see that all the thread switching actually means that the parallel run would be slower.
Try doing something else in the actions (a simple Thread.Sleep would do) that can be processed by multiple threads concurrently and you should see a large difference in the run times. Large enough that the inaccuracy of using DateTime as your timing mechanism will not matter too much.
It's not a matter of time of execution. The output to the console is determined by how the actions are scheduled to run. To get an accurate time of execution, you should be using StopWatch. At any rate, you are using Console.Writeline so it will appear as though it is in one thread of execution. Any thing you have tried to attain by using parallel.invoke is lost by the nature of Console.Writeline.
On something simple like that the run times will be the same. What Parallel.Invoke is doing is running the two methods at the same time.
In the first case you'll have lines spat out to the console in a mixed up order.
Method=BasicAction2, Thread=6, i=9776
Method=BasicAction, Thread=10, i=9985
// <snip>
Method=BasicAction, Thread=10, i=9999
Method=BasicAction2, Thread=6, i=9777
In the second case you'll have all the BasicAction's before the BasicAction2's.
What this shows you is that the two methods are running at the same time.
In ideal case (if number of delegates is equal to number of parallel threads & there are enough cpu cores) duration of operations will become MAX(AllDurations) instead of SUM(AllDurations) (if AllDurations is a list of each delegate execution times like {1sec,10sec, 20sec, 5sec} ). In less idealcase its moving in this direction.
Its useful when you don't care about the order in which delegates are invoked, but you care that you block thread execution until every delegate is completed, so yes it can be a situation where you need to gather data from various sources before you can proceed (they can be webservices or other types of sources).
Parallel.For can be used much more often I think, in this case its pretty much required that you got different tasks and each is taking substantial duration to execute, and I guess if you don't have an idea of possible range of execution times ( which is true for webservices) Invoke will shine the most.
Maybe your static constructor requires to build up two independant dictionaries for your type to use, you can invoke methods that fill them using Invoke() in parallel and shorten time 2x if they both take roughly same time for example.

Categories

Resources