I'm just now starting to learn multi-threading and I came across this question:
public class Program1
{
int variable;
bool variableValueHasBeenSet = false;
public void Func1()
{
variable = 1;
variableValueHasBeenSet = true;
}
public void Func2()
{
if (variableValueHasBeenSet) Console.WriteLine(variable);
}
}
the questions is: Determine all possible outputs (in console) for the following code snippet if Func1() and Func2() are run in parallel on two separate threads. The answer given is nothing, 1 or 0. the first two options are obvious but the third one surprised me so I wanted to try and get it, this is what I tried:
for (int i = 0; i < 100; i++)
{
var prog1 = new Program1();
List<Task> tasks = new List<Task>();
tasks.Add(new Task(() => prog1.Func2(), TaskCreationOptions.LongRunning));
tasks.Add(new Task(() => prog1.Func1(), TaskCreationOptions.LongRunning));
Parallel.ForEach(tasks, t => t.Start());
}
I couldn't get 0, only nothing and 1, so I was wondering what I'm doing wrong and how can I test this specific problem?
this is the explanation they provided for 0:
0 - This might seem impossible but this is a probable output and an interesting one. .Net runtime, C# and the CPU take the liberty of reordering instructions for optimization. So it is possible that variableValueHasBeenSet is set to true but the value of the variable is still zero. Another reason for such an output is caching. Thread2 might cache the value for the variable as 0 and will not see the updated value when Thread1 updates it in Func1. For a single threaded program this is not an issue as the ordering is guaranteed, but not so in multithreaded code. If the code at both the places is surrounded by locks, this problem can be mitigated. Another advanced way is to use memory barriers.
.Net runtime, C# and the CPU take the liberty of reordering
instructions for optimization.
This bit of information is very important, because there is no guarantee the reordering will happen at all.
The optimizer will often reorder the instructions, but usually this is triggered by code complexity and will typically only occur on a release build (the optimizer will look for dependency-chains and may decide to reorder the code if no dependency is broken AND it will result in faster/more compact code). The code complexity of your test is very low and may not trigger the reordering optimization.
The same thing may happen at the CPU level, if no dependency chains are found between CPU instructions, they may be reordered or at least run in parallel by a superscalar CPU, but other, simpler architectures will run code in-order.
Another reason for such an output is caching. Thread2 might cache the
value for the variable as 0 and will not see the updated value when
Thread1 updates it in Func1
Again, this is is only a possibility. This type of optimization is typically triggered when repeatedly accessing a variable in a loop. The optimizer may decide that it is faster to place the variable on a CPU register instead of accessing it from memory every iteration.
In any case, the amount of control you have over how the C# compiler emits its code is very limited, same goes for how the IL code gets translated to machine code. For these reasons, it would be very difficult for you to produce a reproducible test on every architecture for the case you intend to prove.
What is really important is that you need to be aware that 1) the execution order of the instructions can never be taken for granted and 2) variables may be temporarily stored in registers as a potential optimization. Once aware you should write your code defensively around these possibilities
The general advice is that you should not call GC.Collect from your code, but what are the exceptions to this rule?
I can only think of a few very specific cases where it may make sense to force a garbage collection.
One example that springs to mind is a service, that wakes up at intervals, performs some task, and then sleeps for a long time. In this case, it may be a good idea to force a collect to prevent the soon-to-be-idle process from holding on to more memory than needed.
Are there any other cases where it is acceptable to call GC.Collect?
If you have good reason to believe that a significant set of objects - particularly those you suspect to be in generations 1 and 2 - are now eligible for garbage collection, and that now would be an appropriate time to collect in terms of the small performance hit.
A good example of this is if you've just closed a large form. You know that all the UI controls can now be garbage collected, and a very short pause as the form is closed probably won't be noticeable to the user.
UPDATE 2.7.2018
As of .NET 4.5 - there is GCLatencyMode.LowLatency and GCLatencyMode.SustainedLowLatency. When entering and leaving either of these modes, it is recommended that you force a full GC with GC.Collect(2, GCCollectionMode.Forced).
As of .NET 4.6 - there is the GC.TryStartNoGCRegion method (used to set the read-only value GCLatencyMode.NoGCRegion). This can itself, perform a full blocking garbage collection in an attempt to free enough memory, but given we are disallowing GC for a period, I would argue it is also a good idea to perform full GC before and after.
Source: Microsoft engineer Ben Watson's: Writing High-Performance .NET Code, 2nd Ed. 2018.
See:
https://msdn.microsoft.com/en-us/library/system.runtime.gclatencymode(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/dn906204(v=vs.110).aspx
I use GC.Collect only when writing crude performance/profiler test rigs; i.e. I have two (or more) blocks of code to test - something like:
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
TestA(); // may allocate lots of transient objects
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
TestB(); // may allocate lots of transient objects
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
...
So that TestA() and TestB() run with as similar state as possible - i.e. TestB() doesn't get hammered just because TestA left it very close to the tipping point.
A classic example would be a simple console exe (a Main method sort-enough to be posted here for example), that shows the difference between looped string concatenation and StringBuilder.
If I need something precise, then this would be two completely independent tests - but often this is enough if we just want to minimize (or normalize) the GC during the tests to get a rough feel for the behaviour.
During production code? I have yet to use it ;-p
The best practise is to not force a garbage collection in most cases. (Every system I have worked on that had forced garbage collections, had underlining problems that if solved would have removed the need to forced the garbage collection, and speeded the system up greatly.)
There are a few cases when you know more about memory usage then the garbage collector does. This is unlikely to be true in a multi user application, or a service that is responding to more then one request at a time.
However in some batch type processing you do know more then the GC. E.g. consider an application that.
Is given a list of file names on the command line
Processes a single file then write the result out to a results file.
While processing the file, creates a lot of interlinked objects that can not be collected until the processing of the file have complete (e.g. a parse tree)
Does not keep match state between the files it has processed.
You may be able to make a case (after careful) testing that you should force a full garbage collection after you have process each file.
Another cases is a service that wakes up every few minutes to process some items, and does not keep any state while it’s asleep. Then forcing a full collection just before going to sleep may be worthwhile.
The only time I would consider forcing
a collection is when I know that a lot
of object had been created recently
and very few objects are currently
referenced.
I would rather have a garbage collection API when I could give it hints about this type of thing without having to force a GC my self.
See also "Rico Mariani's Performance Tidbits"
These days I consider same of the above cases would be better to use a short lived worker process to do each batch of work and let the OS do the resource recovery.
One case is when you are trying to unit test code that uses WeakReference.
In large 24/7 or 24/6 systems -- systems that react to messages, RPC requests or that poll a database or process continuously -- it is useful to have a way to identify memory leaks. For this, I tend to add a mechanism to the application to temporarily suspend any processing and then perform full garbage collection. This puts the system into a quiescent state where the memory remaining is either legitimately long lived memory (caches, configuration, &c.) or else is 'leaked' (objects that are not expected or desired to be rooted but actually are).
Having this mechanism makes it a lot easier to profile memory usage as the reports will not be clouded with noise from active processing.
To be sure you get all of the garbage, you need to perform two collections:
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
As the first collection will cause any objects with finalizers to be finalized (but not actually garbage collect these objects). The second GC will garbage collect these finalized objects.
You can call GC.Collect() when you know something about the nature of the app the garbage collector doesn't.
As the author, it's often tempting to think this is likely or normal. However, the truth is the GC amounts to a pretty well-written and tested expert system, and it's rare you'll know something about the low level code paths it doesn't.
The best example I can think of where you might have some extra information is an app that cycles between idle periods and very busy periods. You want the best performance possible for the busy periods and therefore want to use the idle time to do some clean up.
However, most of the time the GC is smart enough to do this anyway.
One instance where it is almost necessary to call GC.Collect() is when automating Microsoft Office through Interop. COM objects for Office don't like to automatically release and can result in the instances of the Office product taking up very large amounts of memory. I'm not sure if this is an issue or by design. There's lots of posts about this topic around the internet so I won't go into too much detail.
When programming using Interop, every single COM object should be manually released, usually though the use of Marshal.ReleseComObject(). In addition, calling Garbage Collection manually can help "clean up" a bit. Calling the following code when you're done with Interop objects seems to help quite a bit:
GC.Collect()
GC.WaitForPendingFinalizers()
GC.Collect()
In my personal experience, using a combination of ReleaseComObject and manually calling garbage collection greatly reduces the memory usage of Office products, specifically Excel.
As a memory fragmentation solution.
I was getting out of memory exceptions while writing a lot of data into a memory stream (reading from a network stream). The data was written in 8K chunks. After reaching 128M there was exception even though there was a lot of memory available (but it was fragmented). Calling GC.Collect() solved the issue. I was able to handle over 1G after the fix.
Have a look at this article by Rico Mariani. He gives two rules when to call GC.Collect (rule 1 is: "Don't"):
When to call GC.Collect()
I was doing some performance testing on array and list:
private static int count = 100000000;
private static List<int> GetSomeNumbers_List_int()
{
var lstNumbers = new List<int>();
for(var i = 1; i <= count; i++)
{
lstNumbers.Add(i);
}
return lstNumbers;
}
private static int[] GetSomeNumbers_Array()
{
var lstNumbers = new int[count];
for (var i = 1; i <= count; i++)
{
lstNumbers[i-1] = i + 1;
}
return lstNumbers;
}
private static int[] GetSomeNumbers_Enumerable_Range()
{
return Enumerable.Range(1, count).ToArray();
}
static void performance_100_Million()
{
var sw = new Stopwatch();
sw.Start();
var numbers1 = GetSomeNumbers_List_int();
sw.Stop();
//numbers1 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"List<int>\" took {0} milliseconds", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
var numbers2 = GetSomeNumbers_Array();
sw.Stop();
//numbers2 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"int[]\" took {0} milliseconds", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
//getting System.OutOfMemoryException in GetSomeNumbers_Enumerable_Range method
var numbers3 = GetSomeNumbers_Enumerable_Range();
sw.Stop();
//numbers3 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"int[]\" Enumerable.Range took {0} milliseconds", sw.ElapsedMilliseconds));
}
and I got OutOfMemoryException in GetSomeNumbers_Enumerable_Range method the only workaround is to deallocate the memory by:
numbers = null;
GC.Collect();
You should try to avoid using GC.Collect() since its very expensive. Here is an example:
public void ClearFrame(ulong timeStamp)
{
if (RecordSet.Count <= 0) return;
if (Limit == false)
{
var seconds = (timeStamp - RecordSet[0].TimeStamp)/1000;
if (seconds <= _preFramesTime) return;
Limit = true;
do
{
RecordSet.Remove(RecordSet[0]);
} while (((timeStamp - RecordSet[0].TimeStamp) / 1000) > _preFramesTime);
}
else
{
RecordSet.Remove(RecordSet[0]);
}
GC.Collect(); // AVOID
}
TEST RESULT: CPU USAGE 12%
When you change to this:
public void ClearFrame(ulong timeStamp)
{
if (RecordSet.Count <= 0) return;
if (Limit == false)
{
var seconds = (timeStamp - RecordSet[0].TimeStamp)/1000;
if (seconds <= _preFramesTime) return;
Limit = true;
do
{
RecordSet[0].Dispose(); // Bitmap destroyed!
RecordSet.Remove(RecordSet[0]);
} while (((timeStamp - RecordSet[0].TimeStamp) / 1000) > _preFramesTime);
}
else
{
RecordSet[0].Dispose(); // Bitmap destroyed!
RecordSet.Remove(RecordSet[0]);
}
//GC.Collect();
}
TEST RESULT: CPU USAGE 2-3%
In your example, I think that calling GC.Collect isn't the issue, but rather there is a design issue.
If you are going to wake up at intervals, (set times) then your program should be crafted for a single execution (perform the task once) and then terminate. Then, you set the program up as a scheduled task to run at the scheduled intervals.
This way, you don't have to concern yourself with calling GC.Collect, (which you should rarely if ever, have to do).
That being said, Rico Mariani has a great blog post on this subject, which can be found here:
http://blogs.msdn.com/ricom/archive/2004/11/29/271829.aspx
One useful place to call GC.Collect() is in a unit test when you want to verify that you are not creating a memory leak (e. g. if you are doing something with WeakReferences or ConditionalWeakTable, dynamically generated code, etc).
For example, I have a few tests like:
WeakReference w = CodeThatShouldNotMemoryLeak();
Assert.IsTrue(w.IsAlive);
GC.Collect();
GC.WaitForPendingFinalizers();
Assert.IsFalse(w.IsAlive);
It could be argued that using WeakReferences is a problem in and of itself, but it seems that if you are creating a system that relies on such behavior then calling GC.Collect() is a good way to verify such code.
There are some situations where it is better safe than sorry.
Here is one situation.
It is possible to author an unmanaged DLL in C# using IL rewrites (because there are situations where this is necessary).
Now suppose, for example, the DLL creates an array of bytes at the class level - because many of the exported functions need access to such. What happens when the DLL is unloaded? Is the garbage collector automatically called at that point? I don't know, but being an unmanaged DLL it is entirely possible the GC isn't called. And it would be a big problem if it wasn't called. When the DLL is unloaded so too would be the garbage collector - so who is going to be responsible for collecting any possible garbage and how would they do it? Better to employ C#'s garbage collector. Have a cleanup function (available to the DLL client) where the class level variables are set to null and the garbage collector called.
Better safe than sorry.
The short answer is: never!
using(var stream = new MemoryStream())
{
bitmap.Save(stream, ImageFormat.Png);
techObject.Last().Image = Image.FromStream(stream);
bitmap.Dispose();
// Without this code, I had an OutOfMemory exception.
GC.Collect();
GC.WaitForPendingFinalizers();
//
}
Another reason is when you have a SerialPort opened on a USB COM port, and then the USB device is unplugged. Because the SerialPort was opened, the resource holds a reference to the previously connected port in the system's registry. The system's registry will then contain stale data, so the list of available ports will be wrong. Therefore the port must be closed.
Calling SerialPort.Close() on the port calls Dispose() on the object, but it remains in memory until garbage collection actually runs, causing the registry to remain stale until the garbage collector decides to release the resource.
From https://stackoverflow.com/a/58810699/8685342:
try
{
if (port != null)
port.Close(); //this will throw an exception if the port was unplugged
}
catch (Exception ex) //of type 'System.IO.IOException'
{
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
}
port = null;
If you are creating a lot of new System.Drawing.Bitmap objects, the Garbage Collector doesn't clear them. Eventually GDI+ will think you are running out of memory and will throw a "The parameter is not valid" exception. Calling GC.Collect() every so often (not too often!) seems to resolve this issue.
i am still pretty unsure about this.
I am working since 7 years on an Application Server. Our bigger installations take use of 24 GB Ram. Its hightly Multithreaded, and ALL calls for GC.Collect() ran into really terrible performance issues.
Many third party Components used GC.Collect() when they thought it was clever to do this right now.
So a simple bunch of Excel-Reports blocked the App Server for all threads several times a minute.
We had to refactor all the 3rd Party Components in order to remove the GC.Collect() calls, and all worked fine after doing this.
But i am running Servers on Win32 as well, and here i started to take heavy use of GC.Collect() after getting a OutOfMemoryException.
But i am also pretty unsure about this, because i often noticed, when i get a OOM on 32 Bit, and i retry to run the same Operation again, without calling GC.Collect(), it just worked fine.
One thing i wonder is the OOM Exception itself...
If i would have written the .Net Framework, and i can't alloc a memory block, i would use GC.Collect(), defrag memory (??), try again, and if i still cant find a free memory block, then i would throw the OOM-Exception.
Or at least make this behavior as configurable option, due the drawbacks of the performance issue with GC.Collect.
Now i have lots of code like this in my app to "solve" the problem:
public static TResult ExecuteOOMAware<T1, T2, TResult>(Func<T1,T2 ,TResult> func, T1 a1, T2 a2)
{
int oomCounter = 0;
int maxOOMRetries = 10;
do
{
try
{
return func(a1, a2);
}
catch (OutOfMemoryException)
{
oomCounter++;
if (maxOOMRetries > 10)
{
throw;
}
else
{
Log.Info("OutOfMemory-Exception caught, Trying to fix. Counter: " + oomCounter.ToString());
System.Threading.Thread.Sleep(TimeSpan.FromSeconds(oomCounter * 10));
GC.Collect();
}
}
} while (oomCounter < maxOOMRetries);
// never gets hitted.
return default(TResult);
}
(Note that the Thread.Sleep() behavior is a really App apecific behavior, because we are running a ORM Caching Service, and the service takes some time to release all the cached objects, if RAM exceeds some predefined values. so it waits a few seconds the first time, and has increased waiting time each occurence of OOM.)
one good reason for calling GC is on small ARM computers with little memory, like the Raspberry PI (running with mono).
If unallocated memory fragments use too much of the system RAM, then the Linux OS can get unstable.
I have an application where I have to call GC every second (!) to get rid of memory overflow problems.
Another good solution is to dispose objects when they are no longer needed. Unfortunately this is not so easy in many cases.
This isn't that relevant to the question, but for XSLT transforms in .NET (XSLCompiledTranform) then you might have no choice. Another candidate is the MSHTML control.
If you are using a version of .net less than 4.5, manual collection may be inevitable (especially if you are dealing with many 'large objects').
this link describes why:
https://blogs.msdn.microsoft.com/dotnet/2011/10/03/large-object-heap-improvements-in-net-4-5/
Since there are Small object heap(SOH) and Large object heap(LOH)
We can call GC.Collect() to clear de-reference object in SOP, and move lived object to next generation.
In .net4.5, we can also compact LOH by using largeobjectheapcompactionmode
I have been working on Async calls and I found that the Async version of a method is running much slower than the Sync version. Can anyone comment on what I may be missing. Thanks.
Statistics
Sync method time is 00:00:23.5673480
Async method time is 00:01:07.1628415
Total Records/Entries returned per call = 19972
Below is the code that i am running.
-------------------- Test class ----------------------
[TestMethod]
public void TestPeoplePerformanceSyncVsAsync()
{
DateTime start;
DateTime end;
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
IList<IPerson> people1 = repository.GetPeople();
IList<IPerson> people2 = repository.GetPeople();
}
}
end = DateTime.Now;
var diff = start - end;
Console.WriteLine(diff);
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
Task<IList<IPerson>> people1 = GetPeopleAsync();
Task<IList<IPerson>> people2 = GetPeopleAsync();
Task.WaitAll(new Task[] {people1, people2});
}
}
end = DateTime.Now;
diff = start - end;
Console.WriteLine(diff);
}
private async Task<IList<IPerson>> GetPeopleAsync()
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
return await repository.GetPeopleAsync();
}
}
-------------------------- Repository ----------------------------
public IList<IPerson> GetPeople()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(context.People);
}
return people;
}
public async Task<IList<IPerson>> GetPeopleAsync()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(await context.People.ToListAsync());
}
return people;
}
So we've got a whole bunch of issues here, so I'll just say right off the bat that this isn't going to be an exhaustive list.
First off, the point of asynchrony is not strictly to improve performance. It can be, in certain contexts, used to improve performance, but that's not necessarily its goal. It can also be used to keep a UI responsive, for example. Paralleization is usually used to increase performance, but parallelization and asynchrony aren't equivalent. On top of that, parallelization has an overhead. You're spending time creating threads, scheduling them, synchronizing data between them, etc. The benefit of performing some operations in parallel may or may not surpass this overhead. If it doesn't, a synchronous solution may well be more performant.
Next, your "asynchronous" example isn't asynchronous "all the way up". You're calling WaitAll on the tasks inside the loop. For the example to be properly asynchronous one would like to see it be asynchronous all the way up to a single operation, namely some form of message loop.
Next, the two aren't don't the exact same thing in an asynchronous and synchronous manor. They are doing different things, which will obviously affect performance:
Your "asynchronous" solution creates 3 repositories. Your synchronous solution creates one. There is going to be some overhead here.
GetPeopleAsync takes a list, then pulls all of the items out of the list and puts them into another list. That's unnecessary overhead.
Then there are problems with your benchmarking:
You're using DateTime.Now, which is not designed for timing how long an operation takes. it's precision isn't particularly high, for example. You should use a StopWatch to time how long code takes.
You aren't performing all that many iterations. There's plenty of opportunity for the variation to affect the results here.
You aren't accounting for the fact that the first few runs through a section of code will take longer. The JITter needs to "warm up".
Garbage collections can be affecting your timings, namely that the objects created in the first test can end up being cleaned up during the second test.
It may depend on your data, or rather the amount of it. You didn't post what test metrics you're using to run your tests but this is my experience:
Usually when you see a slowdown in the performance of parallel algorithms when you're expecting improvement it's that the overhead of loading the extra libraries and spawning threads etc. slows down the parallel algorithm and makes it look like the linear/single-threaded version is performing better.
A greater amount of data should show better performance. Also try running the same test twice when all the libraries are loaded to avoid the load overhead.
If you don't see improvement, something is seriously wrong.
Note: You're getting voted down, I'm guessing, because you posted much more code than context, metrics etc. in the OP. IMO, very few SOers will actually bother to read and grok even that much code without being able to execute it while also being presented with metrics that are not at all useful!
Why I didn't read the code: When I see a code block with scroll bars along with the kind of text that was present in the original OP, my brain says: Don't bother. I think many if not most, probably do this.
Things to try:
Two different synch times does not mean statistically significant data. You should run each algorithm a number of times (5 at least) to see if you're experiencing anomalies. If your results for the same algorithms vary wildly then you may have other issues such as bandwidth restriction, server load etc. and the issue is external.
Try a .NET memory performance and/or memory profiler to help you track down the issue.
See #servy's great answer for more clues. It seems that he actually took the time to look at your code more closely.
I wonder whether the following code can be optimized to execute faster. I currently seem to max out at around 1.4 million simple messages per second on a pretty simple data flow structure. I am aware that this sample process passes/transforms messages synchronously, however, I currently test TPL Dataflow as a possible replacement for my own custom solution based on Tasks and concurrent collections. I know the terms "concurrent" already suggest I run things in parallel but for current testing purposes I pushed messages on my own solution through synchronously and I get to about 5.1 million messages per second. What am I missing here, I read TPL Dataflow was pushed as a high throughput, low latency solution but so far I must be overlooking performance tweaks. Anyone who could point me into the right direction please?
class TPLDataFlowExperiments
{
public TPLDataFlowExperiments()
{
var buf1 = new BufferBlock<int>();
var transform = new TransformBlock<int, string>(t =>
{
return "";
});
var action = new ActionBlock<string>(s =>
{
//Thread.Sleep(100);
//Console.WriteLine(s);
});
buf1.LinkTo(transform);
transform.LinkTo(action);
//Propagate all Completions down the flow
buf1.Completion.ContinueWith(t =>
{
transform.Complete();
transform.Completion.ContinueWith(u =>
{
action.Complete();
});
});
Stopwatch watch = new Stopwatch();
watch.Start();
int cap = 10000000;
for (int i = 0; i < cap; i++)
{
buf1.Post(i);
}
//Mark Buffer as Complete
buf1.Complete();
action.Completion.ContinueWith(t =>
{
watch.Stop();
Console.WriteLine("All Blocks finished processing");
Console.WriteLine("Units processed per second: " + cap / watch.ElapsedMilliseconds * 1000);
});
Console.ReadLine();
}
}
I think this mostly comes down to one thing: your test is pretty much meaningless. All those blocks are supposed to do something, and use multiple cores and asynchronous operations to do that.
Also, in your test, it's likely that a lot of time is spent on synchronization. With a more realistic code, the code will take some time to execute, so there will be less contention, so the actual overhead will be smaller than what you measured.
But to actually answer your question, yes, you're overlooking some performance tweaks. Specifically, SingleProducerConstrained, which means data structures with less locking can be used. If I use this on both blocks (the BufferBlock is completely useless here, you can safely remove it), the rate raises from about 3–4 millions of items per second to more than 5 millions on my computer.
To add to svick's answer, the test uses only a single processing thread for a single action block. This way it tests nothing more than the overhead of using the blocks.
DataFlow works in a manner similar to F# Agents, Scala actors and MPI implementations. Each action block executes a single task at a time, listening to input and producing output. Speedup is provided by breaking an algorithm in steps that can be executed independently on multiple cores, passing only messages to each other.
While you can increase the number of concurrent tasks, the most important issue is designing a flow that perform the maximum amount of steps independently of the others.
You can also increase the degrees of parallelism for dataflow blocks. This may offer an additional speedup and can also help with load balancing between linear tasks if you find one of your blocks acts as a bottleneck to the rest.
If your workload is so granular that you expect to process millions of messages per second, then passing individual messages through the pipeline becomes not viable because of the associated overhead. You'll need to chunkify the workload by batching the messages to arrays or lists. For example:
var transform = new TransformBlock<int[], string[]>(batch =>
{
var results = new string[batch.Length];
for (int i = 0; i < batch.Length; i++)
{
results[i] = ProcessItem(batch[i]);
}
return results;
});
For batching your input you could use a BatchBlock, or the "linqy" Buffer extension method from the System.Interactive package, or the similar in functionality Batch method from the MoreLinq package, or do it manually.
I'm just diving into learning about the Parallel class in the 4.0 Framework and am trying to understand when it would be useful. At first after reviewing some of the documentation I tried to execute two loops, one using Parallel.Invoke and one sequentially like so:
static void Main()
{
DateTime start = DateTime.Now;
Parallel.Invoke(BasicAction, BasicAction2);
DateTime end = DateTime.Now;
var parallel = end.Subtract(start).TotalSeconds;
start = DateTime.Now;
BasicAction();
BasicAction2();
end = DateTime.Now;
var sequential = end.Subtract(start).TotalSeconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
static void BasicAction2()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction2, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
There is no noticeable difference in time of execution here, or am I missing the point? Is it more useful for asynchronous invocations of web services or...?
EDIT: I removed the DateTime with Stopwatch, removed the write to the console with a simple addition operation.
UPDATE - Big Time Difference Now: Thanks for clearing up the problems I had when I involved Console
static void Main()
{
Stopwatch s = new Stopwatch();
s.Start();
Parallel.Invoke(BasicAction, BasicAction2);
s.Stop();
var parallel = s.ElapsedMilliseconds;
s.Reset();
s.Start();
BasicAction();
BasicAction2();
s.Stop();
var sequential = s.ElapsedMilliseconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
Thread.Sleep(100);
}
static void BasicAction2()
{
Thread.Sleep(100);
}
The test you are doing is nonsensical; you are testing to see if something that you can not perform in parallel is faster if you perform it in parallel.
Console.Writeline handles synchronization for you so it will always act as though it is running on a single thread.
From here:
...call the SetIn, SetOut, or SetError method, respectively. I/O
operations using these streams are synchronized, which means multiple
threads can read from, or write to, the streams.
Any advantage that the parallel version gains from running on multiple threads is lost through the marshaling done by the console. In fact I wouldn't be surprised to see that all the thread switching actually means that the parallel run would be slower.
Try doing something else in the actions (a simple Thread.Sleep would do) that can be processed by multiple threads concurrently and you should see a large difference in the run times. Large enough that the inaccuracy of using DateTime as your timing mechanism will not matter too much.
It's not a matter of time of execution. The output to the console is determined by how the actions are scheduled to run. To get an accurate time of execution, you should be using StopWatch. At any rate, you are using Console.Writeline so it will appear as though it is in one thread of execution. Any thing you have tried to attain by using parallel.invoke is lost by the nature of Console.Writeline.
On something simple like that the run times will be the same. What Parallel.Invoke is doing is running the two methods at the same time.
In the first case you'll have lines spat out to the console in a mixed up order.
Method=BasicAction2, Thread=6, i=9776
Method=BasicAction, Thread=10, i=9985
// <snip>
Method=BasicAction, Thread=10, i=9999
Method=BasicAction2, Thread=6, i=9777
In the second case you'll have all the BasicAction's before the BasicAction2's.
What this shows you is that the two methods are running at the same time.
In ideal case (if number of delegates is equal to number of parallel threads & there are enough cpu cores) duration of operations will become MAX(AllDurations) instead of SUM(AllDurations) (if AllDurations is a list of each delegate execution times like {1sec,10sec, 20sec, 5sec} ). In less idealcase its moving in this direction.
Its useful when you don't care about the order in which delegates are invoked, but you care that you block thread execution until every delegate is completed, so yes it can be a situation where you need to gather data from various sources before you can proceed (they can be webservices or other types of sources).
Parallel.For can be used much more often I think, in this case its pretty much required that you got different tasks and each is taking substantial duration to execute, and I guess if you don't have an idea of possible range of execution times ( which is true for webservices) Invoke will shine the most.
Maybe your static constructor requires to build up two independant dictionaries for your type to use, you can invoke methods that fill them using Invoke() in parallel and shorten time 2x if they both take roughly same time for example.