Performance measurement of individual threads in WaitAll construction

Performance measurement of individual threads in WaitAll construction - c#

Say I'm writing a piece of software that simulates a user performaning certain actions on a system. I'm measuring the amount of time it takes for such an action to complete using a stopwatch.
Most of the times this is pretty straighforward: the click of a button is simulated, some service call is associated with this button. The time it takes for this service call to complete is measured.
Now comes the crux, some actions have more than one service call associated with them. Since they're all still part of the same logical action, I'm 'grouping' these using the signalling mechanism offered by C#, like so (pseudo):
var syncResultList = new List<WaitHandle>();
var syncResultOne = service.BeginGetStuff();
var syncResultTwo = service.BeginDoOtherStuff();
syncResultList.Add(syncResultOne.AsyncWaitHandle);
syncResultList.Add(syncResultTwo.AsyncWaitHandle);
WaitHandle.WaitAll(syncResultList.ToArray());
var retValOne = service.EndGetStuff(syncResultOne);
var retValTwo = service.EndDoOtherStuff(syncResultTwo);
So, GetStuff and DoOtherStuff constitute one logical piece of work for that particular action. And, ofcourse, I can easily measure the amount of time it takes for this conjuction of methods to complete, by just placing a stopwatch around them. But, I need a more fine-grained approach for my statistics. I'm really interested in the amount of time it takes for each of the methods to complete, without losing the 'grouped' semantics provided by WaitHandle.WaitAll.
What I've done to overcome this, was writing a wrapper class (or rather a code generation file), which implements some timing mechanism using a callback, since I'm not that interested in the actual result (save exceptions, which are part of the statistic), I'd just let that return some statistic. But this turned out to be a performance drain somehow.
So, basically, I'm looking for an alternative to this approach. Maybe it's much simpler than I'm thinking right now, but I can't seem to figure it out by myself at the moment.

This looks like a prime candidate for Tasks ( assuming you're using C# 4 )
You can create Tasks from your APM methods using MSDN: Task.Factory.FromAsync
You can then use all the rich TPL goodness like individual continuations.

If your needs are simple enough, a simple way would be to just record each service call individually, then calculate the logical action based off the individual service calls.
IE if logical action A is made of parallel service calls B and C where B took 2 seconds and C took 1 second, then A takes 2 seconds.
A = Max(B, C)

Related

Thread Contention on a ConcurrentDictionary in C#

I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?

Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.

I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.

If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.

How to speed up foreach loop for IEnumerator

Little background
I am trying to use Intersystems IRIS Api
In case you want to see it.
https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=BNETNAT_refapi#BNETNAT_refapi_iris-iterator
So basically I have something like this (not the actual code)
string global = "myGlobal";
object subs = new object[2];
subs[0]="node1";
subs[1]="node2";
IRISIterator iter = iris.GetIRISIterator(global, subs); //Returns IEnumerator
foreach(var item in iter)
{
//for simplicity just printing it, actual code will process the data
Console.WriteLine(iter.CurrentSubscript.ToString());
}
It runs very slow, it takes almost 4 seconds to read 50 records.
So my question is that given I do not have any control on GetIRISIterator, is it even possible to improve the performance of the above code?
Can I do some type of parallel processing or asynchronous execution to reduce the time?

It is possible that the iterator does some slow operation each time the .MoveNext() method is called to advance to the next item. This may be caused by overhead accessing some external system, like a database, and it is possible this overhead is incurred for each item.
So the first thing I would attempt is to not use GetIRISIterator, there seem to be a GetIRISList that instead returns a list. It is possible that this only suffers the overhead once for the complete list.
Parallel processing is unlikely to improve anything, since it is the fetching of items that takes time, not the processing of items.
Asynchronous execution will not reduce the time taken, but it might improve user experience by showing the user that the system is working at processing the request and has not just hanged.
As whenever performance is discussed I will recommend doing appropriate measurements. Using a performance profiler might be very helpful since that may inform you where most of this time is spent.

Ensure Parallel Invoke don't use to much CPU

I have a C# Program with WCF that use parallel Invoke a bit.
First, every client call is parallel on service side with my WCF service.
I have a class A that contains a List of class B
I can add a list of class B without adding A.
To insert my list of element B I do it in parallel because before adding I do a lot of verification. And same with A
Some Client adds in one time really big list of A elements.
So, I use a parallel invoke for adding each A elements.
I configure it with parallel options to use no more than half of the CPU.
To let other users that do other thing use the CPU.
But the task that Add Class A who is already parallel limited to half of CPU create another Parallel Invoke to Add Class B
For exemple on call of
InvokeAddClassAList
Create Two thread AddClassA.
And Each AddClassA
Create Two Thread AddClassB
So, I have now 4 Thread.
Is this 4 thread limited to half of CPU?
Or only The Two AddClassA are limited to half CPU and each children Thread can use as much CPU as they want?
var pCount = Environment.ProcessorCount / 2;
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = pCount > 0 ? pCount : 1;
Parallel.Invoke(options, actions.ToArray());

Your CPU is a resource that should be used. Nobody will thank you if it idles. So limiting your process is pointless. You don't even know if there is somebody else using the computer at the time.
You could use your own TaskScheduler implementation to influence when and how many tasks will be started and when.
But again, there is no point. You should request as much as possible. And if you want to be nice to other users, lower your processes priority, so if you want to use 100%, they can still work with their higher prioritized processes.

Reactive Extensions Test Scheduler Simulating Time elapse

I am working with RX scheduler classes using the .Schedule(DateTimeOffset, Action>) stuff. Basically I've a scheduled action that can schedule itself again.
Code:
public SomeObject(IScheduler sch, Action variableAmountofTime)
{
this.sch = sch;
sch.Schedule(GetNextTime(), (Action<DateTimeOffset> runAgain =>
{
//Something that takes an unknown variable amount of time.
variableAmountofTime();
runAgain(GetNextTime());
});
}
public DateTimeOffset GetNextTime()
{
//Return some time offset based on scheduler's
//current time which is irregular based on other inputs that i have left out.
return this.sch.now.AddMinutes(1);
}
My Question is concerning simulating the amount of time variableAmountofTime might take and testing that my code behaves as expected and only triggers calling it as expected.
I have tried advancing the test scheduler's time inside the delegate but that does not work. Example of code that I wrote that doesnt work. Assume GetNextTime() is just scheduleing one minute out.
[Test]
public void TestCallsAppropriateNumberOfTimes()
{
var sch = new TestScheduler();
var timesCalled = 0;
var variableAmountOfTime = () =>
{
sch.AdvanceBy(TimeSpan.FromMinutes(3).Ticks);
timescalled++;
};
var someObject = new SomeObject(sch, variableAmountOfTime);
sch.AdvanceTo(TimeSpan.FromMinutes(3).Ticks);
Assert.That(timescalled, Is.EqualTo(1));
}
Since I am wanting to go 3 minutes into the future but the execution takes 3 minutes, I want to see this only trigger 1 time..instead it triggers 3 times.
How can I simulate something taking time during execution using the test scheduler.

Good question. Unfortunately, this is currently not supported in Rx v1.x and Rx v2.0 Beta (but read on). Let me explain the complication of nested Advance* calls to you.
Basically, Advance* implies starting the scheduler to run work till the point specified. This involves running the work in order on a single logical thread that represents the flow of time in the virtual scheduler. Allowing nested Advance* calls raises a few questions.
First of all, should a nested Advance* call cause a nested worker loop to be run? If that were the case, we're no longer mimicking a single logical thread of execution as the current work item would be interrupted in favor of running the inner loop. In fact, Advance* would lead to an implicit yield where the rest of the work (that was due now) after the Advance* call would not be allowed to run until all nested work has been processed. This leads to the situation where future work cannot depend on (or wait for) past work to finish its execution. One way out is to introduce real physical concurrency, which defeats various design points of the virtual time and historical schedulers to begin with.
Alternatively, should a nested Advance* call somehow communicate to the top-most worker loop dispatching call (Advance* or Start) it may need to extend its due time because a nested invocation has asked to advance to a point beyond the original due time. Now all sorts of things are getting weird though. The clock doesn't reflect the changes after returning from Advance* and the top-most call no longer finishes at a predictable time.
For Rx v2.0 RC (coming next month), we took a look at this scenario and decided Advance* is not the right thing to emulate "time slippage" because it'd need an overloaded meaning depending on the context where it's invoked from. Instead, we're introducing a Sleep method that can be used to slip time forward from any context, without the side-effect of running work. Think of it as a way to set the Clock property but with safeguarding against going back in time. The name also reflects the intent clearly.
In addition to the above, to reduce the surprise factor of nested Advance* calls having no effect, we made it detect this situation and throw an InvalidOperationException in a nested context. Sleep, on the other hand, can be called from anywhere.
One final note. It turns out we needed exactly the same feature for work we're doing in Rx v2.0 RC with regards to our treatment of time. Several tests required a deterministic way to emulate slippage of time due to the execution of user code that can take arbitrarily long (think of the OnNext handler to e.g. Observable.Interval).
Hope this helps... Stay tuned for our Rx v2.0 RC release in the next few weeks!
-Bart (Rx team)

C# Action vs. Event vs. Queue performance

I have a Camera class that produces very large images at a high FPS that require processing by a ImageProcessor class. I also have a WPF Control, my View, that displays this information. I need each of these components needs to run on it's own thread so it doesn't lock up the processing.
Method 1) Camera has an Action<Image> ImageCreated that ImageProcessor subscribes to. ImageProcessor has an Action<Image, Foo> ImageCreated that contains an altered Image and Foo results for the View to show.
Method 2) Camera has a threadsafe (using locks and monitors) ProducerConsumer to which it produces Images, and ImageProcessor waits and Consumes. Same story for the View.
Method 2 is nice because I can create and manage my own threads.
Method 1 is nice because I have have multiple ImageProcessors subscribed to the Camera class. But I'm not sure who's thread is doing the heavyweight work, or if Action is wasting time creating threads. Again these images come in many times per second.
I'm trying to get the images to my View as quickly as possible, without tying up processing or causing the View to lock up.
Thoughts?

Unless you do it yourself, using Method 1) does not introduce any multithreading. Invoking an action (unless you call BeginInvoke) does so synchronously, just like any normal method call.
I would advocate Method 2). There is no need to tie it to one single consumer. If you use this queue as a single point of contact between X cameras and Y processors, you've decoupled the cameras from the processors and could modify the value of X and Y independently.
EDIT
At the risk of being accused of blog spam here, I remembered that I wrote a component that's similar (if not an exact match) for what you're looking for awhile ago. See if this helps:
ProcessQueue
The gist of it is that you provide the queue with a delegate that can process a single item--in your case, Image--in the constructor, then call Start. As items are added to the queue using Enqueue, they're automatically dispatched to an appropriate thread and processed.
For example, if you wanted to have the image move Camera->Processor->Writer (and have a variable number of each), then I would do something like this:
ProcessQueue<Foo> processorQueue = new ProcessQueue<Foo>(f => WriteFoo(f));
ProcessQueue<Image> cameraQueue = new ProcessQueue<Image>(i => processorQueue.Enqueue(ProcessImage(i)));
You could vary the number of threads in cameraQueue (which controls the image processing) and processorQueue (which controls writing to disk) by using SetThreadCount.
Once you've done that, you would just call cameraQueue.Enqueue(image) whenever a camera captured an image.

Method one will not work - the Action<T> will executed on the thread that invoked it. Although you should probably use events instead of plain delegates in scenarios like this.
Method two is the way to go, but if possible you should use the new thread-safe collection of .NET 4.0 instead of doing the synchronization yourself - we all know how hard it is to get even the simplest multi-threaded code correct.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.