How to speed up foreach loop for IEnumerator

How to speed up foreach loop for IEnumerator - c#

Little background
I am trying to use Intersystems IRIS Api
In case you want to see it.
https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=BNETNAT_refapi#BNETNAT_refapi_iris-iterator
So basically I have something like this (not the actual code)
string global = "myGlobal";
object subs = new object[2];
subs[0]="node1";
subs[1]="node2";
IRISIterator iter = iris.GetIRISIterator(global, subs); //Returns IEnumerator
foreach(var item in iter)
{
//for simplicity just printing it, actual code will process the data
Console.WriteLine(iter.CurrentSubscript.ToString());
}
It runs very slow, it takes almost 4 seconds to read 50 records.
So my question is that given I do not have any control on GetIRISIterator, is it even possible to improve the performance of the above code?
Can I do some type of parallel processing or asynchronous execution to reduce the time?

It is possible that the iterator does some slow operation each time the .MoveNext() method is called to advance to the next item. This may be caused by overhead accessing some external system, like a database, and it is possible this overhead is incurred for each item.
So the first thing I would attempt is to not use GetIRISIterator, there seem to be a GetIRISList that instead returns a list. It is possible that this only suffers the overhead once for the complete list.
Parallel processing is unlikely to improve anything, since it is the fetching of items that takes time, not the processing of items.
Asynchronous execution will not reduce the time taken, but it might improve user experience by showing the user that the system is working at processing the request and has not just hanged.
As whenever performance is discussed I will recommend doing appropriate measurements. Using a performance profiler might be very helpful since that may inform you where most of this time is spent.

Related

Why my code does not speed up with a multithreaded Parallel.For loop?

I tried to transform a simple sequential loop into a parallel computed loop with the System.Threading.Tasks library.
The code compiles, returns correct results, but It does not save any computational cost, otherwise, it takes longer.
EDIT: Sorry guys, I have probably oversimplified the question and made some errors doing that.
To append additional information, I am running the code on an i7-4700QM, and it is referenced in a Grasshopper script.
Here is the actual code. I also switched to a non thread-local variables
public static class LineNet
{
public static List<Ray> SolveCpu(List<Speaker> sources, List<Receiver> targets, List<Panel> surfaces)
{
ConcurrentBag<Ray> rays = new ConcurrentBag<Ray>();
for (int i = 0; i < sources.Count; i++)
{
Parallel.For(
0,
targets.Count,
j =>
{
Line path = new Line(sources[i].Position, targets[j].Position);
Ray ray = new Ray(path, i, j);
if (Utils.CheckObstacles(ray,surfaces))
{
rays.Add(ray);
}
}
);
}
}
}
The Grasshopper implementation just collects sources targets and surfaces, calls the method Solve and returns rays.
I understand that dispatching workload to threads is expensive, but is it so expensive?
Or is the ConcurrentBag just preventing parallel calculation?
Plus, my classes are immutable (?), but if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?

Without a good Minimal, Complete, and Verifiable code example that reliably reproduces the problem, it is not possible to provide a definitive answer. The code you posted does not even appear to be an excerpt of real code, because the type declared as the return type of the method isn't the same as the value actually returned by the return statement.
However, certainly the code you posted does not seem like a good use of Parallel.For(). Your Line constructor would have be fairly expensive to justify parallelizing the task of creating the items. And to be clear, that's the only possible win here.
At the end, you still need to aggregate all of the Line instances that you created into a single list, so all those intermediate lists created for the Parallel.For() tasks are just pure overhead. And the aggregation is necessarily serialized (i.e. only one thread at a time can be adding an item to the result collection), and in the worst way (each thread only gets to add a single item before it gives up the lock and another thread has a chance to take it).
Frankly, you'd be better off storing each local List<T> in a collection, and then aggregating them all at once in the main thread after Parallel.For() returns. Not that that would be likely to make the code perform better than a straight-up non-parallelized implementation. But at least it would be less likely to be worse. :)
The bottom line is that you don't seem to have a workload that could benefit from parallelization. If you think otherwise, you'll need to explain the basis for that thought in a clearer, more detailed way.
if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?
You're already using (it appears) List<T> as the local data for each task, and indeed that should be fine, as tasks don't share their local data.
But if you are asking why you get an exception if you try to use List<T> instead of ConcurrentBag<T> for the result variable, well that's entirely to be expected. The List<T> class is not thread safe, but Parallel.For() will allow each task it runs to execute the localFinally delegate concurrently with all the others. So you have multiple threads all trying to modify the same not-thread-safe collection concurrently. This is a recipe for disaster. You're fortunate you get the exception; the actual behavior is undefined, and it's just as likely you'll simply corrupt the data structure as cause a run-time exception.

Do I need to check Parallel.ForEach for small workloads?

I have a set of items in workload that can run in parallel. Sometimes there is just 1, sometimes there are a lot. It is noticeably faster when I Parallel.ForEach over a large workload. So it seems to be a suitable use of Parelle.ForEach.
I've read that for workloads of 1 or 2, it's better to not Parallel.ForEach over them. Does this mean I have to wrap every Parallel.ForEach of a variable workload size with this sort of pattern?
if (workItems.Count==1)
{
foreach (MyItem item in workItems)
{
bool failed = WorkItWorkIt(item);
}
}
else
{
Parallel.ForEach(workItems, (item, loopState) =>
{
bool failed = WorkItWorkIt(item);
});
}

Well, Parallel.ForEach with a single task will add a bit of overhead, but not tremendous. Parallel.ForEach will actually reuse the main thread, which means there's really only some small checks, then it runs your work. Two work items will be potentially worthwhile, and more is only better.
The bigger issue with Parallel.ForEach isn't actually the number of items in the collection, it's the amount and type of work being done per item. If you have very small bodies (in terms of CPU time, etc), the cost of parallelizing can outweigh the benefits quickly.
If your work per item is CPU bound and reasonably large, then it's fairly safe to always parallelize, without checks.
That being said, the best solution, as always, is to actually profile your application with all of your various options.

For most common tasks, the CPU cost of sometimes incurring the overhead of Parallel.ForEach() for small workloads will be small compared to the development cost of maintaining two code paths, one for small and one for large workloads.
Unless measurements in your use case indicate to the contrary, I would opt for a single Parallel.ForEach() implementation.

It's definitely worth it if your task is critical - performance-wise - and specially if your method WorkItWorkIt() needs some kind of locking (e.g. write operations on a DataTable), but I wouldn't test it for 1 or 2 elements only. Instead, I'd have a threshold and decide based on the data set size what's the best strategy.
Example: I worked on a real test case where the Parallel.ForEach() would take 4 seconds to process 1200 rows in a DataTable, where the single-threaded linear For would do the same task in 2 seconds only (1/2 the time).

Parallelization of long running processes and performance optimization

I would like to parallelize the application that processes multiple video clips frame by frame. Sequence of each frame per clip is important (obviously).
I decided to go with TPL Dataflow since I believe this is a good example of dataflow (movie frames being data).
So I have one process that loads frames from database (lets say in a batch of 500, all bunched up)
Example sequence:
|mid:1 fr:1|mid:1 fr:2|mid:2 fr:1|mid:3 fr:1|mid:1 fr:3|mid:2 fr:2|mid:2 fr:3|mid:1 fr:4|
and posts them to BufferBlock. To this BufferBlock I have linked ActionBlocks with the filter to have one ActionBlock per MovieID so that I get some kind of data partitioning. Each ActionBlock is sequential, but ideally multiple ActionBlocks for multiple movies can run in parallel.
I do have the above described network working and it does run in parallel, but from my calculations only eight to ten ActionBlocks are executing simultaneously. I timed each ActionBlock's running time and its around 100-200ms.
What steps can I take to at least double concurrency?
I did try converting action delegates to async methods and make database access asynchronous within ActionBlock action delegate but it did not help.
EDIT: I implemented extra level of data partitioning: frames for Movies with Odd IDs are processed on ServerA, frames for Even movies are processed on ServerB. Both instances of the application hit the same database. If my problem was DB IO, then I would not see any improvement in total frames processed count (or very little, under 20%). But I do see it doubling. So this leads me to conclude that Threadpool is not spawning more threads to do more frames in parallel (both servers are quad-cores and profiler shows about 25-30 threads per application).

Some assumptions:
From your example data, you are receiving movie frames (and possibly the frames in the movies) out of order
Your ActionBlock<T> instances are generic; they all call the same method for processing, you just create a list of them based on each movie id (you have a list of movie ids beforehand) like so:
// The movie IDs
IEnumerable<int> movieIds = ...;
// The actions.
var actions = movieIds.Select(
i => new { Id = i, Action = new ActionBlock<Frame>(MethodToProcessFrame) });
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Link everything up.
foreach (var action in actions)
{
// Not necessary in C# 5.0, but still, good practice.
// The copy of the action.
var actionCopy = action;
// Link.
bufferBlock.LinkTo(actionCopy.Action, f => f.MovieId == actionCopy.Id);
}
If this is the case, you're creating too many ActionBlock<T> instances which aren't being given work; because your frames (and possibly movies) are out-of-order, you aren't guaranteed that all of the ActionBlock<T> instances will have work to do.
Additionally, when you create an ActionBlock<T> instance it's going to be created with a MaxDegreeOfParallelism of 1, meaning that it's thread safe because only one thread can access the block at the same time.
Additionally, the TPL DataFlow library ultimately relies on the Task<TResult> class, which schedules by default on the thread pool. The thread pool is going to do a few things here:
Make sure that all processor cores are saturated. This is very different from making sure that your ActionBlock<T> instances are saturated and this is the metric you should be concerned with
Make sure that while the processor cores are saturated, make sure that the work is distributed evenly, as well as make sure that not too many concurrent tasks are executing (context switches are expensive).
It also looks like your method that processes your movies is generic, and it doesn't matter what frame from what movie is passed in (if it does matter, then you need to update your question with that, as it changes a lot of things). This would also mean that it's thread-safe.
Also, if it can be assumed that the processing of one frame doesn't rely on the processing of any previous frames (or, it looks like the frames of the movie come in order) you can use a single ActionBlock<T> but tweak up the MaxDegreeOfParallelism value, like so:
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4,
}
);
// Link. No filter needed.
bufferBlock.LinkTo(action);
Now, your ActionBlock<T> will always be saturated. Granted, any responsible task scheduler (the thread pool by default) is still going to limit the maximum amount of concurrency, but it's going to do as much as it can reasonably do at the same time.
To that end, if your action is truly thread safe, you can set the MaxDegreeOfParallelism to DataflowBlockOptions.Unbounded, like so:
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
// We're thread-safe, let the scheduler determine
// how nuts we can go.
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
}
);
Of course, all of this assumes that everything else is optimal (I/O reads/writes, etc.)

Odds are that's the optimal degree of parallelization. The thread pool is honestly pretty darn good at determining the optimal number of actual threads to have active. My guess is that your hardware can support about that many parallel processes actually working in parallel. If you added more you wouldn't actually be increasing throughput, you'd just be spending more time doing context switches between threads and less time actually working on them.
If you notice that, over an extended period of time, your CPU load, memory bus, network connection, disk access, etc. are all working below capacity then you might have a problem, and you'd want to check to see what is actually bottlenecking. Chances are though some resource somewhere is at it's capacity, and the TPL has recognized that and ensured that it doesn't over saturate that resource.

I suspect you are IO bound. The question is where? On the read or the write. Are you writing more data than reading. CPU may be under 50% because it cannot write out faster.
I am not saying the ActionBlock is wrong but I would consider a producer consumer with BlockingCollection. Optimize how you read and write data.
This different but I have an app where I read blocks of text. Parse the text and then write the words back to SQL. I read the on a single thread, then parallel the parse, and then write on a single thread. I write on a single thread so as not to fracture indexes. If you are IO bound you need to figure out what is the slowest IO then optimize that process.
Tell me more about that IO.
In the question you mention reading from database also.
I would give BlockingCollections a try.
BlockingCollection Class
And have size limit for each as so you don't blow memory.
Make it just big enough that it (almost) never goes empty.
The Blocking Collection after the slowest step will go empty.
If you can parallel process then do so.
What I have found is parallel inserts in a table are not faster.
Let one process take lock and hold it and keep that hose open.
Look close at how you insert.
One row at a time is slow.
I use TVP and insert 10,000 at a time but a lot of people like Drapper or BulkInsert.
If you drop indexes and triggers and insert sorted by clustered index will be fastest.
Take a tablock and hold it.
I am getting inserts in the 10 ms range.
Right now the update is the slowest.
Look at that - are you doing just one row at a time?
Look at taking tablock and doing by video clip.
Unless it is an ugly update it should not take longer than in insert.

Performance measurement of individual threads in WaitAll construction

Say I'm writing a piece of software that simulates a user performaning certain actions on a system. I'm measuring the amount of time it takes for such an action to complete using a stopwatch.
Most of the times this is pretty straighforward: the click of a button is simulated, some service call is associated with this button. The time it takes for this service call to complete is measured.
Now comes the crux, some actions have more than one service call associated with them. Since they're all still part of the same logical action, I'm 'grouping' these using the signalling mechanism offered by C#, like so (pseudo):
var syncResultList = new List<WaitHandle>();
var syncResultOne = service.BeginGetStuff();
var syncResultTwo = service.BeginDoOtherStuff();
syncResultList.Add(syncResultOne.AsyncWaitHandle);
syncResultList.Add(syncResultTwo.AsyncWaitHandle);
WaitHandle.WaitAll(syncResultList.ToArray());
var retValOne = service.EndGetStuff(syncResultOne);
var retValTwo = service.EndDoOtherStuff(syncResultTwo);
So, GetStuff and DoOtherStuff constitute one logical piece of work for that particular action. And, ofcourse, I can easily measure the amount of time it takes for this conjuction of methods to complete, by just placing a stopwatch around them. But, I need a more fine-grained approach for my statistics. I'm really interested in the amount of time it takes for each of the methods to complete, without losing the 'grouped' semantics provided by WaitHandle.WaitAll.
What I've done to overcome this, was writing a wrapper class (or rather a code generation file), which implements some timing mechanism using a callback, since I'm not that interested in the actual result (save exceptions, which are part of the statistic), I'd just let that return some statistic. But this turned out to be a performance drain somehow.
So, basically, I'm looking for an alternative to this approach. Maybe it's much simpler than I'm thinking right now, but I can't seem to figure it out by myself at the moment.

This looks like a prime candidate for Tasks ( assuming you're using C# 4 )
You can create Tasks from your APM methods using MSDN: Task.Factory.FromAsync
You can then use all the rich TPL goodness like individual continuations.

If your needs are simple enough, a simple way would be to just record each service call individually, then calculate the logical action based off the individual service calls.
IE if logical action A is made of parallel service calls B and C where B took 2 seconds and C took 1 second, then A takes 2 seconds.
A = Max(B, C)

multithread read and process large text files

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.