Implementation of traversing of chunk partition in PLINQ

Implementation of traversing of chunk partition in PLINQ - c#

I came across on implementation ContiguousChunkLazyEnumerator class, which is used by PLINQ (traversing of chunk is performed with this iterator). MoveNext method uses thread safe access to source IEnumerator (by using speficied lock), moreover, it saves results of access to internal buffer. It is brief piece of code:
lock (m_sourceSyncLock)
{
// Some .net stuff
try
{
for (; i < mutables.m_nextChunkMaxSize && m_source.MoveNext(); i++)
{
// Read the current entry into our buffer.
chunkBuffer[i] = m_source.Current;
}
}
// Some .net stuff
}
Such iterator will be used by worker threads (N worker threads work with the same iterator). But I really don't understand benefits of such parallel approach. Usage of lock in this context should kill any performance benefits. My assumption is that sequal access by the only worker thread should work with the same speed.

This is because using PLINQ optimizes for concurrent processing of the items, not for the concurrent enumeration of the items.
The heavy lock is done per chunk, so multiple threads will yield to each other between chunks.
This really shines when you have an IEnumerable that is quick to enumerate (like List<T> for example, in reality, there are internal optimisations for List<T>, so not the best example), and want to do some slow computational work on the results.
This code is about creating partitioned data to then be consumed by multiple threads. While it is thread-safe, it is not supposed to be about the fastest concurrent enumeration. It is optimised for data locality.

Related

Thread Contention on a ConcurrentDictionary in C#

I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?

Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.

I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.

If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.

Lock Free Concurrent Queue

This code snippet is from ConcurrentQueue implementation given from here.
internal bool TryPeek(out T result)
{
result = default(T);
int lowLocal = Low;
if (lowLocal > High)
return false;
SpinWait spin = new SpinWait();
while (m_state[lowLocal] == 0)
{
spin.SpinOnce();
}
result = m_array[lowLocal];
return true;
}
Is it really lock-free instead of spinning?

Spinning is a lock. This is stated in MSDN, Wikipedia and many other resources.
It's not about word. Lock-free is a guarantee. It doesn't mean that the code shouldn't use lock statement. Algorithm is lock-free if there is guaranteed system-wide progress. I don't see any difference between this code and the code using locks. The only difference is that the spin uses busy wait and thread yielding instead of putting thread in a sleep mode.
I don't see how this guarantees system-wide process, so personally I think that this is not a lock-free implementation. At least not this function.

Lock free means not using locks. Spinwaiting is not locking. There are a number of methods of synchronizing access to data without using locks. Performing spin waits is one (of many) options. Not all lock-free code will use spin-waits.

Spinning places the CPU in a tight loop without yielding the rest of it's current slice of processor time, avoiding problems that a user-provided loop may create. This can be useful if it is known that the state change is imminent. It is rare for this to be the best option for ordinary code, and represents an alternative to locking for this specialized situation.
So yes, the code is lock-free as the term lock is used in the .NET Framework.
http://msdn.microsoft.com/en-us/library/hh228603.aspx

multithread read and process large text files

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

How to speed up routines making use of collections in multithreading scenario

I've an application that makes use of parallelization for processing data.
The main program is in C#, while one of the routine for analyzing data is on an external C++ dll. This library scans data and calls a callback everytime a certain signal is found within the data. Data should be collected, sorted and then stored into HD.
Here is my first simple implementation of the method invoked by the callback and of the method for sorting and storing data:
// collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// method invoked by the callback
private void Collect(int type, long time)
{
lock(locker) { mySignalList.Add(new MySignal(type, time)); }
}
// store signals to disk
private void Store()
{
// sort the signals
mySignalList.Sort();
// file is a object that manages the writing of data to a FileStream
file.Write(mySignalList.ToArray());
}
Data is made up of a bidimensional array (short[][] data) of size 10000 x n, with n variable. I use parallelization in this way:
Parallel.For(0, 10000, (int i) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
}
Now for each of the 10000 arrays I estimate that 0 to 4 callbacks could be fired. I'm facing a bottleneck and given that my CPU resources are not over-utilized, I suppose that the lock (together with thousand of callbacks) is the problem (am I right or there could be something else?). I've tried the ConcurrentBag collection but performances are still worse (in line with other user findings).
I thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
[EDIT]
I've followed the Reed Copsey's suggestion and I used the following solution (according to the profiler of VS2010, before the burden for locking and adding to the list was taking 15% of the resources, while now only 1%):
// master collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// thread-local storage of data (each thread is working on its List<MySignal>)
ThreadLocal<List<MySignal>> threadLocal;
// analyze data
private void AnalizeData()
{
using(threadLocal = new ThreadLocal<List<MySignal>>(() =>
{ return new List<MySignal>(); }))
{
Parallel.For<int>(0, 10000,
() =>
{ return 0;},
(i, loopState, localState) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
return 0;
},
(localState) =>
{
lock(this)
{
// add thread-local lists to the master collection
mySignalList.AddRange(local.Value);
local.Value.Clear();
}
});
}
}
// method invoked by the callback
private void Collect(int type, long time)
{
local.Value.Add(new MySignal(type, time));
}

thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
You might want to look at using ThreadLocal<T> to hold your collections. This automatically allocates a separate collection per thread.
That being said, there are overloads of Parallel.For which work with local state, and have a collection pass at the end. This, potentially, would allow you to spawn your ProcessData wrapper, where each loop body was working on its own collection, and then recombine at the end. This would, potentially, eliminate the need for locking (since each thread is working on it's own data set) until the recombination phase, which happens once per thread (instead of once per task,ie: 10000 times). This could reduce the number of locks you're taking from ~25000 (0-4*10000) down to a few (system and algorithm dependent, but on a quad core system, probably around 10 in my experience).
For details, see my blog post on aggregating data with Parallel.For/ForEach. It demonstrates the overloads and explains how they work in more detail.

You don't say how much of a "bottleneck" you're encountering. But let's look at the locks.
On my machine (quad core, 2.4 GHz), a lock costs about 70 nanoseconds if it's not contended. I don't know how long it takes to add an item to a list, but I can't imagine that it takes more than a few microseconds. But let's it takes 100 microseconds (I would be very surprised to find that it's even 10 microseconds) to add an item to the list, taking into account lock contention. So if you're adding 40,000 items to the list, that's 4,000,000 microseconds, or 4 seconds. And I would expect one core to be pegged if this were the case.
I haven't used ConcurrentBag, but I've found the performance of BlockingCollection to be very good.
I suspect, though, that your bottleneck is somewhere else. Have you done any profiling?

The basic collections in C# aren't thread safe.
The problem you're having is due to the fact that you're locking the entire collection just to call an add() method.
You could create a thread-safe collection that only locks single elements inside the collection, instead of the whole collection.
Lets look at a linked list for example.
Implement an add(item (or list)) method that does the following:
Lock collection.
A = get last item.
set last item reference to the new item (or last item in new list).
lock last item (A).
unclock collection.
add new items/list to the end of A.
unlock locked item.
This will lock the whole collection for just 3 simple tasks when adding.
Then when iterating over the list, just do a trylock() on each object. if it's locked, wait for the lock to be free (that way you're sure that the add() finished).
In C# you can do an empty lock() block on the object as a trylock().
So now you can add safely and still iterate over the list at the same time.
Similar solutions can be implemented for the other commands if needed.

Any built-in solution for a collection is going to involve some locking. There may be ways to avoid it, perhaps by segregating the actual data constructs being read/written, but you're going to have to lock SOMEWHERE.
Also, understand that Parallel.For() will use the thread pool. While simple to implement, you lose fine-grained control over creation/destruction of threads, and the thread pool involves some serious overhead when starting up a big parallel task.
From a conceptual standpoint, I would try two things in tandem to speed up this algorithm:
Create threads yourself, using the Thread class. This frees you from the scheduling slowdowns of the thread pool; a thread starts processing (or waiting for CPU time) when you tell it to start, instead of the thread pool feeding requests for threads into its internal workings at its own pace. You should be aware of the number of threads you have going at once; the rule of thumb is that the benefits of multithreading are overcome by the overhead when you have more than twice the number of active threads as "execution units" available to execute threads. However, you should be able to architect a system that takes this into account relatively simply.
Segregate the collection of results, by creating a dictionary of collections of results. Each results collection is keyed to some token carried by the thread doing the processing and passed to the callback. The dictionary can have multiple elements READ at one time without locking, and as each thread is WRITING to a different collection within the Dictionary there shouldn't be a need to lock those lists (and even if you did lock them you wouldn't be blocking other threads). The result is that the only collection that has to be locked such that it would block threads is the main dictionary, when a new collection for a new thread is added to it. That shouldn't have to happen often if you're smart about recycling tokens.

In C# would it be better to use Queue.Synchronized or lock() for thread safety?

I have a Queue object that I need to ensure is thread-safe. Would it be better to use a lock object like this:
lock(myLockObject)
{
//do stuff with the queue
}
Or is it recommended to use Queue.Synchronized like this:
Queue.Synchronized(myQueue).whatever_i_want_to_do();
From reading the MSDN docs it says I should use Queue.Synchronized to make it thread-safe, but then it gives an example using a lock object. From the MSDN article:
To guarantee the thread safety of the
Queue, all operations must be done
through this wrapper only.
Enumerating through a collection is
intrinsically not a thread-safe
procedure. Even when a collection is
synchronized, other threads can still
modify the collection, which causes
the enumerator to throw an exception.
To guarantee thread safety during
enumeration, you can either lock the
collection during the entire
enumeration or catch the exceptions
resulting from changes made by other
threads.
If calling Synchronized() doesn't ensure thread-safety what's the point of it? Am I missing something here?

Personally I always prefer locking. It means that you get to decide the granularity. If you just rely on the Synchronized wrapper, each individual operation is synchronized but if you ever need to do more than one thing (e.g. iterating over the whole collection) you need to lock anyway. In the interests of simplicity, I prefer to just have one thing to remember - lock appropriately!
EDIT: As noted in comments, if you can use higher level abstractions, that's great. And if you do use locking, be careful with it - document what you expect to be locked where, and acquire/release locks for as short a period as possible (more for correctness than performance). Avoid calling into unknown code while holding a lock, avoid nested locks etc.
In .NET 4 there's a lot more support for higher-level abstractions (including lock-free code). Either way, I still wouldn't recommend using the synchronized wrappers.

There's a major problem with the Synchronized methods in the old collection library, in that they synchronize at too low a level of granularity (per method rather than per unit-of-work).
There's a classic race condition with a synchronized queue, shown below where you check the Count to see if it is safe to dequeue, but then the Dequeue method throws an exception indicating the queue is empty. This occurs because each individual operation is thread-safe, but the value of Count can change between when you query it and when you use the value.
object item;
if (queue.Count > 0)
{
// at this point another thread dequeues the last item, and then
// the next line will throw an InvalidOperationException...
item = queue.Dequeue();
}
You can safely write this using a manual lock around the entire unit-of-work (i.e. checking the count and dequeueing the item) as follows:
object item;
lock (queue)
{
if (queue.Count > 0)
{
item = queue.Dequeue();
}
}
So as you can't safely dequeue anything from a synchronized queue, I wouldn't bother with it and would just use manual locking.
.NET 4.0 should have a whole bunch of properly implemented thread-safe collections, but that's still nearly a year away unfortunately.

There's frequently a tension between demands for 'thread safe collections' and the requirement to perform multiple operations on the collection in an atomic fashion.
So Synchronized() gives you a collection which won't smash itself up if multiple threads add items to it simultaneously, but it doesn't magically give you a collection that knows that during an enumeration, nobody else must touch it.
As well as enumeration, common operations like "is this item already in the queue? No, then I'll add it" also require synchronisation which is wider than just the queue.

This way we don't need to lock the queue just to find out it was empty.
object item;
if (queue.Count > 0)
{
lock (queue)
{
if (queue.Count > 0)
{
item = queue.Dequeue();
}
}
}

It seems clear to me that using a lock(...) {...} lock is the right answer.
To guarantee the thread safety of the Queue, all operations must be done through this wrapper only.
If other threads access the queue without using .Synchronized(), then you'll be up a creek - unless all your queue access is locked up.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.