Determininistic random numbers in parallel code - c#

I have a question regarding thread ordering for the TPL.
Indeed, it is very important for me that my Parallel.For loop to be executed in the order of the loop. What I mean is that given 4 threads, i would like the first thread to execute every 4k loop, 2nd thread every 4k+1 etc with (k between 0 and, NbSim/4).
1st thread -> 1st loop, 2nd thread -> 2nd loop , 3rd thread -> 3rd loop
4th thread -> 4th loop , 1th thread -> 5th loop etc ...
I have seen the OrderedPartition directive but I am not quite sure of the way I should apply it to a FOR loop and not to a Parallel.FOREACH loop.
Many Thanks for your help.
Follwing the previous remkarks, I am completing the description :
Actually, after some consideration, I believe that my problem is not about ordering.
Indeed, I am working on a Monte-Carlo engine, in which for each iteration I am generating a set of random numbers (always the same (seed =0)) and then apply some business logic to them. Thus everything should be deterministic and when running the algorithm twice I should get the exact same results. But unfortunately this is not the case, and I am strugeling to understand why. Any idea, on how to solve that kind of problems (without printing out every variable I have)?
Edit Number 2:
Thank you all for your suggestions
First, here is the way my code is ordered :
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 4; //or 1
ParallelLoopResult res = Parallel.For<LocalDataStruct>(1,NbSim, options,
() => new LocalDataStruct(//params of the constructor of LocalData),
(iSim, loopState, localDataStruct) => {
//logic
return localDataStruct;
}, localDataStruct => {
lock(syncObject) {
//critical section for outputting the parameters
});
When setting the degreeofParallelism to 1,everything works fine, however when setting the degree of Parallelism to 4 I am getting results that are false and non deterministic (when running the code twice I get different results). It is probably due to mutable objects that is what I am checking now, but the source code is quite extensive, so it takes time. Do you think that there is a good strategy to check the code other than review it (priniting out all variables is impossible in this case (> 1000)? Also when setting the Nb of Simulation to 4 for 4 threads everything is working fine as well, mostly due to luck I believe ( that s why I metionned my first idea regarding ordering).

You can enforce ordering in PLINQ but it comes at a cost. It gives ordered results but does not enforce ordering of execution.
You really cannot do this with TPL without essentially serializing your algorithm. The TPL works on a Task model. It allows you to schedule tasks which are executed by the scheduler with no guarantee as to the order in which the Tasks are executed. Typically parallel implementations take the PLINQ approach and guarantee ordering of results not ordering of execution.
Why is ordered execution important?
So. For a Monte-Carlo engine you would need to make sure that each index in your array received the same random numbers. This does not mean that you need to order your threads, just make the random numbers are ordered across the work done by each thread. So if each loop of your ParallelForEach was passed not only the array of elements to do work on but also it's own instance of a random number generator (with a different fixed seed per thread) then you will still get deterministic results.
I'm assuming that you are familiar with the challenges related to parallelizing Monte-Carlo and generating good random number sequences. If not here's something to get you started;Pseudo-random Number Generation for
Parallel Monte Carlo—A Splitting Approach, Fast, High-Quality, Parallel Random-Number Generators: Comparing Implementations.
Some suggestions
I would start off by ensuring that you can get deterministic results in the sequential case by replacing the ParallelForEach with a ForEach and see if this runs correctly. You could also try comparing the output of a sequential and a parallel run, add some diagnostic output and pipe it to a text file. Then use a diff tool to compare the results.
If this is OK then it is something to do with your parallel implementation, which as is pointed out below is usually related to mutable state. Some things to consider:
Is your random number generator threadsafe? Random is a poor random number generator at best and as far as I know is not designed for parallel execution. It is certainly not suitable for M-C calculations, parallel or otherwise.
Does your code have other state shared between threads, if so what is it? This state will be mutated in a non-deterministic manner and effect your results.
Are you merging results from different threads in parallel. The non-associativity of parallel floating point operations will also cause you issues here, see How can floating point calculations be made deterministic?. Even if the thread results are deterministic if you are combining them in a non deterministic way you will still have issues.

Assuming all threads share the same random number generator, then although you are generating the same sequence every time, which thread gets which elements of this sequence is non-deterministic. Hence you could arrive at different results.
That's if the random number generator is thread-safe; if it isn't, then it's not even guaranteed to generate the same sequence when called from multiple threads.
Apart from that it is difficult to theorize what could be causing non-determinism to arise; basically any global mutable state is suspicious. Each Task should be working with its own data.

If, rather than using a random number generator, you set up an array [0...N-1] of pre-determined values, say [0, 1/N, 2/N, ...], and do a Parallel.ForEach on that, does it still give nondeterministic results? If so, the RNG isn't the issue.

Related

Parallel.Invoke vs Parallel.Foreach for running parallel process on large list

I have C# list which contains around 8000 items (file paths). I want to run a method on all of these items in parallel. For this i have below 2 options:
1) Manually divide list into small-small chunks (say of 500 size each) and create array of actions for these small lists and then call Parallel.Invoke like below:
var partitionedLists = MainList.DivideIntoChunks(500);
List<Action> actions = new List<Action>();
foreach (var lst in partitionedLists)
{
actions.Add(() => CallMethod(lst));
}
Parallel.Invoke(actions.ToArray())
2) Second option is to run Parallel.ForEach like below
Parallel.ForEach(MainList, item => { CallMethod(item) });
What will the best option here?
How Parallel.Foreach divide the list
into small chunks?
Please suggest, thanks in advance.
The first option is a form of task-parallelism, in which you divide your task into group of sub-tasks and execute them in parallel. As is obvious from the code you provided, you are responsible for choosing the level of granularity [chunks] while creating the sub-tasks. The selected granularity might be too big or too low, if one does not rely on appropriate heuristics, and the resulting performance gain might not be significant. Task-parallelism is used in scenarios where the operation to be performed takes similar time for all input values.
The second option is a form of data-parallelism, in which the input data is divided into smaller chunks based on the number of hardware threads/cores/processors available, and then each individual chunk is processed in isolation. In this case, the .NET library chooses the right level of granularity for you and ensures better CPU utilization. Conventionally, data-parallelism is used in scenarios when the operation to be performed can vary in terms of time taken, depending on the input value.
In conclusion, if your operation is more or less uniform over the range of input values and you know the right granularity [chunk size], go ahead with the first option. If however that's not the case or if you are unsure about the above questions, go with the second option which usually pans out better in most scenarios.
NOTE: If this is a very performance critical component of your application, I will advise bench-marking the performances in production like environment with both approaches to get more data, in addition to the above recommendations.

Parallel.ForEach slows down towards end of the iteration

I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371

merge sort with four threads in C#?

I want to write a merge sort program in C# with four threads. In this program each thread has some numbers which are sorted in their threads. I will explain them with an example: first thread has 100 numbers. This thread sort those numbers with merge sort and then, pass them to second thread. Second thread, itself has 100 numbers and sort its numbers with the numbers that have been passed from the first thread. Again after sorting data in second thread all 200 numbers pass to third thread to sort this numbers with third thread's numbers and finally all numbers in fourth thread, are sorted with the fourth thread's numbers and the result is shown. I know in this scenario simple sequential sort method is probably faster than merge sort but I must do the sorting in this way for my school project and also this 100 numbers for each thread was only an example and in my project each thread has more than 100 numbers. I want to sort numbers with merge sort with four threads. I specially have problem in passing the numbers between threads. I'm a beginner in C# and if it's possible please help me with a code. Thanks.
From the scenario you explained, it seems like a sequential process. One thread waits for the outcome of other thread.
But what I guess that if you really want to sort suppose 100 numbers using 4 threads, then pass 25 numbers to each thread and call merge sort on each thread.
When each thread is done sorting, at the end of 1st iteration you have 4 sorted array. Now pass 2 sorted arrays to each thread and call MERGE of merge sort on each thread. (AT this stage you are only using 2 threads only).
Once this merge is done, you are left with 2 sorted arrays.
You just can pass 2 sorted array to any thread and call MERGE (Not merge sort).
I think if you google hard, you will get the solution online.
http://penguin.ewu.edu/~trolfe/ParallelMerge/ParallelMerge.html
A parallel merge sort is not necessarily going to be faster than a simple sequential sort method. Only once you have a large number of items to be sorted (typically much more than would fit in a 64K L1 processor cache, i.e. tens of thousands of 4 byte integers), you have dedicated cores and are able to partition the data over these cores, will you start to see any benefits. For small amounts of data, the parallel approach will actually be slower, due to the need for extra coordination and allocations.
In C# there are built-in methods to do this type of partitioning. PLINQ was created specifically for such tasks.
There are several existing articles/blog posts that discuss solving a parallel merge sort using PLINQ, that could be found by googling "plinq merge sort".
Two in particular that provide some in depth coverage and include some benchmarking versus sequential sorting can be found here:
http://blogs.msdn.com/b/pfxteam/archive/2011/06/07/10171827.aspx
http://dzmitryhuba.blogspot.nl/2010/10/parallel-merge-sort.html

ParallelEnumerable.Range vs Enumerable.Range.AsParallel?

What is the differences between ParallelEnumerable.Range and Enumerable.Range(...).AsParallel()?
ParallelEnumerable.Range creates range partition( best for operations where cpu time is equal foreach item)
where as Enumerable.Range(...).AsParallel() might be executes as range or chunk
Is there any performance difference ? When should I use which ?
If you're performing an operation where it's CPU pressure grows as it iterates then you're going to want to use AsParallel because it can make adjustments as it iterates. However, if it's something that you just need to run in parallel and the operation doesn't create more pressure as it iterates, then use ParallelEnumerable.Range because it can partition the work immediately.
Reference this article for a more detailed explanation.
So, let's assume you're performing some complex math in each iteration, and as the input values grow the math takes longer, you're going to want to use AsParallel so that it can make adjustments. But if you use that same option with something like setting a registry setting, you'll see a performance decrease because there is overhead associated with AsParallel that is not necessary.

Why does using Random in Sort causing [Unable to sort IComparer.Compare error]

I tried shuffling a list of byte (List) using either code:
myList.Sort((a, b) => this._Rnd.Next(-1, 1));
or
myList.Sort(delegate(byte b1, byte b2)
{
return this._Rnd.Next(-1, 1);
});
and they threw the following error:
Unable to sort because the IComparer.Compare() method returns inconsistent results. Either a value does not compare equal to itself, or one value repeatedly compared to another value yields different results. x: '{0}', x's type: '{1}', IComparer: '{2}'.
What is wrong with using a random rather than the compare function of byte?
I tried using LINQ function instead and it works.
var myNewList = myList.OrderBy(s => Guid.NewGuid());
var myNewList = myList.OrderBy(s => this._Rnd.NextDouble());
I did read that these methods are slower than Fisher–Yates shuffle giving O(n) runtime only. But was just wondering on using the Sort function and random.
Not only is the comparison relation required to be consistent, it is also required to impose a total ordering. For example, you cannot say "socks are smaller than shoes, shirts are neither smaller than nor greater than trousers" blah blah blah, feed that into a sort algorithm and expect to get a topological sort out of the other end. Comparison sorts are called comparison sorts because they require a well-formed comparison relation. In particular, quicksort can run forever or give nonsensical results if the comparison relation is not consistent, transitive, and total-ordering.
If what you want is a shuffle then implement a Fischer-Yates shuffle. (Do it correctly; even though the algorithm is trivial, it is almost always implemented wrong.) If what you want is a topological sort then implement a topological sort. Use the right tool for the job.
Because as the error says, Random is not consistent. You must have a comparer that always returns the same result when given the same parameters. otherwise the sort will not be consistent.
Knuth has a random sort algorithm which worked like an insertion sort, but you swapped the value with a randomly chosen location in hhe existing array.
Sorting algorithms generally work by defining a comparison function. The algorithms will repeatedly compare two items in the sequence to be sorted and swap them if their current order doesn't match the desired order. The differences between algorithms have mainly to do with finding the most efficient way possible in the given circumstances to do all the compares.
In the process of making all these compares, it's common for the same two elements in a sequence to need to be compared more than once! Using non-numeric data to make this easier, let's say you have items with values "Red" and "Apple". The random comparer selects "Apple" as the greater item on the first comparison. Later on, if the random comparer selects "Red" as the greater item, and this back and forth continues, you can end up in a situation where the algorithm never finishes.
Mostly you get lucky, and nothing happens. But sometimes you don't. .Net is pretty good about not just running forever and guards against this, but it does (and should!) throw an exception when these guards detect inconsistent ordering.
Of course, the correct way to handle this in the general case is via a Knuth-Fisher-Yates shuffle.
It's further worth mentioning that there are times when a simple Fisher-Yates is not appropriate. One example is needing to randomize a sequence of unknown length... say you're wanting to randomly rearrange data received from a network stream, without knowing how much data is in the stream, and feed that data as quickly as possible to a worker thread elsewhere.
In this situation you can't perfectly randomize that data. Without knowing the length of the stream you don't have enough information to correctly do the shuffle, and even if you did you might find the length is so long as to make holding it all in RAM or even on disk impractical. Or you might find that the stream won't complete for hours, but your worker thread needs to get going much sooner. In this case, you'd likely settle for (and understanding that this is "settling" is important) an algorithm that loads a buffer of adequate length, randomizes the buffer, feeds out about half the buffer to the worker thread, and then re-fills the empty portion of the buffer to repeat the process. Even here, you're likely using Knuth-Fisher-Yates for the step that randomizes the buffer.

Categories

Resources