merge sort with four threads in C#? - c#

I want to write a merge sort program in C# with four threads. In this program each thread has some numbers which are sorted in their threads. I will explain them with an example: first thread has 100 numbers. This thread sort those numbers with merge sort and then, pass them to second thread. Second thread, itself has 100 numbers and sort its numbers with the numbers that have been passed from the first thread. Again after sorting data in second thread all 200 numbers pass to third thread to sort this numbers with third thread's numbers and finally all numbers in fourth thread, are sorted with the fourth thread's numbers and the result is shown. I know in this scenario simple sequential sort method is probably faster than merge sort but I must do the sorting in this way for my school project and also this 100 numbers for each thread was only an example and in my project each thread has more than 100 numbers. I want to sort numbers with merge sort with four threads. I specially have problem in passing the numbers between threads. I'm a beginner in C# and if it's possible please help me with a code. Thanks.

From the scenario you explained, it seems like a sequential process. One thread waits for the outcome of other thread.
But what I guess that if you really want to sort suppose 100 numbers using 4 threads, then pass 25 numbers to each thread and call merge sort on each thread.
When each thread is done sorting, at the end of 1st iteration you have 4 sorted array. Now pass 2 sorted arrays to each thread and call MERGE of merge sort on each thread. (AT this stage you are only using 2 threads only).
Once this merge is done, you are left with 2 sorted arrays.
You just can pass 2 sorted array to any thread and call MERGE (Not merge sort).
I think if you google hard, you will get the solution online.
http://penguin.ewu.edu/~trolfe/ParallelMerge/ParallelMerge.html

A parallel merge sort is not necessarily going to be faster than a simple sequential sort method. Only once you have a large number of items to be sorted (typically much more than would fit in a 64K L1 processor cache, i.e. tens of thousands of 4 byte integers), you have dedicated cores and are able to partition the data over these cores, will you start to see any benefits. For small amounts of data, the parallel approach will actually be slower, due to the need for extra coordination and allocations.
In C# there are built-in methods to do this type of partitioning. PLINQ was created specifically for such tasks.
There are several existing articles/blog posts that discuss solving a parallel merge sort using PLINQ, that could be found by googling "plinq merge sort".
Two in particular that provide some in depth coverage and include some benchmarking versus sequential sorting can be found here:
http://blogs.msdn.com/b/pfxteam/archive/2011/06/07/10171827.aspx
http://dzmitryhuba.blogspot.nl/2010/10/parallel-merge-sort.html

Related

C#( C++ would be cool too) Fastest way to find differences in two large arrays/lists with indexes

More Details:
For this problem, I'm specifically looking for the fastest way to do this, in general and specifically in c#. I don't necessarily mean "theoretical" fastest/algorithmic, instead I'm looking for practical implementation speed. In this specific situation, the arrays only have like 1000 elements each, which seems very small, but this computation is going to be running very rapidly and comparing many arrays(it blows up in size very quickly). I ultimately need the indexes of each element that is different.
I can obviously do a very simple implementation like:
public List<int> FindDifferences(List<double> Original,List<double> NewList)
{
List<int> Changes = new List<int>();
for(int i=0;i<Original.Count;i++)
{
if(Original[i]!=NewList[i])
{
Changes.Add(i);
}
}
return Changes;
}
But from what I can see, this will be really slow overall since it has to iterate once though each item on the list. Is there anything I can do to speed this up? Specifically, is there a way to do something like a parallel foreach that generates a list of the indexes of changes? I saw what I think was a similar question asked before, but I didn't quite understand the answer .Or would there be another way to run the calculation on all items of the list simultaneously(or somehow clustered)?
Assumptions
Each array or list being compared contains data of the same
type(double int or string), so if array1 holds strings and is
compared to array2, I know for certain that array2 will only hold
strings and it will be of the same size(in terms of item count-I can
see if maybe they are the same byte count too if that could come
into play).
The vast majority of the items in these comparisons will remain the same. My resultant "differences" list will probably only contain a few(1-10) items, if any.
Concerns
1) After a comparison is made(old and new list in the block above), the new list will overwrite the old list. If computation time is slower than the time it takes to receive a new message(a new list to compare), I can have a problem with collision:
Lets say I have three lists, A,B, and C. A would be my global/"current state" list. When a message is received containing a new list(B), it would be the list B would be compared to.
In an ideal world, A would be compared to B, I would receive a list of integers representing the indexes that contain elements different between the two. After the method computes and returns this index list, A would become B(the values of B overwrite the values of A as my "current state"). When I receive another message(C), this would be compared to my new current state(A, but with the values previously belonging to B), I'd receive the list of differences and C's values would overwrite A's and become the new current state. If the comparison between A and B is still calculating when C is received, I would need to make sure the new calculation either:
Doesn't happen until after A and B's comparison finish and A is overwritten with its new values. or
The comparison is instead made between B and C, with C overwriting A after the comparison finishes(the difference list is fired off elsewhere, so I'd still receive both change lists)
2) If this comparison between lists can't be sped up, is there somewhere else I can speed up instead? These messages I'm receiving come as an object with three values, an Ascii-encoded byte array, a long string(the already parsed byte array), and a "type"(the name of the list it corresponds to-so I know the data type of its contents). I currently ignore the byte array and parse the string by splitting it at newline characters.
I know this is inefficient, but I have trouble converting the byte array into ints or doubles. The doubles because it has a lot of "noise"(a value of 1.50 could end up coming in as 1.4976789, so I actually have to round it to get its "real" value). The ints because there is no 0 padding, so I don't know the length to chunk the byte array into. Below is an example of what I'm doing:
public List<string> ListFromString(string request)
{
List<string> fulllist = request.Split('\n').ToList<string>();
return fulllist.GetRange(1, fulllist.Count - 2); //There's always a label tacked on the beginning so I start from 1
}
public List<double> RequestListAsDouble(string request)
{
List<string> RequestAsString = ListFromString(request);
List<double> RequestListAsDouble = new List<double>();
foreach(string requestElement in RequestAsString)
{
double requestElementAsDouble = Math.Round(Double.Parse(requestElement),2);
RequestListAsDouble.Add(requestElementAsDouble);
}
return RequestListAsDouble;
}
Your single-threaded comparison of the two parsed lists is probably the fastest way to do it. It is certainly the easiest. As noted by another poster, you can get some speed advantage by pre-allocating the size of the "Changes" list to be some percentage of the size of your input list.
If you want to try parallel thread comparisons, you should setup "N" number of threads in advance and have them wait for a starting event. "N" is the number of real processors on your system. Each thread should compare a portion of the lists, and write their answers to the interlocked output list "Changes". On completion, the threads go back to sleep, waiting for the next starting event.
When all the threads have gone back to their starting positions, the main thread can pick up the "Changes" and pass it along. Repeat with the next list
Be sure to clean up all the worker threads when your application is supposed to exit - or it won't exit.
There is a lot of overhead in starting and ending threads. It is all too easy to lose all the processing speed from that overhead. That's why you would want a pool of worker threads already setup and waiting on an event flag. Threads only improve processing speed up to the number of real CPUs in the system.
A small optimization would be to initialize the results list with the capacity of the original
https://msdn.microsoft.com/en-us/library/4kf43ys3(v=vs.110).aspx
If the size of the collection can be estimated, using the
List(Int32) constructor and specifying the initial capacity
eliminates the need to perform a number of resizing operations while
adding elements to the List.
List<int> Changes = new List<int>(Original.Length);

Parallel.Invoke vs Parallel.Foreach for running parallel process on large list

I have C# list which contains around 8000 items (file paths). I want to run a method on all of these items in parallel. For this i have below 2 options:
1) Manually divide list into small-small chunks (say of 500 size each) and create array of actions for these small lists and then call Parallel.Invoke like below:
var partitionedLists = MainList.DivideIntoChunks(500);
List<Action> actions = new List<Action>();
foreach (var lst in partitionedLists)
{
actions.Add(() => CallMethod(lst));
}
Parallel.Invoke(actions.ToArray())
2) Second option is to run Parallel.ForEach like below
Parallel.ForEach(MainList, item => { CallMethod(item) });
What will the best option here?
How Parallel.Foreach divide the list
into small chunks?
Please suggest, thanks in advance.
The first option is a form of task-parallelism, in which you divide your task into group of sub-tasks and execute them in parallel. As is obvious from the code you provided, you are responsible for choosing the level of granularity [chunks] while creating the sub-tasks. The selected granularity might be too big or too low, if one does not rely on appropriate heuristics, and the resulting performance gain might not be significant. Task-parallelism is used in scenarios where the operation to be performed takes similar time for all input values.
The second option is a form of data-parallelism, in which the input data is divided into smaller chunks based on the number of hardware threads/cores/processors available, and then each individual chunk is processed in isolation. In this case, the .NET library chooses the right level of granularity for you and ensures better CPU utilization. Conventionally, data-parallelism is used in scenarios when the operation to be performed can vary in terms of time taken, depending on the input value.
In conclusion, if your operation is more or less uniform over the range of input values and you know the right granularity [chunk size], go ahead with the first option. If however that's not the case or if you are unsure about the above questions, go with the second option which usually pans out better in most scenarios.
NOTE: If this is a very performance critical component of your application, I will advise bench-marking the performances in production like environment with both approaches to get more data, in addition to the above recommendations.

Determininistic random numbers in parallel code

I have a question regarding thread ordering for the TPL.
Indeed, it is very important for me that my Parallel.For loop to be executed in the order of the loop. What I mean is that given 4 threads, i would like the first thread to execute every 4k loop, 2nd thread every 4k+1 etc with (k between 0 and, NbSim/4).
1st thread -> 1st loop, 2nd thread -> 2nd loop , 3rd thread -> 3rd loop
4th thread -> 4th loop , 1th thread -> 5th loop etc ...
I have seen the OrderedPartition directive but I am not quite sure of the way I should apply it to a FOR loop and not to a Parallel.FOREACH loop.
Many Thanks for your help.
Follwing the previous remkarks, I am completing the description :
Actually, after some consideration, I believe that my problem is not about ordering.
Indeed, I am working on a Monte-Carlo engine, in which for each iteration I am generating a set of random numbers (always the same (seed =0)) and then apply some business logic to them. Thus everything should be deterministic and when running the algorithm twice I should get the exact same results. But unfortunately this is not the case, and I am strugeling to understand why. Any idea, on how to solve that kind of problems (without printing out every variable I have)?
Edit Number 2:
Thank you all for your suggestions
First, here is the way my code is ordered :
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 4; //or 1
ParallelLoopResult res = Parallel.For<LocalDataStruct>(1,NbSim, options,
() => new LocalDataStruct(//params of the constructor of LocalData),
(iSim, loopState, localDataStruct) => {
//logic
return localDataStruct;
}, localDataStruct => {
lock(syncObject) {
//critical section for outputting the parameters
});
When setting the degreeofParallelism to 1,everything works fine, however when setting the degree of Parallelism to 4 I am getting results that are false and non deterministic (when running the code twice I get different results). It is probably due to mutable objects that is what I am checking now, but the source code is quite extensive, so it takes time. Do you think that there is a good strategy to check the code other than review it (priniting out all variables is impossible in this case (> 1000)? Also when setting the Nb of Simulation to 4 for 4 threads everything is working fine as well, mostly due to luck I believe ( that s why I metionned my first idea regarding ordering).
You can enforce ordering in PLINQ but it comes at a cost. It gives ordered results but does not enforce ordering of execution.
You really cannot do this with TPL without essentially serializing your algorithm. The TPL works on a Task model. It allows you to schedule tasks which are executed by the scheduler with no guarantee as to the order in which the Tasks are executed. Typically parallel implementations take the PLINQ approach and guarantee ordering of results not ordering of execution.
Why is ordered execution important?
So. For a Monte-Carlo engine you would need to make sure that each index in your array received the same random numbers. This does not mean that you need to order your threads, just make the random numbers are ordered across the work done by each thread. So if each loop of your ParallelForEach was passed not only the array of elements to do work on but also it's own instance of a random number generator (with a different fixed seed per thread) then you will still get deterministic results.
I'm assuming that you are familiar with the challenges related to parallelizing Monte-Carlo and generating good random number sequences. If not here's something to get you started;Pseudo-random Number Generation for
Parallel Monte Carlo—A Splitting Approach, Fast, High-Quality, Parallel Random-Number Generators: Comparing Implementations.
Some suggestions
I would start off by ensuring that you can get deterministic results in the sequential case by replacing the ParallelForEach with a ForEach and see if this runs correctly. You could also try comparing the output of a sequential and a parallel run, add some diagnostic output and pipe it to a text file. Then use a diff tool to compare the results.
If this is OK then it is something to do with your parallel implementation, which as is pointed out below is usually related to mutable state. Some things to consider:
Is your random number generator threadsafe? Random is a poor random number generator at best and as far as I know is not designed for parallel execution. It is certainly not suitable for M-C calculations, parallel or otherwise.
Does your code have other state shared between threads, if so what is it? This state will be mutated in a non-deterministic manner and effect your results.
Are you merging results from different threads in parallel. The non-associativity of parallel floating point operations will also cause you issues here, see How can floating point calculations be made deterministic?. Even if the thread results are deterministic if you are combining them in a non deterministic way you will still have issues.
Assuming all threads share the same random number generator, then although you are generating the same sequence every time, which thread gets which elements of this sequence is non-deterministic. Hence you could arrive at different results.
That's if the random number generator is thread-safe; if it isn't, then it's not even guaranteed to generate the same sequence when called from multiple threads.
Apart from that it is difficult to theorize what could be causing non-determinism to arise; basically any global mutable state is suspicious. Each Task should be working with its own data.
If, rather than using a random number generator, you set up an array [0...N-1] of pre-determined values, say [0, 1/N, 2/N, ...], and do a Parallel.ForEach on that, does it still give nondeterministic results? If so, the RNG isn't the issue.

Compare 10 Million Entities

I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file.
The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity.
I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible.
What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed).
Do you have an idea how I could split the workload or reduce it's size?
Thanks
If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.
If the rules are not completely general I see 3 solutions to optimize different cases:
if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.
By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.
I would try to think about rule hierarchy.
Let's say for example that rule A is "Color" and rule B is "Shape".
If you first divide objects by color,
than there is no need to compare Red circle with Blue triangle.
This will reduce the number of compares you will need to do.
I would create a hashcode from each entity. You probably have to exclude the id from the hash generation and then test for equals. If you have the hashs you could order all the hashcodes alphabetical. Having all entities in order means that it's pretty easy to check for doubles.
If you want to compare each entity with all entities than effectively you need to cluster the data , there is very fewer reasons to compare totally unrelated things ( compare Clothes with Human does not make sense) , i think your rules will try to cluster the data.
so you need to cluster the data , try some clustering algorithms like K-Means.
Also see , Apache Mahout
Are you looking for the best suitable sorting algorithm, kind of a, for this?
I think Divide and Concur seems good.
If the algorithm seems good, you can have plenty of other ways to do the calculation. Specially parallel processing using MPICH or something may give you a final destination.
But before decide how to execute, you have to think if algorithm fits first.

With PLINQ, how can I set the preferred chunk size when using AsParallel().MaxDegreeOfParallelism(4)?

I have a list with thousands of objects, on which an operation that can take between 1 and 3 minutes is performed.
I am of course using PLINQ, but I have noticed that when approaching the end of the input list, only one core is working, like if the partitioning had been determined ex ante.
So, with a IList, what is the best way to force PLINQ to keep using worker threads as long as there are items to be processed?
The computer has plenty of hardware cores available.
References:
Chunk partitioning vs range partitioning in PLINQ
Can I configure the number of threads used by Parallel Extensions?
How to make PLINQ to spawn more concurrent threads in .NET 4.0 beta 2?
From what I understand, PLINQ will choose range or chunk partitioning depending on whether the source sequence is an IList or not. If it is an IList, the bounds are known and elements can be accessed by index, so PLINQ chooses range partitioning to divide the list evenly between threads. For instance, if you have 1000 items in your list and you use 4 threads, each thread will have 250 items to process. On the other hand, if the source sequence is not an IList, PLINQ can't use range partitioning because it doesn't know what the ranges would be; so it uses chunk partitioning instead.
In your case, if you have an IList and you want to force chunk partitioning, you can just make it look like a simple IEnumerable: instead of writing this:
list.AsParallel()...
Write that:
list.Select(x => x).AsParallel()...
The dummy projection will hide the fact that the source is actually an IList, so PLINQ will use chunk partitioning.

Categories

Resources