Understanding and solving K-Way merge sort

Understanding and solving K-Way merge sort - c#

I want to:
count the number of comparisons needed by k-Way merge sort to sort random permutation of numbers from 0 to N-1.
to count the number of data moves needed by K-Way merge sort to sort random permutation of numbers from 0 to N-1.
I understand how 2-way merge sort works correctly, and understand the code very well. My problem now is I don't know how to start. How do I convert the 2-way merge sort into K-Way so that I can solve the above problems?
I have searched the web but can't find any tutorial to explain "k-Way merge sort" very well.
I need good explanation what to do so that I can take it from there and do it myself.
Like I said I understand the 2-Way, so how do I move to the K-Way merge sort? How do I implement the K-way?
Edit
I read some post http://bchalk.com/work/view/k_way_merge_sort
that BinaryHeap must be used to implement k-Way merge. Is that so or there are other ways?
How do I divide my list into K? Is there a special way of doing it?

When k > 2, the leading elements from each of the input streams are typically kept in a minheap structure. This makes it easy to find to the mininum of the the n-values, to pop that value off the heap, and insert a replacement value from the corresponding input stream.
A heap does O(lg2 k) comparisons for each insertion, so the total work for a k-way merge of n items is n * lg2(k).
Eventhough you asked about C# and Java, you can learn how to do it by looking at the Python standard library code for a k-way merge: http://hg.python.org/cpython/file/2.7/Lib/heapq.py#l323
To answer your other question, there is no special way to divide your list into K groups. Just take the first N/k elements in the first array, the next N/k elements into the next, etc. Sort each array and then merge them using heaps as mentioned above.

You could always think of a k-way merge as a series of 2-way merges, that is, do a 2-way merge with the result of the first and second, and the third: Merge(Merge(L1, L2), L3) and so on. Even faster would be to split it twice: Merge(Merge(L1, L2), Merge(L3, L4)). As you can see, for a k-way sort, you would then need some sort of loop (recursion).

Related

Grouping neighbouring entries with the same value in a 2d array

EDIT: I now realized the question was not appropriate for stack but I've gotten a lot of helpful advice anyway. Thanks everyone!
I have a 2d array and I want to group together neighbors of the same value. Using C# (working with unity).
Let's say I have this:
int[,] array {
0,0,0,0,0,0,1,0,0,0,
0,1,1,0,0,0,1,0,0,0,
0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,1,1,0,
0,0,0,0,0,0,1,1,1,0
}
There are three "clusters" of 1:s. I want to add them to a dictionary with some variable for identification. So maybe first add the neighboring values to a list, add that list to a dictionary, clear the list and move onto the next cluster.
The columns and rows would be of equal length in the real thing.
I would also want the sorting method to accept arrays of various sizes so no hardcoded values. I parse the array from an XML document.
I've tried looking into Array.Sort but the resources I have found have been exclusively about sorting values in as/descending order. Just pointing me in the right direction, some some relevant web resources would be greatly appreciated!

I'm not going to give you the answer in full code since 1. you shouldn't be asking for it here and 2. you can definitely work it out yourself.
This is a good opportunity for you to whip out your pen and paper and figure out the algorithm. Lets say we want something similar to your task: just grouping the clusters of ones. The pseudocode might look like this.
Create a list of clusters
For each element in the grid, check if its a one.
If it is a one, check if it has a neighbor that is part of a cluster.
If so, add it to that cluster, else create a new cluster an add it.
If would then run through this on paper with a small example.
Once you have your desired algorithm, putting it into a dictionary and sorting it should be trivial.

Magic square brute force algorithm

Basically for an assignment I need to create a C# program that will take a number as input (n) then create a 2D array size n*n with the numbers 1 to (n*n). It needs to use a brute force method. I have done this but at the moment the program will just randomly generate the order of the numbers each time, so it will sometimes check the same order more than once. Obviously this means it takes a really long time to check any number above 3, and even for 3 it can take a few minutes. Basically I'm wondering if there is any way for me to make it only check each order once. I am only allowed to use "basic" C# functions, so just things like *, /, +, - and nothing like .Shuffle etc.

Let me make sure I understand the question: you wish to enumerate all permutations of the numbers 1 through n squared, and check whether the permutation produces a magic square. You are now generating permutations randomly, but instead you wish to generate all permutations.
I wrote a series of articles on generating permutations; it is too long to easily summarize here.
http://ericlippert.com/2013/04/15/producing-permutations-part-one/

Choosing random order, as you found, is not a good idea.
I suggest that you put all the number 1 ... (n*n) in array , and than find all the permutation.
when you have all the permutation, it's easy to create square (1 .. n ==> the first row, n+1 ... 2n ==> the second row and so on).
Now, finding all the permutation can be done with the basic operation with recursion

Compare 10 Million Entities

I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file.
The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity.
I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible.
What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed).
Do you have an idea how I could split the workload or reduce it's size?
Thanks

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.
If the rules are not completely general I see 3 solutions to optimize different cases:
if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.
By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

I would try to think about rule hierarchy.
Let's say for example that rule A is "Color" and rule B is "Shape".
If you first divide objects by color,
than there is no need to compare Red circle with Blue triangle.
This will reduce the number of compares you will need to do.

I would create a hashcode from each entity. You probably have to exclude the id from the hash generation and then test for equals. If you have the hashs you could order all the hashcodes alphabetical. Having all entities in order means that it's pretty easy to check for doubles.

If you want to compare each entity with all entities than effectively you need to cluster the data , there is very fewer reasons to compare totally unrelated things ( compare Clothes with Human does not make sense) , i think your rules will try to cluster the data.
so you need to cluster the data , try some clustering algorithms like K-Means.
Also see , Apache Mahout

Are you looking for the best suitable sorting algorithm, kind of a, for this?
I think Divide and Concur seems good.
If the algorithm seems good, you can have plenty of other ways to do the calculation. Specially parallel processing using MPICH or something may give you a final destination.
But before decide how to execute, you have to think if algorithm fits first.

Where can I find a large list of primes for an array literal?

I need a way to store and very efficiently retrieve the first 3512 primes in C#. As far as I know I would use an int array.
I have not been able to find a comma-separated listing of the first 3512 primes. What can I do to find/create such a list to paste in for the array, other than rolling my own prime generator?

Generating 3512 would be fairly easy and quick, but you can find lists pretty easily. Here's a list of the first 50,000 primes that would be easy to import or read in.

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.