C# QuickSort too slow - c#

I'm learning different types of sorting now, and I found out that, starting from a certain point, my QuickSort algorithm doesn't work that quick at all.
Here is my code:
class QuickSort
{
// partitioning array on the key so that the left part is <=key, right part > key
private int Partition(int[] arr, int start, int end)
{
int key = arr[end];
int i = start - 1;
for (int j = start; j < end; j++)
{
if (arr[j] <= key) Swap(ref arr[++i], ref arr[j]);
}
Swap(ref arr[++i], ref arr[end]);
return i;
}
// sorting
public void QuickSorting(int[] arr, int start, int end)
{
if (start < end)
{
int key = Partition(arr, start, end);
QuickSorting(arr, start, key - 1);
QuickSorting(arr, key + 1, end);
}
}
}
class Test
{
static void Main(string[] args)
{
QuickSort quick = new QuickSort();
Random rnd = new Random(DateTime.Now.Millisecond);
int[] array = new int[1000000];
for (int i = 0; i < 1000000; i++)
{
int i_rnd = rnd.Next(1, 1000);
array[i] = i_rnd;
}
quick.QuickSorting(array, 0, array.Length - 1);
}
}
It takes about 15 seconds to run this code on an array of a million elements. While, for example, MergeSort or HeapSort do the same in less than a second.
Could you please explain to me why this can happen?

How quick your sort is and which algorithm you should use depends a lot of your input data. Is it random, nearly sorted, reversed etc.
There's a very nice page that illustrates how the different sorting algorithms work:
Sorting Algorithm Animations

Have you considered inlining the Swap method? It shouldn't be hard to do so, but it may be that the JIT is finding it hard to inline.
When I implemented quicksort for Edulinq I didn't see this problem at all - you may want to try my code (the simplest, recursive form probably) to see how that performs for you. If it does well, try to work out where the differences are.
While different algorithms will behave differently with the same data, I wouldn't expect to see this much difference on randomly-generated data.

You have 1,000,000 random elements with 1,000 distinct values. So, we can expect most values to appear about 1,000 times in your array. This gives you some quadratic O(n^2) running time.
To partition the array in 1,000 pieces, where every partition contains the same number, happens at a stack depth of about log2(1000), about 10. (That is assuming a call to partition neatly breaks it up in two pieces.) That's about 10,000,000 operations.
To quicksort the last 1,000 partitions, all containing 1,000 identical values. We need 1,000 times 1,000 + 999 + 998 + ... + 1 comparisons. (At every round quicksort reduces the problem by one, only removing the key/pivot.) That gives 500,000,000 operations. The most ideal way of quicksort 1,000 partitions would be 1,000 times 1,000*10 operations = 10,000,000. Because of the identical values, you hit a quadratic case here, quicksort's worst case performance. So, about halfway down the quicksort, it goes to worst case behavior.
If every value occurs only a few times, it doesn't matter if you sort those few tiny partitions in O(N^2) or O(N logN). But here we had a lot and huge partitions to be sorted in O(N^2).
To improve your code: partition in 3 pieces. Smaller than the pivot, equal to the pivot and bigger than the pivot. Then, only quicksort the first and last partitions. You will need to do an extra compare; test for equality first. But I think, for this input, it would be a lot faster.

Related

Comparing 1 million integers in an array without sorting it first

I have a task to find the difference between every integer in an array of random numbers and return the lowest difference. A requirement is that the integers can be between 0 and int.maxvalue and that the array will contain 1 million integers.
I put some code together which works fine for a small amount of integers but it takes a long time (so long most of the time I give up waiting) to do a million. My code is below, but I'm looking for some insight on how I can improve performance.
for(int i = 0; i < _RandomIntegerArray.Count(); i++) {
for(int ii = i + 1; ii < _RandomIntegerArray.Count(); ii++) {
if (_RandomIntegerArray[i] == _RandomIntegerArray[ii]) continue;
int currentDiff = Math.Abs(_RandomIntegerArray[i] - _RandomIntegerArray[ii]);
if (currentDiff < lowestDiff) {
Pairs.Clear();
}
if (currentDiff <= lowestDiff) {
Pairs.Add(new NumberPair(_RandomIntegerArray[i], _RandomIntegerArray[ii]));
lowestDiff = currentDiff;
}
}
}
Apologies to everyone that has pointed out that I don't sort; unfortunately sorting is not allowed.
Imagine that you have already found a pair of integers a and b from your random array such that a > b and a-b is the lowest among all possible pairs of integers in the array.
Does an integer c exist in the array such that a > c > b, i.e. c goes between a and b? Clearly, the answer is "no", because otherwise you'd pick the pair {a, c} or {c, b}.
This gives an answer to your problem: a and b must be next to each other in a sorted array. Sorting can be done in O(N*log N), and the search can be done in O(N) - an improvement over O(N2) algorithm that you have.
As per #JonSkeet try sorting the array first and then only compare consecutive array items, which means that you only need to iterate the array once:
Array.Sort(_RandomIntegerArray);
for (int i = 1; i < _RandomIntegerArray.Count(); i++)
{
int currentDiff = _RandomIntegerArray[i] - _RandomIntegerArray[i-1];
if (currentDiff < lowestDiff)
{
Pairs.Clear();
}
if (currentDiff <= lowestDiff)
{
Pairs.Add(new NumberPair(_RandomIntegerArray[i], _RandomIntegerArray[i-1]));
lowestDiff = currentDiff;
}
}
In my testing this results in < 200 ms elapsed for 1 million items.
You've got a million integers out of a possible 2.15 or 4.3 billion (signed or unsigned). That means the largest possible min distance is either about 2150 or 4300. Let's say that the max possible min distance is D.
Divide the legal integers into groups of length D. Create a hash h keyed on integers with arrays of ints as values. Process your array by taking each element x, and adding it to h[x/D].
The point of doing this is that any valid pair of points is either contained in h(k) for some k, or collectively in h(k) and h(k+1).
Find your pair of points by going through the keys of the hash and checking the points associated with adjacent keys. You can sort if you like, or use a bitvector, or any other method but now you're dealing with small arrays (on average 1 element per array).
As elements of the array are b/w 0 to int.maxvalue, so I suppose maxvalue will be less than 1 million. If it is so you just need to initialise the array[maxvalue] to 0 and then as you read 1 million values increment the value in your array.
Now read this array and find the lowest value as described by others as if all the values were sorted. If at any element is present more than 1 than its value will be >1 so you could easily say that min. difference is 0.
NOTE- This method is efficient only if you do not use sorting and more importantly int.maxvalue<<<<<(less than) 10^6(1 million).
It helps a little if you do not count on each iteration
int countIntegers = _RandomIntegerArray.Count();
for(int i = 0; i < countIntegers; i++) {
//...
for(int ii = i + 1; ii < countIntegers; ii++) {
//...
Given that Count() is only returning the number of Ints in an array on each successful count and not modifying the array or caching output until modifications.
How about splitting up the array into arraysize/number of processors sized chunks and running each chunk in a different thread. (Neil)
Assume three parts A, B and C of size as close as possible.
For each part, find the minimum "in-part" difference and that of pairs with the first component from the current part and the second from the next part (A being the next from C).
With a method taking O(n²) time, n/3 should take one ninth, done 2*3 times, this amounts to two thirds plus change for combining the results.
This calls to be applied recursively - remember Карацу́ба/Karatsuba multiplication?
Wait - maybe use two parts after all, for three fourth of the effort - very close to "Karatsuba". (When not seeing how to use an even number of parts, I was thinking multiprocessing with every processor doing "the same".)

Why my implementation of linked list quick sort is much slower than array one?

I been algorithm problem that requires me to do implementation of quick sort algorithm for linked list and array.
I have done both parts , algorithms are working, but it seems there is some bug in my quick-sort linked list implementation.
Here is my Quick sort linked list implementation.
public static void SortLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
if( low != null && low !=high)
{
DataList.Node p = _PartitionLinkedList(items, low, high);
SortLinkedList(items, low, p);
SortLinkedList(items, p.Next(), null);
}
}
private static DataList.Node _PartitionLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
DataList.Node pivot = low;
DataList.Node i = low;
for (DataList.Node j = i.Next(); j != high; j=j.Next())
{
if (j.Value().CompareTo(pivot.Value()) <= 0)
{
items.Swap(i.Next(),j);
i = i.Next();
}
}
items.Swap(pivot, i);
return i;
}
Here is Quick Sort array implementation
public static void SortData(DataArray items, int low, int high)
{
if (low < high)
{
int pi = _PartitionData(items, low, high);
SortData(items, low, pi - 1);
SortData(items, pi + 1, high);
}
}
static int _PartitionData(DataArray arr, int low, int high)
{
double pivot = arr[high];
int i = (low - 1);
for (int j = low; j <= high - 1; j++)
{
if (arr[j].CompareTo(pivot)<=0)
{
i++;
arr.Swap(i,j);
}
}
arr.Swap(i + 1, high);
return i + 1;
}
Here is Quick sort array and linked list performance. (left n, right time)
Picture
As you can see qs linked list took 10 min to sort 6400 elements. I dont think that its normal..
Also I dont think that its because of the data structure, because I was using same structure for selection sort and performance for both linked list and array were similar.
GitHub repo in case i forgot to provide some code. Repo
10 minutes is a very long time for 6400 elements. It would normally require 2 or 3 horrible mistakes, not just one.
Unfortunately, I only see one horrible mistake: Your second recursive call to SortLinkedList(items, p.Next(), null); goes all the way to the end of the list. You meant for it to stop at high.
That might account for the 10 minutes, but it seems a little unlikely.
It also looks to me like your sort is incorrect, even after you fix the above bug -- be sure to test the output!
I would look at your linked list, particularly the swap method. Unless we see the implementation of the linked list, I think the problem area is there.
Is there a reason why you're using linked lists? They have o(n) search which makes quicksort n^2lg(n) sort.
A different way to do it is to add all the items in your linked lists to a list, sort that list, and recreate your linkedlist. List.Sort() uses quick sort.
public static void SortLinkedList(DataList items)
{
list<object> actualList = new list<object>();
for (DataList.Node j = i.Next(); j != null; j=j.Next())
{
list.add(j.Value());
}
actualList.Sort();
items.Clear();
for (int i = 0; i < actualList.Count;i++)
{
items.Add(actualList[i]);
}
}
Quick sort for linked list is normally slightly different than quick sort for arrays. Use the first node's data value as the pivot value. Then the code creates 3 lists, one for values < pivot, one for values == pivot, one for values > pivot. It then does a recursive calls for the < pivot and > pivot lists. When the recursive call returns those 3 lists are now sorted, so the code only needs to concatenate the 3 lists.
To speed up concatenation of lists, keep track of a pointer to the last node. To simplify this, use circular lists, and use a pointer to the last node as the main way to access a list. This makes appending (joining) list simpler (no scanning). Once inside a function, use last->next in order to get a pointer to the first node of a list.
Two of the worst case data patterns are already sorted data or already reverse sorted data. If the circular list with pointer to last node method is used, then the average of last and first nodes could be used as a median of 2 which could help (note the list for nodes == pivot could end up empty).
Worst case time complexity is O(n^2). Worst case stack usage is O(n). The stack usage could be reduced by using recursion on the smaller of the list < pivot and list > pivot. After return, the now sorted smaller list would be concatenated with the list == pivot and saved in a 4th list. Then the sort process would iterate on the remaining unsorted list, then merging (or perhaps joining) with the saved list.
Sorting a linked list, using any method, including bottom up merge sort , will be slower than moving the linked list to an array, sorting the array, then creating a linked list from the sorted array. However the quick sort method I describe will be much faster than using an array oriented algorithm with a linked list.

Shuffling an array bottlenecks on Random.Nex(int)

I've been working on a small piece of code that sorts the provided array. The array should be sorted as fast as possible. Randomization is not that important. After profiling the method I found out that the biggest hog is Random.Next. Which takes up about 70% of the method execution time. After searching online for faster random generators I found no plug and play libraries that offer any improved performance.
So I was wondering whether there are any ways to improve the performance of this code any more.
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Shuffle(byte[] chars)
{
for (var i = 0; i < chars.Length; i++)
{
var index = rnd.Next(chars.Length);
byte tmpStore = chars[index];
chars[index] = chars[i];
chars[i] = tmpStore;
}
}
Alright, this is getting into micro-optimization territory.
Random.Next(int) actually performs some ops internally that we can optimize out:
int index = (int)(rnd.Next() * (1.0 / int.Max) * chars.Length);
Since you're using the same maxValue over and over in a loop, a trivial optimization would be to precalculate your denominator outside of the loop. This way we get rid of an int->double conversion and a multiply:
double d = chars.Length / (double)int.Max;
And then:
int index = (int)(rnd.Next() * d);
On a separate note: your shuffle isn't going to have a uniform distribution. See Jeff Atwood's post The Danger of Naïveté which deals specifically with this subject and shows how to perform a uniform Fisher-Yates shuffle.
If n^n isn't too big for the double range, you could create one random double number, multiply it by n^n, then use modulo(n) each iteration as the current random number prior to dividing the random result by n as preparation for the next iteration.

Is using Random and OrderBy a good shuffle algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have read an article about various shuffle algorithms over at Coding Horror. I have seen that somewhere people have done this to shuffle a list:
var r = new Random();
var shuffled = ordered.OrderBy(x => r.Next());
Is this a good shuffle algorithm? How does it work exactly? Is it an acceptable way of doing this?
It's not a way of shuffling that I like, mostly on the grounds that it's O(n log n) for no good reason when it's easy to implement an O(n) shuffle. The code in the question "works" by basically giving a random (hopefully unique!) number to each element, then ordering the elements according to that number.
I prefer Durstenfeld's variant of the Fisher-Yates shuffle which swaps elements.
Implementing a simple Shuffle extension method would basically consist of calling ToList or ToArray on the input then using an existing implementation of Fisher-Yates. (Pass in the Random as a parameter to make life generally nicer.) There are plenty of implementations around... I've probably got one in an answer somewhere.
The nice thing about such an extension method is that it would then be very clear to the reader what you're actually trying to do.
EDIT: Here's a simple implementation (no error checking!):
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
T[] elements = source.ToArray();
// Note i > 0 to avoid final pointless iteration
for (int i = elements.Length-1; i > 0; i--)
{
// Swap element "i" with a random earlier element it (or itself)
int swapIndex = rng.Next(i + 1);
T tmp = elements[i];
elements[i] = elements[swapIndex];
elements[swapIndex] = tmp;
}
// Lazily yield (avoiding aliasing issues etc)
foreach (T element in elements)
{
yield return element;
}
}
EDIT: Comments on performance below reminded me that we can actually return the elements as we shuffle them:
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
    T[] elements = source.ToArray();
    for (int i = elements.Length - 1; i >= 0; i--)
    {
        // Swap element "i" with a random earlier element it (or itself)
// ... except we don't really need to swap it fully, as we can
// return it immediately, and afterwards it's irrelevant.
        int swapIndex = rng.Next(i + 1);
yield return elements[swapIndex];
        elements[swapIndex] = elements[i];
    }
}
This will now only do as much work as it needs to.
Note that in both cases, you need to be careful about the instance of Random you use as:
Creating two instances of Random at roughly the same time will yield the same sequence of random numbers (when used in the same way)
Random isn't thread-safe.
I have an article on Random which goes into more detail on these issues and provides solutions.
This is based on Jon Skeet's answer.
In that answer, the array is shuffled, then returned using yield. The net result is that the array is kept in memory for the duration of foreach, as well as objects necessary for iteration, and yet the cost is all at the beginning - the yield is basically an empty loop.
This algorithm is used a lot in games, where the first three items are picked, and the others will only be needed later if at all. My suggestion is to yield the numbers as soon as they are swapped. This will reduce the start-up cost, while keeping the iteration cost at O(1) (basically 5 operations per iteration). The total cost would remain the same, but the shuffling itself would be quicker. In cases where this is called as collection.Shuffle().ToArray() it will theoretically make no difference, but in the aforementioned use cases it will speed start-up. Also, this would make the algorithm useful for cases where you only need a few unique items. For example, if you need to pull out three cards from a deck of 52, you can call deck.Shuffle().Take(3) and only three swaps will take place (although the entire array would have to be copied first).
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
T[] elements = source.ToArray();
// Note i > 0 to avoid final pointless iteration
for (int i = elements.Length - 1; i > 0; i--)
{
// Swap element "i" with a random earlier element it (or itself)
int swapIndex = rng.Next(i + 1);
yield return elements[swapIndex];
elements[swapIndex] = elements[i];
// we don't actually perform the swap, we can forget about the
// swapped element because we already returned it.
}
// there is one item remaining that was not returned - we return it now
yield return elements[0];
}
Starting from this quote of Skeet:
It's not a way of shuffling that I like, mostly on the grounds that it's O(n log n) for no good reason when it's easy to implement an O(n) shuffle. The code in the question "works" by basically giving a random (hopefully unique!) number to each element, then ordering the elements according to that number.
I'll go on a little explaining the reason for the hopefully unique!
Now, from the Enumerable.OrderBy:
This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved
This is very important! What happens if two elements "receive" the same random number? It happens that they remain in the same order they are in the array. Now, what is the possibility for this to happen? It is difficult to calculate exactly, but there is the Birthday Problem that is exactly this problem.
Now, is it real? Is it true?
As always, when in doubt, write some lines of program: http://pastebin.com/5CDnUxPG
This little block of code shuffles an array of 3 elements a certain number of times using the Fisher-Yates algorithm done backward, the Fisher-Yates algorithm done forward (in the wiki page there are two pseudo-code algorithms... They produce equivalent results, but one is done from first to last element, while the other is done from last to first element), the naive wrong algorithm of http://blog.codinghorror.com/the-danger-of-naivete/ and using the .OrderBy(x => r.Next()) and the .OrderBy(x => r.Next(someValue)).
Now, Random.Next is
A 32-bit signed integer that is greater than or equal to 0 and less than MaxValue.
so it's equivalent to
OrderBy(x => r.Next(int.MaxValue))
To test if this problem exists, we could enlarge the array (something very slow) or simply reduce the maximum value of the random number generator (int.MaxValue isn't a "special" number... It is simply a very big number). In the end, if the algorithm isn't biased by the stableness of the OrderBy, then any range of values should give the same result.
The program then tests some values, in the range 1...4096. Looking at the result, it's quite clear that for low values (< 128), the algorithm is very biased (4-8%). With 3 values you need at least r.Next(1024). If you make the array bigger (4 or 5), then even r.Next(1024) isn't enough. I'm not an expert in shuffling and in math, but I think that for each extra bit of length of the array, you need 2 extra bits of maximum value (because the birthday paradox is connected to the sqrt(numvalues)), so that if the maximum value is 2^31, I'll say that you should be able to sort arrays up to 2^12/2^13 bits (4096-8192 elements)
It's probablly ok for most purposes, and almost always it generates a truly random distribution (except when Random.Next() produces two identical random integers).
It works by assigning each element of the series a random integer, then ordering the sequence by these integers.
It's totally acceptable for 99.9% of the applications (unless you absolutely need to handle the edge case above). Also, skeet's objection to its runtime is valid, so if you're shuffling a long list you might not want to use it.
This has come up many times before. Search for Fisher-Yates on StackOverflow.
Here is a C# code sample I wrote for this algorithm. You can parameterize it on some other type, if you prefer.
static public class FisherYates
{
// Based on Java code from wikipedia:
// http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
static public void Shuffle(int[] deck)
{
Random r = new Random();
for (int n = deck.Length - 1; n > 0; --n)
{
int k = r.Next(n+1);
int temp = deck[n];
deck[n] = deck[k];
deck[k] = temp;
}
}
}
Seems like a good shuffling algorithm, if you're not too worried on the performance. The only problem I'd point out is that its behavior is not controllable, so you may have a hard time testing it.
One possible option is having a seed to be passed as a parameter to the random number generator (or the random generator as a parameter), so you can have more control and test it more easily.
I found Jon Skeet's answer to be entirely satisfactory, but my client's robo-scanner will report any instance of Random as a security flaw. So I swapped it out for System.Security.Cryptography.RNGCryptoServiceProvider. As a bonus, it fixes that thread-safety issue that was mentioned. On the other hand, RNGCryptoServiceProvider has been measured as 300x slower than using Random.
Usage:
using (var rng = new RNGCryptoServiceProvider())
{
var data = new byte[4];
yourCollection = yourCollection.Shuffle(rng, data);
}
Method:
/// <summary>
/// Shuffles the elements of a sequence randomly.
/// </summary>
/// <param name="source">A sequence of values to shuffle.</param>
/// <param name="rng">An instance of a random number generator.</param>
/// <param name="data">A placeholder to generate random bytes into.</param>
/// <returns>A sequence whose elements are shuffled randomly.</returns>
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, RNGCryptoServiceProvider rng, byte[] data)
{
var elements = source.ToArray();
for (int i = elements.Length - 1; i >= 0; i--)
{
rng.GetBytes(data);
var swapIndex = BitConverter.ToUInt32(data, 0) % (i + 1);
yield return elements[swapIndex];
elements[swapIndex] = elements[i];
}
}
Looking for an algorithm? You can use my ShuffleList class:
class ShuffleList<T> : List<T>
{
public void Shuffle()
{
Random random = new Random();
for (int count = Count; count > 0; count--)
{
int i = random.Next(count);
Add(this[i]);
RemoveAt(i);
}
}
}
Then, use it like this:
ShuffleList<int> list = new ShuffleList<int>();
// Add elements to your list.
list.Shuffle();
How does it work?
Let's take an initial sorted list of the 5 first integers: { 0, 1, 2, 3, 4 }.
The method starts by counting the nubmer of elements and calls it count. Then, with count decreasing on each step, it takes a random number between 0 and count and moves it to the end of the list.
In the following step-by-step example, the items that could be moved are italic, the selected item is bold:
0 1 2 3 4
0 1 2 3 4
0 1 2 4 3
0 1 2 4 3
1 2 4 3 0
1 2 4 3 0
1 2 3 0 4
1 2 3 0 4
2 3 0 4 1
2 3 0 4 1
3 0 4 1 2
This algorithm shuffles by generating a new random value for each value in a list, then ordering the list by those random values. Think of it as adding a new column to an in-memory table, then filling it with GUIDs, then sorting by that column. Looks like an efficient way to me (especially with the lambda sugar!)
Slightly unrelated, but here is an interesting method (that even though it is really excessibe, has REALLY been implemented) for truly random generation of dice rolls!
Dice-O-Matic
The reason I'm posting this here, is that he makes some interesting points about how his users reacted to the idea of using algorithms to shuffle, over actual dice. Of course, in the real world, such a solution is only for the really extreme ends of the spectrum where randomness has such an big impact and perhaps the impact affects money ;).
I would say that many answers here like "This algorithm shuffles by generating a new random value for each value in a list, then ordering the list by those random values" might be very wrong!
I'd think that this DOES NOT assign a random value to each element of the source collection. Instead there might be a sort algorithm running like Quicksort which would call a compare-function approximately n log n times. Some sort algortihm really expect this compare-function to be stable and always return the same result!
Couldn't it be that the IEnumerableSorter calls a compare function for each algorithm step of e.g. quicksort and each time calls the function x => r.Next() for both parameters without caching these!
In that case you might really mess up the sort algorithm and make it much worse than the expectations the algorithm is build up on. Of course, it eventually will become stable and return something.
I might check it later by putting debugging output inside a new "Next" function so see what happens.
In Reflector I could not immediately find out how it works.
It is worth noting that due to the deferred execution of LINQ, using a random number generator instance with OrderBy() can result in a possibly unexpected behavior: The sorting does not happen until the collection is read. This means each time you read or enumerate the collection, the order changes. One would possibly expect the elements to be shuffled once and then to retain the order each time it is accessed thereafter.
Random random = new();
var shuffled = ordered.OrderBy(x => random.Next())
The code above passes a lambda function x => random.Next() as a parameter to OrderBy(). This will capture the instance referred to by random and save it with the lambda by so that it can call Next() on this instance to perform the ordering later which happens right before it is enumerated(when the first element is requested from the collection).
The problem here, is since this execution is saved for later, the ordering happens each time just before the collection is enumerated using new numbers obtained by calling Next() on the same random instance.
Example
To demonstrate this behavior, I have used Visual Studio's C# Interactive Shell:
> List<int> list = new() { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
> Random random = new();
> var shuffled = list.OrderBy(element => random.Next());
> shuffled.ToList()
List<int>(10) { 5, 9, 10, 4, 6, 2, 8, 3, 1, 7 }
> shuffled.ToList()
List<int>(10) { 8, 2, 9, 1, 3, 6, 5, 10, 4, 7 } // Different order
> shuffled.ElementAt(0)
9 // First element is 9
> shuffled.ElementAt(0)
7 // First element is now 7
>
This behavior can even be seen in action by placing a breakpoint just after where the IOrderedEnumerable is created when using Visual Studio's debugger: each time you hover on the variable, the elements show up in a different order.
This, of course, does not apply if you immediately enumerate the elements by calling ToList() or an equivalent. However, this behavior can lead to bugs in many cases, one of them being when the shuffled collection is expected to contain a unique element at each index.
Startup time to run on code with clear all threads and cache every new test,
First unsuccessful code. It runs on LINQPad. If you follow to test this code.
Stopwatch st = new Stopwatch();
st.Start();
var r = new Random();
List<string[]> list = new List<string[]>();
list.Add(new String[] {"1","X"});
list.Add(new String[] {"2","A"});
list.Add(new String[] {"3","B"});
list.Add(new String[] {"4","C"});
list.Add(new String[] {"5","D"});
list.Add(new String[] {"6","E"});
//list.OrderBy (l => r.Next()).Dump();
list.OrderBy (l => Guid.NewGuid()).Dump();
st.Stop();
Console.WriteLine(st.Elapsed.TotalMilliseconds);
list.OrderBy(x => r.Next()) uses 38.6528 ms
list.OrderBy(x => Guid.NewGuid()) uses 36.7634 ms (It's recommended from MSDN.)
the after second time both of them use in the same time.
EDIT:
TEST CODE on Intel Core i7 4#2.1GHz, Ram 8 GB DDR3 #1600, HDD SATA 5200 rpm with [Data: www.dropbox.com/s/pbtmh5s9lw285kp/data]
using System;
using System.Runtime;
using System.Diagnostics;
using System.IO;
using System.Collections.Generic;
using System.Collections;
using System.Linq;
using System.Threading;
namespace Algorithm
{
class Program
{
public static void Main(string[] args)
{
try {
int i = 0;
int limit = 10;
var result = GetTestRandomSort(limit);
foreach (var element in result) {
Console.WriteLine();
Console.WriteLine("time {0}: {1} ms", ++i, element);
}
} catch (Exception e) {
Console.WriteLine(e.Message);
} finally {
Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);
}
}
public static IEnumerable<double> GetTestRandomSort(int limit)
{
for (int i = 0; i < 5; i++) {
string path = null, temp = null;
Stopwatch st = null;
StreamReader sr = null;
int? count = null;
List<string> list = null;
Random r = null;
GC.Collect();
GC.WaitForPendingFinalizers();
Thread.Sleep(5000);
st = Stopwatch.StartNew();
#region Import Input Data
path = Environment.CurrentDirectory + "\\data";
list = new List<string>();
sr = new StreamReader(path);
count = 0;
while (count < limit && (temp = sr.ReadLine()) != null) {
// Console.WriteLine(temp);
list.Add(temp);
count++;
}
sr.Close();
#endregion
// Console.WriteLine("--------------Random--------------");
// #region Sort by Random with OrderBy(random.Next())
// r = new Random();
// list = list.OrderBy(l => r.Next()).ToList();
// #endregion
// #region Sort by Random with OrderBy(Guid)
// list = list.OrderBy(l => Guid.NewGuid()).ToList();
// #endregion
// #region Sort by Random with Parallel and OrderBy(random.Next())
// r = new Random();
// list = list.AsParallel().OrderBy(l => r.Next()).ToList();
// #endregion
// #region Sort by Random with Parallel OrderBy(Guid)
// list = list.AsParallel().OrderBy(l => Guid.NewGuid()).ToList();
// #endregion
// #region Sort by Random with User-Defined Shuffle Method
// r = new Random();
// list = list.Shuffle(r).ToList();
// #endregion
// #region Sort by Random with Parallel User-Defined Shuffle Method
// r = new Random();
// list = list.AsParallel().Shuffle(r).ToList();
// #endregion
// Result
//
st.Stop();
yield return st.Elapsed.TotalMilliseconds;
foreach (var element in list) {
Console.WriteLine(element);
}
}
}
}
}
Result Description: https://www.dropbox.com/s/9dw9wl259dfs04g/ResultDescription.PNG
Result Stat: https://www.dropbox.com/s/ewq5ybtsvesme4d/ResultStat.PNG
Conclusion:
Assume: LINQ OrderBy(r.Next()) and OrderBy(Guid.NewGuid()) are not worse than User-Defined Shuffle Method in First Solution.
Answer: They are contradiction.

What's wrong in terms of performance with this code? List.Contains, random usage, threading?

I have a local class with a method used to build a list of strings and I'm finding that when I hit this method (in a for loop of 1000 times) often it's not returning the amount I request.
I have a global variable:
string[] cachedKeys
A parameter passed to the method:
int requestedNumberToGet
The method looks similar to this:
List<string> keysToReturn = new List<string>();
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ?
cachedKeys.Length : requestedNumberToGet;
Random rand = new Random();
DateTime breakoutTime = DateTime.Now.AddMilliseconds(5);
//Do we have enough to fill the request within the time? otherwise give
//however many we currently have
while (DateTime.Now < breakoutTime
&& keysToReturn.Count < numberPossibleToGet
&& cachedKeys.Length >= numberPossibleToGet)
{
string randomKey = cachedKeys[rand.Next(0, cachedKeys.Length)];
if (!keysToReturn.Contains(randomKey))
keysToReturn.Add(randomKey);
}
if (keysToReturn.Count != numberPossibleToGet)
Debugger.Break();
I have approximately 40 strings in cachedKeys none exceeding 15 characters in length.
I'm no expert with threading so I'm literally just calling this method 1000 times in a loop and consistently hitting that debug there.
The machine this is running on is a fairly beefy desktop so I would expect the breakout time to be realistic, in fact it randomly breaks at any point of the loop (I've seen 20s, 100s, 200s, 300s).
Any one have any ideas where I'm going wrong with this?
Edit: Limited to .NET 2.0
Edit: The purpose of the breakout is so that if the method is taking too long to execute, the client (several web servers using the data for XML feeds) won't have to wait while the other project dependencies initialise, they'll just be given 0 results.
Edit: Thought I'd post the performance stats
Original
'0.0042477465711424217323710136' - 10
'0.0479597267250446634977350473' - 100
'0.4721072091564710039963179678' - 1000
Skeet
'0.0007076318358897569383818334' - 10
'0.007256508857969378789762386' - 100
'0.0749829936486341141122684587' - 1000
Freddy Rios
'0.0003765841748043396576939248' - 10
'0.0046003053460705201359390649' - 100
'0.0417058592642360970458535931' - 1000
Why not just take a copy of the list - O(n) - shuffle it, also O(n) - and then return the number of keys that have been requested. In fact, the shuffle only needs to be O(nRequested). Keep swapping a random member of the unshuffled bit of the list with the very start of the unshuffled bit, then expand the shuffled bit by 1 (just a notional counter).
EDIT: Here's some code which yields the results as an IEnumerable<T>. Note that it uses deferred execution, so if you change the source that's passed in before you first start iterating through the results, you'll see those changes. After the first result is fetched, the elements will have been cached.
static IEnumerable<T> TakeRandom<T>(IEnumerable<T> source,
int sizeRequired,
Random rng)
{
List<T> list = new List<T>(source);
sizeRequired = Math.Min(sizeRequired, list.Count);
for (int i=0; i < sizeRequired; i++)
{
int index = rng.Next(list.Count-i);
T selected = list[i + index];
list[i + index] = list[i];
list[i] = selected;
yield return selected;
}
}
The idea is that at any point after you've fetched n elements, the first n elements of the list will be those elements - so we make sure that we don't pick those again. When then pick a random element from "the rest", swap it to the right position and yield it.
Hope this helps. If you're using C# 3 you might want to make this an extension method by putting "this" in front of the first parameter.
The main issue are the using retries in a random scenario to ensure you get unique values. This quickly gets out of control, specially if the amount of items requested is near to the amount of items to get i.e. if you increase the amount of keys, you will see the issue less often but that can be avoided.
The following method does it by keeping a list of the keys remaining.
List<string> GetSomeKeys(string[] cachedKeys, int requestedNumberToGet)
{
int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
List<string> keysRemaining = new List<string>(cachedKeys);
List<string> keysToReturn = new List<string>(numberPossibleToGet);
Random rand = new Random();
for (int i = 0; i < numberPossibleToGet; i++)
{
int randomIndex = rand.Next(keysRemaining.Count);
keysToReturn.Add(keysRemaining[randomIndex]);
keysRemaining.RemoveAt(randomIndex);
}
return keysToReturn;
}
The timeout was necessary on your version as you could potentially keep retrying to get a value for a long time. Specially when you wanted to retrieve the whole list, in which case you would almost certainly get a fail with the version that relies on retries.
Update: The above performs better than these variations:
List<string> GetSomeKeysSwapping(string[] cachedKeys, int requestedNumberToGet)
{
int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
List<string> keys = new List<string>(cachedKeys);
List<string> keysToReturn = new List<string>(numberPossibleToGet);
Random rand = new Random();
for (int i = 0; i < numberPossibleToGet; i++)
{
int index = rand.Next(numberPossibleToGet - i) + i;
keysToReturn.Add(keys[index]);
keys[index] = keys[i];
}
return keysToReturn;
}
List<string> GetSomeKeysEnumerable(string[] cachedKeys, int requestedNumberToGet)
{
Random rand = new Random();
return TakeRandom(cachedKeys, requestedNumberToGet, rand).ToList();
}
Some numbers with 10.000 iterations:
Function Name Elapsed Inclusive Time Number of Calls
GetSomeKeys 6,190.66 10,000
GetSomeKeysEnumerable 15,617.04 10,000
GetSomeKeysSwapping 8,293.64 10,000
A few thoughts.
First, your keysToReturn list is potentially being added to each time through the loop, right? You're creating an empty list and then adding each new key to the list. Since the list was not pre-sized, each add becomes an O(n) operation (see MSDN documentation). To fix this, try pre-sizing your list like this.
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
List<string> keysToReturn = new List<string>(numberPossibleToGet);
Second, your breakout time is unrealistic (ok, ok, impossible) on Windows. All of the information I've ever read on Windows timing suggests that the best you can possibly hope for is 10 millisecond resolution, but in practice it's more like 15-18 milliseconds. In fact, try this code:
for (int iv = 0; iv < 10000; iv++) {
Console.WriteLine( DateTime.Now.Millisecond.ToString() );
}
What you'll see in the output are discrete jumps. Here is a sample output that I just ran on my machine.
13
...
13
28
...
28
44
...
44
59
...
59
75
...
The millisecond value jumps from 13 to 28 to 44 to 59 to 75. That's roughly a 15-16 millisecond resolution in the DateTime.Now function for my machine. This behavior is consistent with what you'd see in the C runtime ftime() call. In other words, it's a systemic trait of the Windows timing mechanism. The point is, you should not rely on a consistent 5 millisecond breakout time because you won't get it.
Third, am I right to assume that the breakout time is prevent the main thread from locking up? If so, then it'd be pretty easy to spawn off your function to a ThreadPool thread and let it run to completion regardless of how long it takes. Your main thread can then operate on the data.
Use HashSet instead, HashSet is much faster for lookup than List
HashSet<string> keysToReturn = new HashSet<string>();
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
Random rand = new Random();
DateTime breakoutTime = DateTime.Now.AddMilliseconds(5);
int length = cachedKeys.Length;
while (DateTime.Now < breakoutTime && keysToReturn.Count < numberPossibleToGet) {
int i = rand.Next(0, length);
while (!keysToReturn.Add(cachedKeys[i])) {
i++;
if (i == length)
i = 0;
}
}
Consider using Stopwatch instead of DateTime.Now. It may simply be down to the inaccuracy of DateTime.Now when you're talking about milliseconds.
The problem could quite possibly be here:
if (!keysToReturn.Contains(randomKey))
keysToReturn.Add(randomKey);
This will require iterating over the list to determine if the key is in the return list. However, to be sure, you should try profiling this using a tool. Also, 5ms is pretty fast at .005 seconds, you may want to increase that.

Categories

Resources