I've been working on a small piece of code that sorts the provided array. The array should be sorted as fast as possible. Randomization is not that important. After profiling the method I found out that the biggest hog is Random.Next. Which takes up about 70% of the method execution time. After searching online for faster random generators I found no plug and play libraries that offer any improved performance.
So I was wondering whether there are any ways to improve the performance of this code any more.
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Shuffle(byte[] chars)
{
for (var i = 0; i < chars.Length; i++)
{
var index = rnd.Next(chars.Length);
byte tmpStore = chars[index];
chars[index] = chars[i];
chars[i] = tmpStore;
}
}
Alright, this is getting into micro-optimization territory.
Random.Next(int) actually performs some ops internally that we can optimize out:
int index = (int)(rnd.Next() * (1.0 / int.Max) * chars.Length);
Since you're using the same maxValue over and over in a loop, a trivial optimization would be to precalculate your denominator outside of the loop. This way we get rid of an int->double conversion and a multiply:
double d = chars.Length / (double)int.Max;
And then:
int index = (int)(rnd.Next() * d);
On a separate note: your shuffle isn't going to have a uniform distribution. See Jeff Atwood's post The Danger of Naïveté which deals specifically with this subject and shows how to perform a uniform Fisher-Yates shuffle.
If n^n isn't too big for the double range, you could create one random double number, multiply it by n^n, then use modulo(n) each iteration as the current random number prior to dividing the random result by n as preparation for the next iteration.
Related
I apologize if this is in the incorrect forum. Despite finding a lot of Array manipulation on this site, most of these are averaging/summing... the array of numerics as a set using LINQ, which processes well for all values in an array. But I need to process each index over multiple arrays (of the same size).
My routine receives array data from devices, typically double[512] or ushort[512]; A single device itself will always have the same size of Array data, but the array sizes can range from 256 to 2048 depending on the device. I need to hold CountToAverage quantity of the arrays to average. Each time an array is received, it must push and pop from the queue to ensure that the number of arrays in the average process is consistent (this part of the process is fixed in the Setup() for this benchmark testing. For comparison purposes, the benchmark results are shown after the code.
What I am looking for is the fastest most efficient way to average the values of each index of all the arrays, and return a new array (of the same size) where each index is averaged from the set of arrays. The count of arrays to be averaged can range from 3 to 25 (the code below sets benchmark param to 10). I have 2 different averaging methods in the test, the 2nd is significantly faster, 6-7 times faster than the first. My first question is; Is there any way to achieve this faster, that can be done at O(1) or O(log n) time complexity?
Secondarily, I am using a Queue (which may be changed to ConcurrentQueue for implementation) as a holder for the arrays to be processed. My primary reasoning for using a queue is because I can guarantee FIFO processing of the feed of arrays which is critical. Also, I can process against the values in the Queue through a foreach loop (just like a List) without having to dequeue until I am ready. I would be interested if anyone knows whether this is performance hindering as I haven't benchmarked it. Keep in mind it must be thread-safe. If you have an alternative way to process multiple sets of array data in a thread-safe manner I am "all ears".
The reason for the performance requirement is this is not the only process that is happening, I have multiple devices that are sending array results "streamed" at an approximate rate of 1 every 1-5 milliseconds, for each device coming from different threads/processes/connections, that still has several other much more intensive algorithms to process through, so this cannot be a bottleneck.
Any insights on optimizations and performance are appreciated.
using System;
using System.Collections.Generic;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using Microsoft.Diagnostics.Tracing.Parsers.MicrosoftAntimalwareEngine;
namespace ArrayAverage
{
public class ArrayAverage
{
[Params(10)]
public int CountToAverage;
[Params(512, 2048)]
public int PixelSize;
static Queue<double[]> calcRepo = new Queue<double[]>();
static List<double[]> spectra = new();
[Benchmark]
public double[] CalculateIndexAverages()
{
// This is too slow
var avg = new double[PixelSize];
for (int i = 0; i < PixelSize; i++)
{
foreach (var arrayData in calcRepo)
{
avg[i] += arrayData[i];
}
avg[i] /= calcRepo.Count;
}
return avg;
}
[Benchmark]
public double[] CalculateIndexAverages2()
{
// this is faster, but is it the fastest?
var sum = new double[PixelSize];
int cnt = calcRepo.Count;
foreach (var arrayData in calcRepo)
{
for (int i = 0; i < PixelSize; i++)
{
sum[i] += arrayData[i];
}
}
var avg = new double[PixelSize];
for (int i = 0; i < PixelSize; i++)
{
avg[i] = sum[i] / cnt;
}
return avg;
}
[GlobalSetup]
public void Setup()
{
// Just generating some data as simple Triangular curve simulating a range of spectra
for (double offset = 0; offset < CountToAverage; offset++)
{
var values = new double[PixelSize];
var decrement = 0;
for (int i = 0; i < PixelSize; i++)
{
if (i > (PixelSize / 2))
decrement--;
values[i] = (offset / 7) + i + (decrement * 2);
}
calcRepo.Enqueue(values);
}
}
}
public class App
{
public static void Main()
{
BenchmarkRunner.Run<ArrayAverage>();
}
}
}
Benchmark results:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
[Host] : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Method
Arrays To Average
Array Size
Mean
Error
StdDev
CalculateIndexAverages
10
512
32.164 μs
0.5485 μs
0.5130 μs
CalculateIndexAverages2
10
512
5.792 μs
0.1135 μs
0.2241 μs
CalculateIndexAverages
10
2048
123.628 μs
2.3394 μs
1.9535 μs
CalculateIndexAverages2
10
2048
22.311 μs
0.4366 μs
0.8093 μs
When dealing with simple operations on a large amount of data, you'd be very interested in SIMD:
SIMD stands for "single instruction, multiple data". It’s a set of processor instructions that ... allows mathematical operations to execute over a set of values in parallel.
In your particular case, using the the Vector<T> example would give you a quick win. Naively converting your fastest method to use Vectors already gives a ~2x speed up on my PC.
public double[] CalculateIndexAverages4() {
// Assumption: PixelSize is a round multiple of Vector<>.Count
// If not, you'll have to add in the 'remainder' from the example.
var batch = Vector<double>.Count;
var sum = new double[PixelSize];
foreach (var arrayData in calcRepo) {
// Vectorised summing:
for (int i = 0; i <= PixelSize - batch; i += batch) {
var vSum = new Vector<double>(sum, i);
var vData = new Vector<double>(arrayData, i);
(vSum + vData).CopyTo(sum, i);
}
}
var vCnt = Vector<double>.One * calcRepo.Count;
// Reuse sum[] for averaging, so we don't incur memory allocation cost
for (int i = 0; i <= PixelSize - batch; i += batch) {
var vSum = new Vector<double>(sum, i);
(vSum / vCnt).CopyTo(sum, i);
}
return sum;
}
The Vector<T>.Count gives you how many items are being parallelised into one instruction. In the case of double, it's likely to be 4 on most modern CPUs supporting AVX2.
If you're okay with losing precision and can go to float, you'll get a much bigger win by again doubling the amount of data processed in a single CPU op. All of this without even changing your algorithm.
You can further optimize the code by reducing memory allocations. If the method is called frequently, time spent on GC will dominate completely.
// Assuming the data fits on the stack. Some 100k pixels should be safe.
Span<double> sum = stackalloc double[PixelSize];
// ...
Span<double> avg = stackalloc double[PixelSize];
And possibly also remove the extra stack-allocation of avg and simply reuse the sum:
for (int i = 0; i < sum.Length; i++)
{
sum[i] /= cnt;
}
// TODO: Avoid array allocation! Maybe use a pre-allocated array and fill it here.
return sum.ToArray();
In my opinion this would be fairly well optimized code. A major reason for the second option to be faster is that it access memory linearly, instead of jumping between multiple different arrays. Another factor is that foreach loops have some overhead, so placing this in the outer loop will also help a bit.
You might gain a little bit performance by switching out the queue and foreach loop to a list/array and for loop, but since PixelSize is much larger than CountToAverage I would expect the benefit to be fairly small.
Unrolling the loop to process say 4 values at a time might help a bit. It is possible for the c# compiler to apply such optimization automatically, but it is often difficult to tell what optimization are applied or not, so it might be easier just to test.
The next step would be to look at parallelization. Simple summing code like this might benefit a from SIMD to process multiple values at a time. But the link shows that using processor specific intrinsic has a much larger benefit over the more general Vector<T>, but may require separate code paths for each platform you are targeting. The link also have performance examples of summing values at various levels of optimization, with example code, so is well worth a read.
Another option would be to use multiple threads with Parallel.For/Foreach, but at 6μs it is likely that the overhead will be larger than any gains unless the size of the data is significantly larger.
So i am working on a program to factor a given number. you write a number in the command line and it gives you all the factors. now, doing this sequentially is VERY slow. because it uses only one thread, if i'm not mistaking. now, i thought about doing it with Parallel.For and it worked, only with integers, so wanted to try it with BigIntegers
heres the normal code:
public static void Factor(BigInteger f)
{
for(BigInteger i = 1; i <= f; i++)
{
if (f % i == 0)
{
Console.WriteLine("{0} is a factor of {1}", i, f);
}
}
}
the above code should be pretty easy to understand. but as i said, this Very slow for big numbers (above one billion it starts to get impractical) and heres my parallel code:
public static void ParallelFacotr(BigInteger d)
{
Parallel.For<BigInteger>(0, d, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, i =>
{
try
{
if (d % i == 0)
{
Console.WriteLine("{0} is a factor of {1}", i, d);
}
}
catch (DivideByZeroException)
{
Console.WriteLine("Done!");
Console.ReadKey();
Launcher.Main();
}
});
}
now, the above code (parallel one) works just fine with integers (int) but its VERY fast. it factored 1000.000.000 in just 2 seconds. so i thought why not try it with bigger integers. and also, i thought that putting <Biginteger> after the parallel.for would do it. but it doesn't. so, how do you work with bigintegers in parallel for loop? and i already tried just a regular parallel loop with a bigInteger as the argument, but then it gives an error saying that it cannot convert from BigInteger to int. so how do you
Improve your algorithm efficiency first.
While it is possible to use BigInteger you will not have CPU ALU arithmetical logic to resolve arbitrarily big numbers logic in hardware, so it will be noticeably slower. So unless you need bigger numbers than 9 quintillion or exactly 9,223,372,036,854,775,807 then you can use type long.
A second thing to not is that you do not need to loop over all elements as it needs to be multiple of something, so you can reduce
for(BigInteger i = 1; i <= f; i++)
for(long i = 1; i <= Math.Sqrt(f); i++)
That would mean that instead of needing to iterate over 1000000000 items you iterate over 31623.
Additionally, if you still plan on using BigInt then check the parameters:
It should be something in the lines of
Parallel.For(
0,
(int)d,
() => BigInteger.Zero,
(x, state, subTotal) => subTotal + BigInteger.One,
Just for trivia. Some programming languages are more efficient in solving problems than others and in this case, there is a languages Wolfram (previously Mathematica) when solving problems is simpler, granted that you know what you are doing.
However they do have google alternative that answers you directly and it has a decent AI that processes your natural language to give you an exact answer as best as it could.
So finding factors of numbers is easy as:
Factor[9223372036854775809]
or use web api https://www.wolframalpha.com/input/?i=factor+9223372036854775809
You can also call Wolfram kernel from C#, but terms and conditions apply.
I have a piece of code in my C# Windows Form Application that looks like below:
List<string> RESULT_LIST = new List<string>();
int[] arr = My_LIST.ToArray();
string s = "";
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < arr.Length; i++)
{
int counter = i;
for (int j = 1; j <= arr.Length; j++)
{
counter++;
if (counter == arr.Length)
{
counter = 0;
}
s += arr[counter].ToString();
RESULT_LIST.Add(s);
}
s = "";
}
sw.Stop();
TimeSpan ts = sw.Elapsed;
string elapsedTime = String.Format("{0:00}", ts.TotalMilliseconds * 1000);
MessageBox.Show(elapsedTime);
I use this code to get any combination of the numbers of My list. I have behaved with My_LIST like a recursive one. The image below demonstrates my purpose very clearly:
All I need to do is:
Making a formula to calculate the approximate run time of these two
nested for loops to guess the run time for any length and help the
user know the approximate time that he/she must wait.
I have used a C# Stopwatch like this: Stopwatch sw = new Stopwatch(); to show the run time and below are the results(Note that in order to reduce the chance of error I've repeated the calculation three times for each length and the numbers show the time in nano seconds for the first, second and third attempt respectively.):
arr.Length = 400; 127838 - 107251 - 100898
arr.Length = 800; 751282 - 750574 - 739869
arr.Length = 1200; 2320517 - 2136107 - 2146099
arr.Length = 2000; 8502631 - 7554743 - 7635173
Note that there are only one-digit numbers in My_LIST to make the time
of adding numbers to the list approximately equal.
How can I find out the relation between arr.Length and run time?
First, let's suppose you have examined the algorithm and noticed that it appears to be quadratic in the array length. This suggests to us that the time taken to run should be a function of the form
t = A + B n + C n2
You've gathered some observations by running the code multiple times with different values for n and measuring t. That's a good approach.
The question now is: what are the best values for A, B and C such that they match your observations closely?
This problem can be solved in a variety of ways; I would suggest to you that the least-squares method of regression would be the place to start, and see if you get good results. There's a page on it here:
www.efunda.com/math/leastsquares/lstsqr2dcurve.cfm
UPDATE: I just looked at your algorithm again and realized it is cubic because you have a quadratic string concat in the inner loop. So this technique might not work so well. I suggest you use StringBuilder to make your algorithm quadratic.
Now, suppose you did not know ahead of time that the problem was quadratic. How would you determine the formula then? A good start would be to graph your points on log scale paper; if they roughly form a straight line then the slope of the line gives you a clue as to the power of the polynomial. If they don't form a straight line -- well, cross that bridge when you come to it.
Well you gonna do some math here.
Since the total number of runs is exactly n^2, not O(n^2) but exactly n^2 times.
Then what you could do is to keep a counter variable for the number of items processed and use math to find out an estimate
int numItemProcessed;
int timeElapsed;//read from stop watch
int totalItems = n * n;
int remainingEstimate = ((float) totalItems - numItemProcessed) / numItemProcessed) * timeElapsed
Don't assume the algorithm is necessarily N^2 in time complexity.
Take the averages of your numbers, and plot the best fit on a log-log plot, then measure the gradient. This will give you an idea as to the largest term in the polynomial. (see wikipedia log-log plot)
Once you have that, you can do a least-squares regression to work out the coefficients of the polynomial of the correct order. This will allow an estimate from the data, of the time taken for an unseen problem.
Note: As Eric Lippert said, it depends on what you want to measure - averaging may not be appropriate depending on your use case - the first run time might be more correct.
This method will work for any polynomial algorithm. It will also tell you if the algorithm is polynomial (non-polynomial running times will not give straight lines on the log-log plot).
I have a loop of about one million values. I want to store just the exponent of each value to display them e.g. in a histogram.
At the moment I'm doing it this way:
int histogram[51]; //init all with 0
for(int i = 0; i < 1000000; i++)
{
int exponent = getExponent(getValue(i));
//getExponent(double value) gives the exponent(base 10)
//getValue(int i) gives the value for loop i
if(exponent > 25)
exponent = 25;
if(exponent < -25)
exponent = -25;
histogramm[exponent+25]++;
}
Is there a more efficient and elegant way of doing this?
Perhaps without using an array?
Is there a more efficient and elegant way of doing this?
The only more elegant way would be to use Math.Min and Math.Max but it wouldn't be any more efficient. histogramm[Math.Max(0,Math.Min(50,exponent+25))]++ is fewer characters but no more performant or elegant in my opinion.
Perhaps without using an array?
An array is a raw set of values so is the most straightforward way of storing them.
Assuming that getExponent and getValue are optimized, the only way to optimize this is to use Parrallel.For. I don't think the difference will be significant, but i see it as the only way.
As for the array, it is the best indexed low-level data structure that you can use for storing data.
Use Parallel Library [ using System.Threading.Tasks; ]:
int[] histogram = new int[51]; //init all with 0
Action<int> work = (i) =>
{
int exponent = getExponent(getValue(i));
//getExponent(double value) gives the exponent(base 10)
//getValue(int i) gives the value for loop i
if (exponent > 25)
exponent = 25;
if (exponent < -25)
exponent = -25;
histogram[exponent + 25]++;
};
Parallel.For(0, 1000000, work);
There's not much to optimize with the conditional and array access. However, if the getExponent function is a time consuming operation, you can cache the results. A lot is really going to depend on what typical data looks like. Profile to see if it helps. Also, someone else mentioned using Parallel. That's worth profiling too as it could make a difference if getValue or getExponent are slow enough to overcome the Parallel and locking overhead.
Important note: Using double is probably a bad idea for a dictionary. But I'm confused by the math going on and conversions from double's to int's.
Either way, the idea here is to cache a calculation. So perhaps you can find a better way if this test proves useful.
int[] histogram = Enumerable.Repeat(0, 51).ToArray();
...
Dictionary<double, int> cache = new Dictionary<double, int>(histogram.Length);
...
for (int i = 0; i < 1000000; i++)
{
double value = getValue(i);
int exponent;
if(!cache.TryGetValue(value, out exponent))
{
exponent = getExponent(value);
cache[value] = exponent;
}
if (exponent > 25)
exponent = 25;
else if (exponent < -25)
exponent = -25;
histogram[exponent + 25]++;
}
Try using an public property to store your values for you exponent, so can test and check your results.
As for as elegance goes, I would simplify the if conditions with actual functions or subroutines, and create additional abstraction with clear and concise naming.
I'm learning different types of sorting now, and I found out that, starting from a certain point, my QuickSort algorithm doesn't work that quick at all.
Here is my code:
class QuickSort
{
// partitioning array on the key so that the left part is <=key, right part > key
private int Partition(int[] arr, int start, int end)
{
int key = arr[end];
int i = start - 1;
for (int j = start; j < end; j++)
{
if (arr[j] <= key) Swap(ref arr[++i], ref arr[j]);
}
Swap(ref arr[++i], ref arr[end]);
return i;
}
// sorting
public void QuickSorting(int[] arr, int start, int end)
{
if (start < end)
{
int key = Partition(arr, start, end);
QuickSorting(arr, start, key - 1);
QuickSorting(arr, key + 1, end);
}
}
}
class Test
{
static void Main(string[] args)
{
QuickSort quick = new QuickSort();
Random rnd = new Random(DateTime.Now.Millisecond);
int[] array = new int[1000000];
for (int i = 0; i < 1000000; i++)
{
int i_rnd = rnd.Next(1, 1000);
array[i] = i_rnd;
}
quick.QuickSorting(array, 0, array.Length - 1);
}
}
It takes about 15 seconds to run this code on an array of a million elements. While, for example, MergeSort or HeapSort do the same in less than a second.
Could you please explain to me why this can happen?
How quick your sort is and which algorithm you should use depends a lot of your input data. Is it random, nearly sorted, reversed etc.
There's a very nice page that illustrates how the different sorting algorithms work:
Sorting Algorithm Animations
Have you considered inlining the Swap method? It shouldn't be hard to do so, but it may be that the JIT is finding it hard to inline.
When I implemented quicksort for Edulinq I didn't see this problem at all - you may want to try my code (the simplest, recursive form probably) to see how that performs for you. If it does well, try to work out where the differences are.
While different algorithms will behave differently with the same data, I wouldn't expect to see this much difference on randomly-generated data.
You have 1,000,000 random elements with 1,000 distinct values. So, we can expect most values to appear about 1,000 times in your array. This gives you some quadratic O(n^2) running time.
To partition the array in 1,000 pieces, where every partition contains the same number, happens at a stack depth of about log2(1000), about 10. (That is assuming a call to partition neatly breaks it up in two pieces.) That's about 10,000,000 operations.
To quicksort the last 1,000 partitions, all containing 1,000 identical values. We need 1,000 times 1,000 + 999 + 998 + ... + 1 comparisons. (At every round quicksort reduces the problem by one, only removing the key/pivot.) That gives 500,000,000 operations. The most ideal way of quicksort 1,000 partitions would be 1,000 times 1,000*10 operations = 10,000,000. Because of the identical values, you hit a quadratic case here, quicksort's worst case performance. So, about halfway down the quicksort, it goes to worst case behavior.
If every value occurs only a few times, it doesn't matter if you sort those few tiny partitions in O(N^2) or O(N logN). But here we had a lot and huge partitions to be sorted in O(N^2).
To improve your code: partition in 3 pieces. Smaller than the pivot, equal to the pivot and bigger than the pivot. Then, only quicksort the first and last partitions. You will need to do an extra compare; test for equality first. But I think, for this input, it would be a lot faster.