I´ve a line of code that is called millions of times inside a for loop, checking if a passed argument is double.NaN. I´ve profiled my application and one of the bottlenecks is this simple function:
public void DoSomething(double[] args)
{
for(int i = 0; i < args.Length;i++)
{
if(double.IsNan(args[i]))
{
//Do something
}
}
}
Can I optimize it even if I can´t change the code inside the if?
If you have really optimized other parts of your code, you can let this function become a little bit cryptic an utilize the definition of Not a Number (NaN):
"The predicate x != y is True but all
others, x < y , x <= y , x == y , x >=
y and x > y , are False whenever x or
y or both are NaN.” (IEEE Standard 754
for Binary Floating-Point Arithmetic)
Translating that to your code you would get:
public void DoSomething(double[] args)
{
for(int i = 0; i < args.Length;i++)
{
double value = args[i];
if(value != value)
{
//Do something
}
}
}
In an ARM device using WindoWs CE + .NET Compact Framework 3.5 with around 50% probability of getting a Nan, value != value is twice as fast as double.IsNan(value).
Just be sure to measure your application execution after!
I find it hard (but not impossible) to believe that any other check on args[i] would be faster than double.IsNan().
One possibility is if this is a function. There is an overhead with calling functions, sometimes substantial, especially if the function itself is relatively small.
You could take advantage of the fact that the bit patterns for IEEE754 NaNs are well known and just do some bit checks (without calling a function to do it) - this would remove that overhead. In C, I'd try that with a macro. Where the exponent bits are all 1 and the mantissa bits are not all 0, that's a NaN (signalling or quiet is decided by the sign bit but you're probably not concerned with that). In addition, NaNs are never equal to one another so you could test for equality between args[i] and itself - false means it's a NaN.
Another possibility may be workable if the array is used more often than it's changed. Maintain another array of booleans which indicate whether or not the associated double is a NaN. Then, whenever one of the doubles changes, compute the associated boolean.
Then your function becomes:
public void DoSomething(double[] args, boolean[] nan) {
for(int i = 0; i < args.Length; i++) {
if (nan[i]) {
//Do something
}
}
}
This is the same sort of "trick" used in databases where you pre-compute values only when the data changes rather than every time you read it out. If you're in a situation where the data is being used a lot more than being changed, it's a good optimisation to look into (most algorithms can trade off space for time).
But remember the optimisation mantra: Measure, don't guess!
Just to further reiterate how important performance testing is I ran the following test on my Core i5-750 in 64-bit native and 32-bit mode on Windows 7 compiled with VS 2010 targetting .NET 4.0 and got the following results:
public static bool DoSomething(double[] args) {
bool ret = false;
for (int i = 0; i < args.Length; i++) {
if (double.IsNaN(args[i])) {
ret = !ret;
}
}
return ret;
}
public static bool DoSomething2(double[] args) {
bool ret = false;
for (int i = 0; i < args.Length; i++) {
if (args[i] != args[i]) {
ret = !ret;
}
}
return ret;
}
public static IEnumerable<R> Generate<R>(Func<R> func, int num) {
for (int i = 0; i < num; i++) {
yield return func();
}
}
static void Main(string[] args) {
Random r = new Random();
double[] data = Generate(() => {
var res = r.NextDouble();
return res < 0.5 ? res : Double.NaN;
}, 1000000).ToArray();
Stopwatch sw = new Stopwatch();
sw.Start();
DoSomething(data);
Console.WriteLine(sw.ElapsedTicks);
sw.Reset();
sw.Start();
DoSomething2(data);
Console.WriteLine(sw.ElapsedTicks);
Console.ReadKey();
}
In x86 mode (Release, naturally):
DoSomething() = 139544
DoSomething2() = 137924
In x64 mode:
DoSomething() = 19417
DoSomething2() = 17448
However, something interesting happens if our distribution of NaN's is sparser. If we change our 0.5 constant to 0.9 (only 10% NaN's) we get:
x86:
DoSomething() = 31483
DoSomething2() = 31731
x64:
DoSomething() = 31432
DoSomething2() = 31513
Reordering the calls shows the same trend as well. Food for thought.
Related
I am trying to figure out what the difference between the following for loops is.
The first is code that I wrote while practicing algorithms on codewars.com. It times out when attempting the larger test cases.
The second is one of the top solutions. It seems functionally similar (obviously its more concise) but runs much faster and does not time out. Can anyone explain to me what the difference is? Also, the return statement in the second snippet is confusing to me. What exactly does this syntax mean? Maybe this is where it is more efficient.
public static long findNb(long m)
{
int sum = 0;
int x = new int();
for (int n = 0; sum < m; n++)
{
sum += n*n*n;
x = n;
System.Console.WriteLine(x);
}
if (sum == m)
{
return x;
}
return -1;
}
vs
public static long findNb(long m) //seems similar but doesnt time out
{
long total = 1, i = 2;
for(; total < m; i++) total += i * i * i;
return total == m ? i - 1 : -1;
}
The second approach uses long for the total value. Chances are that you're using an m value that's high enough to exceed the number of values representable by int. So your math overflows and the n value becomes a negative number. You get caught in an infinite loop, where n can never get as big as m.
And, like everyone else says, get rid of the WriteLine.
Also, the return statement in the second snippet is confusing to me. What exactly does this syntax mean?
It's a ternary conditional operator.
Both approaches are roughly the same, except unwanted System.Console.WriteLine(x); which spolis the fun: printing on the Console (UI!) is a slow operation.
If you are looking for a fast solution (esp. for the large m and long loop) you can just precompute all (77936) values:
public class Solver {
static Dictionary<long, long> s_Sums = new Dictionary<long, long>();
private static void Build() {
long total = 0;
for (long i = 0; i <= 77936; ++i) {
total += i * i * i;
s_Sums.Add(total, i);
}
}
static Solver()
Build();
}
public static long findNb(long m) {
return s_Sums.TryGetValue(m, out long result)
? result
: -1;
}
}
When I run into micro optimisation challenges like this, I always use BenchmarkDotnet. It's the tool to use to get all the insights to performance, memory allocations, deviations in .NET Framework versions, 64bit vs 32 bit etc. etc.
But as others write - remember to remove the WriteLine() statement :)
I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
{
static void Main(string[] args)
{
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
}
private static string[] InitData()
{
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
{
data[i] = new string('-', maxString);
}
watch.Stop();
Console.WriteLine($"Data init, count: {data.Length:n0} x {maxString:n0}, {watch.Elapsed.TotalSeconds} secs");
return data;
}
private static void TestPerf1(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
{
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
if (j < val.Length)
{
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
}
}
if (allEmpty)
break;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
private static void TestPerf0(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
{
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
}
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
Invert your i and j loop. This will remove the allEmpty variable completely.
Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.
Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
{
foreach (int ch in val)
{
var count = counters[ch];
tmp ^= count;
if (testStore)
{
counters[ch] = count + 1;
}
}
}
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to #mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
I've found two diferent methods to get a Max value from an array but I'm not really fond of parallel programing, so I really don't understand it.
I was wondering do this methods do the same or am I missing something?
I really don't have much information about them. Not even comments...
The first method:
int[] vec = ... (I guess the content doesn't matter)
static int naiveMax()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, i =>
{
lock (obj) {
if (vec[i] > max) max = vec[i];
}
});
return max;
}
And the second one:
static int Max()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, //could be Parallel.For<int>
() => vec[0],
(i, loopState, partial) =>
{
if(vec[i]>partial) partial = vec[i];
return partial;
},
partial => {
lock (obj) {
if( partial > max) max = partial;
}
});
return max;
}
Do these do the same or something diferent and what? Thanks ;)
Both find the maximum value in an array of integers. In an attempt to find the maximum value faster, they do it "in parallel" using the Parallel.For Method. Both methods fail at this, though.
To see this, we first need a sufficiently large array of integers. For small arrays, parallel processing doesn't give us a speed-up anyway.
int[] values = new int[100000000];
Random random = new Random();
for (int i = 0; i < values.Length; i++)
{
values[i] = random.Next();
}
Now we can run the two methods and see how long they take. Using an appropriate performance measurement setup (Stopwatch, array of 100,000,000 integers, 100 iterations, Release build, no debugger attached, JIT warm-up) I get the following results on my machine:
naiveMax 00:06:03.3737078
Max 00:00:15.2453303
So Max is much much better than naiveMax (6 minutes! cough).
But how does it compare to, say, PLINQ?
static int MaxPlinq(int[] values)
{
return values.AsParallel().Max();
}
MaxPlinq 00:00:11.2335842
Not bad, saved a few seconds. Now, what about a plain, old, sequential for loop for comparison?
static int Simple(int[] values)
{
int result = values[0];
for (int i = 0; i < values.Length; i++)
{
if (result < values[i]) result = values[i];
}
return result;
}
Simple 00:00:05.7837002
I think we have a winner.
Lesson learned: Parallel.For is not pixie dust that you can sprinkle over your code to
make it magically run faster. If performance matters, use the right tools and measure, measure, measure, ...
They appear to do the same thing, however they are very inefficient. The point of parallelization is to improve the speed of code that can be executed independently. Due to race conditions, discovering the maximum (as implemented here) requires an atomic semaphore/lock on the actual logic... Which means you're spinning up many threads and related resources simply to do the code sequentially anyway... Defeating the purpose of parallelization entirely.
I'm looking for a library or existing code to simplify fractions.
Does anyone have anything at hand or any links?
P.S. I already understand the process but really don't want to rewrite the wheel
Update
Ok i've checked out the fraction library on the CodeProject
BUT the problem I have is a little bit tricker than simplifying a fraction.
I have to reduce a percentage split which could be 20% / 50% / 30% (always equal to 100%)
I think you just need to divide by the GCD of all the numbers.
void Simplify(int[] numbers)
{
int gcd = GCD(numbers);
for (int i = 0; i < numbers.Length; i++)
numbers[i] /= gcd;
}
int GCD(int a, int b)
{
while (b > 0)
{
int rem = a % b;
a = b;
b = rem;
}
return a;
}
int GCD(int[] args)
{
// using LINQ:
return args.Aggregate((gcd, arg) => GCD(gcd, arg));
}
I haven't tried the code, but it seems simple enough to be right (assuming your numbers are all positive integers and you don't pass an empty array).
You can use Microsoft.FSharp.Math.BigRational, which is in the free F# Power Pack library. Although it depends on F# (which is gratis and included in VS2010), it can be used from C#.
BigRational reduced = BigRational.FromInt(4)/BigRational.FromInt(6);
Console.WriteLine(reduced);
2/3
Console.WriteLine(reduced.Numerator);
2
Console.WriteLine(reduced.Denominator);
3
This library looks like it might be what you need:
var f = new Fraction(numerator, denominator);
numerator = f.Numerator;
denominator = f.Denominator;
Although, I haven't tested it, so it looks like you may need to play around with it to get it to work.
The best example of Fraction (aka Rational) I've seen is in Timothy Budd's "Classic Data Structures in C++". His implementation is very good. It includes a simple implementation of GCD algorithm.
It shouldn't be hard to adapt to C#.
A custom solution:
void simplify(int[] numbers)
{
for (int divideBy = 50; divideBy > 0; divideBy--)
{
bool divisible = true;
foreach (int cur in numbers)
{
//check for divisibility
if ((int)(cur/divideBy)*divideBy!=cur){
divisible = false;
break;
}
}
if (divisible)
{
for (int i = 0; i < numbers.GetLength(0);i++ )
{
numbers[i] /= divideBy;
}
}
}
}
Example usage:
int [] percentages = {20,30,50};
simplify(percentages);
foreach (int p in percentages)
{
Console.WriteLine(p);
}
Outupts:
2
3
5
By the way, this is my first c# program. Thought it would simply be a fun problem to try a new language with, and now I'm in love! It's like Java, but everything I wish was a bit different is exactly how I wanted it
<3 c#
Edit: Btw don't forget to make it static void if it's for your Main class.
(background: Why should I use int instead of a byte or short in C#)
To satisfy my own curiosity about the pros and cons of using the "appropriate size" integer vs the "optimized" integer i wrote the following code which reinforced what I previously held true about int performance in .Net (and which is explained in the link above) which is that it is optimized for int performance rather than short or byte.
DateTime t;
long a, b, c;
t = DateTime.Now;
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
a = DateTime.Now.Ticks - t.Ticks;
t = DateTime.Now;
for (short index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
b=DateTime.Now.Ticks - t.Ticks;
t = DateTime.Now;
for (byte index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
c=DateTime.Now.Ticks - t.Ticks;
Console.WriteLine(a.ToString());
Console.WriteLine(b.ToString());
Console.WriteLine(c.ToString());
This gives roughly consistent results in the area of...
~950000
~2000000
~1700000
Which is in line with what i would expect to see.
However when I try repeating the loops for each data type like this...
t = DateTime.Now;
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
a = DateTime.Now.Ticks - t.Ticks;
The numbers are more like...
~4500000
~3100000
~300000
Which I find puzzling. Can anyone offer an explanation?
NOTE:
In the interest of comparing like for like i've limited the loops to 127 because of the range of the byte value type.
Also this is an act of curiosity not production code micro-optimization.
First of all, it's not .NET that's optimized for int performance, it's the machine that's optimized because 32 bits is the native word size (unless you're on x64, in which case it's long or 64 bits).
Second, you're writing to the console inside each loop - that's going too be far more expensive than incrementing and testing the loop counter, so you're not measuring anything realistic here.
Third, a byte has range up to 255, so you can loop 254 times (if you try to do 255 it will overflow and the loop will never end - but you don't need to stop at 128).
Fourth, you're not doing anywhere near enough iterations to profile. Iterating a tight loop 128 or even 254 times is meaningless. What you should be doing is putting the byte/short/int loop inside another loop that iterates a much larger number of times, say 10 million, and check the results of that.
Finally, using DateTime.Now within calculations is going to result in some timing "noise" while profiling. It's recommended (and easier) to use the Stopwatch class instead.
Bottom line, this needs many changes before it can be a valid perf test.
Here's what I'd consider to be a more accurate test program:
class Program
{
const int TestIterations = 5000000;
static void Main(string[] args)
{
RunTest("Byte Loop", TestByteLoop, TestIterations);
RunTest("Short Loop", TestShortLoop, TestIterations);
RunTest("Int Loop", TestIntLoop, TestIterations);
Console.ReadLine();
}
static void RunTest(string testName, Action action, int iterations)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < iterations; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: Elapsed Time = {1}", testName, sw.Elapsed);
}
static void TestByteLoop()
{
int x = 0;
for (byte b = 0; b < 255; b++)
++x;
}
static void TestShortLoop()
{
int x = 0;
for (short s = 0; s < 255; s++)
++x;
}
static void TestIntLoop()
{
int x = 0;
for (int i = 0; i < 255; i++)
++x;
}
}
This runs each loop inside a much larger loop (5 million iterations) and performs a very simple operation inside the loop (increments a variable). The results for me were:
Byte Loop: Elapsed Time = 00:00:03.8949910
Short Loop: Elapsed Time = 00:00:03.9098782
Int Loop: Elapsed Time = 00:00:03.2986990
So, no appreciable difference.
Also, make sure you profile in release mode, a lot of people forget and test in debug mode, which will be significantly less accurate.
The majority of this time is probably spent writing to the console. Try doing something other than that in the loop...
Additionally:
Using DateTime.Now is a bad way of measuring time. Use System.Diagnostics.Stopwatch instead
Once you've got rid of the Console.WriteLine call, a loop of 127 iterations is going to be too short to measure. You need to run the loop lots of times to get a sensible measurement.
Here's my benchmark:
using System;
using System.Diagnostics;
public static class Test
{
const int Iterations = 100000;
static void Main(string[] args)
{
Measure(ByteLoop);
Measure(ShortLoop);
Measure(IntLoop);
Measure(BackToBack);
Measure(DelegateOverhead);
}
static void Measure(Action action)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < Iterations; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: {1}ms", action.Method.Name,
sw.ElapsedMilliseconds);
}
static void ByteLoop()
{
for (byte index = 0; index < 127; index++)
{
index.ToString();
}
}
static void ShortLoop()
{
for (short index = 0; index < 127; index++)
{
index.ToString();
}
}
static void IntLoop()
{
for (int index = 0; index < 127; index++)
{
index.ToString();
}
}
static void BackToBack()
{
for (byte index = 0; index < 127; index++)
{
index.ToString();
}
for (short index = 0; index < 127; index++)
{
index.ToString();
}
for (int index = 0; index < 127; index++)
{
index.ToString();
}
}
static void DelegateOverhead()
{
// Nothing. Let's see how much
// overhead there is just for calling
// this repeatedly...
}
}
And the results:
ByteLoop: 6585ms
ShortLoop: 6342ms
IntLoop: 6404ms
BackToBack: 19757ms
DelegateOverhead: 1ms
(This is on a netbook - adjust the number of iterations until you get something sensible :)
That seems to show it making basically no significant different which type you use.
Just out of curiosity I modified a litte the program from Aaronaught and compiled it in both x86 and x64 modes. Strange, Int works much faster in x64:
x86
Byte Loop: Elapsed Time = 00:00:00.8636454
Short Loop: Elapsed Time = 00:00:00.8795518
UShort Loop: Elapsed Time = 00:00:00.8630357
Int Loop: Elapsed Time = 00:00:00.5184154
UInt Loop: Elapsed Time = 00:00:00.4950156
Long Loop: Elapsed Time = 00:00:01.2941183
ULong Loop: Elapsed Time = 00:00:01.3023409
x64
Byte Loop: Elapsed Time = 00:00:01.0646588
Short Loop: Elapsed Time = 00:00:01.0719330
UShort Loop: Elapsed Time = 00:00:01.0711545
Int Loop: Elapsed Time = 00:00:00.2462848
UInt Loop: Elapsed Time = 00:00:00.4708777
Long Loop: Elapsed Time = 00:00:00.5242272
ULong Loop: Elapsed Time = 00:00:00.5144035
I tried out the two programs above as they looked like they would produce different and possibly conflicting results on my dev machine.
Outputs from Aaronaughts' test harness
Short Loop: Elapsed Time = 00:00:00.8299340
Byte Loop: Elapsed Time = 00:00:00.8398556
Int Loop: Elapsed Time = 00:00:00.3217386
Long Loop: Elapsed Time = 00:00:00.7816368
ints are much quicker
Outputs from Jon's
ByteLoop: 1126ms
ShortLoop: 1115ms
IntLoop: 1096ms
BackToBack: 3283ms
DelegateOverhead: 0ms
nothing in it
Jon has the big fixed constant of calling tostring in the results which may be hiding the possible benefits that could occur if the work done in the loop was less.
Aaronaught is using a 32bit OS which dosen't seem to benefit from using ints as much as the x64 rig I am using.
Hardware / Software
Results were collected on a Core i7 975 at 3.33GHz with turbo disabled and the core affinity set to reduce impact of other tasks. Performance settings all set to maximum and virus scanner / unnecessary background tasks suspended. Windows 7 x64 ultimate with 11 GB of spare ram and very little IO activity. Run in release config built in vs 2008 without a debugger or profiler attached.
Repeatability
Originally repeated 10 times changing order of execution for each test. Variation was negligible so i only posted my first result. Under max CPU load the ratio of execution times stayed consistent. Repeat runs on multiple x64 xp xeon blades gives roughly same results after taking into account CPU generation and Ghz
Profiling
Redgate / Jetbrains / Slimtune / CLR profiler and my own profiler all indicate that the results are correct.
Debug Build
Using the debug settings in VS gives consistent results like Aaronaught's.
A bit late to the game, but this question deserves an accurate answer.
The generated IL code for int loop will indeed be faster than the other two. When using byte or short a convert instruction is required. It is possible, though, that the jitter is able to optimize it away under certain conditions (not in scope of this analysis).
Benchmark
Targeting .NET Core 3.1 with Release (Any CPU) configuration. Benchmark executed on x64 CPU.
| Method | Mean | Error | StdDev |
|---------- |----------:|---------:|---------:|
| ByteLoop | 149.78 ns | 0.963 ns | 0.901 ns |
| ShortLoop | 149.40 ns | 0.322 ns | 0.286 ns |
| IntLoop | 79.38 ns | 0.764 ns | 0.638 ns |
Generated IL
Comparing the IL for the three methods, it becomes obvious that the induced cost comes from a conv instruction.
IL_0000: ldc.i4.0
IL_0001: stloc.0
IL_0002: br.s IL_0009
IL_0004: ldloc.0
IL_0005: ldc.i4.1
IL_0006: add
IL_0007: conv.i2 ; conv.i2 for short, conv.i4 for byte
IL_0008: stloc.0
IL_0009: ldloc.0
IL_000a: ldc.i4 0xff
IL_000f: blt.s IL_0004
IL_0011: ret
Complete test code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace LoopPerformance
{
public class Looper
{
[Benchmark]
public void ByteLoop()
{
for (byte b = 0; b < 255; b++) {}
}
[Benchmark]
public void ShortLoop()
{
for (short s = 0; s < 255; s++) {}
}
[Benchmark]
public void IntLoop()
{
for (int i = 0; i < 255; i++) {}
}
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Looper>();
}
}
}
Profiling .Net code is very tricky because the run-time environment the compiled byte-code runs in can be doing run-time optimisations on the byte code. In your second example, the JIT compiler probably spotted the repeated code and created a more optimised version. But, without any really detailed description of how the run-time system works, it's impossible to know what is going to happen to your code. And it would be foolish to try and guess based on experimentation since Microsoft are perfectly within their rights to redesign the JIT engine at any time provided they don't break any functionality.
Console write has zero to do with actual performance of the data. It has more to do with the interaction with the console library calls. Suggest you do something interesting inside those loops that is data size independant.
Suggestions: bit shifts, multiplies, array manipulation, addition, many others...
Adding to the performance of different integral data types, I tested the performance of Int32 vs Int64 (i.e. int vs long) for an implementation of my prime number calculator, and found that on my x64 machine (Ryzen 1800X) there was no marked difference.
I couldn't really test with shorts (Int16 and UInt16) because it overflows pretty quickly.
And as others noted, your short loops are obfuscating your results, and especially your debugging statements. You should try to use a worker thread instead.
Here is a performance comparison of int vs long:
Of course, make sure to avoid long (and anything other than plain int) for array indices, since you can't even use them, and casting to int could only hurt performance (immeasurable in my test).
Here is my profiling code, which polls the progress as the worker thread spins forever. It does slow down slightly with repeated tests, so I made sure to test in other orderings and individually as well:
public static void Run() {
TestWrapper(new PrimeEnumeratorInt32());
TestWrapper(new PrimeEnumeratorInt64());
TestWrapper(new PrimeEnumeratorInt64Indices());
}
private static void TestWrapper<X>(X enumeration)
where X : IDisposable, IEnumerator {
int[] lapTimesMs = new int[] { 100, 300, 600, 1000, 3000, 5000, 10000 };
int sleepNumberBlockWidth = 2 + (int)Math.Ceiling(Math.Log10(lapTimesMs.Max()));
string resultStringFmt = string.Format("\tTotal time is {{0,-{0}}}ms, number of computed primes is {{1}}", sleepNumberBlockWidth);
int totalSlept = 0;
int offset = 0;
Stopwatch stopwatch = new Stopwatch();
Type t = enumeration.GetType();
FieldInfo field = t.GetField("_known", BindingFlags.NonPublic | BindingFlags.Instance);
Console.WriteLine("Testing {0}", t.Name);
_continue = true;
Thread thread = new Thread(InfiniteLooper);
thread.Start(enumeration);
stopwatch.Start();
foreach (int sleepSize in lapTimesMs) {
SleepExtensions.SleepWithProgress(sleepSize + offset);
//avoid race condition calling the Current property by using reflection to get private data
Console.WriteLine(resultStringFmt, stopwatch.ElapsedMilliseconds, ((IList)field.GetValue(enumeration)).Count);
totalSlept += sleepSize;
offset = totalSlept - (int)stopwatch.ElapsedMilliseconds;//synchronize to stopwatch laps
}
_continue = false;
thread.Join(100);//plz stop in time (Thread.Abort is no longer supported)
enumeration.Dispose();
stopwatch.Stop();
}
private static bool _continue = true;
private static void InfiniteLooper(object data) {
IEnumerator enumerator = (IEnumerator)data;
while (_continue && enumerator.MoveNext()) { }
}
}
Note you can replace SleepExtensions.SleepWithProgress with just Thread.Sleep
And the three variations of the algorithm being profiled:
Int32 version
class PrimeEnumeratorInt32 : IEnumerator<int> {
public int Current { get { return this._known[this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private int _currentIdx = -1;
private List<int> _known = new List<int>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private int ComputeNext(int lastKnown) {
int current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
int testIdx;
int sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (int)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[testIdx] <= sqrt) {
if (current % this._known[testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite) {
current += 2;
} else {
return current;//and end
}
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Int64 version
class PrimeEnumeratorInt64 : IEnumerator<long> {
public long Current { get { return this._known[this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private int _currentIdx = -1;
private List<long> _known = new List<long>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private long ComputeNext(long lastKnown) {
long current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
int testIdx;
long sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (long)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[testIdx] <= sqrt) {
if (current % this._known[testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite)
current += 2;
else
return current;//and end
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Int64 for both values and indices
Note the necessary casting of indices accessing the _known list.
class PrimeEnumeratorInt64Indices : IEnumerator<long> {
public long Current { get { return this._known[(int)this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private long _currentIdx = -1;
private List<long> _known = new List<long>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private long ComputeNext(long lastKnown) {
long current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
long testIdx;
long sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (long)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[(int)testIdx] <= sqrt) {
if (current % this._known[(int)testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite)
current += 2;
else
return current;//and end
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Total, my test program is using 43MB of memory after 20 seconds for Int32 and 75MB of memory for Int64, due to the List<...> _known collection, which is the biggest difference I'm observing.
I profiled versions using unsigned types as well. Here are my results (Release mode):
Testing PrimeEnumeratorInt32
Total time is 20000 ms, number of computed primes is 3842603
Testing PrimeEnumeratorUInt32
Total time is 20001 ms, number of computed primes is 3841554
Testing PrimeEnumeratorInt64
Total time is 20001 ms, number of computed primes is 3839953
Testing PrimeEnumeratorUInt64
Total time is 20002 ms, number of computed primes is 3837199
All 4 versions have essentially identical performance. I guess the lesson here is to never assume how performance will be affected, and that you should probably use Int64 if you are targeting an x64 architecture, since it matches my Int32 version even with the increased memory usage.
And a validation my prime calculator is working:
P.S. Release mode had consistent results that were 1.1% faster.
P.P.S. Here are the necessary using statements:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using System.Threading;
Another use case where int16 or int32 may be preferable to int64 is for SIMD (Single Instruction, Multiple Data), so you can double/quadruple/octuple etc. your throughput, by stuffing more data into your instructions. This is because the register size is (generally) 256-bit, so you can evaluate 16, 8, or 4 values simultaneously, respectively. It is very useful for vector calculations.
The data structure on MSDN.
A couple use cases: improving performance with simd intrinsics in three use cases. I particularly found SIMD to be useful for higher-dimensional binary tree child index lookup operations (i.e. signal vectors).
You can also use SIMD to accelerate other array operations and further tighten your loops.