Tracking down a performance problem (micro I know) I end with this test program. Compiled with the framework 4.5 and Release mode it tooks on my machine around 10ms.
What bothers me if that if I remove this line
public int[] value1 = new int[80];
times get closer to 2 ms. It seems that there is some memory fragmentation problem but I failed to explain the why. I have tested the program with Net Core 2.0 with same results. Can anyone explain this behaviour?
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace ConsoleApp4
{
public class MyObject
{
public int value = 1;
public int[] value1 = new int[80];
}
class Program
{
static void Main(string[] args)
{
var list = new List<MyObject>();
for (int i = 0; i < 500000; i++)
{
list.Add(new MyObject());
}
long total = 0;
for (int i = 0; i < 200; i++)
{
int counter = 0;
Stopwatch timer = Stopwatch.StartNew();
foreach (var obj in list)
{
if (obj.value == 1)
counter++;
}
timer.Stop();
total += timer.ElapsedMilliseconds;
}
Console.WriteLine(total / 200);
Console.ReadKey();
}
}
}
UPDATE:
After some research I came to the conclusion that it's just the processor cache access time. Using the VS profiler, the cache misses seem to be a lot higher
Without array
With array
There are several implications involved.
When you have your line public int[] value1 = new int[80];, you have one extra allocation of memory: a new array is created on a heap which will accommodate 80 integers (320 bytes) + overhead of the class. You do 500 000 of these allocations.
These allocations total up for more than 160 MBs of RAM, which may cause the GC to kick in and see if there is memory to be released.
Further, when you allocate so much memory, it is likely that some of the objects from the list are not retained in the CPU cache. When you later enumerate your collection, the CPU may need to read the data from RAM, not from cache, which will induce a serious performance penalty.
I'm not able to reproduce a big difference between the two and I wouldn't expect it either. Below are the results I get on .NET Core 2.2.
Instances of MyObject will be allocated on the heap. In one case, you have an int and a reference to the int array. In the other you have just the int. In both cases, you need to do the additional work of following the reference from the list. That is the same in both cases and the compiled code shows this.
Branch prediction will affect how fast this runs, but since you're branching on the same condition every time I wouldn't expect this to change from run to run (unless you change the data).
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.376 (1803/April2018Update/Redstone4)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.200-preview-009648
[Host] : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
DefaultJob : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Method | size | Mean | Error | StdDev | Ratio |
------------- |------- |---------:|----------:|----------:|------:|
WithArray | 500000 | 8.167 ms | 0.0495 ms | 0.0463 ms | 1.00 |
WithoutArray | 500000 | 8.167 ms | 0.0454 ms | 0.0424 ms | 1.00 |
For reference:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;
namespace CoreSandbox
{
[DisassemblyDiagnoser(printAsm: true, printSource: false, printPrologAndEpilog: true, printIL: false, recursiveDepth: 1)]
//[MemoryDiagnoser]
public class Test
{
private List<MyObject> dataWithArray;
private List<MyObjectLight> dataWithoutArray;
[Params(500_000)]
public int size;
public class MyObject
{
public int value = 1;
public int[] value1 = new int[80];
}
public class MyObjectLight
{
public int value = 1;
}
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Test>();
}
[GlobalSetup]
public void Setup()
{
dataWithArray = new List<MyObject>(size);
dataWithoutArray = new List<MyObjectLight>(size);
for (var i = 0; i < size; i++)
{
dataWithArray.Add(new MyObject());
dataWithoutArray.Add(new MyObjectLight());
}
}
[Benchmark(Baseline = true)]
public int WithArray()
{
var counter = 0;
foreach(var obj in dataWithArray)
{
if (obj.value == 1)
counter++;
}
return counter;
}
[Benchmark]
public int WithoutArray()
{
var counter = 0;
foreach (var obj in dataWithoutArray)
{
if (obj.value == 1)
counter++;
}
return counter;
}
}
}
Related
we have a piece of code that basically reads a template from a file, and then replaces a bunch of placeholders with real data. The template is something outside of the developer's control and we have noticed that sometimes (for large template files), it can get quite CPU-intensive to perform the replaces.
At least, we believe that it is the string replaces that are intensive. Debugging locally shows that it only takes milliseconds to perform each replace, but every template is different and some templates can contain hundreds of these tags that need replacing.
I'll show a little bit of code first, before I continue.
This is a huge simplification of the real code. There could be hundreds of replacements happening on our real code.
string template = File.ReadAllText("path\to\file");
if (!string.IsNullOrEmpty(template))
{
if (template.Contains("[NUMBER]"))
template = template.Replace("[NUMBER]", myobject.Number);
if (template.Contains("[VAT_TABLE]"))
template = template.Replace("[VAT_TABLE]", ConstructVatTable(myObject));
// etc ... :
}
private string ConstructVatTable(Invoice myObject)
{
string vatTemplate = "this is a template with tags of its own";
StringBuilder builder = new StringBuilder();
foreach (var item in myObject.VatItems)
{
builder.Append(vatTemplate.Replace("[TAG1]", item.Tag1).Replace("[TAG2]", item.Tag2);
}
return builder.ToString();
}
Is this the most optimal way of replacing parts of a large string, or are there better ways? Are there ways that we could profile what we are doing in more detail to show us where the CPU intensive parts may lie? Any help or advice would be greatly appreciated.
You perhaps need to come up with some alternative strategies for your replacements and race your horses.
I did 4 here and benched them:
[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net50)]
public class Streplace
{
private string _doc;
private Dictionary<string,string> _tags;
[GlobalSetup]
public void GenerateDocAndTags()
{
var sb = new StringBuilder(50000);
_tags = new Dictionary<string, string>();
for (int i = 0; i < 20; i++)
{
_tags["TAG" + i] = new string((char)('a' + i), 1000);
}
for (int i = 0; i < 1000; i++)
{
sb.Append(Guid.NewGuid().ToString().ToUpper());
if (i % 50 == 0)
{
sb.Append("[TAG" + i / 50 + "]");
}
}
_doc = sb.ToString();
}
[Benchmark]
public void NaiveString()
{
var str = _doc;
foreach (var tag in _tags)
{
str = str.Replace("[" + tag.Key + "]", tag.Value);
}
}
[Benchmark]
public void NaiveStringBuilder()
{
var strB = new StringBuilder(_doc, _doc.Length * 2);
foreach (var tag in _tags)
{
strB.Replace("[" + tag.Key + "]", tag.Value);
}
var s = strB.ToString();
}
[Benchmark]
public void StringSplit()
{
var strs = _doc.Split('[',']');
for (int i = 1; i < strs.Length; i+= 2)
{
strs[i] = _tags[strs[i]];
}
var s = string.Concat(strs);
}
[Benchmark]
public void StringCrawl()
{
var strB = new StringBuilder(_doc.Length * 2);
var str = _doc;
var lastI = 0;
for (int i = str.IndexOf('['); i > -1; i = str.IndexOf('[', i))
{
strB.Append(str, lastI, i - lastI); //up to the [
i++;
var j = str.IndexOf(']', i);
var tag = str[i..j];
strB.Append(_tags[tag]);
lastI = j + 1;
}
strB.Append(str, lastI, str.Length - lastI);
var s = strB.ToString();
}
}
NaiveString - your replace replace replace approach
NaiveStringBuilder - Rafal's approach - I was quite surprised how badly this performed, but I haven't looked into why. If anyone notices a glaring error in my code, let me know
StringSplit - the approach I commented - split the string and then the odd indexes are what needs swapping out, then join the string again
StringCrawl - travel the string looking for [ and ] putting either the doc content, or a tag content into the StringBuilder, depending on whether we're inside or outside the brackets
The test document was generated; thousands of GUIDs with [TAGx] (x was 0 to 19) inserted at regular intervals. The [TAGx] were each replaced with a string considerably longer that was a bunch of repeated chars. The resulting document was 56000 chars and looked like:
The results were:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1526 (21H2)
Intel Core i7-7820HQ CPU 2.90GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.200
[Host] : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT [AttachedDebugger]
.NET 5.0 : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Job=.NET 5.0 Runtime=.NET 5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |-----------:|---------:|---------:|---------:|---------:|---------:|----------:|
| NaiveString | 1,266.9 us | 24.73 us | 39.93 us | 433.5938 | 433.5938 | 433.5938 | 1,820 KB |
| NaiveStringBuilder | 3,908.6 us | 73.79 us | 93.33 us | 78.1250 | 78.1250 | 78.1250 | 293 KB |
| StringSplit | 110.8 us | 2.15 us | 2.01 us | 34.4238 | 34.4238 | 34.4238 | 181 KB |
| StringCrawl | 101.5 us | 1.96 us | 2.40 us | 79.9561 | 79.9561 | 79.9561 | 251 KB |
NaiveString's memory usage is huge, NaiveStringBuilder's memory is better but is 3x slower. Crawl and Split are pretty good - about 12x faster than NaiveString, 40x faster than NaiveStringBuilder (and a bit less memory too, bonus).
I thought the Crawl would be better than the Split, but I'm not sure the 10% speedup is worth the extra memory/collections - that would be your call. Take the code, maybe add some more approaches, and race them. Nuget install BenchmarkDotNet, put a static void main of this:
public static async Task Main()
{
#if DEBUG
var sr = new Streplace();
sr.GenerateDocAndTags();
sr.NaiveString();
sr.NaiveStringBuilder();
sr.StringSplit();
sr.StringCrawl();
#else
var summary = BenchmarkRunner.Run<Streplace>();
#endif
}
I was testing a .NET application on a RaspberryPi and whereas each iteration of that program took 500 milliseconds on a Windows laptop, the same took 5 seconds on a RaspberryPi. After some debugging, I found that majority of that time was being spent on a foreach loop concatenating strings.
Edit 1: To clarify, that 500 ms and 5 s time I mentioned was the time of the entire loop. I placed a timer before the loop, and stopped the timer after the loop had finished. And, the number of iterations are the same in both, 1000.
Edit 2: To time the loop, I used the answer mentioned here.
private static string ComposeRegs(List<list_of_bytes> registers)
{
string ret = string.Empty;
foreach (list_of_bytes register in registers)
{
ret += Convert.ToString(register.RegisterValue) + ",";
}
return ret;
}
Out of the blue I replaced the foreach with a for loop, and suddenly it starts taking almost the same time as it did on that laptop. 500 to 600 milliseconds.
private static string ComposeRegs(List<list_of_bytes> registers)
{
string ret = string.Empty;
for (UInt16 i = 0; i < 1000; i++)
{
ret += Convert.ToString(registers[i].RegisterValue) + ",";
}
return ret;
}
Should I always use for loops instead of foreach? Or was this just a scenario in which a for loop is way faster than a foreach loop?
The actual problem is concatenating strings not a difference between for vs foreach. The reported timings are excruciatingly slow even on a Raspberry Pi. 1000 items is so little data it can fit in either machine's CPU cache. An RPi has a 1+ GHZ CPU which means each concatenation takes at leas 1000 cycles.
The problem is the concatenation. Strings are immutable. Modifying or concatenating strings creates a new string. Your loops created 2000 temporary objects that need to be garbage collected. That process is expensive. Use a StringBuilder instead, preferably with a capacity roughly equal to the size of the expected string.
[Benchmark]
public string StringBuilder()
{
var sb = new StringBuilder(registers.Count * 3);
foreach (list_of_bytes register in registers)
{
sb.AppendFormat("{0}",register.RegisterValue);
}
return sb.ToString();
}
Simply measuring a single execution, or even averaging 10 executions, won't produce valid numbers. It's quite possible the GC run to collect those 2000 objects during one of the tests. It's also quite possible that one of the tests was delayed by JIT compilation or any other number of reasons. A test should run long enough to produce stable numbers.
The defacto standard for .NET benchmarking is BenchmarkDotNet. That library will run each benchmark long enough to eliminate startup and cooldown effect and account for memory allocations and GC collections. You'll see not only how much each test takes but how much RAM is used and how many GCs are caused
To actually measure your code try using this benchmark using BenchmarkDotNet :
[MemoryDiagnoser]
[MarkdownExporterAttribute.StackOverflow]
public class ConcatTest
{
private readonly List<list_of_bytes> registers;
public ConcatTest()
{
registers = Enumerable.Range(0,1000).Select(i=>new list_of_bytes(i)).ToList();
}
[Benchmark]
public string StringBuilder()
{
var sb = new StringBuilder(registers.Count*3);
foreach (var register in registers)
{
sb.AppendFormat("{0}",register.RegisterValue);
}
return sb.ToString();
}
[Benchmark]
public string ForEach()
{
string ret = string.Empty;
foreach (list_of_bytes register in registers)
{
ret += Convert.ToString(register.RegisterValue) + ",";
}
return ret;
}
[Benchmark]
public string For()
{
string ret = string.Empty;
for (UInt16 i = 0; i < registers.Count; i++)
{
ret += Convert.ToString(registers[i].RegisterValue) + ",";
}
return ret;
}
}
The tests are run by calling BenchmarkRunner.Run<ConcatTest>()
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<ConcatTest>();
Console.WriteLine(summary);
}
}
Results
Running this on a Macbook produced the following results. Note that BenchmarkDotNet produced results ready to use in StackOverflow, and the runtime information is included in the results :
BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.5.2 (20G95) [Darwin 20.6.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
-------------- |----------:|---------:|---------:|---------:|--------:|----------:|
StringBuilder | 34.56 μs | 0.682 μs | 0.729 μs | 7.5684 | 0.3052 | 35 KB |
ForEach | 278.36 μs | 5.509 μs | 5.894 μs | 818.8477 | 24.4141 | 3,763 KB |
For | 268.72 μs | 3.611 μs | 3.015 μs | 818.8477 | 24.4141 | 3,763 KB |
Both For and ForEach took almost 10 times more than StringBuilder and used 100 times as much RAM
If a string can change like in your example then using a StringBuilder is a better option and could help the issue your dealing with.
Modify any string object will result into the creation of a new string object. This makes the use of string costly. So when the user needs the repetitive operations on the string then the need of StringBuilder come into existence. It provides the optimized way to deal with the repetitive and multiple string manipulation operations. It represents a mutable string of characters. Mutable means the string which can be changed. So String objects are immutable but StringBuilder is the mutable string type. It will not create a new modified instance of the current string object but do the modifications in the existing string object.
So intead of creating many temporary objects that will need to be garbage collected and mean while are taking a lot of memory, just use StringBuilder.
More about StringBuilder - https://learn.microsoft.com/en-us/dotnet/api/system.text.stringbuilder?view=net-6.0
I am exploring async/await and have found a curious scenario that I need guidance on resolving.
For reference, the code seen in this question can be found here:
https://github.com/Mike-EEE/Stash/tree/master/AwaitPerformance
I have provided two simple ways of awaiting a set of tasks. The first is simply creating a List<Task>, adding tasks to this list, and awaiting the entire result at once with a call to Task.WhenAll:
public async Task<uint> AwaitList()
{
var list = new List<Task>();
for (var i = 0u; i < 10; i++)
{
list.Add(Task.Delay(1));
}
await Task.WhenAll(list).ConfigureAwait(false);
return 123;
}
The second is by awaiting each task as they occur in the for loop:
public async Task<uint> AwaitEach()
{
for (var i = 0u; i < 10; i++)
{
await Task.Delay(1).ConfigureAwait(false);
}
return 123;
}
When running these two methods with Benchmark.NET, however, I get surprisingly conflicting results:
// * Summary *
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
AMD Ryzen 7 2700X, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview5-011568
[Host] : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------- |----------:|----------:|----------:|------:|------:|------:|----------:|
| AwaitList | 15.60 ms | 0.0274 ms | 0.0243 ms | - | - | - | 2416 B |
| AwaitEach | 155.62 ms | 0.9113 ms | 0.8524 ms | - | - | - | 352 B |
As you can see, awaiting the list of tasks is much faster, but generates a ton of allocations. Awaiting each item, however, is the inverse: it is slower but generates way less garbage.
Is there an obvious, ideal way that I am overlooking to get the both of best worlds here? That is, is there a way to await a set of Task elements that is both fast and results in a low amount of allocations?
Thank you in advance for any assistance.
You are not comparing apples to apples here.
In your example:
AwaitList creates a list of Tasks and then runs them all parallel (async).
AwaitEach runs each Task one after another hence making the async keyword useless.
If, however, you make your list of Tasks, so that each task can start and then compare WhenAll vs a loop, your comparison would look like this:
public async Task<uint> AwaitList()
{
var list = new List<Task>();
for (var i = 0u; i < 10; i++)
{
list.Add(Task.Delay(1));
}
await Task.WhenAll(list).ConfigureAwait(false);
return 123;
}
verses
public async Task<uint> AwaitEach()
{
var list = new List<Task>();
for (var i = 0; i < 10; i++)
{
list.Add(Task.Delay(1));
}
for (var i = 0; i < 10; i++)
{
await list[i].ConfigureAwait(false);
}
return 123;
}
Now compare the stats on these two functions and you will find they are ballpark of one another.
I have a general question about dictionaries in C#.
Say I read in a text file, split it up into keys and values and store them in a dictionary.
Would it be more useful to put them all into a single dictionary or split it up into smaller ones?
It probably wouldn't make a huge difference with small text files but some of them have more than 100.000 lines.
What would you recommend?
First rule is always to benchmark before trying optimization. That being said, some people might have done the benchmarking for you. Check those results here
From the article (Just in case it disappears from the net)
The smaller Dictionary (with half the number of keys) was much faster.
In this case, the behavior of both Dictionaries on the input was
identical. This means that having unneeded keys in the Dictionary
makes it slower.
My perspective is that you should use separate Dictionaries for
separate purposes. If you have two sets of keys, do not store them in
the same Dictionary. If you can divide them up, you can enhance lookup
performance.
Credit: dotnetperls.com
Also from the article :
Full Dictionary: 791 ms
Half-size Dictionary: 591 ms [faster]
Maybe you can live with much less code and 200ms more, it really depends on your application
I believe the original article is either inaccurate or outdated. In any case, the statements regarding "dictionary size" have since been removed. Now, to answer the question:
Targeting .NET 6 x64 gives BETTER performance for a SINGLE dictionary. In fact, performance gets worse the more dictionaries you use:
| Method | Mean | Error | StdDev | Median |
|-------------- |----------:|---------:|----------:|----------:|
| Dictionary_1 | 91.54 us | 1.815 us | 3.318 us | 89.88 us |
| Dictionary_2 | 122.55 us | 1.067 us | 0.998 us | 122.19 us |
| Dictionary_10 | 390.77 us | 7.757 us | 18.882 us | 382.55 us |
The results should come as no surprise. For N-dictionary lookup you will calculate the hash code up to N times for every item to look up, instead of doing it just once. Also, you have to loop through the list of dictionaries which introduces a miniscule performance hit. All in all, it just makes sense.
Now, under some bizarre conditions, it might be possible to gain some speed using N-dictionary. E.g. Consider a tiny CPU cache, thrashing, hash code collisions etc. Have yet to encounter such a scenario though...
Benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace MyBenchmarks;
public class DictionaryBenchmark
{
private const int N = 1000000;
private readonly string[] data;
private readonly Dictionary<string, string> dictionary;
private readonly List<Dictionary<string, string>> dictionaries2;
private readonly List<Dictionary<string, string>> dictionaries10;
public DictionaryBenchmark()
{
data = Enumerable.Range(0, N).Select(n => Guid.NewGuid().ToString()).ToArray();
dictionary = data.ToDictionary(x => x);
dictionaries2 = CreateDictionaries(2);
dictionaries10 = CreateDictionaries(10);
}
private List<Dictionary<string, string>> CreateDictionaries(int count)
{
int chunkSize = N / count;
return data.Select((item, index) => (Item: item, Index: index))
.GroupBy(x => x.Index / chunkSize)
.Select(g => g.Select(x => x.Item).ToDictionary(x => x))
.ToList();
}
[Benchmark]
public void Dictionary_1()
{
for (int i = 0; i < N; i += 1000)
{
dictionary.ContainsKey(data[i]);
}
}
[Benchmark]
public void Dictionary_2()
{
for (int i = 0; i < N; i += 1000)
{
foreach (var d in dictionaries2)
{
if (d.ContainsKey(data[i]))
{
break;
}
}
}
}
[Benchmark]
public void Dictionary_10()
{
for (int i = 0; i < N; i += 1000)
{
foreach (var d in dictionaries10)
{
if (d.ContainsKey(data[i]))
{
break;
}
}
}
}
}
public class Program
{
public static void Main() => BenchmarkRunner.Run<DictionaryBenchmark>();
}
(background: Why should I use int instead of a byte or short in C#)
To satisfy my own curiosity about the pros and cons of using the "appropriate size" integer vs the "optimized" integer i wrote the following code which reinforced what I previously held true about int performance in .Net (and which is explained in the link above) which is that it is optimized for int performance rather than short or byte.
DateTime t;
long a, b, c;
t = DateTime.Now;
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
a = DateTime.Now.Ticks - t.Ticks;
t = DateTime.Now;
for (short index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
b=DateTime.Now.Ticks - t.Ticks;
t = DateTime.Now;
for (byte index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
c=DateTime.Now.Ticks - t.Ticks;
Console.WriteLine(a.ToString());
Console.WriteLine(b.ToString());
Console.WriteLine(c.ToString());
This gives roughly consistent results in the area of...
~950000
~2000000
~1700000
Which is in line with what i would expect to see.
However when I try repeating the loops for each data type like this...
t = DateTime.Now;
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
for (int index = 0; index < 127; index++)
{
Console.WriteLine(index.ToString());
}
a = DateTime.Now.Ticks - t.Ticks;
The numbers are more like...
~4500000
~3100000
~300000
Which I find puzzling. Can anyone offer an explanation?
NOTE:
In the interest of comparing like for like i've limited the loops to 127 because of the range of the byte value type.
Also this is an act of curiosity not production code micro-optimization.
First of all, it's not .NET that's optimized for int performance, it's the machine that's optimized because 32 bits is the native word size (unless you're on x64, in which case it's long or 64 bits).
Second, you're writing to the console inside each loop - that's going too be far more expensive than incrementing and testing the loop counter, so you're not measuring anything realistic here.
Third, a byte has range up to 255, so you can loop 254 times (if you try to do 255 it will overflow and the loop will never end - but you don't need to stop at 128).
Fourth, you're not doing anywhere near enough iterations to profile. Iterating a tight loop 128 or even 254 times is meaningless. What you should be doing is putting the byte/short/int loop inside another loop that iterates a much larger number of times, say 10 million, and check the results of that.
Finally, using DateTime.Now within calculations is going to result in some timing "noise" while profiling. It's recommended (and easier) to use the Stopwatch class instead.
Bottom line, this needs many changes before it can be a valid perf test.
Here's what I'd consider to be a more accurate test program:
class Program
{
const int TestIterations = 5000000;
static void Main(string[] args)
{
RunTest("Byte Loop", TestByteLoop, TestIterations);
RunTest("Short Loop", TestShortLoop, TestIterations);
RunTest("Int Loop", TestIntLoop, TestIterations);
Console.ReadLine();
}
static void RunTest(string testName, Action action, int iterations)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < iterations; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: Elapsed Time = {1}", testName, sw.Elapsed);
}
static void TestByteLoop()
{
int x = 0;
for (byte b = 0; b < 255; b++)
++x;
}
static void TestShortLoop()
{
int x = 0;
for (short s = 0; s < 255; s++)
++x;
}
static void TestIntLoop()
{
int x = 0;
for (int i = 0; i < 255; i++)
++x;
}
}
This runs each loop inside a much larger loop (5 million iterations) and performs a very simple operation inside the loop (increments a variable). The results for me were:
Byte Loop: Elapsed Time = 00:00:03.8949910
Short Loop: Elapsed Time = 00:00:03.9098782
Int Loop: Elapsed Time = 00:00:03.2986990
So, no appreciable difference.
Also, make sure you profile in release mode, a lot of people forget and test in debug mode, which will be significantly less accurate.
The majority of this time is probably spent writing to the console. Try doing something other than that in the loop...
Additionally:
Using DateTime.Now is a bad way of measuring time. Use System.Diagnostics.Stopwatch instead
Once you've got rid of the Console.WriteLine call, a loop of 127 iterations is going to be too short to measure. You need to run the loop lots of times to get a sensible measurement.
Here's my benchmark:
using System;
using System.Diagnostics;
public static class Test
{
const int Iterations = 100000;
static void Main(string[] args)
{
Measure(ByteLoop);
Measure(ShortLoop);
Measure(IntLoop);
Measure(BackToBack);
Measure(DelegateOverhead);
}
static void Measure(Action action)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < Iterations; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: {1}ms", action.Method.Name,
sw.ElapsedMilliseconds);
}
static void ByteLoop()
{
for (byte index = 0; index < 127; index++)
{
index.ToString();
}
}
static void ShortLoop()
{
for (short index = 0; index < 127; index++)
{
index.ToString();
}
}
static void IntLoop()
{
for (int index = 0; index < 127; index++)
{
index.ToString();
}
}
static void BackToBack()
{
for (byte index = 0; index < 127; index++)
{
index.ToString();
}
for (short index = 0; index < 127; index++)
{
index.ToString();
}
for (int index = 0; index < 127; index++)
{
index.ToString();
}
}
static void DelegateOverhead()
{
// Nothing. Let's see how much
// overhead there is just for calling
// this repeatedly...
}
}
And the results:
ByteLoop: 6585ms
ShortLoop: 6342ms
IntLoop: 6404ms
BackToBack: 19757ms
DelegateOverhead: 1ms
(This is on a netbook - adjust the number of iterations until you get something sensible :)
That seems to show it making basically no significant different which type you use.
Just out of curiosity I modified a litte the program from Aaronaught and compiled it in both x86 and x64 modes. Strange, Int works much faster in x64:
x86
Byte Loop: Elapsed Time = 00:00:00.8636454
Short Loop: Elapsed Time = 00:00:00.8795518
UShort Loop: Elapsed Time = 00:00:00.8630357
Int Loop: Elapsed Time = 00:00:00.5184154
UInt Loop: Elapsed Time = 00:00:00.4950156
Long Loop: Elapsed Time = 00:00:01.2941183
ULong Loop: Elapsed Time = 00:00:01.3023409
x64
Byte Loop: Elapsed Time = 00:00:01.0646588
Short Loop: Elapsed Time = 00:00:01.0719330
UShort Loop: Elapsed Time = 00:00:01.0711545
Int Loop: Elapsed Time = 00:00:00.2462848
UInt Loop: Elapsed Time = 00:00:00.4708777
Long Loop: Elapsed Time = 00:00:00.5242272
ULong Loop: Elapsed Time = 00:00:00.5144035
I tried out the two programs above as they looked like they would produce different and possibly conflicting results on my dev machine.
Outputs from Aaronaughts' test harness
Short Loop: Elapsed Time = 00:00:00.8299340
Byte Loop: Elapsed Time = 00:00:00.8398556
Int Loop: Elapsed Time = 00:00:00.3217386
Long Loop: Elapsed Time = 00:00:00.7816368
ints are much quicker
Outputs from Jon's
ByteLoop: 1126ms
ShortLoop: 1115ms
IntLoop: 1096ms
BackToBack: 3283ms
DelegateOverhead: 0ms
nothing in it
Jon has the big fixed constant of calling tostring in the results which may be hiding the possible benefits that could occur if the work done in the loop was less.
Aaronaught is using a 32bit OS which dosen't seem to benefit from using ints as much as the x64 rig I am using.
Hardware / Software
Results were collected on a Core i7 975 at 3.33GHz with turbo disabled and the core affinity set to reduce impact of other tasks. Performance settings all set to maximum and virus scanner / unnecessary background tasks suspended. Windows 7 x64 ultimate with 11 GB of spare ram and very little IO activity. Run in release config built in vs 2008 without a debugger or profiler attached.
Repeatability
Originally repeated 10 times changing order of execution for each test. Variation was negligible so i only posted my first result. Under max CPU load the ratio of execution times stayed consistent. Repeat runs on multiple x64 xp xeon blades gives roughly same results after taking into account CPU generation and Ghz
Profiling
Redgate / Jetbrains / Slimtune / CLR profiler and my own profiler all indicate that the results are correct.
Debug Build
Using the debug settings in VS gives consistent results like Aaronaught's.
A bit late to the game, but this question deserves an accurate answer.
The generated IL code for int loop will indeed be faster than the other two. When using byte or short a convert instruction is required. It is possible, though, that the jitter is able to optimize it away under certain conditions (not in scope of this analysis).
Benchmark
Targeting .NET Core 3.1 with Release (Any CPU) configuration. Benchmark executed on x64 CPU.
| Method | Mean | Error | StdDev |
|---------- |----------:|---------:|---------:|
| ByteLoop | 149.78 ns | 0.963 ns | 0.901 ns |
| ShortLoop | 149.40 ns | 0.322 ns | 0.286 ns |
| IntLoop | 79.38 ns | 0.764 ns | 0.638 ns |
Generated IL
Comparing the IL for the three methods, it becomes obvious that the induced cost comes from a conv instruction.
IL_0000: ldc.i4.0
IL_0001: stloc.0
IL_0002: br.s IL_0009
IL_0004: ldloc.0
IL_0005: ldc.i4.1
IL_0006: add
IL_0007: conv.i2 ; conv.i2 for short, conv.i4 for byte
IL_0008: stloc.0
IL_0009: ldloc.0
IL_000a: ldc.i4 0xff
IL_000f: blt.s IL_0004
IL_0011: ret
Complete test code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace LoopPerformance
{
public class Looper
{
[Benchmark]
public void ByteLoop()
{
for (byte b = 0; b < 255; b++) {}
}
[Benchmark]
public void ShortLoop()
{
for (short s = 0; s < 255; s++) {}
}
[Benchmark]
public void IntLoop()
{
for (int i = 0; i < 255; i++) {}
}
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Looper>();
}
}
}
Profiling .Net code is very tricky because the run-time environment the compiled byte-code runs in can be doing run-time optimisations on the byte code. In your second example, the JIT compiler probably spotted the repeated code and created a more optimised version. But, without any really detailed description of how the run-time system works, it's impossible to know what is going to happen to your code. And it would be foolish to try and guess based on experimentation since Microsoft are perfectly within their rights to redesign the JIT engine at any time provided they don't break any functionality.
Console write has zero to do with actual performance of the data. It has more to do with the interaction with the console library calls. Suggest you do something interesting inside those loops that is data size independant.
Suggestions: bit shifts, multiplies, array manipulation, addition, many others...
Adding to the performance of different integral data types, I tested the performance of Int32 vs Int64 (i.e. int vs long) for an implementation of my prime number calculator, and found that on my x64 machine (Ryzen 1800X) there was no marked difference.
I couldn't really test with shorts (Int16 and UInt16) because it overflows pretty quickly.
And as others noted, your short loops are obfuscating your results, and especially your debugging statements. You should try to use a worker thread instead.
Here is a performance comparison of int vs long:
Of course, make sure to avoid long (and anything other than plain int) for array indices, since you can't even use them, and casting to int could only hurt performance (immeasurable in my test).
Here is my profiling code, which polls the progress as the worker thread spins forever. It does slow down slightly with repeated tests, so I made sure to test in other orderings and individually as well:
public static void Run() {
TestWrapper(new PrimeEnumeratorInt32());
TestWrapper(new PrimeEnumeratorInt64());
TestWrapper(new PrimeEnumeratorInt64Indices());
}
private static void TestWrapper<X>(X enumeration)
where X : IDisposable, IEnumerator {
int[] lapTimesMs = new int[] { 100, 300, 600, 1000, 3000, 5000, 10000 };
int sleepNumberBlockWidth = 2 + (int)Math.Ceiling(Math.Log10(lapTimesMs.Max()));
string resultStringFmt = string.Format("\tTotal time is {{0,-{0}}}ms, number of computed primes is {{1}}", sleepNumberBlockWidth);
int totalSlept = 0;
int offset = 0;
Stopwatch stopwatch = new Stopwatch();
Type t = enumeration.GetType();
FieldInfo field = t.GetField("_known", BindingFlags.NonPublic | BindingFlags.Instance);
Console.WriteLine("Testing {0}", t.Name);
_continue = true;
Thread thread = new Thread(InfiniteLooper);
thread.Start(enumeration);
stopwatch.Start();
foreach (int sleepSize in lapTimesMs) {
SleepExtensions.SleepWithProgress(sleepSize + offset);
//avoid race condition calling the Current property by using reflection to get private data
Console.WriteLine(resultStringFmt, stopwatch.ElapsedMilliseconds, ((IList)field.GetValue(enumeration)).Count);
totalSlept += sleepSize;
offset = totalSlept - (int)stopwatch.ElapsedMilliseconds;//synchronize to stopwatch laps
}
_continue = false;
thread.Join(100);//plz stop in time (Thread.Abort is no longer supported)
enumeration.Dispose();
stopwatch.Stop();
}
private static bool _continue = true;
private static void InfiniteLooper(object data) {
IEnumerator enumerator = (IEnumerator)data;
while (_continue && enumerator.MoveNext()) { }
}
}
Note you can replace SleepExtensions.SleepWithProgress with just Thread.Sleep
And the three variations of the algorithm being profiled:
Int32 version
class PrimeEnumeratorInt32 : IEnumerator<int> {
public int Current { get { return this._known[this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private int _currentIdx = -1;
private List<int> _known = new List<int>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private int ComputeNext(int lastKnown) {
int current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
int testIdx;
int sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (int)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[testIdx] <= sqrt) {
if (current % this._known[testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite) {
current += 2;
} else {
return current;//and end
}
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Int64 version
class PrimeEnumeratorInt64 : IEnumerator<long> {
public long Current { get { return this._known[this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private int _currentIdx = -1;
private List<long> _known = new List<long>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private long ComputeNext(long lastKnown) {
long current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
int testIdx;
long sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (long)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[testIdx] <= sqrt) {
if (current % this._known[testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite)
current += 2;
else
return current;//and end
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Int64 for both values and indices
Note the necessary casting of indices accessing the _known list.
class PrimeEnumeratorInt64Indices : IEnumerator<long> {
public long Current { get { return this._known[(int)this._currentIdx]; } }
object IEnumerator.Current { get { return this.Current; } }
private long _currentIdx = -1;
private List<long> _known = new List<long>() { 2, 3 };
public bool MoveNext() {
if (++this._currentIdx >= this._known.Count)
this._known.Add(this.ComputeNext(this._known[^1]));
return true;//no end
}
private long ComputeNext(long lastKnown) {
long current = lastKnown + 2;//start at 2 past last known value, which is guaranteed odd because we initialize up thru 3
long testIdx;
long sqrt;
bool isComposite;
while (true) {//keep going until a new prime is found
testIdx = 1;//all test values are odd, so skip testing the first known prime (two)
sqrt = (long)Math.Sqrt(current);//round down, and avoid casting due to the comparison type of the while loop condition
isComposite = false;
while (this._known[(int)testIdx] <= sqrt) {
if (current % this._known[(int)testIdx++] == 0L) {
isComposite = true;
break;
}
}
if (isComposite)
current += 2;
else
return current;//and end
}
}
public void Reset() {
this._currentIdx = -1;
}
public void Dispose() {
this._known = null;
}
}
Total, my test program is using 43MB of memory after 20 seconds for Int32 and 75MB of memory for Int64, due to the List<...> _known collection, which is the biggest difference I'm observing.
I profiled versions using unsigned types as well. Here are my results (Release mode):
Testing PrimeEnumeratorInt32
Total time is 20000 ms, number of computed primes is 3842603
Testing PrimeEnumeratorUInt32
Total time is 20001 ms, number of computed primes is 3841554
Testing PrimeEnumeratorInt64
Total time is 20001 ms, number of computed primes is 3839953
Testing PrimeEnumeratorUInt64
Total time is 20002 ms, number of computed primes is 3837199
All 4 versions have essentially identical performance. I guess the lesson here is to never assume how performance will be affected, and that you should probably use Int64 if you are targeting an x64 architecture, since it matches my Int32 version even with the increased memory usage.
And a validation my prime calculator is working:
P.S. Release mode had consistent results that were 1.1% faster.
P.P.S. Here are the necessary using statements:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using System.Threading;
Another use case where int16 or int32 may be preferable to int64 is for SIMD (Single Instruction, Multiple Data), so you can double/quadruple/octuple etc. your throughput, by stuffing more data into your instructions. This is because the register size is (generally) 256-bit, so you can evaluate 16, 8, or 4 values simultaneously, respectively. It is very useful for vector calculations.
The data structure on MSDN.
A couple use cases: improving performance with simd intrinsics in three use cases. I particularly found SIMD to be useful for higher-dimensional binary tree child index lookup operations (i.e. signal vectors).
You can also use SIMD to accelerate other array operations and further tighten your loops.