Hash table faster in C# than C++?

Hash table faster in C# than C++? - c#

Here's a curiosity I've been investigating. The .NET Dictionary class performs ridiculously fast compared to the STL unordered_map in a test I keep running, and I can't figure out why.
(0.5 seconds vs. 4 seconds on my machine)
(.NET 3.5 SP1 vs. Visual Studio 2008 Express SP1's STL)
On the other hand, if I implement my own hash table in C# and C++, the C++ version is about twice as fast as the C# one, which is fine because it reinforces my common sense that native machine code is sometimes faster. (See. I said "sometimes".) Me being the same person in both languages, I wonder what tricks was the C# coder from Microsoft able to play that the C++ coder from Microsoft wasn't? I'm having trouble imagining how a compiler could play such tricks on its own, going through the trouble of optimizing what should look to it to be arbitrary function calls.
It's a simple test, storing and retrieving integers.
C#:
const int total = (1 << 20);
int sum = 0;
Dictionary<int, int> dict = new Dictionary<int, int>();
for(int i = 0; i < total; i++)
{
dict.Add(i, i * 7);
}
for(int j = 0; j < (1 << 3); j++)
{
int i = total;
while(i > 0)
{
i--;
sum += dict[i];
}
}
Console.WriteLine(sum);
C++:
const int total = (1 << 20);
int sum = 0;
std::tr1::unordered_map<int, int> dict;
for(int i = 0; i < total; i++)
{
dict.insert(pair<int, int>(i, i * 7));
}
for(int j = 0; j < (1 << 3); j++)
{
int i = total;
while(i > 0)
{
i--;
std::tr1::unordered_map<int, int>::const_iterator found =
dict.find(i);
sum += found->second;
}
}
cout << sum << endl;

the two versions are not equivalent , your are constructing an iterator in each pass of your C++ while loop. that takes CPU time and throws your results.

You are measuring the cost of explicit memory management. More statistics are available here. This is relevant too. And Chris Sells' attempt to add deterministic finalization to the CLR is notable.

There will be some differences at the code level: the fact that the unordered map takes a pair forces the construction of such object, when C# might be faster with passing two parameters in Add.
Another point is the implementation of the hashtables themselves: the implementation of the hashing function, or the way to deal with collisions, will cause different performance patterns.
Throw in alignment and caching, JIT-friendliness of some algorithms, and comparing two different implementations in two different languages becomes very difficult, and the only thing you can compare is the particular task at hand. Try with fewer or more elements, or try to access elements randomly instead of in sequence, and you might see very different results.

Related

.NET BitArray cardinality [duplicate]

I am implementing a library where I am extensively using the .Net BitArray class and need an equivalent to the Java BitSet.Cardinality() method, i.e. a method which returns the number of bits set. I was thinking of implementing it as an extension method for the BitArray class. The trivial implementation is to iterate and count the bits set (like below), but I wanted a faster implementation as I would be performing thousands of set operations and counting the answer. Is there a faster way than the example below?
count = 0;
for (int i = 0; i < mybitarray.Length; i++)
{
if (mybitarray [i])
count++;
}

This is my solution based on the "best bit counting method" from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
public static Int32 GetCardinality(BitArray bitArray)
{
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
// fix for not truncated bits in last integer that may have been set to true with SetAll()
ints[ints.Length - 1] &= ~(-1 << (bitArray.Count % 32));
for (Int32 i = 0; i < ints.Length; i++)
{
Int32 c = ints[i];
// magic (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
unchecked
{
c = c - ((c >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = ((c + (c >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
count += c;
}
return count;
}
According to my tests, this is around 60 times faster than the simple foreach loop and still 30 times faster than the Kernighan approach with around 50% bits set to true in a BitArray with 1000 bits. I also have a VB version of this if needed.

you can accomplish this pretty easily with Linq
BitArray ba = new BitArray(new[] { true, false, true, false, false });
var numOnes = (from bool m in ba
where m
select m).Count();

BitArray myBitArray = new BitArray(...
int
bits = myBitArray.Count,
size = ((bits - 1) >> 3) + 1,
counter = 0,
x,
c;
byte[] buffer = new byte[size];
myBitArray.CopyTo(buffer, 0);
for (x = 0; x < size; x++)
for (c = 0; buffer[x] > 0; buffer[x] >>= 1)
counter += buffer[x] & 1;
Taken from "Counting bits set, Brian Kernighan's way" and adapted for bytes. I'm using it for bit arrays of 1 000 000+ bits and it's superb.
If your bits are not n*8 then you can count the mod byte manually.

I had the same issue, but had more than just the one Cardinality method to convert. So, I opted to port the entire BitSet class. Fortunately it was self-contained.
Here is the Gist of the C# port.
I would appreciate if people would report any bugs that are found - I am not a Java developer, and have limited experience with bit logic, so I might have translated some of it incorrectly.

Faster and simpler version than the accepted answer thanks to the use of System.Numerics.BitOperations.PopCount
C#
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
for (Int32 i = 0; i < ints.Length; i++) {
count += BitOperations.PopCount(ints[i]);
}
Console.WriteLine(count);
F#
let ints = Array.create ((bitArray.Count >>> 5) + 1) 0u
bitArray.CopyTo(ints, 0)
ints
|> Array.sumBy BitOperations.PopCount
|> printfn "%d"
See more details in Is BitOperations.PopCount the best way to compute the BitArray cardinality in .NET?

You could use Linq, but it would be useless and slower:
var sum = mybitarray.OfType<bool>().Count(p => p);

There is no faster way with using BitArray - What it comes down to is you will have to count them - you could use LINQ to do that or do your own loop, but there is no method offered by BitArray and the underlying data structure is an int[] array (as seen with Reflector) - so this will always be O(n), n being the number of bits in the array.
The only way I could think of making it faster is using reflection to get a hold of the underlying m_array field, then you can get around the boundary checks that Get() uses on every call (see below) - but this is kinda dirty, and might only be worth it on very large arrays since reflection is expensive.
public bool Get(int index)
{
if ((index < 0) || (index >= this.Length))
{
throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
return ((this.m_array[index / 0x20] & (((int) 1) << (index % 0x20))) != 0);
}
If this optimization is really important to you, you should create your own class for bit manipulation, that internally could use BitArray, but keeps track of the number of bits set and offers the appropriate methods (mostly delegate to BitArray but add methods to get number of bits currently set) - then of course this would be O(1).

If you really want to maximize the speed, you could pre-compute a lookup table where given a byte-value you have the cardinality, but BitArray is not the most ideal structure for this, since you'd need to use reflection to pull the underlying storage out of it and operate on the integral types - see this question for a better explanation of that technique.
Another, perhaps more useful technique, is to use something like the Kernighan trick, which is O(m) for an n-bit value of cardinality m.
static readonly ZERO = new BitArray (0);
static readonly NOT_ONE = new BitArray (1).Not ();
public static int GetCardinality (this BitArray bits)
{
int c = 0;
var tmp = new BitArray (myBitArray);
for (c; tmp != ZERO; c++)
tmp = tmp.And (tmp.And (NOT_ONE));
return c;
}
This too is a bit more cumbersome than it would be in say C, because there are no operations defined between integer types and BitArrays, (tmp &= tmp - 1, for example, to clear the least significant set bit, has been translated to tmp &= (tmp & ~0x1).
I have no idea if this ends up being any faster than naively iterating for the case of the BCL BitArray, but algorithmically speaking it should be superior.
EDIT: cited where I discovered the Kernighan trick, with a more in-depth explanation

If you don't mind to copy the code of System.Collections.BitArray to your project and Edit it,you can write as fellow:
(I think it's the fastest. And I've tried use BitVector32[] to implement my BitArray, but it's still so slow.)
public void Set(int index, bool value)
{
if ((index < 0) || (index >= this.m_length))
{
throw new ArgumentOutOfRangeException("index", "Index Out Of Range");
}
SetWithOutAuth(index,value);
}
//When in batch setting values,we need one method that won't auth the index range
private void SetWithOutAuth(int index, bool value)
{
int v = ((int)1) << (index % 0x20);
index = index / 0x20;
bool NotSet = (this.m_array[index] & v) == 0;
if (value && NotSet)
{
CountOfTrue++;//Count the True values
this.m_array[index] |= v;
}
else if (!value && !NotSet)
{
CountOfTrue--;//Count the True values
this.m_array[index] &= ~v;
}
else
return;
this._version++;
}
public int CountOfTrue { get; internal set; }
public void BatchSet(int start, int length, bool value)
{
if (start < 0 || start >= this.m_length || length <= 0)
return;
for (int i = start; i < length && i < this.m_length; i++)
{
SetWithOutAuth(i,value);
}
}

I wrote my version of after not finding one that uses a look-up table:
private int[] _bitCountLookup;
private void InitLookupTable()
{
_bitCountLookup = new int[256];
for (var byteValue = 0; byteValue < 256; byteValue++)
{
var count = 0;
for (var bitIndex = 0; bitIndex < 8; bitIndex++)
{
count += (byteValue >> bitIndex) & 1;
}
_bitCountLookup[byteValue] = count;
}
}
private int CountSetBits(BitArray bitArray)
{
var result = 0;
var numberOfFullBytes = bitArray.Length / 8;
var numberOfTailBits = bitArray.Length % 8;
var tailByte = numberOfTailBits > 0 ? 1 : 0;
var bitArrayInBytes = new byte[numberOfFullBytes + tailByte];
bitArray.CopyTo(bitArrayInBytes, 0);
for (var i = 0; i < numberOfFullBytes; i++)
{
result += _bitCountLookup[bitArrayInBytes[i]];
}
for (var i = (numberOfFullBytes * 8); i < bitArray.Length; i++)
{
if (bitArray[i])
{
result++;
}
}
return result;
}

The problem is naturally O(n), as a result your solution is probably the most efficient.
Since you are trying to count an arbitrary subset of bits you cannot count the bits when they are set (would would provide a speed boost if you are not setting the bits too often).
You could check to see if the processor you are using has a command which will return the number of set bits. For example a processor with SSE4 could use the POPCNT according to this post. This would probably not work for you since .Net does not allow assembly (because it is platform independent). Also, ARM processors probably do not have an equivalent.
Probably the best solution would be a look up table (or switch if you could guarantee the switch will compiled to a single jump to currentLocation + byteValue). This would give you the count for the whole byte. Of course BitArray does not give access to the underlying data type so you would have to make your own BitArray. You would also have to guarantee that all the bits in the byte will always be part of the intersection which does not sound likely.
Another option would be to use an array of booleans instead of a BitArray. This has the advantage not needing to extract the bit from the others in the byte. The disadvantage is the array will take up 8x as much space in memory meaning not only wasted space, but also more data push as you iterate through the array to perform your count.
The difference between a standard array look up and a BitArray look up is as follows:
Array:
offset = index * indexSize
Get memory at location + offset and save to value
BitArray:
index = index/indexSize
offset = index * indexSize
Get memory at location + offset and save to value
position = index%indexSize
Shift value position bits
value = value and 1
With the exception of #2 for Arrays and #3 most of these commands take 1 processor cycle to complete. Some of the commands can be combined into 1 command using x86/x64 processors, though probably not with ARM since it uses a reduced set of instructions.
Which of the two (array or BitArray) perform better will be specific to your platform (processor speed, processor instructions, processor cache sizes, processor cache speed, amount of system memory (Ram), speed of system memory (CAS), speed of connection between processor and RAM) as well as the spread of indexes you want to count (are the intersections most often clustered or are they randomly distributed).
To summarize: you could probably find a way to make it faster, but your solution is the fastest you will get for your data set using a bit per boolean model in .NET.
Edit: make sure you are accessing the indexes you want to count in order. If you access indexes 200, 5, 150, 151, 311, 6 in that order then you will increase the amount of cache misses resulting in more time spent waiting for values to be retrieved from RAM.

Efficient way of counting specific occurrences during a for loop

I have a loop of about one million values. I want to store just the exponent of each value to display them e.g. in a histogram.
At the moment I'm doing it this way:
int histogram[51]; //init all with 0
for(int i = 0; i < 1000000; i++)
{
int exponent = getExponent(getValue(i));
//getExponent(double value) gives the exponent(base 10)
//getValue(int i) gives the value for loop i
if(exponent > 25)
exponent = 25;
if(exponent < -25)
exponent = -25;
histogramm[exponent+25]++;
}
Is there a more efficient and elegant way of doing this?
Perhaps without using an array?

Is there a more efficient and elegant way of doing this?
The only more elegant way would be to use Math.Min and Math.Max but it wouldn't be any more efficient. histogramm[Math.Max(0,Math.Min(50,exponent+25))]++ is fewer characters but no more performant or elegant in my opinion.
Perhaps without using an array?
An array is a raw set of values so is the most straightforward way of storing them.

Assuming that getExponent and getValue are optimized, the only way to optimize this is to use Parrallel.For. I don't think the difference will be significant, but i see it as the only way.
As for the array, it is the best indexed low-level data structure that you can use for storing data.
Use Parallel Library [ using System.Threading.Tasks; ]:
int[] histogram = new int[51]; //init all with 0
Action<int> work = (i) =>
{
int exponent = getExponent(getValue(i));
//getExponent(double value) gives the exponent(base 10)
//getValue(int i) gives the value for loop i
if (exponent > 25)
exponent = 25;
if (exponent < -25)
exponent = -25;
histogram[exponent + 25]++;
};
Parallel.For(0, 1000000, work);

There's not much to optimize with the conditional and array access. However, if the getExponent function is a time consuming operation, you can cache the results. A lot is really going to depend on what typical data looks like. Profile to see if it helps. Also, someone else mentioned using Parallel. That's worth profiling too as it could make a difference if getValue or getExponent are slow enough to overcome the Parallel and locking overhead.
Important note: Using double is probably a bad idea for a dictionary. But I'm confused by the math going on and conversions from double's to int's.
Either way, the idea here is to cache a calculation. So perhaps you can find a better way if this test proves useful.
int[] histogram = Enumerable.Repeat(0, 51).ToArray();
...
Dictionary<double, int> cache = new Dictionary<double, int>(histogram.Length);
...
for (int i = 0; i < 1000000; i++)
{
double value = getValue(i);
int exponent;
if(!cache.TryGetValue(value, out exponent))
{
exponent = getExponent(value);
cache[value] = exponent;
}
if (exponent > 25)
exponent = 25;
else if (exponent < -25)
exponent = -25;
histogram[exponent + 25]++;
}

Try using an public property to store your values for you exponent, so can test and check your results.
As for as elegance goes, I would simplify the if conditions with actual functions or subroutines, and create additional abstraction with clear and concise naming.

2D Array vs Array of Arrays in tight loop performance C#

I had a look and couldn't see anything quite answering my question.
I'm not exactly the best at creating accurate 'real life' tests, so i'm not sure if that's the problem here. Basically I want to create a few simple neural networks to create something to the effect of Gridworld. Performance of these neural networks will be critical and i dont want the hidden layer to be a bottleneck as much as possible.
I would rather use more memory and be faster, so I opted to use arrays instead of lists (due to lists having an extra bounds check over arrays). The arrays aren't always full, but because the if statement (check if the element is null) is the same until the end, it can be predicted and there is no performance drop from that at all.
My question comes from how I store the data for the network to process. I figured due to 2D arrays storing all the data together it would be better cache wise and would run faster. But from my mock up test that an array of arrays performs much better in this scenario.
Some Code:
private void RunArrayOfArrayTest(float[][] testArray, Data[] data)
{
for (int i = 0; i < testArray.Length; i++) {
for (int j = 0; j < testArray[i].Length; j++) {
var inputTotal = data[i].bias;
for (int k = 0; k < data[i].weights.Length; k++) {
inputTotal += testArray[i][k];
}
}
}
}
private void Run2DArrayTest(float[,] testArray, Data[] data, int maxI, int maxJ)
{
for (int i = 0; i < maxI; i++) {
for (int j = 0; j < maxJ; j++) {
var inputTotal = data[i].bias;
for (int k = 0; k < maxJ; k++) {
inputTotal += testArray[i, k];
}
}
}
}
These are the two functions that are timed. Each 'creature' has its own network (The first for loop), each network has hidden nodes (The second for loop) and i need to find the sum of the weights for each input (The third loop). In my test i stripped it so that it's not really what i am doing in my actual code, but the same amount of loops happen (The data variable would have it's own 2D array, but i didn't want to possibly skew the results). From this i was trying to get a feel for which one is faster, and to my surprise the array of arrays was.
Code to start the tests:
// Array of Array test
Stopwatch timer = Stopwatch.StartNew();
RunArrayOfArrayTest(arrayOfArrays, dataArrays);
timer.Stop();
Console.WriteLine("Array of Arrays finished in: " + timer.ElapsedTicks);
// 2D Array test
timer = Stopwatch.StartNew();
Run2DArrayTest(array2D, dataArrays, NumberOfNetworks, NumberOfInputNeurons);
timer.Stop();
Console.WriteLine("2D Array finished in: " + timer.ElapsedTicks);
Just wanted to show how i was testing it. The results from this in release mode give me values like:
Array of Arrays finished in: 8972
2D Array finished in: 16376
Can someone explain to me what i'm doing wrong? Why is an array of arrays faster in this situation by so much? Isn't a 2D array all stored together, meaning it would be more cache friendly?
Note i really do need this to be fast as it needs to sum up hundreds of thousands - millions of numbers per frame, and like i said i don't want this is be a problem. I know this can be multi threaded in the future quite easily because each network is completely separate and even each node is completely separate.
Last question i suppose, would something like this be possible to run on the GPU instead? I figure a GPU would not struggle to have much larger amounts of networks with much larger numbers of input/hidden neurons.

In the CLR, there are two different types of array:
Vectors, which are zero-based, single-dimensional arrays
Arrays, which can have non-zero bases and multiple dimensions
Your "array of arrays" is a "vector of vectors" in CLR terms.
Vectors are significantly faster than arrays, basically. It's possible that arrays could be optimized further in later CLR versions, but I doubt that there'll get the same amount of love as vectors, as they're so relatively rarely used. There's not a lot you can do to make CLR arrays faster. As you say, they'll be more cache friendly, but they have this CLR penalty.
You can improve your array-of-arrays code already, however, by only performing the first indexing operation once per row:
private void RunArrayOfArrayTest(float[][] testArray, Data[] data)
{
for (int i = 0; i < testArray.Length; i++) {
// These don't change in the loop below, so extract them
var row = testArray[i];
var inputTotal = data[i].bias;
var weightLength = data[i].weights.Length;
for (int j = 0; j < row.Length; j++) {
for (int k = 0; k < weightLength; k++) {
inputTotal += row[k];
}
}
}
}
If you want to get the cache friendliness and still use a vector, you could have a single float[] and perform the indexing yourself... but I'd probably start off with the array-of-arrays approach.

Should I prefer variables or multiple indirections on arrays in C# in perf critical code?

in some perf critical program (single threaded), if I have arrays of primitive types, and need to access the same index of those more than once in loops.
Should I use tmp variables, or would just constant indirection on the array be better/faster ?
I could imagine also that it is possible that either is the same / is transparently optimised at compile time.

Lets test this:
int[] arr = new int[]{1, 2, 3, 4, 5, 6, 7};
int t = arr[3];
int a = 0;
var start = DateTime.UtcNow;
for (int i = 0; i < 1000000000; i++)
{
a += t;
}
Console.WriteLine(a);
Console.WriteLine(DateTime.UtcNow-start);
a = 0;
start = DateTime.UtcNow;
for (int i = 0; i < 1000000000; i++)
{
a += arr[3];
}
Console.WriteLine(a);
Console.WriteLine(DateTime.UtcNow - start);
Output:
-294967296
00:00:02.1925000
-294967296
00:00:03.4250000
Yes its slower to access the array repeatedly.

In general access to an array is slower than a temporary because it has 2 additional pieces of overhead
A layer of indirection
A bounds check
However I would not be changing any code I had today based on that knowledge unless a profiler clearly showed a tight loop to be a significant perf problem in my application. And furthermore showed that changing to a cached local produced a significant perf benefit.

Array access always involves indirection, so if there are frequent accesses the variable will likely be faster.
That said, I find it incredibly unlikely that you will be able to measure the difference. This is an example of micro-optimization.

C# Prime Generator, Maxxing out Bit Array

(C#, prime generator)
Heres some code a friend and I were poking around on:
public List<int> GetListToTop(int top)
{
top++;
List<int> result = new List<int>();
BitArray primes = new BitArray(top / 2);
int root = (int)Math.Sqrt(top);
for (int i = 3, count = 3; i <= root; i += 2, count++)
{
int n = i - count;
if (!primes[n])
for (int j = n + i; j < top / 2; j += i)
{
primes[j] = true;
}
}
if (top >= 2)
result.Add(2);
for (int i = 0, count = 3; i < primes.Length; i++, count++)
{
if (!primes[i])
{
int n = i + count;
result.Add(n);
}
}
return result;
}
On my dorky AMD x64 1800+ (dual core), for all primes below 1 billion in 34546.875ms. Problem seems to be storing more in the bit array. Trying to crank more than ~2billion is more than the bitarray wants to store. Any ideas on how to get around that?

I would "swap" parts of the array out to disk. By that, I mean, divide your bit array into half-billion bit chunks and store them on disk.
The have only a few chunks in memory at any one time. With C# (or any other OO language), it should be easy to encapsulate the huge array inside this chunking class.
You'll pay for it with a slower generation time but I don't see any way around that until we get larger address spaces and 128-bit compilers.

Or as an alternative approach to the one suggested by Pax, make use of the new Memory-Mapped File classes in .NET 4.0 and let the OS decide which chunks need to be in memory at any given time.
Note however that you'll want to try and optimise the algorithm to increase locality so that you do not needlessly end up swapping pages in and out of memory (trickier than this one sentence makes it sound).

Use multiple BitArrays to increase the maximum size. If a number is to great bit-shift it and store the result in a bit-array for storing bits 33-64.
BitArray second = new BitArray(int.MaxValue);
long num = 23958923589;
if (num > int.MaxValue)
{
int shifted = (int)num >> 32;
second[shifted] = true;
}
long request = 0902305023;
if (request > int.MaxValue)
{
int shifted = (int)request >> 32;
return second[shifted];
}
else return first[request];
Of course it would be nice if BitArray would support size up to System.Numerics.BigInteger.
Swapping to disk will make your code really slow.
I have a 64-bit OS, and my BitArray is also limited to 32-bits.
PS: your prime number calculations looks wierd, mine looks like this:
for (int i = 2; i <= number; i++)
if (primes[i])
for (int scalar = i + i; scalar <= number; scalar += i)
{
primes[scalar] = false;
yield return scalar;
}

The Sieve algorithm would be better performing. I could determine all the 32-bit primes (total about 105 million) for the int range in less than 4 minutes with that. Of course returning the list of primes is a different thing as the memory requirement there would be a little over 400 MB (1 int = 4 bytes). Using a for loop the numbers were printed to a file and then imported to a DB for more fun :) However for the 64 bit primes the program would need several modifications and perhaps require distributed execution over multiple nodes. Also refer to the following links
http://www.troubleshooters.com/codecorn/primenumbers/primenumbers.htm
http://en.wikipedia.org/wiki/Prime-counting_function

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.