Reverse Engineering String.GetHashCode

Reverse Engineering String.GetHashCode - c#

String.GetHashCode's behavior is depend on the program architecture. So it will return one value in x86 and one value on x64. I have a test application which must run in x86 and it must predict the hash code output from an application which must run on x64.
Below is the disassembly of the String.GetHashCode implementation from mscorwks.
public override unsafe int GetHashCode()
{
fixed (char* text1 = ((char*) this))
{
char* chPtr1 = text1;
int num1 = 0x15051505;
int num2 = num1;
int* numPtr1 = (int*) chPtr1;
for (int num3 = this.Length; num3 > 0; num3 -= 4)
{
num1 = (((num1 << 5) + num1) + (num1 >≫ 0x1b)) ^ numPtr1[0];
if (num3 <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr1[1];
numPtr1 += 2;
}
return (num1 + (num2 * 0x5d588b65));
}
}
Can anybody port this function to a safe implementation??

Hash codes are not intended to be repeatable across platforms, or even multiple runs of the same program on the same system. You are going the wrong way. If you don't change course, your path will be difficult and one day it may end in tears.
What is the real problem you want to solve? Would it be possible to write your own hash function, either as an extension method or as the GetHashCode implementation of a wrapper class and use that one instead?

First off, Jon is correct; this is a fool's errand. The internal debug builds of the framework that we use to "eat our own dogfood" change the hash algorithm every day precisely to prevent people from building systems -- even test systems -- that rely on unreliable implementation details that are documented as subject to change at any time.
Rather than enshrining an emulation of a system that is documented as being not suitable for emulation, my recommendation would be to take a step back and ask yourself why you're trying to do something this dangerous. Is it really a requirement?
Second, StackOverflow is a technical question and answer site, not a "do my job for me for free" site. If you are hell bent on doing this dangerous thing and you need someone who can rewrite unsafe code into equivalent safe code then I recommend that you hire someone who can do that for you.

While all of the warnings given here are valid, they don't answer the question. I had a situation in which GetHashCode() was unfortunately already being used for a persisted value in production, and I had no choice but to re-implement using the default .NET 2.0 32-bit x86 (little-endian) algorithm. I re-coded without unsafe as shown below, and this appears to be working. Hope this helps someone.
// The GetStringHashCode() extension method is equivalent to the Microsoft .NET Framework 2.0
// String.GetHashCode() method executed on 32 bit systems.
public static int GetStringHashCode(this string value)
{
int hash1 = (5381 << 16) + 5381;
int hash2 = hash1;
int len = value.Length;
int intval;
int c0, c1;
int i = 0;
while (len > 0)
{
c0 = (int)value[i];
c1 = len > 1 ? (int)value[i + 1] : 0;
intval = c0 | (c1 << 16);
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ intval;
if (len <= 2)
{
break;
}
i += 2;
c0 = (int)value[i];
c1 = len > 3 ? (int)value[i + 1] : 0;
intval = c0 | (c1 << 16);
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ intval;
len -= 4;
i += 2;
}
return hash1 + (hash2 * 1566083941);
}

The following exactly reproduces the default String hash codes on .NET 4.7 (and probably earlier). This is the hash code given by:
Default on a String instance: "abc".GetHashCode()
StringComparer.Ordinal.GetHashCode("abc")
Various String methods that take StringComparison.Ordinal enumeration.
System.Globalization.CompareInfo.GetStringComparer(CompareOptions.Ordinal)
Testing on release builds with full JIT optimization, these versions modestly outperform the built-in .NET code, and have also been heavily unit-tested for exact equivalence with .NET behavior. Notice there are separate versions for x86 versus x64. Your program should generally include both; below the respective code listings is a calling harness which selects the appropriate version at runtime.
x86 - (.NET running in 32-bit mode)
static unsafe int GetHashCode_x86_NET(int* p, int c)
{
int h1, h2 = h1 = 0x15051505;
while (c > 2)
{
h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;
h2 = ((h2 << 5) + h2 + (h2 >> 27)) ^ *p++;
c -= 4;
}
if (c > 0)
h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;
return h1 + (h2 * 0x5d588b65);
}
x64 - (.NET running in 64-bit mode)
static unsafe int GetHashCode_x64_NET(Char* p)
{
int h1, h2 = h1 = 5381;
while (*p != 0)
{
h1 = ((h1 << 5) + h1) ^ *p++;
if (*p == 0)
break;
h2 = ((h2 << 5) + h2) ^ *p++;
}
return h1 + (h2 * 0x5d588b65);
}
Calling harness / extension method for either platform (x86/x64):
readonly static int _hash_sz = IntPtr.Size == 4 ? 0x2d2816fe : 0x162a16fe;
public static unsafe int GetStringHashCode(this String s)
{
/// Note: x64 string hash ignores remainder after embedded '\0'char (unlike x86)
if (s.Length == 0 || (IntPtr.Size == 8 && s[0] == '\0'))
return _hash_sz;
fixed (char* p = s)
return IntPtr.Size == 4 ?
GetHashCode_x86_NET((int*)p, s.Length) :
GetHashCode_x64_NET(p);
}

Related

Hash tables with long (100+ character) key names

I am working on a data structure for a utility of mine, and I am TEMPTED to do a hash table in which the key is a very long string, specifically a file path. There are a number of reasons why this makes sense from a data standpoint, mainly the fact that the path is guaranteed unique. That said, every single example I have seen of a hash table has very short keys and potentially long values. So, I am wondering if that is just a function of easy examples? Or is there a performance or technical reason not to use long keys?
I will be using $variable = New-Object Collections.Specialized.OrderedDictionary for version agnostic ordering, if that makes any difference.

I think you are fine to have keys that have a long string.
Under the hood, the key lookup in OrderedDictionary is doing this in
if (objectsTable.Contains(key)) {
objectsTable is of type Hashtable
If you follow the chain of getting the hash in the Hashtable class, you'll get to this:
https://referencesource.microsoft.com/#mscorlib/system/collections/hashtable.cs,4f6addb8551463cf
// Internal method to get the hash code for an Object. This will call
// GetHashCode() on each object if you haven't provided an IHashCodeProvider
// instance. Otherwise, it calls hcp.GetHashCode(obj).
protected virtual int GetHash(Object key)
{
if (_keycomparer != null)
return _keycomparer.GetHashCode(key);
return key.GetHashCode();
}
So, the question becomes, what's the cost of getting a HashCode on a string?
https://referencesource.microsoft.com/#mscorlib/system/string.cs
The function GetHashCode, you'll see is a loop, but its only an O(n) function as it only grows based on the string length. You'll notice the computation for a hash is a bit different on 32-bit machines than on others, but O(n) is a worse case for expansion of the algorithm.
There's other parts of the function, but I think this is the key part, as it's the part that can grow (src is the char* meaning a pointing to the characters in the string).
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif

unsigned char * - Equivalent C#

I am porting a library from C++ to C# but have come across a scenario I am unsure of how to resolve, which involves casting an unsigned char * to an unsigned int *.
C++
unsigned int c4;
unsigned int c2;
unsigned int h4;
int pos(unsigned char *p)
{
c4 = *(reinterpret_cast<unsigned int *>(p - 4));
c2 = *(reinterpret_cast<unsigned short *>(p - 2));
h4 = ((c4 >> 11) ^ c4) & (N4 - 1);
if ((tab4[h4][0] != 0) && (tab4[h4][1] == c4))
{
c = 256;
return (tab4[h4][0]);
}
c = 257;
return (tab2[c2]);
}
C# (It's wrong):
public uint pos(byte p)
{
c4 = (uint)(p - 4);
c2 = (ushort)(p - 2);
h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4)) {
c = 256;
return (tab4[h4, 0]);
}
c = 257;
return (tab2[c2]);
}
I believe in the C# example, you could change byte p to byte[] but I am clueless when it would come to casting byte[] to a single uint value.
Additionally, could anyone please explain to me, why would you cast an unsigned char * to a unsigned int *? What purpose does it have?
Any help/push to direction would be very useful.

Translation of the problematic lines would be:
int pos(byte[] a, int offset)
{
// Read the four bytes immediately preceding offset
c4 = BitConverter.ToUInt32(a, offset - 4);
// Read the two bytes immediately preceding offset
c2 = BitConverter.ToUInt16(a, offset - 2);
and change the call from x = pos(&buf[i]) (which even in C++ is the same as x = pos(buf + i)) to
x = pos(buf, i);
An important note is that the existing C++ code is wrong as it violates the strict aliasing rule.

Implementing analogous functionality in C# does not need to involve code that replicates the C version on a statement-by-statement basis, especially when the original is using pointers.
When we assume an architecture where int is 32 bit, you could simplify the C# version like this:
uint[] tab2;
uint[,] tab4;
ushort c;
public uint pos(uint c4)
{
var h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4))
{
c = 256;
return (tab4[h4, 0]);
}
else
{
c = 257;
var c2 = (c4 >> 16) & 0xffff; // HIWORD
return (tab2[c2]);
}
}
This simplification is possible because c4 and c2 overlap: c2 is the high word of c4, and is needed only when the lookup in tab4 does not match.
(The identifier N4 was present in original code but replaced in your own translation by the expression 1<<20).
The calling code would have to loop over an array of int, which according to comments is possible. While the original C++ code starts at offset 4 and looks back, the C# equivalent would start at offset 0, which seems a more natural thing to do.

In C++ code you are sending pointer to char, but normally C# does not working with memory using this way, you need array instead of pointer.
But you can use unsafe keyword to work directly with memory.
https://msdn.microsoft.com/en-us/library/chfa2zb8.aspx

Does String.GetHashCode consider the full string or only part of it?

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs
public override int GetHashCode() {
#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING
unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
This is possibly more than you bargained for, I'll annotate the code a bit:
The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.
// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
IntPtr arg_0F_0;
IntPtr expr_06 = arg_0F_0 = this;
if (expr_06 != 0)
{
arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
}
char* ptr = arg_0F_0;
int num = 352654597;
int num2 = num;
int* ptr2 = (int*)ptr;
for (int i = this.Length; i > 0; i -= 4)
{
num = ((num << 5) + num + (num >> 27) ^ *ptr2);
if (i <= 2)
{
break;
}
num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
ptr2 += (IntPtr)8 / 4;
}
return num + num2 * 1566083941;
}

string.GetHashCode() returns different values in debug vs release, how do I avoid this?

To my surprise the folowing method produces a different result in debug vs release:
int result = "test".GetHashCode();
Is there any way to avoid this?
I need a reliable way to hash a string and I need the value to be consistent in debug and release mode. I would like to avoid writing my own hashing function if possible.
Why does this happen?
FYI, reflector gives me:
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}

GetHashCode() is not what you should be using to hash a string, almost 100% of the time. Without knowing what you're doing, I recommend that you use an actual hash algorithm, like SHA-1:
using(System.Security.Cryptography.SHA1Managed hp = new System.Security.Cryptography.SHA1Managed()) {
// Use hp.ComputeHash(System.Text.Encoding.ASCII (or Unicode, UTF8, UTF16, or UTF32 or something...).GetBytes(theString) to compute the hash code.
}
Update: For something a little bit faster, there's also SHA1Cng, which is significantly faster than SHA1Managed.

Here's a better approach that is much faster than SHA and you can replace the modified GetHasCode with it: C# fast hash murmur2
There are several implementations with different levels of "unmanaged" code, so if you need fully managed it's there and if you can use unsafe it's there too.

/// <summary>
/// Default implementation of string.GetHashCode is not consistent on different platforms (x32/x64 which is our case) and frameworks.
/// FNV-1a - (Fowler/Noll/Vo) is a fast, consistent, non-cryptographic hash algorithm with good dispersion. (see http://isthe.com/chongo/tech/comp/fnv/#FNV-1a)
/// </summary>
private static int GetFNV1aHashCode(string str)
{
if (str == null)
return 0;
var length = str.Length;
// original FNV-1a has 32 bit offset_basis = 2166136261 but length gives a bit better dispersion (2%) for our case where all the strings are equal length, for example: "3EC0FFFF01ECD9C4001B01E2A707"
int hash = length;
for (int i = 0; i != length; ++i)
hash = (hash ^ str[i]) * 16777619;
return hash;
}
I guess this implementation is slower than the unsafe one posted here. But it's much simpler and safe. Works good in case super speed is not needed.

Faster String GetHashCode (e.g. using Multicore or GPU)

According to http://www.codeguru.com/forum/showthread.php?t=463663 , C#'s getHashCode function in 3.5 is implemented as:
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}
I am curious if anyone can come up with a function which returns the same results, but is faster. It is OK to increase the overall starting and resource overhead of the main application. Requiring a one-time initialization (per application execution, not per call or per string) is OK.
Note that unlike Microsoft, considerations like, "doing it this way will make everything else slower and has costs that make this method stupid!" can be ignored, so it is possible that even assuming Microsoft's is perfect, it can be beaten by doing something "stupid."
This purely an exercise in my own curiosity and will not be used in real code.
Examples of ideas I've thought of:
Using multiple cores (calculating num2 and num independently)
Using the gpu

One way to make a function go faster is to take special cases into account.
A function with variable size inputs has special cases based on size.
Going parallel only makes sense when the the cost of going parallel
is smaller than the gain, and for this kind of computation it is likely
that the string would have to be fairly large to overcome the cost
of forking a parallel thread. But implementing that isn't hard;
basically you need a test for this.Length exceeding an empirically
determined threshold, and then forking multiple threads to compute
hashes on substrings, with a final step composing the subhashes into
a final hash. Implementation left for the reader.
Modern processors also have SIMD instructions, which can process up
to 32 (or 64) bytes in a single instruction. This would allow you
to process the string in 32 (16 bit character) chunks in one-two
SIMD instructions per chunk; and then fold the 64 byte result into
a single hashcode at the end. This is likely to be extremely fast
for strings of any reasonable size. The implementation of this
from C# is harder, because one doesn't expect a virtual machine
to provide provide easy (or portable) access to the SIMD instructions
that you need. Implementation also left for the reader.
EDIT: Another answer suggests that Mono system does provide
SIMD instruction access.
Having said that, the particular implementation exhibited is pretty stupid.
The key observation is that the loop checks the limit twice on every iteration.
One can solve that problem by checking the end condition cases in advance,
and executing a loop that does the correct number of iterations.
One can do better than that by using
Duffs device
to jump into an unrolled loop of N iterations. This gets rid of
the loop limit checking overhead for N-1 iterations. That modification
would be very easy and surely be worth the effort to implement.
EDIT: You can also combine the SIMD idea and the loop unrolling idea to enable processing many chunks of 8/16 characters in a few SIMD instrucions.
For languages that can't jump into loops, one can do the equivalent of
Duff's device by simply peeling off the initial cases. A shot at
how to recode the original code using the loop peeling approach is the following:
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
const int N=3; // a power of two controlling number of loop iterations
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
count = this.length;
unrolled_iterations = count >> (N+1); // could be 0 and that's OK
for (int i = unrolled_iterations; i > 0; i--)
{
// repeat 2**N times
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[8];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[9]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[10];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[11]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[12];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[13]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[14];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[15]; }
numPtr += 16;
}
if (count & ((1<<N)-1))
{
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
numPtr += 8;
}
if (count & ((1<<(N-1))-1))
{
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
numPtr += 4;
}
if (count & ((1<<(N-2)-1))
{
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
numPtr += 2;
}
// repeat N times and finally:
if { count & (1) }
{
{ num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
// numPtr += 1;
}
return (num + (num2 * 0x5d588b65));
}
}
I haven't compiled or tested this code, but the idea is right.
It depends on the compiler doing reasonable constant folding
and address arithmetic.
I tried to code this to preserve the exact hash value of the original,
but IMHO that isn't really a requirement.
It would be even simpler and a tiny bit faster if it didn't use
the num/num2 stunt, but simply updated num for each character.
Corrected version (by Brian) as a static function:
public static unsafe int GetHashCodeIra(string x)
{
fixed (char* str = x.ToCharArray())
{
const int N = 2; // a power of two controlling number of loop iterations
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*)chPtr;
int count = (x.Length+1) / 2;
int unrolled_iterations = count >> (N+1); // could be 0 and that's OK
for (int i = unrolled_iterations; i > 0; i--)
{ // repeat 2**N times
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
}
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
}
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5];
}
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7];
}
numPtr += 8;
}
if (0 != (count & ((1 << N) )))
{
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
}
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
}
numPtr += 4;
}
if (0 != (count & ((1 << (N - 1) ))))
{
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
}
numPtr += 2;
}
// repeat N times and finally:
if (1 == (count & 1))
{
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
// numPtr += 1;
}
}
return (num + (num2 * 0x5d588b65));
}
}

Threads and GPU most certainly will introduce overhead greater than possible performance boost. The approach that could be justified is using SIMD instruction sets, such as SSE. However, it would require testing whether this partcular instruction set is available, which may cost. It will also bring boost on long strings only.
If you want to try it, consider testing Mono support for SIMD before diving into C or assembly. Read here about development possibilities and gotchas.

You could parallelize this however the problem you will run into is that threads, CUDA, etc have overheads associated with them. Even if you use a thread pool, if your strings are not very large, let's say a typical string is 128-256 characters (probably less than this) you will probably still end up making each call to this function taking longer than it did originally.
Now, if you were dealing with very large strings, then yes it would improve your time. The simple algorithm is "embarrassingly parallel."

I think all of your suggested approaches are very inefficient compared to the current implementation.
Using GPU:
The string data needs to be transferred to the GPU and the result back, which takes a lot of time. GPU's are very fast, but only when comparing floating point calculations, which aren't used here. All operations are on Integers, for which x86 CPU power is decent.
Using Another CPU Core:
This would involve creating a separate thread, locking down memory and synchronizing the thread requesting the Hash Code. The incurred overhead simply outweighs the benefits of parallel processing.
If you would want to calculate Hash values of thousands of strings in one go, things might look a little different, but I can't imagine a scenario where this would justify implementing a faster GetHashCode().

Each step in the computation builds on the result of the previous step. If iterations of the loop run out of order, you will get a different result (the value of num from the previous iteration serves as input to the next iteration).
For that reason, any approach (multithreading, massively parallel execution on a GPU) that runs steps in parallel will generally skew the result.
Also, I would be surprised if the previously discussed loop unrolling is not already being done internally by the compiler to the extent that it actually makes a difference in execution time (compilers tend to be smarter than the average programmer these days, and loop unrolling has been around for a really long time as a compiler optimization technique).

Given that strings are immutable, the first thing that I would consider is caching the return result.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reverse Engineering String.GetHashCode - c#

Related

Hash tables with long (100+ character) key names

unsigned char * - Equivalent C#

Does String.GetHashCode consider the full string or only part of it?

string.GetHashCode() returns different values in debug vs release, how do I avoid this?

Faster String GetHashCode (e.g. using Multicore or GPU)

Categories

Resources