SqlServer Checksum in C# - c#

I'm using the chechsum function in sql server 2008 R2 and I would like to get the same int values in a C# app.
Is there any equivalent method in c# that returns the values like the sql checksum function?
Thanx

On SQL Server Forum, at this page, it's stated:
The built-in CHECKUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See this post for more explanation.
I was able to port the BINARY_CHECKSUM to c# and it seems to be working... I'll be looking at the plain CHECKSUM later...
private int SQLBinaryChecksum(string text)
{
long sum = 0;
byte overflow;
for (int i = 0; i < text.Length; i++)
{
sum = (long)((16 * sum) ^ Convert.ToUInt32(text[i]));
overflow = (byte)(sum / 4294967296);
sum = sum - overflow * 4294967296;
sum = sum ^ overflow;
}
if (sum > 2147483647)
sum = sum - 4294967296;
else if (sum >= 32768 && sum <= 65535)
sum = sum - 65536;
else if (sum >= 128 && sum <= 255)
sum = sum - 256;
return (int)sum;
}

CHECKSUM docs don't disclose how it computes the hash. If you want a hash you can use in T-SQL and C#, pick from the algorithms supported in HashBytes

The T-SQL documentation does not specify what algorithm is used by checksum() outside of this:
CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value
is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an
index is built over the computed CHECKSUM value, the result is a hash index. This can be used for
equality searches over the columns.
It's unlikely to compute an MD5 hash, since its return value (the computed hash) is a 32-bit integer; an MD5 hash is 128 bits in length.

In case you need to do a checksum on a GUID, change dna2's answer to this:
private int SQLBinaryChecksum(byte[] text)
With a byte array, the value from SQL will match the value from C#. To test:
var a = Guid.Parse("DEAA5789-6B51-4EED-B370-36F347A0E8E4").ToByteArray();
Console.WriteLine(SQLBinaryChecksum(a));
vs SQL:
select BINARY_CHECKSUM(CONVERT(uniqueidentifier,'DEAA5789-6B51-4EED-B370-36F347A0E8E4'))
both answers will be -1897092103.

#Dan's implementation of BinaryChecksum can be greatly simplified down in c# down to
int SqlBinaryChecksum(string text)
{
uint accumulator = 0;
for (int i = 0; i < text.Length; i++)
{
var leftRotate4bit = (accumulator << 4) | (accumulator >> -4);
accumulator = leftRotate4bit ^ text[i];
}
return (int)accumulator;
}
This also makes it clearer what the algorithm is doing. For each character, a 4 bit circular shift then an xor with character's byte

Related

CRC-16 and CRC-32 Checks

I need help trying to verify CRC-16 values (also need help with CRC-32 values). I tried to sit down and understand how CRC works but I am drawing a blank.
My first problem is when trying to use an online calculator for calculating the message "BD001325E032091B94C412AC" into CRC16 = 12AC. The documentation states that the last two octets are the CRC16 value, so I am inputting "BD001325E032091B94C4" into the site http://www.lammertbies.nl/comm/info/crc-calculation.html and receive 5A90 as the result instead of 12AC.
Does anybody know why these values are different and where I can find code for how to calculate CRC16 and CRC32 values (I plan to later learn how to do this but times doesn't allow right now)?
Some more messages are as following:
16000040FFFFFFFF00015FCB
3C00003144010405E57022C7
BA00001144010101B970F0ED
3900010101390401B3049FF1
09900C800000000000008CF3
8590000000000000000035F7
00900259025902590259EBC9
0200002B00080191014BF5A2
BB0000BEE0014401B970E51E
3D000322D0320A2510A263A0
2C0001440000D60000D65E54
--Edit--
I have included more information. The documentation I was referencing is TIA-102.BAAA-A (from the TIA standard). The following is what the documentation states (trying to avoid copyright infringement as much as possible):
The Last Block in a packet comprises several octets of user information and / or
pad octets, followed by a 4-octet CRC parity check. This is referred to as the
packet CRC.
The packet CRC is a 4-octet cyclic redundancy check coded over all of the data
octets included in the Intermediate Blocks and the octets of user information of
the Last Block. The specific calculation is as follows.
Let k be the total number of user information and pad bits over which the packet
CRC is to be calculated. Consider the k message bits as the coefficients of a
polynomial M(x) of degree k–1, associating the MSB of the zero-th message
octet with x^k–1 and the LSB of the last message octet with x^0. Define the
generator polynomial, GM(x), and the inversion polynomial, IM(x).
GM(x) = x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 + x^5 +
x^4 + x^2 + x + 1
IM(x) = x^31 + x^30 + x^29 + ... + x^2 + x +1
The packet CRC polynomial, FM(x), is then computed from the following formula.
FM(x) = ( x^32 M(x) mod GM(x) ) + IM(x) modulo 2, i.e., in GF(2)
The coefficients of FM(x) are placed in the CRC field with the MSB of the zero-th
octet of the CRC corresponding to x^31 and the LSB of the third octet of the CRC
corresponding to x^0.
In the above quote, I have put ^ to show powers as the formatting didn't stay the same when quoted. I'm not sure what goes to what but does this help?
I have a class I converted from a C++ I found in internet, it uses a long to calculate a CRC32. It adhere to the standard and is the one use by PKZIP, WinZip and Ethernet. To test it, use Winzip and compress a file then calculate the same file with this class, it should return the same CRC. It does for me.
public class CRC32
{
private int[] iTable;
public CRC32() {
this.iTable = new int[256];
Init();
}
/**
* Initialize the iTable aplying the polynomial used by PKZIP, WINZIP and Ethernet.
*/
private void Init()
{
// 0x04C11DB7 is the official polynomial used by PKZip, WinZip and Ethernet.
int iPolynomial = 0x04C11DB7;
// 256 values representing ASCII character codes.
for (int iAscii = 0; iAscii <= 0xFF; iAscii++)
{
this.iTable[iAscii] = this.Reflect(iAscii, (byte) 8) << 24;
for (int i = 0; i <= 7; i++)
{
if ((this.iTable[iAscii] & 0x80000000L) == 0) this.iTable[iAscii] = (this.iTable[iAscii] << 1) ^ 0;
else this.iTable[iAscii] = (this.iTable[iAscii] << 1) ^ iPolynomial;
}
this.iTable[iAscii] = this.Reflect(this.iTable[iAscii], (byte) 32);
}
}
/**
* Reflection is a requirement for the official CRC-32 standard. Note that you can create CRC without it,
* but it won't conform to the standard.
*
* #param iReflect
* value to apply the reflection
* #param iValue
* #return the calculated value
*/
private int Reflect(int iReflect, int iValue)
{
int iReturned = 0;
// Swap bit 0 for bit 7, bit 1 For bit 6, etc....
for (int i = 1; i < (iValue + 1); i++)
{
if ((iReflect & 1) != 0)
{
iReturned |= (1 << (iValue - i));
}
iReflect >>= 1;
}
return iReturned;
}
/**
* PartialCRC caculates the CRC32 by looping through each byte in sData
*
* #param lCRC
* the variable to hold the CRC. It must have been initialize.
* <p>
* See fullCRC for an example
* </p>
* #param sData
* array of byte to calculate the CRC
* #param iDataLength
* the length of the data
* #return the new caculated CRC
*/
public long CalculateCRC(long lCRC, byte[] sData, int iDataLength)
{
for (int i = 0; i < iDataLength; i++)
{
lCRC = (lCRC >> 8) ^ (long) (this.iTable[(int) (lCRC & 0xFF) ^ (int) (sData[i] & 0xff)] & 0xffffffffL);
}
return lCRC;
}
/**
* Caculates the CRC32 for the given Data
*
* #param sData
* the data to calculate the CRC
* #param iDataLength
* then length of the data
* #return the calculated CRC32
*/
public long FullCRC(byte[] sData, int iDataLength)
{
long lCRC = 0xffffffffL;
lCRC = this.CalculateCRC(lCRC, sData, iDataLength);
return (lCRC /*& 0xffffffffL)*/^ 0xffffffffL);
}
/**
* Calculates the CRC32 of a file
*
* #param sFileName
* The complete file path
* #param context
* The context to open the files.
* #return the calculated CRC32 or -1 if an error occurs (file not found).
*/
long FileCRC(String sFileName, Context context)
{
long iOutCRC = 0xffffffffL; // Initilaize the CRC.
int iBytesRead = 0;
int buffSize = 32 * 1024;
FileInputStream isFile = null;
try
{
byte[] data = new byte[buffSize]; // buffer de 32Kb
isFile = context.openFileInput(sFileName);
try
{
while ((iBytesRead = isFile.read(data, 0, buffSize)) > 0)
{
iOutCRC = this.CalculateCRC(iOutCRC, data, iBytesRead);
}
return (iOutCRC ^ 0xffffffffL); // Finalize the CRC.
}
catch (Exception e)
{
// Error reading file
}
finally
{
isFile.close();
}
}
catch (Exception e)
{
// file not found
}
return -1l;
}
}
Read Ross Williams tutorial on CRCs to get a better understanding of CRC's, what defines a particular CRC, and their implementations.
The reveng website has an excellent catalog of known CRCs, and for each the CRC of a test string (nine bytes: "123456789" in ASCII/UTF-8). Note that there are 22 different 16-bit CRCs defined there.
The reveng software on that same site can be used to reverse engineer the polynomial, initialization, post-processing, and bit reversal given several examples as you have for the 16-bit CRC. (Hence the name "reveng".) I ran your data through and got:
./reveng -w 16 -s 16000040FFFFFFFF00015FCB 3C00003144010405E57022C7 BA00001144010101B970F0ED 3900010101390401B3049FF1 09900C800000000000008CF3 8590000000000000000035F7 00900259025902590259EBC9 0200002B00080191014BF5A2 BB0000BEE0014401B970E51E 3D000322D0320A2510A263A0 2C0001440000D60000D65E54
width=16 poly=0x1021 init=0xc921 refin=false refout=false xorout=0x0000 check=0x2fcf name=(none)
As indicated by the "(none)", that 16-bit CRC is not any of the 22 listed on reveng, though it is similar to several of them, differing only in the initialization.
The additional information you provided is for a 32-bit CRC, either CRC-32 or CRC-32/BZIP in the reveng catalog, depending on whether the bits are reversed or not.
There are quite a few parameters to CRC calculations: Polynomial, initial value, final XOR... see Wikipedia for details. Your CRC does not seem to fit the ones on the site you used, but you can try to find the right parameters from your documentation and use a different calculator, e.g. this one (though I'm afraid it doesn't support HEX input).
One thing to keep in mind is that CRC-16 is usually calculated over the data that is supposed to be checksummed plus two zero-bytes, e.g. you are probably looking for a CRC16 function where CRC16(BD001325E032091B94C40000) == 12AC. With checksums calculated in this way, the CRC of the data with checksum appended will work out to 0, which makes checking easier, e.g. CRC16(BD001325E032091B94C412AC) == 0000

CRC programming help needed, CRC32 conversion from the .NET class to C

Code(written in C):
unsigned long chksum_crc32 (unsigned char *block, unsigned int length)
{
register unsigned long crc;
unsigned long i;
crc = 0xFFFFFFFF;
for (i = 0; i < length; i++)
{
crc = ((crc >> 8) & 0x00FFFFFF) ^ crc_tab[(crc ^ *block++) & 0xFF];
}
return (crc ^ 0xFFFFFFFF);
}
/* chksum_crc32gentab() -- to a global crc_tab[256], this one will
* calculate the crcTable for crc32-checksums.
* it is generated to the polynom [..]
*/
void chksum_crc32gentab ()
{
unsigned long crc, poly;
int i, j;
poly = 0xEDB88320L;
for (i = 0; i < 256; i++)
{
crc = i;
for (j = 8; j > 0; j--)
{
if (crc & 1)
{
crc = (crc >> 1) ^ poly;
}
else
{
crc >>= 1;
}
}
crc_tab[i] = crc;
}
}
For starters; I know how CRC works, first the divisor is calculated with a specified polynomial, then this FCS(frame check sequence) is appended to the data set and sent to the end users system. Once the transfer is finished, the FCS is checked with the same polynomial used to calculate the FCS, and if the remainder of the data with that divisor is zero, then you know the data is correct.
I do not understand the implementation of these two functions. From what I have learned, the function chksum_crc32gentab() generates all the possible hex values the checksum could take with the 32 bit CRC polynomial. One thing I dont get is how poly = 0xEDB88320L; is equivelent to a polynomial. I don't understand the logic in the bottom of this function either. For example, the conditional if (crc & 1), does this mean that for every bit in crc that is 1, compute, otherwise shift right one bit?
I also do not understand chksum_crc32(unsigned char *block, unsigned int length);. Does this function just take in a string of bytes and convert them to the proper crc value computed with the table?. I guess I am confused about the logic it uses within the for loop.
If anyone understands this code an explanation would be great; this does work for the crc32 conversion from the .net class, an example of how data is converted then used by these functions would be something like:
(C# source)
MemoryStream ms = new MemoryStream(System.Text.Encoding.Default.GetBytes(input));
foreach (byte b in crc32.ComputeHash(ms))
hash += b.ToString("x2").ToLower();
Here is the original site and project the C code was taken from. http://www.codeproject.com/Articles/35134/How-to-calculate-CRC-in-C
Any explanation would help
Or just google it... Second hit is: http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/bsd/libkern/crc32.c
Backporting it from C#'s the hard way to do it, most of these algorithms are already in C.
In CRC calculations, binary polynomials, which are sums of x^n with either a 0 or 1 coefficient, are represented simply as binary words where the position of the 0 or 1 indicates which power of x it is a coefficient of.
0xEDB88320L represents the coefficients of the CRC32 polynomial as 1's where there is an x^n term (except for the x^32 term, which is left out). The CRC32 polynomial (why oh why doesn't stackoverflow have TeX equations like math.stackexchange -- I can't write decent equations here! sigh, sorry for the rant ...) is:
x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
Because of how this CRC is defined with respect to bit-ordering, the lowest coefficients are in the highest bits. So the first E in the hex constant above is 1110 representing (in order from left to right in the bits), 1 + x + x^2.
You can find the construction in the crc32.c source file of zlib, from which a snippet is shown here:
static const unsigned char p[] = {0,1,2,4,5,7,8,10,11,12,16,22,23,26};
/* make exclusive-or pattern from polynomial (0xedb88320UL) */
poly = 0;
for (n = 0; n < (int)(sizeof(p)/sizeof(unsigned char)); n++)
poly |= (z_crc_t)1 << (31 - p[n]);
/* generate a crc for every 8-bit value */
for (n = 0; n < 256; n++) {
c = (z_crc_t)n;
for (k = 0; k < 8; k++)
c = c & 1 ? poly ^ (c >> 1) : c >> 1;
crc_table[0][n] = c;
}
The if (crc & 1) or c & 1 ? above looks at the low bit of the CRC at each step before it is shifted away. That is effectively a carry bit for the polynomial subtraction operation, so if it is a one, the polynomial is subtracted (exclusive-ored) from the shifted down polynomial in the CRC (multiplied by x). The CRC is shifted down whether the low bit is 1 or not.
The chksum_crc32() function that you show indeed computes the CRC on the provided block of data. It is the standard table-based approach for CRC calculations on strings of bytes, which indexes the table by the exclusive-or of the data byte and the low byte of the CRC. This does the same thing as shifting in a bit at a time and applying the polynomial for 1 bits, but does it in one step instead of eight. The CRC is effectively multiplied by x^8 (the >> 8), and is exclusive-ored with the effect of exclusive-oring with the polynomial 0 to 8 times at various shifted locations depending on the index value. It is simply a speed trick using a pre-computed table.
You can find even more extreme speed tricks used in zlib's crc32.c that uses larger tables and processes more data a time.

Converting a partial MD5 hash code into a long

I'm using the MD5 algorithm to hash the key for an on-disk hash table (I know it's questionable whether this is the best algorithm to use for this, but I'm going with it for now. The problem is generalizable to any algorithm that produces a byte array). My problem is this:
The size of the hash code determines the number of combinations (buckets) in the hash table. Since MD5 is 128 bit, there are a huge number of combinations (~ 3.4e38) which is way too big for my purpose. So what I want to do is pick off the first n bits of the byte array that MD5 produces, and convert those into a long (or ulong) value. Since MD5 produces a byte array, it would be easy to do if I wanted an integral number of bytes, but this leads to too big a jump in the number of combinations. I'm finding the single bit version to be a lot trickier.
Goal:
n = 10 // I.e. I want 2^10 combinations
long pos = someFcn(byte[] key, n)
where key is the value being hashed, and n is the number of bits of the MD5 result I want to use. Pos, then, will be an integer from 0 to 1023 (in the case of n = 10). If n = 11, the code will be from 0 to 2^11-1 = 2027, etc. Has to be somewhat fast/efficient.
Doesn't seem that hard but it's eluding me. Any help would be much appreciated. Thanks.
First, convert the first four bytes into an integer, with BitConverter.ToInt32. It's getting four bytes no matter what, but this probably won't make it measurably slower, since you're working with 32-bit registers for the rest of the calculations anyway, and complex stuff like "if it's < 16 then do this with the first two bytes" will just make it more complicated
Then, given that integer, take the lowest N bits. If you really want a specific number of bits [a power of two number of buckets] not known at compile time, ~((-1)<<N) is a nice trick to get 2^N-1.
Or you could simply use ToUInt32 instead and modulo a prime number [it might be slightly better to convert to UInt64 instead, then you've got fully half the bits to start with, in this case]
To obtain the first 10 bits, for example:
int result = ((int)key[0] << 2) | (((int)key[1] >> 6) & 0x03)
If you have an array like this,
unsigned char data[2000];
then you can just scrape off the first n bits into an integer like so:
typedef unsigned long long int MyInt;
MyInt scrape(size_t n, unsigned char * data)
{
MyInt result = 0;
size_t b;
for (b = 0; b < n / 8; ++b)
{
result <<= 8;
result += data[b];
}
const size_t remaining_bits = n % 8;
result <<= remaining_bits;
result += (data[b] >> (8 - remaining_bits));
return result;
}
I'm assuming that CHAR_BITS == 8, feel free to generalize the code if you like. Also the size of the array times 8 must be at least n.

Reversing a hash function

I have the following hash function, and I'm trying to get my way to reverse it, so that I can find the key from a hashed value.
uint Hash(string s)
{
uint result = 0;
for (int i = 0; i < s.Length; i++)
{
result = ((result << 5) + result) + s[i];
}
return result;
}
The code is in C# but I assume it is clear.
I am aware that for one hashed value, there can be more than one key, but my intent is not to find them all, just one that satisfies the hash function suffices.
EDIT :
The string that the function accepts is formed only from digits 0 to 9 and the chars '*' and '#' hence the Unhash function must respect this criteria too.
Any ideas? Thank you.
This should reverse the operations:
string Unhash(uint hash)
{
List<char> s = new List<char>();
while (hash != 0)
{
s.Add((char)(hash % 33));
hash /= 33;
}
s.Reverse();
return new string(s.ToArray());
}
This should return a string that gives the same hash as the original string, but it is very unlikely to be the exact same string.
Characters 0-9,*,# have ASCII values 48-57,42,35, or binary: 00110000 ... 00111001, 00101010, 00100011
First 5 bits of those values are different, and 6th bit is always 1. This means that you can deduce your last character in a loop by taking current hash:
uint lastChar = hash & 0x1F - ((hash >> 5) - 1) & 0x1F + 0x20;
(if this doesn't work, I don't know who wrote it)
Now roll back hash,
hash = (hash - lastChar) / 33;
and repeat the loop until hash becomes zero. I don't have C# on me, but I'm 70% confident that this should work with only minor changes.
Brute force should work if uint is 32 bits. Try at least 2^32 strings and one of them is likely to hash to the same value. Should only take a few minutes on a modern pc.
You have 12 possible characters, and 12^9 is about 2^32, so if you try 9 character strings you're likely to find your target hash. I'll do 10 character strings just to be safe.
(simple recursive implementation in C++, don't know C# that well)
#define NUM_VALID_CHARS 12
#define STRING_LENGTH 10
const char valid_chars[NUM_VALID_CHARS] = {'0', ..., '#' ,'*'};
void unhash(uint hash_value, char *string, int nchars) {
if (nchars == STRING_LENGTH) {
string[STRING_LENGTH] = 0;
if (Hash(string) == hash_value) { printf("%s\n", string); }
} else {
for (int i = 0; i < NUM_VALID_CHARS; i++) {
string[nchars] = valid_chars[i];
unhash(hash_value, string, nchars + 1);
}
}
}
Then call it with:
char string[STRING_LENGTH + 1];
unhash(hash_value, string, 0);
Hash functions are designed to be difficult or impossible to reverse, hence the name (visualize meat + potatoes being ground up)
I would start out by writing each step that result = ((result << 5) + result) + s[i]; does on a separate line. This will make solving a lot easier. Then all you have to do is the opposite of each line (in the opposite order too).

Are there any working implementations of the rolling hash function used in the Rabin-Karp string search algorithm?

I'm looking to use a rolling hash function so I can take hashes of n-grams of a very large string.
For example:
"stackoverflow", broken up into 5 grams would be:
"stack", "tacko", "ackov", "ckove",
"kover", "overf", "verfl", "erflo", "rflow"
This is ideal for a rolling hash function because after I calculate the first n-gram hash, the following ones are relatively cheap to calculate because I simply have to drop the first letter of the first hash and add the new last letter of the second hash.
I know that in general this hash function is generated as:
H = c1ak − 1 + c2ak − 2 + c3ak − 3 + ... + cka0 where a is a constant and c1,...,ck are the input characters.
If you follow this link on the Rabin-Karp string search algorithm , it states that "a" is usually some large prime.
I want my hashes to be stored in 32 bit integers, so how large of a prime should "a" be, such that I don't overflow my integer?
Does there exist an existing implementation of this hash function somewhere that I could already use?
Here is an implementation I created:
public class hash2
{
public int prime = 101;
public int hash(String text)
{
int hash = 0;
for(int i = 0; i < text.length(); i++)
{
char c = text.charAt(i);
hash += c * (int) (Math.pow(prime, text.length() - 1 - i));
}
return hash;
}
public int rollHash(int previousHash, String previousText, String currentText)
{
char firstChar = previousText.charAt(0);
char lastChar = currentText.charAt(currentText.length() - 1);
int firstCharHash = firstChar * (int) (Math.pow(prime, previousText.length() - 1));
int hash = (previousHash - firstCharHash) * prime + lastChar;
return hash;
}
public static void main(String[] args)
{
hash2 hashify = new hash2();
int firstHash = hashify.hash("mydog");
System.out.println(firstHash);
System.out.println(hashify.hash("ydogr"));
System.out.println(hashify.rollHash(firstHash, "mydog", "ydogr"));
}
}
I'm using 101 as my prime. Does it matter if my hashes will overflow? I think this is desirable but I'm not sure.
Does this seem like the right way to go about this?
i remember a slightly different implementation which seems to be from one of sedgewick's algorithms books (it also contains example code - try to look it up). here's a summary adjusted to 32 bit integers:
you use modulo arithmetic to prevent your integer from overflowing after each operation.
initially set:
c = text ("stackoverflow")
M = length of the "n-grams"
d = size of your alphabet (256)
q = a large prime so that (d+1)*q doesn't overflow (8355967 might be a good choice)
dM = dM-1 mod q
first calculate the hash value of the first n-gram:
h = 0
for i from 1 to M:
h = (h*d + c[i]) mod q
and for every following n-gram:
for i from 1 to lenght(c)-M:
// first subtract the oldest character
h = (h + d*q - c[i]*dM) mod q
// then add the next character
h = (h*d + c[i+M]) mod q
the reason why you have to add d*q before subtracting the oldest character is because you might run into negative values due to small values caused by the previous modulo operation.
errors included but i think you should get the idea. try to find one of sedgewick's algorithms books for details, less errors and a better description. :)
As i understand it's a function minimization for:
2^31 - sum (maxchar) * A^kx
where maxchar = 62 (for A-Za-z0-9). I've just calculated it by Excel (OO Calc, exactly) :) and a max A it found is 76, or 73, for a prime number.
Not sure what your aim is here, but if you are trying to improve performance, using math.pow will cost you far more than you save by calculating a rolling hash value.
I suggest you start by keeping to simple and efficient and you are very likely find it is fast enough.

Categories

Resources