Converting a partial MD5 hash code into a long

Converting a partial MD5 hash code into a long - c#

I'm using the MD5 algorithm to hash the key for an on-disk hash table (I know it's questionable whether this is the best algorithm to use for this, but I'm going with it for now. The problem is generalizable to any algorithm that produces a byte array). My problem is this:
The size of the hash code determines the number of combinations (buckets) in the hash table. Since MD5 is 128 bit, there are a huge number of combinations (~ 3.4e38) which is way too big for my purpose. So what I want to do is pick off the first n bits of the byte array that MD5 produces, and convert those into a long (or ulong) value. Since MD5 produces a byte array, it would be easy to do if I wanted an integral number of bytes, but this leads to too big a jump in the number of combinations. I'm finding the single bit version to be a lot trickier.
Goal:
n = 10 // I.e. I want 2^10 combinations
long pos = someFcn(byte[] key, n)
where key is the value being hashed, and n is the number of bits of the MD5 result I want to use. Pos, then, will be an integer from 0 to 1023 (in the case of n = 10). If n = 11, the code will be from 0 to 2^11-1 = 2027, etc. Has to be somewhat fast/efficient.
Doesn't seem that hard but it's eluding me. Any help would be much appreciated. Thanks.

First, convert the first four bytes into an integer, with BitConverter.ToInt32. It's getting four bytes no matter what, but this probably won't make it measurably slower, since you're working with 32-bit registers for the rest of the calculations anyway, and complex stuff like "if it's < 16 then do this with the first two bytes" will just make it more complicated
Then, given that integer, take the lowest N bits. If you really want a specific number of bits [a power of two number of buckets] not known at compile time, ~((-1)<<N) is a nice trick to get 2^N-1.
Or you could simply use ToUInt32 instead and modulo a prime number [it might be slightly better to convert to UInt64 instead, then you've got fully half the bits to start with, in this case]

To obtain the first 10 bits, for example:
int result = ((int)key[0] << 2) | (((int)key[1] >> 6) & 0x03)

If you have an array like this,
unsigned char data[2000];
then you can just scrape off the first n bits into an integer like so:
typedef unsigned long long int MyInt;
MyInt scrape(size_t n, unsigned char * data)
{
MyInt result = 0;
size_t b;
for (b = 0; b < n / 8; ++b)
{
result <<= 8;
result += data[b];
}
const size_t remaining_bits = n % 8;
result <<= remaining_bits;
result += (data[b] >> (8 - remaining_bits));
return result;
}
I'm assuming that CHAR_BITS == 8, feel free to generalize the code if you like. Also the size of the array times 8 must be at least n.

Related

How to (theoretically) print all possible double precision numbers in C#?

For a little personal research project I want to generate a string list of all possible values a double precision floating point number can have.
I've found the "r" formatting option, which guarantees that the string can be parsed back into the exact same bit representation:
string s = myDouble.ToString("r");
But how to generate all possible bit combinations? Preferably ordered by value.
Maybe using the unchecked keyword somehow?
unchecked
{
//for all long values
myDouble[i] = myLong++;
}
Disclaimer: It's more a theoretical question, I am not going to read all the numbers... :)

using unsafe code:
ulong i = 0; //long is 64 bit, like double
unsafe
{
double* d = (double*)&i;
for(;i<ulong.MaxValue;i++)
Console.WriteLine(*d);
}

You can start with all possible values 0 <= x < 1. You can create those by having zero for exponent and use different values for the mantissa.
The mantissa is stored in 52 bits of the 64 bits that make a double precision number, so that makes for 2 ^ 52 = 4503599627370496 different numbers between 0 and 1.
From the description of the decimal format you can figure out how the bit pattern (eight bytes) should be for those numbers, then you can use the BitConverter.ToDouble method to do the conversion.
Then you can set the first bit to make the negative version of all those numbers.
All those numbers are unique, beyond that you will start getting duplicate values because there are several ways to express the same value when the exponent is non-zero. For each new non-zero exponent you would get the value that were not possible to express with the previously used expontents.
The values between 0 and 1 will however keep you busy for the forseeable future, so you can just start with those.

This should be doable in safe code: Create a bit string. Convert that to a double. Output. Increment. Repeat.... A LOT.
string bstr = "01010101010101010101010101010101"; // this is 32 instead of 64, adjust as needed
long v = 0;
for (int i = bstr.Length - 1; i >= 0; i--) v = (v << 1) + (bstr[i] - '0');
double d = BitConverter.ToDouble(BitConverter.GetBytes(v), 0);
// increment bstr and loop

How to work with the bits in a byte

I have a single byte which contains two values. Here's the documentation:
The authority byte is split into two fields. The three least significant bits carry the user’s authority level (0-5). The five most
significant bits carry an override reject threshold. If these bits are
set to zero, the system reject threshold is used to determine whether
a score for this user is considered an accept or reject. If they are
not zero, then the value of these bits multiplied by ten will be the
threshold score for this user.
Authority Byte:
7 6 5 4 3 ......... 2 1 0
Reject Threshold .. Authority
I don't have any experience of working with bits in C#.
Can someone please help me convert a Byte and get the values as mentioned above?
I've tried the following code:
BitArray BA = new BitArray(mybyte);
But the length comes back as 29 and I would have expected 8, being each bit in the byte.
-- Thanks for everyone's quick help. Got it working now! Awesome internet.

Instead of BitArray, you can more easily use the built-in bitwise AND and right-shift operator as follows:
byte authorityByte = ...
int authorityLevel = authorityByte & 7;
int rejectThreshold = authorityByte >> 3;
To get the single byte back, you can use the bitwise OR and left-shift operator:
int authorityLevel = ...
int rejectThreshold = ...
Debug.Assert(authorityLevel >= 0 && authorityLevel <= 7);
Debug.Assert(rejectThreshold >= 0 && rejectThreshold <= 31);
byte authorityByte = (byte)((rejectThreshold << 3) | authorityLevel);

Your use of the BitArray is incorrect. This:
BitArray BA = new BitArray(mybyte);
..will be implicitly converted to an int. When that happens, you're triggering this constructor:
BitArray(int length);
..therefore, its creating it with a specific length.
Looking at MSDN (http://msdn.microsoft.com/en-us/library/x1xda43a.aspx) you want this:
BitArray BA = new BitArray(new byte[] { myByte });
Length will then be 8 (as expected).

To get a value of the five most significant bits in a byte as an integer, shift the byte to the right by 3 (i.e. by 8-5), and set the three upper bits to zero using bitwise AND operation, like this:
byte orig = ...
int rejThreshold = (orig >> 3) & 0x1F;
>> is the "shift right" operator. It moves bits 7..3 into positions 4..0, dropping the three lower bits.
0x1F is the binary number 00011111, which has the upper three bits set to zero, and the lower five bits set to one. AND-ing with this number zeroes out three upper bits.
This technique can be generalized to get other bit patterns and other integral data types. You shift the bits that you want into the least-significant position, and apply a mask that "cuts out" the number of bits that you want. In some cases, shifting would not be necessary (e.g. when you get the least significant group of bits). In other cases, such as above, the masking would not be necessary, because you get the most significant group of bits in an unsigned type (if the type is signed, ANDing would be required).

You're using the wrong constructor (probably).
The one that you're using is probably this one, while you need this one:
var bitArray = new BitArray(new [] { myByte } );

Generating uniform random integers with a certain maximum

I want to generate uniform integers that satisfy 0 <= result <= maxValue.
I already have a generator that returns uniform values in the full range of the built in unsigned integer types. Let's call the methods for this byte Byte(), ushort UInt16(), uint UInt32() and ulong UInt64(). Assume that the result of these methods is perfectly uniform.
The signature of the methods I want are uint UniformUInt(uint maxValue) and ulong UniformUInt(ulong maxValue).
What I'm looking for:
Correctness
I'd prefer the return values to be distributed in the given interval.
But a very small bias is acceptable if it increases performance significantly. By that I mean a bias of an order that allows distinguisher with probability 2/3 given 2^64 values.
It must work correctly for any maxValue.
Performance
The method should be fast.
Efficiency
The method does consume little raw randomness, since depending on the underlying generator, generating the raw bytes might be costly. Wasting a few bits is fine, but consuming say 128 bits to generate a single number is probably excessive.
It's also possible to cache some left over randomness from the previous call in some member variables.
Be careful with int overflows, and wrapping behavior.
I already have a solution(I'll post it as an answer), but it's a bit ugly for my tastes. So I'd like to get ideas for better solutions.
Suggestions on how to unit test with large maxValues would be nice too, since I can't generate a histogram with 2^64 buckets and 2^74 random values. Another complication is that with certain bugs, only some maxValue distributions are biased a lot, and others only very slightly.

How about something like this as a general-purpose solution? The algorithm is based on that used by Java's nextInt method, rejecting any values that would cause a non-uniform distribution. So long as the output of your UInt32 method is perfectly uniform then this should be too.
uint UniformUInt(uint inclusiveMaxValue)
{
unchecked
{
uint exclusiveMaxValue = inclusiveMaxValue + 1;
// if exclusiveMaxValue is a power of two then we can just use a mask
// also handles the edge case where inclusiveMaxValue is uint.MaxValue
if ((exclusiveMaxValue & (~exclusiveMaxValue + 1)) == exclusiveMaxValue)
return UInt32() & inclusiveMaxValue;
uint bits, val;
do
{
bits = UInt32();
val = bits % exclusiveMaxValue;
// if (bits - val + inclusiveMaxValue) overflows then val has been
// taken from an incomplete chunk at the end of the range of bits
// in that case we reject it and loop again
} while (bits - val + inclusiveMaxValue < inclusiveMaxValue);
return val;
}
}
The rejection process could, theoretically, keep looping forever; in practice the performance should be pretty good. It's difficult to suggest any generally applicable optimisations without knowing (a) the expected usage patterns, and (b) the performance characteristics of your underlying RNG.
For example, if most callers will be specifying a max value <= 255 then it might not make sense to ask for four bytes of randomness every time. On the other hand, the performance benefit of requesting fewer bytes might be outweighed by the additional cost of always checking how many you actually need. (And, of course, once you do have specific information then you can keep optimising and testing until your results are good enough.)

I am not sure, that his is an answer. It definitly needs more space than a comment, so I have to write it here, but I am willing to delete if others think this is stupid.
From the OQ I get, that
Entropy bits are very expensive
Everything else should be considered expensive, but less so than entropy.
My idea is to use binary digits to half, quater ... the maxValue space, until it is reduced to a number. Somthing like
I'l use maxValue=333 (decimal) as an example and assume a function getBit(), that randomly returns 0 or 1
offset:=0
space:=maxValue
while (space>0)
//Right-shift the value, keeping the rightmost bit this should be
//efficient on x86 and x64, if coded in real code, not pseudocode
remains:=space & 1
part:=floor(space/2)
space:=part
//In the 333 example, part is now 166, but 2*166=332 If we were to simply chose one
//half of the space, we would be heavily biased towards the upper half, so in case
//we have a remains, we consume a bit of entropy to decide which half is bigger
if (remains)
if(getBit())
part++;
//Now we decide which half to chose, consuming a bit of entropy
if (getBit())
offset+=part;
//Exit condition: The remeinind number space=0 is guaranteed to be met
//In the 333 example, offset will be 0, 166 or 167, remaining space will be 166
}
randomResult:=offset
getBit() can either come from your entropy source, if it is bit-based, or by consuming n bits of entropy at once on first call (obviously with n being the optimum for your entropy source), and shifting this until empty.

My current solution. A bit ugly for my tastes. It also has two divisions per generated number, which might negatively impact performance (I haven't profiled this part yet).
uint UniformUInt(uint maxResult)
{
uint rand;
uint count = maxResult + 1;
if (maxResult < 0x100)
{
uint usefulCount = (0x100 / count) * count;
do
{
rand = Byte();
} while (rand >= usefulCount);
return rand % count;
}
else if (maxResult < 0x10000)
{
uint usefulCount = (0x10000 / count) * count;
do
{
rand = UInt16();
} while (rand >= usefulCount);
return rand % count;
}
else if (maxResult != uint.MaxValue)
{
uint usefulCount = (uint.MaxValue / count) * count;//reduces upper bound by 1, to avoid long division
do
{
rand = UInt32();
} while (rand >= usefulCount);
return rand % count;
}
else
{
return UInt32();
}
}
ulong UniformUInt(ulong maxResult)
{
if (maxResult < 0x100000000)
return InternalUniformUInt((uint)maxResult);
else if (maxResult < ulong.MaxValue)
{
ulong rand;
ulong count = maxResult + 1;
ulong usefulCount = (ulong.MaxValue / count) * count;//reduces upper bound by 1, since ulong can't represent any more
do
{
rand = UInt64();
} while (rand >= usefulCount);
return rand % count;
}
else
return UInt64();
}

Need a way to pick a common bit in two bitmasks at random

Imagine two bitmasks, I'll just use 8 bits for simplicity:
01101010
10111011
The 2nd, 4th, and 6th bits are both 1. I want to pick one of those common "on" bits at random. But I want to do this in O(1).
The only way I've found to do this so far is pick a random "on" bit in one, then check the other to see if it's also on, then repeat until I find a match. This is still O(n), and in my case the majority of the bits are off in both masks. I do of course & them together to initially check if there's any common bits at all.
Is there a way to do this? If so, I can increase the speed of my function by around 6%. I'm using C# if that matters. Thanks!
Mike

If you are willing to have an O(lg n) solution, at the cost of a possibly nonuniform probability, recursively half split, i.e. and with the top half of the bits set and the bottom half set. If both are nonzero then chose one randomly, else choose the nonzero one. Then half split what remains, etc. This will take 10 comparisons for a 32 bit number, maybe not as few as you would like, but better than 32.
You can save a few ands by choosing to and with the high half or low half at random, and if there are no hits taking the other half, and if there are hits taking the half tested.
The random number only needs to be generated once, as you are only using one bit at each test, just shift the used bit out when you are done with it.
If you have lots of bits, this will be more efficient. I do not see how you can get this down to O(1) though.
For example, if you have a 32 bit number first and the anded combination with either 0xffff0000 or 0x0000ffff if the result is nonzero (say you anded with 0xffff0000) conitinue on with 0xff000000 of 0x00ff0000, and so on till you get to one bit. This ends up being a lot of tedious code. 32 bits takes 5 layers of code.

Do you want a uniform random distribution? If so, I don't see any good way around counting the bits and then selecting one at random, or selecting random bits until you hit one that is set.
If you don't care about uniform, you can select a set bit out of a word randomly with:
unsigned int pick_random(unsigned int w, int size) {
int bitpos = rng() % size;
unsigned int mask = ~((1U << bitpos) - 1);
if (mask & w)
w &= mask;
return w - (w & (w-1));
}
where rng() is your random number generator, w is the word you want to pick from, and size is the relevant size of the word in bits (which may be the machine wordsize, or may be less as long as you don't set the upper bits of the word. Then, for your example, you use pick_random(0x6a & 0xbb, 8) or whatever values you like.

This function uniformly randomly selects one bit which is high in both masks. If there are
no possible bits to pick, zero is returned instead. The running time is O(n), where n is the number of high bits in the anded masks. So if you have a low number of high bits in your masks, this function could be faster even though the worst case is O(n) which happens when all the bits are high. The implementation in C is as follows:
unsigned int randomMasksBit(unsigned a, unsigned b){
unsigned int i = a & b; // Calculate the bits which are high in both masks.
unsigned int count = 0
unsigned int randomBit = 0;
while (i){ // Loop through all high bits.
count++;
// Randomly pick one bit from the bit stream uniformly, by selecting
// a random floating point number between 0 and 1 and checking if it
// is less then the probability needed for random selection.
if ((rand() / (double)RAND_MAX) < (1 / (double)count)) randomBit = i & -i;
i &= i - 1; // Move on to the next high bit.
}
return randomBit;
}

O(1) with uniform distribution (or as uniform as random generator offers) can be done, depending on whether you count certain mathematical operation as O(1). As a rule we would, though in the case of bit-tweaking one might make a case that they are not.
The trick is that while it's easy enough to get the lowest set bit and to get the highest set bit, in order to have uniform distribution we need to randomly pick a partitioning point, and then randomly pick whether we'll go for the highest bit below it or the lowest bit above (trying the other approach if that returns zero).
I've broken this down a bit more than might be usual to allow the steps to be more easily followed. The only question on constant timing I can see is whether Math.Pow and Math.Log should be considered O(1).
Hence:
public static uint FindRandomSharedBit(uint x, uint y)
{//and two nums together, to find shared bits.
return FindRandomBit(x & y);
}
public static uint FindRandomBit(uint val)
{//if there's none, we can escape out quickly.
if(val == 0)
return 0;
Random rnd = new Random();
//pick a partition point. Note that Random.Next(1, 32) is in range 1 to 31
int maskPoint = rnd.Next(1, 32);
//pick which to try first.
bool tryLowFirst = rnd.Next(0, 2) == 1;
// will turn off all bits above our partition point.
uint lowerMask = Convert.ToUInt32(Math.Pow(2, maskPoint) - 1);
//will turn off all bits below our partition point
uint higherMask = ~lowerMask;
if(tryLowFirst)
{
uint lowRes = FindLowestBit(val & higherMask);
return lowRes != 0 ? lowRes : FindHighestBit(val & lowerMask);
}
uint hiRes = FindHighestBit(val & lowerMask);
return hiRes != 0 ? hiRes : FindLowestBit(val & higherMask);
}
public static uint FindLowestBit(uint masked)
{ //e.g 00100100
uint minusOne = masked - 1; //e.g. 00100011
uint xord = masked ^ minusOne; //e.g. 00000111
uint plusOne = xord + 1; //e.g. 00001000
return plusOne >> 1; //e.g. 00000100
}
public static uint FindHighestBit(uint masked)
{
double db = masked;
return (uint)Math.Pow(2, Math.Floor(Math.Log(masked, 2)));
}

I believe that, if you want uniform, then the answer will have to be Theta(n) in terms of the number of bits, if it has to work for all possible combinations.
The following C++ snippet (stolen) should be able to check if any given num is a power of 2.
if (!var || (var & (var - 1))) {
printf("%u is not power of 2\n", var);
}
else {
printf("%u is power of 2\n", var);
}

If you have few enough bits to worry about, you can get O(1) using a lookup table:
var lookup8bits = new int[256][] = {
new [] {},
new [] {0},
new [] {1},
new [] {0, 1},
...
new [] {0, 1, 2, 3, 4, 5, 6, 7}
};
Failing that, you can find the least significant bit of a number x with (x & -x), assuming 2s complement. For example, if x = 46 = 101110b, then -x = 111...111010010b, hence x & -x = 10.
You can use this technique to enumerate the set bits of x in O(n) time, where n is the number of set bits in x.
Note that computing a pseudo random number is going to take you a lot longer than enumerating the set bits in x!

This can't be done in O(1), and any solution for a fixed number of N bits (unless it's totally really ridiculously stupid) will have a constant upper bound, for that N.

Convert 2 bytes to a number

I have a control that has a byte array in it.
Every now and then there are two bytes that tell me some info about number of future items in the array.
So as an example I could have:
...
...
Item [4] = 7
Item [5] = 0
...
...
The value of this is clearly 7.
But what about this?
...
...
Item [4] = 0
Item [5] = 7
...
...
Any idea on what that equates to (as an normal int)?
I went to binary and thought it may be 11100000000 which equals 1792. But I don't know if that is how it really works (ie does it use the whole 8 items for the byte).
Is there any way to know this with out testing?
Note: I am using C# 3.0 and visual studio 2008

BitConverter can easily convert the two bytes in a two-byte integer value:
// assumes byte[] Item = someObject.GetBytes():
short num = BitConverter.ToInt16(Item, 4); // makes a short
// out of Item[4] and Item[5]

A two-byte number has a low and a high byte. The high byte is worth 256 times as much as the low byte:
value = 256 * high + low;
So, for high=0 and low=7, the value is 7. But for high=7 and low=0, the value becomes 1792.
This of course assumes that the number is a simple 16-bit integer. If it's anything fancier, the above won't be enough. Then you need more knowledge about how the number is encoded, in order to decode it.
The order in which the high and low bytes appear is determined by the endianness of the byte stream. In big-endian, you will see high before low (at a lower address), in little-endian it's the other way around.

You say "this value is clearly 7", but it depends entirely on the encoding. If we assume full-width bytes, then in little-endian, yes; 7, 0 is 7. But in big endian it isn't.
For little-endian, what you want is
int i = byte[i] | (byte[i+1] << 8);
and for big-endian:
int i = (byte[i] << 8) | byte[i+1];
But other encoding schemes are available; for example, some schemes use 7-bit arithmetic, with the 8th bit as a continuation bit. Some schemes (UTF-8) put all the continuation bits in the first byte (so the first has only limited room for data bits), and 8 bits for the rest in the sequence.

If you simply want to put those two bytes next to each other in binary format, and see what that big number is in decimal, then you need to use this code:
if (BitConverter.IsLittleEndian)
{
byte[] tempByteArray = new byte[2] { Item[5], Item[4] };
ushort num = BitConverter.ToUInt16(tempByteArray, 0);
}
else
{
ushort num = BitConverter.ToUInt16(Item, 4);
}
If you use short num = BitConverter.ToInt16(Item, 4); as seen in the accepted answer, you are assuming that the first bit of those two bytes is the sign bit (1 = negative and 0 = positive). That answer also assumes you are using a big endian system. See this for more info on the sign bit.

If those bytes are the "parts" of an integer it works like that. But beware, that the order of bytes is platform specific and that it also depends on the length of the integer (16 bit=2 bytes, 32 bit=4bytes, ...)

In case that item[5] is the MSB
ushort result = BitConverter.ToUInt16(new byte[2] { Item[5], Item[4] }, 0);
int result = 256 * Item[5] + Item[4];

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.