Weird behaviour of MD5 hashing

Weird behaviour of MD5 hashing - c#

I have faced aweird problem with the following code, the code below suppose to stop after one iteration, but it just keep going. However, if I remove the last "result_bytes = md5.ComputeHash(orig_bytes);" then it will work. Does anyone face similar problem before?
MD5 md5;
byte[] orig_bytes;
byte[] result_bytes;
Dictionary<byte[], string> hashes = new Dictionary<byte[], string>();
string input = "NEW YORK";
result_bytes = UnicodeEncoding.Default.GetBytes("HELLO");
while (!hashes.ContainsKey(result_bytes))
{
md5 = new MD5CryptoServiceProvider();
orig_bytes = UnicodeEncoding.Default.GetBytes(input);
result_bytes = md5.ComputeHash(orig_bytes);
hashes.Add(result_bytes, input);
Console.WriteLine(BitConverter.ToString(result_bytes));
Console.WriteLine(hashes.ContainsKey(result_bytes));
result_bytes = md5.ComputeHash(orig_bytes);
}

When you reassign result_bytes to a new value in the last line, you have a new reference to a byte array, which is not equal to the one in the collection, therefore hashes.ContainsKey returns false.

You're assuming that byte arrays override Equals and GetHashCode to compare for equality: they don't. They just use the default identity test - so without the extra assignment at the end, you're just checking whether the exact key object you've just added is still in the dictionary - which of course it is.
One way round this would be to store a reversible string representation of the hash (e.g. using base64), instead of the hash itself. Or write your own implementation of IEqualityComparer<byte[]> and pass that to the Dictionary constructor, so that it uses that implementation to find the hash code of byte arrays and compare them with each other.
In short: this has nothing to do with MD5, and everything to do with the fact that
Console.WriteLine(new byte[0].Equals(new byte[0]));
will print False :)

Related

Grouping unique integer arrays

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.
So, for every new array I need to figure out if it already exits. And if it does what's the index.
A naive approach is to keep all arrays in a list and then just call:
MyList.FindIndex(x=>x.SequenceEqual(Small_Array));
But this becomes very slow if the number of arrays is getting into the thousands.
A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.
Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".
The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:
public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
// Note: SequenceEqual already checks the count before looking at content.
public bool Equals(int[] first, int[] second) =>
first.SequenceEqual(second);
public int GetHashCode(int[] array)
{
unchecked
{
int hash = 23;
foreach (var item in array)
{
hash = hash * 31 + item;
}
return hash;
}
}
}
You'd then create the dictionary like this:
var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());
Then you'd have something like:
public int MaybeAddArray(int[] array)
{
if (!arrayMap.TryGetValue(array, out var index))
{
index = arrayMap.Count + 1;
arrayMap[array] = index;
}
return index;
}
Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

xxHash convert resulting in hash too long

I'm using xxHash for C# to hash a value for consistency.
ComputeHash returns a byte[], but I need to store the results in a long.
I'm able to convert the results into an int32 using the BitConverter. Here is what I've tried:
var xxHash = new System.Data.HashFunction.xxHash();
byte[] hashedValue = xxHash.ComputeHash(Encoding.UTF8.GetBytes(valueItem));
long value = BitConverter.ToInt64(hashedValue, 0);
When I use int this works fine, but when I change to ToInt64 it fails.
Here's the exception I get:
Destination array is not long enough to copy all the items in the collection. Check array index and length.

When you construct your xxHash object, you need to supply a hashsize:
var hasher = new xxHash(32);
valid hash sizes are 32 and 64.
See https://github.com/brandondahler/Data.HashFunction/blob/master/src/xxHash/xxHash.cs for the source.

Adding a new answer because current implementation of xxHash from Brandon Dahler uses a hashing factory where you initialize the factory with a configuration containing hashsize and seed:
using System.Data.HashFunction.xxHash;
//can also set seed here, (ulong) Seed=234567
xxHashConfig config = new xxHashConfig() { HashSizeInBits = 64 };
var factory = xxHashFactory.Instance.Create(config);
byte[] hashedValue = factory.ComputeHash(Encoding.UTF8.GetBytes(valueItem)).Hash;

BitConverter.ToInt64 expects hashedValue to have 8 bytes (= 64bits). You could manually extend, and then pass it.

Random String generation - avoiding duplicates

I am using the below code to generate random keys like the following TX8L1I
public string GetNewId()
{
string result = string.Empty;
Random rnd = new Random();
short codeLength = 5;
string chars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
StringBuilder builder = new StringBuilder(codeLength);
for (int i = 0; i < codeLength; ++i)
builder.Append(chars[rnd.Next(chars.Length)]);
result = string.Format("{0}{1}", "T", builder.ToString());
return result;
}
Every-time a key is generated a correspondent record is created on the database using the generated result as primary key. Off course this is not safe since a primary key violation might occur.
What's the correct approach to this? Should I load all existing keys and verify against tha t list if the key already exist and if yes generate another one?
Or maybe a more efficient way would be to move this logic into the database side? But even on the database side I still need to check that the random key generated does not exist on the current table.
Any ideas?
Thanks

Move the decision to the database. Add a primary key of type uniqueidentifier. Then its a simple "fire and forget" algorithm.. the database decides what the key is.

I think you should use Random instance outside of your method.
Since Random object is seeded from the system clock, which means that if you call your method several times very quickly, it will use the same seed each time, which means that you'll get the same string at the end.

well my decision to use the random is that I don't want the users to be able to know how the keys are generated (format), since the website is public and I don't want users to be trying to access keys that.
You've merged two problems into one that are really separate:
Identify a row in a database.
Have an identifier that is passed to and from clients, that matches this, but is not predictable.
This is a specific case of a more general set of two problems:
Identify a row in a database.
Have an identifier that is passed to and from clients, that matches this.
In this case, when we don't care about guessing, then the easiest way to deal with it is to just use the same identifier. E.g. for the database row identified by the integer 42 we use the string "42" and the mapping is a trivial int.Parse() or int.TryParse() in one direction and a trivial .ToString() or implicit .ToString() in the other.
And that's therefore the pattern we use without even thinking about it; public IDs and database keys are the same (perhaps with some conversion).
But the best solution to your specific case where you want to prevent guessing is not to change the key, but to change mapping between the key and the public identifier.
First, use auto-incremented integers ("IDENTITY" in SQL Server, and various similar concepts in other databases).
Then, when you are sending the key to the client (i.e. using it in a form value or appending it to a URI) then map it as so:
private const string Seed = "this is my secret seed ¾Ÿˇʣכ ↼⊜┲◗ blah balh";
private static string GetProtectedID(int id)
{
using(var sha = System.Security.Cryptography.SHA1.Create())
{
return string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(id.ToString() + Seed)).Select(b => b.ToString("X2"))) + id.ToString();
}
}
For example, with an ID of 123 this produces "989178D90470D8777F77C972AF46C4DED41EF0D9123".
Now map back to a the key with:
private static bool GetIDFromProtectedID(string str, out int id)
{
int chkID;
if(int.TryParse(str.Substring(40), out chkID))
{
using(var sha = System.Security.Cryptography.SHA1.Create())
{
if(string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(chkID.ToString() + Seed)).Select(b => b.ToString("X2"))) == str.Substring(0, 40))
{
id = chkID;
return true;
}
}
}
id = 0;
return false;
}
For "989178D90470D8777F77C972AF46C4DED41EF0D9123" this returns true and sets the id argument to 123. For "989178D90470D8777F77C972AF46C4DED41EF0D9122" (because I tried to guess the ID to attack your site) it returns false and sets id to 0. (The correct ID for key 122 is "F8AD0F55CA1B9426D18F684C4857E0C4D43984BA122", which is not easy to guess from having seen that for 123).
If you want, you can remove some of the 40 characters of the output to produce smaller IDs. This makes it less secure (an attacker has fewer possible IDs to brute-force) but should still be reasonable for many uses.
Obviously, you should used a different value for Seed than here, so that someone reading this answer can't use it to predict your ID; change the seed, change the ID. Seed cannot be changed once set without altering every ID in the system. Sometimes that's a good thing (if identifiers are never meant to have long-term value anyway), but normally it's bad (you've just 404d every page in the site that used it).

I would move the logic to the database instead. Even though the database still have to check for the existence of the key, it's already within the database operation at that point, so there is no back-and-forth happening.
Also, if there's an index on this randomly generated key, a simple if exists will be quick enough to determine if the key has been used before or not.

You can do the following:
Generate your string as well as concatenate the following onto it
string newUniqueString = string.Format("{0}{1}", result, DateTime.Now.ToString("yyyyMMddHHmmssfffffff"));
This way, you will never have the same key ever again!
Or use
var StringGuid = Guid.NewGuid();

You've commited a typical sin with Random: one shouldn't create Random instance
each time they want to generate random value (otherwise one'll get badly skewed distribution with many repeats). Put Random away form function:
// Let Generator be thread safe
private static ThreadLocal<Random> s_Generator = new ThreadLocal<Random>(
() => new Random());
public static Random Generator {
get {
return s_Generator.Value;
}
}
// For repetition test. One can remove repetion test if
// number of generated ids << 15000
private HashSet<String> m_UsedIds = new HashSet<String>();
public string GetNewId() {
int codeLength = 5;
string chars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
while (true) {
StringBuilder builder = new StringBuilder(codeLength);
for (int i = 0; i < codeLength; ++i)
builder.Append(chars[Generator.Next(chars.Length)]); // <- same generator for all calls
result = string.Format("{0}{1}", "T", builder.ToString());
// Test if random string in fact a repetition
if (m_UsedIds.Contains(result))
continue;
m_UsedIds.Add(result);
return result;
}
}
For codeLength = 5 there're Math.Pow(chars.Length, codeLength) possible
strings (Math.Pow(36, 5) == 60466176 = 6e7). According to birthday paradox
you can expect 1st repetition at about 2 * sqrt(possible strings) == 2 * sqrt(6e7) == 15000. If it's OK, you can skip the repetition test, otherwise HashSet<String> could
be a solution.

Computing md5 hash

I am computing md5hash of files to check if identical so I wrote the following
private static byte[] GetMD5(string p)
{
FileStream fs = new FileStream(p, FileMode.Open);
HashAlgorithm alg = new HMACMD5();
byte[] hashValue = alg.ComputeHash(fs);
fs.Close();
return hashValue;
}
and to test if for the beginning I called it like
var v1 = GetMD5("C:\\test.mp4");
var v2 = GetMD5("C:\\test.mp4");
and from debugger I listed v1 and v2 values and they are different !! why is that ?

It's because you're using HMACMD5, a keyed hashing algorithm, which combines a key with the input to produce a hash value. When you create an HMACMD5 via it's default constructor, it will use a random key each time, therefore the hashes will always be different.
You need to use MD5:
private static byte[] GetMD5(string p)
{
using(var fs = new FileStream(p, FileMode.Open))
{
using(var alg = new MD5CryptoServiceProvider())
{
return alg.ComputeHash(fs);
}
}
}
I've changed the code to use usings as well.

From the HMACMD5 constructor doc:
HMACMD5 is a type of keyed hash algorithm that is constructed from the
MD5 hash function and used as a Hash-based Message Authentication Code
(HMAC). The HMAC process mixes a secret key with the message data,
hashes the result with the hash function, mixes that hash value with
the secret key again, and then applies the hash function a second
time. The output hash will be 128 bits in length.
With this constructor, a 64-byte, randomly generated key is used.
(Emphasis mine)
With every call to GetMD5(), you're generating a new random key.
You might want to use System.Security.Cryptography.MD5Cng

My guess is that you did something like:
Console.WriteLine(v1);
Console.WriteLine(v2);
or
Console.WriteLine(v1 == v2);
That just shows that the variable values refer to distinct arrays - it doesn't say anything about the values within those arrays.
Instead, try this (to print out the hex):
Console.WriteLine(BitConverter.ToString(v1));
Console.WriteLine(BitConverter.ToString(v2))

Use ToString() methode to get the value of the array byte

Hash Digest / Array Comparison in C#

I'm writing an application that needs to verify HMAC-SHA256 checksums. The code I currently have looks something like this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = System.Text.Encoding.UTF8.GetBytes(secret);
byte[] value = System.Text.Encoding.UTF8.GetBytes(data);
byte[] checksum_bytes = System.Text.Encoding.UTF8.GetBytes(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expected_bytes = hmac.ComputeHash(value);
return checksum_bytes.SequenceEqual(expected_bytes);
}
}
I know that this is susceptible to timing attacks.
Is there a message digest comparison function in the standard library? I realize I could write my own time hardened comparison method, but I have to believe that this is already implemented elsewhere.

EDIT: Original answer is below - still worth reading IMO, but regarding the timing attack...
The page you referenced gives some interesting points about compiler optimizations. Given that you know the two byte arrays will be the same length (assuming the size of the checksum isn't particularly secret, you can immediately return if the lengths are different) you might try something like this:
public static bool CompareArraysExhaustively(byte[] first, byte[] second)
{
if (first.Length != second.Length)
{
return false;
}
bool ret = true;
for (int i = 0; i < first.Length; i++)
{
ret = ret & (first[i] == second[i]);
}
return ret;
}
Now that still won't take the same amount of time for all inputs - if the two arrays are both in L1 cache for example, it's likely to be faster than if it has to be fetched from main memory. However, I suspect that is unlikely to cause a significant issue from a security standpoint.
Is this okay? Who knows. Different processors and different versions of the CLR may take different amounts of time for an & operation depending on the two operands. Basically this is the same as the conclusion of the page you referenced - that it's probably as good as we'll get in a portable way, but that it would require validation on every platform you try to run on.
At least the above code only uses relatively simple operations. I would personally avoid using LINQ operations here as there could be sneaky optimizations going on in some cases. I don't think there would be in this case - or they'd be easy to defeat - but you'd at least have to think about them. With the above code, there's at least a reasonably close relationship between the source code and IL - leaving "only" the JIT compiler and processor optimizations to worry about :)
Original answer
There's one significant problem with this: in order to provide the checksum, you have to have a string whose UTF-8 encoded form is the same as the checksum. There are plenty of byte sequences which simply don't represent UTF-8-encoded text. Basically, trying to encode arbitrary binary data as text using UTF-8 is a bad idea.
Base64, on the other hand, is basically designed for this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
byte[] checksumBytes = Convert.FromBase64String(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expectedBytes = hmac.ComputeHash(value);
return checksumBytes.SequenceEqual(expectedBytes);
}
}
On the other hand, instead of using SequenceEqual on the byte array, you could Base64 encode the actual hash, and see whether that matches:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
using (var hmac = new HMACSHA256(key))
{
return checksum == Convert.ToBase64String(hmac.ComputeHash(value));
}
}
I don't know of anything better within the framework. It wouldn't be too hard to write a specialized SequenceEqual operator for arrays (or general ICollection<T> implementations) which checked for equal lengths first... but given that the hashes are short, I wouldn't worry about that.

If you're worried about the timing of the SequenceEqual, you could always replace it with something like this:
checksum_bytes.Zip( expected_bytes, (a,b) => a == b ).Aggregate( true, (a,r) => a && r );
This returns the same result as SequenceEquals but always check every element before given an answer this less chance of revealing anything through a timing attack.

How it is susceptible to timing attacks? Your code works the same amount of time in the case of valid or invalid digest. And calculate digest/check digest looks like the easiest way to check this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Weird behaviour of MD5 hashing - c#

When you reassign result_bytes to a new value in the last line, you have a new reference to a byte array, which is not equal to the one in the collection, therefore hashes.ContainsKey returns false.

Related

Grouping unique integer arrays

xxHash convert resulting in hash too long

Random String generation - avoiding duplicates

Computing md5 hash

Hash Digest / Array Comparison in C#

Categories

Resources