Background (You can skip this section)
I have a large amount of data (about 3 mb) that needs to be kept up to date on several hundred machines. Some of the machines run C# and some run Java. The data could change at any time and needs to be propogated to the clients within minutes. The data is delivered in Json format from 4 load balanced servers. These 4 servers are running ASP.NET 4.0 with Mvc 3 and C# 4.0.
The code that runs on the 4 servers has a hashing algorithm which hashes the Json response and then converts the hash to a string. This hash is given to the client. Then, every few minutes, the clients ping the server with the hash and if the hash is out of date the new Json object is returned. If the hash is still current then a 304 with an emptry body is returned.
Occasionally the hashes generated by the 4 boxes are inconsistent across the boxes, which means that the clients are constantly downloading the data (each request could hit a different server).
Code Snipet
Here is the code that is used to generate the hash.
internal static HashAlgorithm Hasher { get; set; }
...
Hasher = new SHA1Managed();
...
Convert.ToBase64String(Hasher.ComputeHash(Encoding.ASCII.GetBytes(jsonString)));
To try and debug the problem I split it out like this:
Prehash = PreHashBuilder.ToString();
ASCIIBytes = Encoding.ASCII.GetBytes(Prehash);
HashedBytes = Hasher.ComputeHash(ASCIIBytes);
Hash = Convert.ToBase64String(HashedBytes);
I then added a route which spits out the above values and used Beyond Compare to compare the differences.
Byte arrays are converted to a string format for BeyondCompare use by using:
private static string GetString(byte[] bytes)
{
StringBuilder sb = new StringBuilder();
foreach (byte b in bytes)
{
sb.Append(b);
}
return sb.ToString();
}
As you can see the byte array is displayed litterally as a sequence of bytes. It is not 'converted'.
The Problem
I discovered that the Prehash and ASCIIBytes values were the same, but the HashedBytes values were different - which meant that the Hash was also different.
I restarted the IIS WebSites on the 4 server boxes several times and, when they had different hashes, compared the values in BeyondCompare. In everycase it was the "HashedBytes" value that was different (the results of SHA1Managed.ComputeHash(...))
The Question
What am I doing wrong? The input to the ComputeHash function is identical. Is SHA1Managed machine dependent? That doesn't make since because half the time the 4 machines have the same hash.
I have searched StackOverFlow and Bing but have been unable to find anyone else with this problem. The closest thing I could find was people with problems with their encoding, but I think I have proven that the encoding is not an issue.
Output
I was hoping not to dump everything here because of how long it is, but here is a snipet of the dump I am comparing:
Hash:o1ZxBaVuU6OhE6De96wJXUvmz3M=
HashedBytes:163861135165110831631611916022224717299375230207115
ASCIIBytes:1151169710310146991111094779114100101114831011141181059910147115101114118105991014611511899591151051031101171129511510111411810599101114101102101114101110991011159598979910710111010011111410010111411510111411810599101951185095117114108611041161161125847471051159897991071011101004610910211598101115116971031014699111109477911410010111483101114118105991014711510111411810599101461151189947118505911510510311011711295115101114118105991011141011021011141011109910111595989799107101110100112971211091011101161151161111141011151011141....
Prehash:...
When I compare the two pages on the different servers the ASCII Bytes are identical but the HashedBytes are not. The dump method I use for the bytes does no conversions, it simply dumps each byte out in sequence. I could delimit the bytes with a '.' I suppose.
Follow Up
I have made the b.ToString(CultureInfo.InvariantCulture) change and have made the HashAlgorithm a local variable instead of a static property. I am waiting for the code to deploy to the servers.
I have been trying to duplicate the issue but have been unable to do so once I made the SHA1Managed property a local variable instead of global static.
The problem was with Multi-Threading. My code was thread safe except for the SHA1Managed class that I had marked static. I assumed that SHA1Managed.ComputeHash would be thread safe underneath but apparently it is not if marked internal static.
To repeat, SHA1Managed.ComputeHash is not thread safe if marked internal static.
MSDN states:
Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
I don't know why internal static behaves differently than public static.
I would mark #pst as the answer and add a comment to clarify the problem, but #pst made a comment so I can't mark it as the answer.
Thanks for all your input.
Your GetString method could potentially produce different results on machines of different cultures, because StringBuilder.Append(byte) calls byte.ToString(CultureInfo.CurrentCulture). Try
private static string GetString(byte[] bytes)
{
StringBuilder sb = new StringBuilder();
foreach (byte b in bytes)
{
sb.Append(b.ToString(CultureInfo.InvariantCulture));
}
return sb.ToString();
}
But using a method that doesn't use decimal string representations of the byte values would be better.
The problem is your code is likely messing with leading 0's, use the following as your array to string code to compare. it will produce reliable results and is specifically designed for turning byte arrays in to strings so they can be transmitted between machines.
using System.Runtime.Remoting.Metadata.W3cXsd2001;
public byte[] StringToBytes(string value)
{
SoapHexBinary soapHexBinary = SoapHexBinary.Parse(value);
return soapHexBinary.Value;
}
public string BytesToString(byte[] value)
{
SoapHexBinary soapHexBinary = new SoapHexBinary(value);
return soapHexBinary.ToString();
}
Also, I would recommend that you check that the JSON is not subtlety different, as that would create a totally diffrent hash. For example some cultures represent the number "One thousand six hundred point seven" as 1,600.7, 1 000.7, or even 1 600,7 (see this Wikipedia page).
Related
I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
to use HtmlAgilityPack to fetch html content (any other recommendation?)
I don't need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net's GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?
You should absolutely not use GetHashCode() for this. The documentation explicitly states:
Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework.
The results of GetHashCode can change between runs - all that's guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed, String.GetHashCode's algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.
If you want to use hashing, use MD5, SHA1 etc - something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too - you don't need to bother decoding the data as text.)
It's not clear to me whether refetching periodically is really the best idea though - do these servers not support last modified times, etags etc?
As you have asked for suggestions. I would have used this method instead
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://google.com");
And i would have saved this string in my DB. After the particular interval i could have compared them again.
But yes I do agree the string size would be really be large.
If I just want to get a alert on the fact the content has changed some how. I would use MD5. As the result size of an MD5 string is only 27 characters.
Hence easier to compare and store in DB
I know there are similar questions already on SO but none of them seem to address this problem. I have inherited the following c# code that has been used to create password hashes in a legacy .net app, for various reasons the C# implementation is now being migrated to php:
string input = "fred";
SHA256CryptoServiceProvider provider = new SHA256CryptoServiceProvider();
byte[] hashedValue = provider.ComputeHash(Encoding.ASCII.GetBytes(input));
string output = "";
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
foreach ( char c in asciiString ) {
int tmp = c;
output += String.Format("{0:x2}",
(uint)System.Convert.ToUInt32(tmp.ToString()));
}
return output;
My php code is very simple but for the same input "fred" doesn't produce the same result:
$output = hash('sha256', "fred");
I've traced the problem down to an encoding issue - if I change this line in the C# code:
string asciiString = ASCIIEncoding.ASCII.GetString(hashedValue);
to
string asciiString = ASCIIEncoding.UTF7.GetString(hashedValue);
Then the php and C# output match (it yields d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3a2cfcd2286e5).
Since I'm not able to change the .net code I need to work out how to replicate the results in php.
Thanks in advance for any help,
I don’t know PHP well enough to answer your question; however, I must point out that your C# code is broken. Try generating the hash of these two inputs: "âèí" and "çñÿ". You will find that their hash collides:
3f3b221c6c6e3f71223f51695d456d52223f243f3f363949443f3f763b483615
The first bug lies in this operation:
Encoding.ASCII.GetBytes(input)
This assumes that all characters within your input are US-ASCII. Any non-ASCII characters would cause the encoder to fall back to the byte value for the ? character, thereby giving (unwanted) hash collisions, as demonstrated above. Notwithstanding, this will not be an issue if your input is constrained to only allow US-ASCII characters.
The other (more severe) bug lies in the following operation:
ASCIIEncoding.ASCII.GetString(hashedValue)
ASCII only defines mappings for values 0–127. Since the elements of your hashedValue byte array may contain any byte value (0–255), encoding them as ASCII would cause data to be lost whenever a value greater than 127 is encountered. This may lead to further “unwanted” (read: potentially maliciously generated) hash collisions, even when your original input was US-ASCII.
Given that, statistically, half of the bytes constituting your hashes would be greater than 127, then you are losing at least half the strength of your hash algorithm. If a hacker gains access to your stored hashes, it is quite likely that they will manage to devise an attack to generate hash collisions by exploiting this cryptographic weakness.
Edit: Notwithstanding the considerations mentioned in my posts and Jon’s, here is the PHP code that succumbs to the same weakness – so to speak – as your C# code, and thereby gives the same hash:
$output = hash('sha256', $input, true);
for ($i = 0; $i < strlen($output); $i++)
if ($output[$i] > chr(127))
$output[$i] = '?';
$output = bin2hex($output);
Could you use mb_convert_encoding (see http://php.net/manual/en/function.mb-convert-encoding.php - the page also has a link to a list of supported encodings) to convert the PHP string to ASCII from UTF7?
I've traced the problem down to an encoding issue
Yes. You're trying to treat arbitrary binary data as if it's valid text-encoded data. It's not. You should not be using any Encoding here.
If you want the results in hex, the simplest approach is to use BitConverter.ToString
string text = BitConverter.ToString(hashedValue).Replace("-", "").ToLower();
And yes, as pointed out elsewhere, you probably shouldn't be using ASCII to convert the text to binary at the start of the hashing process. I'd probably use UTF-8.
It's really important that you understand the problem here though, as otherwise you'll run into it in other places too. You should only use encodings such as ASCII, UTF-8 etc (on any platform) when you've genuinely got encoded text data. You shouldn't use them for images, the results of cryptography, the results of hashing, etc.
EDIT: Okay, you say you can't change the C# code... it's not clear whether that just means you've got legacy data, or whether you need to keep using the C# code regardless. You should absolutey not run this code for a second longer than you have to.
But in PHP, you may find you can get away with just replacing every byte with a value >= 0x80 in the hash with 0x3F, which is the ASCII for "question mark". If you look through your data you'll probably find there are a lot of 3F bytes in there.
If you can get this to work, I would strongly suggest that you migrate over to the true MD5 hash without losing information like this. Wherever you're storing the hashes, store two: the legacy one (which is all you have now) and the rehashed one. Whenever you're asked to validate that a password is correct, you should:
Check whether you have a "new" one; if so, only use that - ignore the legacy one.
If you only have a legacy one:
Hash the password in the broken way to check whether it's correct
If it is, hash it again properly and store the results in the "new" place.
Then when everyone's logged in correctly once, you'll be able to wipe out the legacy hashes.
I have the following t-sql code which I have converted to c#.
DECLARE #guidRegular UNIQUEIDENTIFIER, #dtmNow DATETIME
SELECT #guidRegular = '{5bf8e554-8dbc-4008-9d48-5c6e0a4d28d7}'
SELECT #dtmNow = '2012-02-09 18:31:38'
print (CAST(CAST(#guidRegular AS BINARY(10)) + CAST(#dtmNow AS BINARY(6)) AS UNIQUEIDENTIFIER))
When I execute the .net version of the code (using same Guid and DateTime) I Get a different guid? It looks like it has something to do with the datetime element can anyone help ?
c# extension code:
using system.data.linq;
...
...
public static class GuidExtensions
{
public static Guid ToNewModifiedGuid(this Guid guid)
{
var dateTime = new DateTime(2012,02,09,18,31,38);
var guidBinary = new Binary(guid.ToByteArray().Take(10).ToArray());
var dateBinary = new Binary(BitConverter.GetBytes(dateTime.ToBinary()).ToArray().Take(6).ToArray());
var bytes = new byte[guidBinary.Length + dateBinary.Length];
Buffer.BlockCopy(guidBinary.ToArray(), 0, bytes, 0, guidBinary.ToArray().Length);
Buffer.BlockCopy(dateBinary.ToArray(), 0, bytes, guidBinary.ToArray().Length, dateBinary.ToArray().Length);
return new Guid(bytes);
}
}
I'm not surprised that SQL and .net would have different binary representations of a date/time. I would be surprised if they had.
Your c# code is asking the DateTime structure to serialize a value to a 64-bit ( 8 byte) byte array that can be used to recreate the same value. Then you're throwing away 2 bytes (the year? the millisecond? a checksum? who knows?)
Your sql code is asking the sql engine to take it's internal representation of a datetime - which is also 8 bytes - throw away two, and give the result.
So:
If you want identical values, you would need to stop relying on the internals of how a datetime is stored / serialized. Convert it to 6 bytes using a repeatable method you can write in both .net and tsql
Realize that you are removing the 6 bytes of a guid that represent the spatially unique portion and replacing them with the time. So you are creating a GUID that has the time encoded twice, and are greatly increasing the odds of duplicate GUIDs being created.
Of course, this ignores the more glaring issue of "why would anyone want to do that?" I'm going to assume that it's some really brilliant subsystem, instead of the more likely explanation that somebody is desperately trying to solve the wrong problem.
The original article has a flaw in the logic. The author describes both Natural and Surrogate keys but doesn't recognize that the RFC for UUIDs can be used to create a Natural key. Of course, doing so would require creating a custom function for generating a UUID based on some solution domain information, rather than relying on the default machine/time-based function for their generation.
Doing a single function to replace the generation of the keys makes a lot more sense than this, though.
I have some model objects that I save in a DB serialized with protobuf. I want to compare the version I will save to the existing one, to avoid to add two times the same version.
Ideally I should
byte[] existingBlob = GetFromDBExistingModelObject();
ModelType existingModel = existingBlob.Deserialize();
if (!model.Equals(existingModel))
{
byte[] serializedModel = model.Serialize();
Save(serializedModel); //Save in DB the new blob
}
However I will have to implement .Equals on every model object and this be quite painful. I would like to do
byte[] existingBlob = GetFromDBExistingModelObject();
byte[] serializedModel = model.Serialize();
if (!compareBlob(existingBlob, serializedModel)
{
Save(serializedModel);
}
private bool compareBlob(byte[] existingBlob, byte[] serializedModel)
{
if (serializedModel.Length != existingBlob.Length)
{
return false;
}
return !serializedModel.Where((t, i) => t != existingBlob[i]).Any();
}
I also do that for performance, because I don't deserialize the existingBlob
What do you think of this implementation ? Do you think I can rely on this comparison ? I use protobuf for serialization.
Thanks for your comment.
protobuf-net will produce a predictable output, but strictly speaking that is not guaranteed by the spec; - there are 2 edge cases (field-order, and sub-normal forms† for varint encoding) that technically could produce different output with the same meaning, but protobuf-net will always produce the same output currently.
I am toying with adding an option to deliberately use sub-normal varint forms to avoid some memory shuffling, but that would be opt-in only.
So; as long as you aren't building your binary files by appending (protobuf is an appendable format, but obviously all bets are off if you are appending in arbitrary orders), then yes: the data on the wire should be predictable, and you can compare the byte sequence to test for equality.
As a minor note, I would recommend a regular for loop here, for efficiency:
if (serializedModel.Length != existingBlob.Length)
{
return false;
}
for(int i = 0 ; i < serializedModel.Length ; i++)
if(serializedModel[i] != existingBlob[i]) return false;
return true;
(if you are particularly speed-crazy, you could even use unsafe code and compare it as a int* or long* instead (taking 1/4 or 1/8 of the tests), and just check the last few bytes manually)
You might also consider comparing a hash (sha1 etc) instead of byte-by-byte; this would be especially useful for large models, especially if you can store the hashed value along with the original (so you never have to fetch the original existing BLOB - just the existing hash).
† : specifically, the bit-sequence 10000000 or 00000000 at the "big end" of a varint just means "and more zeros at the big end" (with or without more data to follow), so has no impact on the number; hence any (reasonable) number of 0x80 0x80 0x00 on the end of a varint does not change the result; there is a use-case where-by this could be used to avoid having to move data around, by deliberately using an oversized varint as a length-prefix.
I have an object with the following properties
GID
ID
Code
Name
Some of the clients dont want to enter the Code so the intial plan was to put the ID in the code but the baseobject of the orm is different so I'm like screwed...
my plan was to put ####-#### totally random values in code how can I generate something like that say a windows 7 serial generator type stuff but would that not have an overhead what would you do in this case.
Do you want a random value, or a unique value?
random != unique.
Remember, random merely states a probability of not generating the same value, or a probability of generating the same value again. As time increases, likelihood of generating a previous value increases - becoming a near certainty. Which do you require?
Personally, I recommend just using a Guid with some context [refer to easiest section below]. I also provided some other suggestions so you have options, depending on your situation.
easiest
If Code is an unbounded string [ie can be of any length], easiest semi-legible means of generating a unique code would be
OrmObject ormObject= new OrmObject ();
string code = string.
Format ("{0} [{1}]", ormObject.Name, Guid.NewGuid ()).
Trim ();
// generates something like
// "My Product [DA9190E1-7FC6-49d6-9EA5-589BBE6E005E]"
you can substitute ormObject.Name for any distinguishable string. I would typically use typeof (objectInstance.GetType ()).Name but that will only work if OrmObject is a base class, if it's a concrete class used for everything they will all end up with similar tags. The point is to add some user context, such that - as in #Yuriy Faktorovich's referenced wtf article - users have something to read.
random
I responded a day or two ago about random number generation. Not so much generating numbers as building a simple flexible framework around a generator to improve quality of code and data, this should help streamline your source.
If you read that, you could easily write an extension method, say
public static class IRandomExtensions
{
public static CodeType GetCode (this IRandom random)
{
// 1. get as many random bytes as required
// 2. transform bytes into a 'Code'
// 3. bob's your uncle
...
}
}
// elsewhere in code
...
OrmObject ormObject = new OrmObject ();
ormObject.Code = random.GetCode ();
...
To actually generate a value, I would suggest implementing an IRandom interface with a System.Security.Cryptography.RNGCryptoServiceProvider implementation. Said implementation would generate a buffer of X random bytes, and dole out as many as required, regenerating a stream when exhausted.
Furthermore - I don't know why I keep writing, I guess this problem is really quite fascinating! - if CodeType is string and you want something readable, you could just take said random bytes and turn them into a "seemingly" readable string via Base64 conversion
public static class IRandomExtensions
{
// assuming 'CodeType' is in fact a string
public static string GetCode (this IRandom random)
{
// 1. get as many random bytes as required
byte[] randomBytes; // fill from random
// 2. transform bytes into a 'Code'
string randomBase64String =
System.Convert.ToBase64String (randomBytes).Trim ("=");
// 3. bob's your uncle
...
}
}
Remember
random != unique.
Your values will repeat. Eventually.
unique
There are a number of questions you need to ask yourself about your problem.
Must all Code values be unique? [if not, you're trying too hard]
What Type is Code? [if any-length string, use a full Guid]
Is this a distributed application? [if not, use a DB value as suggested by #LBushkin above]
If it is a distributed application, can client applications generate and submit instances of these objects? [if so, then you want a globally unique identifier, and again Guids are a sure bet]
I'm sure you have more constraints, but this is an example of the kind of line of inquiry you need to perform when you encounter a problem like your own. From these questions, you will come up with a series of constraints. These constraints will inform your design.
Hope this helps :)
Btw, you will receive better quality solutions if you post more details [ie constraints] about your problem. Again, what Type is Code, are there length constraints? Format constraints? Character constraints?
Arg, last edit, I swear. If you do end up using Guids, you may wish to obfuscate this, or even "compress" their representation by encoding them in base64 - similar to base64 conversion above for random numbers.
public static class GuidExtensions
{
public static string ToBase64String (this Guid id)
{
return System.Convert.
ToBase64String (id.ToByteArray ()).
Trim ("=");
}
}
Unlike truncating, base64 conversion is not a lossful transformation. Of course, the trim above is lossful in context of full base64 expansion - but = is just padding, extra information introduced by the conversion, and not part of original Guid data. If you want to go back to a Guid from this base64 converted value, then you will have to re-pad your base64 string until its length is a multiple of 4 - don't ask, just look up base64 if you are interested :)
You could generate a Guid using :
Guid.NewGuid().ToString();
It would give you something like :
788E94A0-C492-11DE-BFD4-FCE355D89593
Use an Autonumber column or Sequencer from your database to generate a unique code number. Almost all modern databases support automatically generated numbers in one form or another. Look into what you database supports.
Autonumber/Sequencer values from the DB are guaranteed to be unique and are relatively inexpensive to acquire. If you want to avoid completely sequential numbers assigned to codes, you can pad and concatenate several sequencer values together.