From this sample code from MSDN
http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx
The hash code for "abc" is: 536991770
But how to convert back the "536991770" to "abc"?
The is no way to get value from the hashcode. See hash-function definition.
Hash values are not used to uniquely identify the original value, have values are not unique for each type of the input value.
A hash function may map two or more
keys to the same hash value. In many
applications, it is desirable to
minimize the occurrence of such
collisions, which means that the hash
function must map the keys to the hash
values as evenly as possible.
You cannot. Hashes are one way.
The thing with hashes is that you loose information. Independent of the length of the string, the result is always an integer. This means e.g. that getting the has of a string of 10,000 characters will also result in an integer. It is of course impossible to get the original string back from this integer.
There is no way to "decrypt" the hash code. Amongst other reasons, because two different strings may very well produce the same hash code. That feature alone would make it impossible to reverse the process.
You cannot,
Even if you will have a table with all strings in the world and their hash code you wouldn't be able to achieve that since there are more string then ints (~4 billion ints) so there are several strings that result in the same hash code.
Related
Looking for some hash function to make string to int mapping with following restrictions.
restrictions:
Same strings go to same number.
Different strings go to different numbers.
During one run of application I am getting strings from same length, only in the runtime I know the length.
Any suggestions how to create the hash function ?
A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.
This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.
Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.
Here is the current implementation of dictionary.cs.
You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.
Different strings go to different numbers.
There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.
Is the String.GetHashCode function not right for your needs?
On this blog post, there is a sentence as below:
This hash is unique for the given text. If you use the hash function
on the same text again, you'll get the same hash. But there is no way
to get the given text from the hash.
Forgive my ignorance on math but I cannot understand why it is not possible to get the given text from the hash.
I would understand if we use one key to encrypt the value and another to decrypt but I cannot figure it out in my mind. What is really going on here behind the scenes?
Anything that clears my mind will be appreciated.
Hashing is not encryption.
A hash produces a "digest" - a summary of the input. Whatever the input size, the hash size is always the same (see how MD5 returns the same size result for any input size).
With a hash, you can get the same hash from several different inputs (hash collisions) - how would you reverse this? Which is the correct input?
I suggest reading this blog post from Troy Hunt on the matter in order to gain better understanding of hashes, passwords and security.
Encryption is a different thing - you would get a different cypher from the input and key - and the size of the cypher will tend to be larger as the input is larger. This is reversible if you have the right key.
Update (following the different comments):
Though collisions can happen, when using a cryptographically significant hash (like the ones you have posted about), they will be rare and difficult to produce.
When hashing passwords, always use a salt - this reduces the chances of the hash being reversed by rainbow tables to almost nothing (assuming a good salt has been used).
You need to decide about the tradeoffs of the cost of hashing (can be processor intensive) and the cost of what you are protecting.
As you are simply protecting the login details, using the .NET membership provider should provide enough security.
Hash functions are many to one functions. This means that many inputs will give the same result but that for any given input you get one and only one result.
Why this is so can be intuitively seen by considering a hash function that takes a string input of any length and generates a 32 bit integer. There are obviously far more strings than 2^32 which means that your hash function cannot give each input string a unique output. (see http://en.wikipedia.org/wiki/Pigeonhole_principle for more discussion - the Uses and applications section specifically talks about hashes)
Given we now know that any result from our hash function could have been generated from one or more inputs and we have no information other than the result we have no way to determine which input was used so it cannot be reversed.
There are at least two reasons:
Hashing usually uses asymmetric functions for calculations - meaning that finding reverse value of some operation is MUCH more difficult (in time/resources/efforts) than the direct operation.
Hashes of same algorithm are always of the same length - meaning there is a limited set of possible hashes. This means that for every hash there will be infinite number of collisions - different source data block which form the same hash value.
It's not encrypt/decrypt. For example, simple hash function:
int hash(int data)
{
return data % 2;
}
Problem?
Hashing is like using a checksum to verify data, not to encrypt or compress data.
This is essentially math, a Hash function is a function that is NOT 1 to 1. It takes a Range of inputs in the set of all binary data B* and maps it to some fixed length binary string set Bn for fixed n or so.( this definition is onto however)
you can try and calculate the pre-image, of a given hash via brute force, but without knowing the size, it is infinite.
You can hash any length of data you want, from a single byte to a terabyte file. All possible data can be hashed to a 256 bit value (taking SHA-256 as an example). That means that there are 2^256 possible values output from the SHA-256 hash algorithm. However, there are a lot more than 2^256 possible values that can be input to SHA-256. You can input any combination of bytes for any length you want.
Because there are far more possible inputs than possible outputs, then some of the inputs must generate the same output. Since you don't know which of the many possible inputs generated the output, it is not possible to reliably go backwards.
A very simple hash algorithm would be to take the first character of each word within a text. If you take the same text you can always get out the same hash but it is impossible to rebuilt the original text from only having the first character of each word.
Example hash from my answer above:
AvshawbtttfcoewwatIyttstycagotshbisitrtotfohtfcoew
And now try to find out the corresponding text from the given hash. ;-)
I'am implementing data serialization and I've encounter a problem.
I've got:
4 byte fields:
Values range 0-255
Values range 0- 4
Values range 0-255
Values range 0- 100
and 1 int field(only positive values)
I've got an idea to convet all to byte array(lenght 8) or int array(lenght 2) and get C# GetHashCode method
Is GetHashCode strong enough to use as identifier for this data?
Or someone has better idea, maybe?
EOG
GetHashCode isn't meant to create a unique identifier - its primary use is for assigning values to buckets in hashed data structures (like HashTable) - see http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/. When I need a unique identifier for an object, and for some reason the object itself doesn't provide one, I usually just fall back on GUIDs. They are trivial to generate in C# and guaranteed to be unique within the scope of whatever you're doing.
GetHashCode is purely for hashing in dictionary. You should not use it as identifier anywhere because of possible hash collisions. It returns Int32 and for String for example it is clearly possible to have more than 2,147,483,647 unique strings. Two different strings can have the same hash code. Having said that you have two options:
1) If you need your identifier to be derived from the actual values. For example if you need to quickly tell if you already have new Object persisted without deserializing all objects and comparing them to object in question. You can use ComputeHash on SHA1 for example.
2) If you don't need identifier to be derived from actual values you can simply generate Guid like bbogovich have suggested.
The GetHashCode() value for ints and longs (< int.MaxValue) is the same as the value, But for array's the value is not stable. So don't use it.
Why not convert the entire structure to a long as use that?
I use some identity classes/structs that contains 1-2 ints, maybe a datetime or a small string as well. I use these as keys in a dictionary.
What would be a good override of GetHashCode for something like this? Something quite simple but still somewhat performant hopefully.
Thanks
Take a look into Essential C#.
It contains a detailed description on how to overwrite GetHashCode() correctly.
Extract from the book
The purpose of the hash code is to efficiently balance a hash table by generating a number that corresponds to the value of an object.
Required: Equal objects must have equal hash codes (if a.Equals(b), then a.GetHashCode() == b.GetHashCode())
Required: GetHashCode()'s returns over the life of a particular object should be constant (the same value), even if the object's data changes. In many cases, you should cache the method return to enforce this.
Required: GetHashCode() should not throw any exceptions; GetHashCode() must always successfully return a value.
Performance: Hash codes should be unique whenever possible. However, since hash code return only an int, there has to be an overlap in hash codes for objects that have potentially more values than an int can hold -- virtually all types. (An obvious example is long, since there are more possible long values than an int could uniquely identify.)
Performance: The possible hash code values should be distributed evenly over the range of an int. For example, creating a hash that doesn't consider the fact that distribution of a string in Latin-based languages primarily centers on the initial 128 ASCII characters would result in a very uneven distribution of string values and would not be a strong GetHashCode() algorithm.
Performance: GetHashCode() should be optimized for performance. GetHashCode() is generally used in Equals() implementations to short-circuit a full equals comparison if the hash codes are different. As a result, it is frequently called when the type is used as a key type in dictionary collections.
Performance: Small differences between two objects should result in large differences between hash codes values -- ideally, a 1-bit difference in the object results in around 16 bits of the hash code changing, on average. This helps ensure that the hash table remains balanced no matter how it is "bucketing" the hash values.
Security: It should be difficult for an attacker to craft an object that has a particular hash code. The attack is to flood a hash table with large amounts of data that all hash to the same value. The hash table implementation then becomes O(n) instead of O(1), resulting in a possible denial-of-service attack.
As already mentioned here you have also to consider some points about overriding Equals() and there are some code examples showing how to implement these two functions.
So these informations should give a starting point but i recommend to buy the book and to read the complete chapter 9 (at least the first twelve sides) to get all the points on how to correctly implement these two crucial functions.
I am retrieving lists of crc32 hashes that contain names of files, not there contents.
I need to be able to decrypt the strings which are hashed names like "vacationplans_2010.txt"
which are less then 25 characters long.
is this possible?
it is one-way hash function. It can't be decrypted.
Despite what other users answered, CRC32 is not a cryptographic hash function; it is meant for integrity checks (data checksums).
Cryptographic hash functions are often described as "one-way hash functions", CRC32 lacks the "one-way" part.
That being said, you should consider the following: since the set of all possible 25-characters-or-less filenames is more than 2^32, some file names are bound to have the same hash value. Therefore, it might be that for some of the CRC32 values you get - there will be several possible sources (file-names). You will need a way to determine the "real" source (i assume that human-decision would be the best choice, since our brain is a great pattern-recognition device, but it really depends on your scenario).
Several methods can be used to partially achieve what you are asking for. Brute-force is one of them (although, with 25 characters long file names, brute-force may take a while). A modified dictionary attack is another option. Other options are based on analysis of the CRC32 algorithm, and will require that you dive into the implementation details of the algorithm (otherwise you'll have a hard time understanding what you're implementing). For example, see this article, or this artice.
EDIT: definitions by Bruce Schneier (author of Applied Cryptography, among other things):
One-way functions are relatively easy
to compute, but significantly harder
to reverse. … . In this context,
"hard" is defined as something like:
It would take millions of years to
compute x from f(x), even if all the
computers in the worlds were assigned
to the problem.
A hash function is a function,
mathematical or otherwise, that takes
a variable length input string and
(called a pre-image) and converts it
to a fixed length (generally smaller)
output string (called a hash value).
The security of a one-way hash
function is its one-wayness.
A hash function like CRC32 calculates a simple value given (variable) input. The calculation is not reversible - i.e. you cannot reliably get the original value given only the hash.
Yep, the general method is to find out the rule how u hash encryt result be the same as