From a list of integers in C#, I need to generate a list of unique values. I thought in MD5 or similar but they generates too many bytes.
Integer size is 2 bytes.
I want to get a one way correspondence, for example
0 -> ARY812Q3
1 -> S6321Q66
2 -> 13TZ79K2
So, proving the hash, the user cannot know the integer or to interfere a sequence behind a list of hashes.
For now, I tried to use MD5(my number) and then I used the first 8 characters. However I found the first collision at 51389. Which other alternatives I could use?
As I say, I only need one way. It is not necessary to be able to calculate the integer from the hash. The system uses a dictionary to find them.
UPDATE:
Replying some suggestions about using GetHashCode(). GetHashCode returns the same integer. My purpose is to hide to the end user the integer. In this case, the integer is the primary key of a database. I do not want to give this information to users because they could deduce the number of records in the database or the increment of records by week.
Hashes are not unique, so maybe I need to use encryption like TripleDes or so, but I wanted to use something fast and simple. Also, TripleDes returns too many bytes too.
UPDATE 2:
I was talking about hashes and it is an error. In reality, I am trying to obfuscate it, and I tried it using hash algorithm, that it is not a good idea because they are not unique.
Update May 2017
Feel free to use (or modify) the library I developed, installable via Nuget with:
Install-Package Kent.Cryptography.Obfuscation
This converts a non-negative id such as 127 to 8-character string, e.g. xVrAndNb, and back (with some available options to randomize the sequence each time it's generated).
Example Usage
var obfuscator = new Obfuscator();
string maskedID = obfuscator.Obfuscate(15);
Full documentation at: Github.
Old Answer
I came across this problem way back and I couldn't find what I want in StackOverflow. So I made this obfuscation class and just shared it on github.
Obfuscation.cs - Github
You can use it by:
Obfuscation obfuscation = new Obfuscation();
string maskedValue = obfuscation.Obfuscate(5);
int? value = obfuscation.DeObfuscate(maskedValue);
Perhaps it can be of help to future visitor :)
Encrypt it with Skip32, which produces a 32 bit output. I found this C# implementation but can't vouch for its correctness. Skip32 is a relatively uncommon crypto choice and probably hasn't been analyzed much. Still it should be sufficient for your obfuscation purposes.
The strong choice would be format preserving encryption using AES in FFX mode. But that's pretty complicated and probably overkill for your application.
When encoded with Base32 (case insensitive, alphanumeric) a 32 bit value corresponds to 7 characters. When encoded in hex, it corresponds to 8 characters.
There is also the non cryptographic alternative of generating a random value, storing it in the database and handling collisions.
Xor the integer. Maybe with a random key that it is generated per user (stored in session). While it's not strictly a hash (as it is reversible), the advantages are that you don't need to store it anywhere, and the size will be the same.
For what you want, I'd recommend using GUIDs (or other kind of unique identifier where the probability of collision is either minimal or none) and storing them in the database row, then just never show the ID to the user.
IMHO, it's kind of bad practice to ever show the primary key in the database to the user (much less to let users do any kind of operations on them).
If they need to have raw access to the database for some reason, then just don't use ints as primary keys, and make them guids (but then your requirement loses importance since they can just access the number of records)
Edit
Based on your requirements, if you don't care the algorithm is potentially computationally expensive, then you can just generate a random 8 byte string every time a new row is added, and keep generating random strings until you find one that is not already in the database.
This is far from optimal, and -can- be computationally expensive, but taking you use a 16-bit id and the maximum number of rows is 65536, I'd not care too much about it (the possibility of an 8 byte random string to be in a 65536 possibility list is minimal, so you'll probably be good at first or as much as second try, if your pseudo-random generator is good).
Related
I need a random string of 32 characters to be used as salt for hashing some value. This random string is generated per user.
What is the difference between generating a guid per user and using the RNGCryptoServiceProvider?
It's the difference between generating a unique key and generating 32 random characters. That's about it. Do what you intend to do.
If you need some way of identifying that user uniquely, even if databases are merged, use a GUID. If you need a salt for hashing a password, then use a random byte[]. Neither of them works well in the other context.
After reading this, I understood the difference
http://blogs.msdn.com/b/oldnewthing/archive/2012/05/23/10309199.aspx
GUIDs are designed to be unique, not random
The GUID generation algorithm was designed for uniqueness. It was not designed for randomness or for unpredictability. Indeed, if you
look at an earlier discussion, you can see that so-called Algorithm 1
is non-random and totally predictable. If you use an Algorithm 1 GUID
generator to assign GUIDs to candidates, you'll find that the GUIDs
are assigned in numerically ascending order (because the timestamp
increases). The customer's proposed algorithm would most likely end up
choosing for jury duty the first N people entered into the system
after a 32-bit timer rollover. Definitely not random.
Similarly, the person who wanted to use a GUID for password generation would find that the passwords are totally predictable if
you know what time the GUID was generated and which computer generated
the GUID (which you can get by looking at the final six bytes from
some other password-GUID). Totally-predictable passwords are probably
not a good idea.
for my employer I have to present customers of a web-app with checksums for certain files they download.
I'd like to present the user with the hash their client tools are also likely to generate, hence I have been comparing online hashing tools. My question is regarding their form of hashing, since they differ, strangely enough.
After a quick search I tested with 5:
http://www.convertstring.com/Hash/SHA256
http://www.freeformatter.com/sha256-generator.html#ad-output
http://online-encoder.com/sha256-encoder-decoder.html
http://www.xorbin.com/tools/sha256-hash-calculator
http://www.everpassword.com/sha-256-generator
Entering the value 'test' (without 'enter' after it) all 5 give me the same SHA256 result. However, and here begins the peculiar thing, when I enter the value 'test[enter]test' (so two lines) online tool 1, 2 and 3 give me the same SHA256 hash, and site 4 and 5 give me a different one (so 1, 2 and 3 are equal, and 4 and 5 are equal). This most likely has to do with the way the tool, or underlying code handles \r\n, or at least I think so.
Coincidentally, site 1, 2 and 3 present me with the same hash as my C# code does:
var sha256Now = ComputeHash(Encoding.UTF8.GetBytes("test\r\ntest"), new SHA256CryptoServiceProvider());
private static string ComputeHash(byte[] inputBytes, HashAlgorithm algorithm)
{
var hashedBytes = algorithm.ComputeHash(inputBytes);
return BitConverter.ToString(hashedBytes);
}
The question is: which sites are 'right'?
Is there any way to know if a hash is compliant with the standard?
UPDATE1: Changed the encoding to UTF8. This has no influence on the output hash being created though. Thx #Hans. (because my Encoding.Default is probably Encoding.UTF8)
UPDATE2: Maybe I should expand the question a bit, since it may have been under-explained, sorry. I guess what I am asking is more of a usability question than a technical one; Should I offer all the hashes with different line endings? Or should I stick to one? The client will probably call my company afraid that their file was changed somehow if they have a different way of calculating the hash. How is this usually solved?
All those sites return valid values.
Sites 4 and 5 use \n as line break.
EDIT
I see you edited your question to add Encoding.Default.GetBytes in the code example.
This is interesting, because you see there is some string to byte array conversion to run before computing the hash. Line breaking (\n or \r\n) as well as text encoding are both ways to interpret your string to get different bytes values.
Once you have the same bytes as input, all hash results will be identical.
EDIT 2:
If you're dealing with bytes directly, then just compute the hash with those bytes. Don't try to provide different hash values; a hash must only return one value. If your clients have a different hash value than yours, then they are doing it wrong.
That being said, I'm pretty sure it won't ever happen because there isn't any way to misinterpret a byte array.
I need to hash a number (about 22 digits) and the result length must be less than 12 characters. It can be a number or a mix of characters, and must be unique. (The number entered will be unique too).
For example, if the number entered is 000000000000000000001, the result should be something like 2s5As5A62s.
I looked at the typicals, like MD5, SHA-1, etc., but they give high length results.
The problem with your question is that the input is larger than the output and unique. If you're expecting a unique output as well, it won't happen. The reason behind this that if you have an input space of say 22 numeric digits (10^22 possibilities) and an output space of hexadecimal digits with a length of 11 digits (16^11 possibilities), you end up with more input possibilities than output possibilities.
The graph below shows that you would need a an output space of 19 hexadecimal digits and a perfect one-to-one function, otherwise you will have collisions pretty often (more than 50% of the time). I assume this is something you do not want, but you did not specify.
Since what you want cannot be done, I would suggest rethinking your design or using a checksum such as the cyclic redundancy check (CRC). CRC-64 will produce a 64 bit output and when encoded with any base64 algorithm, will give you something along the lines of what you want. This does not provide cryptographic strength like SHA-1, so it should never be used in anything related to information security.
However, if you were able to change your criteria to allow for long hash outputs, then I would strongly suggest you look at SHA-512, as it will provide high quality outputs with an extremely low chance of duplication. By a low chance I mean that no two inputs have yet been found to equal the same hash in the history of the algorithm.
If both of these suggestions still are not great for you, then your last alternative is probably just going with only base64 on the input data. It will essentially utilize the standard English alphabet in the best way possible to represent your data, thus reducing the number of characters as much as possible while retaining a complete representation of the input data. This is not a hash function, but simply a method for encoding binary data.
Why not taking MD5 or SHA-N then refactor to BASE64 (or base-whatever) and take only 12 characters of them ?
NB: In all case the hash will NEVER be unique (but can offer low collision probability)
You can't use a hash if it has to be unique.
You need about 74 bits to store such a number. If you convert it to base-64 it will be about 12 characters.
Can you elaborate on what your requirement is for the hashing? Do you need to make sure the result is diverse? (i.e. not 1 = a, 2 = b)
Just thinking out loud, and a little bit laterally, but could you not apply principles of run-length encoding on your number, treating it as data you want to compress. You could then use the base64 version of your compressed version.
On this blog post, there is a sentence as below:
This hash is unique for the given text. If you use the hash function
on the same text again, you'll get the same hash. But there is no way
to get the given text from the hash.
Forgive my ignorance on math but I cannot understand why it is not possible to get the given text from the hash.
I would understand if we use one key to encrypt the value and another to decrypt but I cannot figure it out in my mind. What is really going on here behind the scenes?
Anything that clears my mind will be appreciated.
Hashing is not encryption.
A hash produces a "digest" - a summary of the input. Whatever the input size, the hash size is always the same (see how MD5 returns the same size result for any input size).
With a hash, you can get the same hash from several different inputs (hash collisions) - how would you reverse this? Which is the correct input?
I suggest reading this blog post from Troy Hunt on the matter in order to gain better understanding of hashes, passwords and security.
Encryption is a different thing - you would get a different cypher from the input and key - and the size of the cypher will tend to be larger as the input is larger. This is reversible if you have the right key.
Update (following the different comments):
Though collisions can happen, when using a cryptographically significant hash (like the ones you have posted about), they will be rare and difficult to produce.
When hashing passwords, always use a salt - this reduces the chances of the hash being reversed by rainbow tables to almost nothing (assuming a good salt has been used).
You need to decide about the tradeoffs of the cost of hashing (can be processor intensive) and the cost of what you are protecting.
As you are simply protecting the login details, using the .NET membership provider should provide enough security.
Hash functions are many to one functions. This means that many inputs will give the same result but that for any given input you get one and only one result.
Why this is so can be intuitively seen by considering a hash function that takes a string input of any length and generates a 32 bit integer. There are obviously far more strings than 2^32 which means that your hash function cannot give each input string a unique output. (see http://en.wikipedia.org/wiki/Pigeonhole_principle for more discussion - the Uses and applications section specifically talks about hashes)
Given we now know that any result from our hash function could have been generated from one or more inputs and we have no information other than the result we have no way to determine which input was used so it cannot be reversed.
There are at least two reasons:
Hashing usually uses asymmetric functions for calculations - meaning that finding reverse value of some operation is MUCH more difficult (in time/resources/efforts) than the direct operation.
Hashes of same algorithm are always of the same length - meaning there is a limited set of possible hashes. This means that for every hash there will be infinite number of collisions - different source data block which form the same hash value.
It's not encrypt/decrypt. For example, simple hash function:
int hash(int data)
{
return data % 2;
}
Problem?
Hashing is like using a checksum to verify data, not to encrypt or compress data.
This is essentially math, a Hash function is a function that is NOT 1 to 1. It takes a Range of inputs in the set of all binary data B* and maps it to some fixed length binary string set Bn for fixed n or so.( this definition is onto however)
you can try and calculate the pre-image, of a given hash via brute force, but without knowing the size, it is infinite.
You can hash any length of data you want, from a single byte to a terabyte file. All possible data can be hashed to a 256 bit value (taking SHA-256 as an example). That means that there are 2^256 possible values output from the SHA-256 hash algorithm. However, there are a lot more than 2^256 possible values that can be input to SHA-256. You can input any combination of bytes for any length you want.
Because there are far more possible inputs than possible outputs, then some of the inputs must generate the same output. Since you don't know which of the many possible inputs generated the output, it is not possible to reliably go backwards.
A very simple hash algorithm would be to take the first character of each word within a text. If you take the same text you can always get out the same hash but it is impossible to rebuilt the original text from only having the first character of each word.
Example hash from my answer above:
AvshawbtttfcoewwatIyttstycagotshbisitrtotfohtfcoew
And now try to find out the corresponding text from the given hash. ;-)
I am retrieving lists of crc32 hashes that contain names of files, not there contents.
I need to be able to decrypt the strings which are hashed names like "vacationplans_2010.txt"
which are less then 25 characters long.
is this possible?
it is one-way hash function. It can't be decrypted.
Despite what other users answered, CRC32 is not a cryptographic hash function; it is meant for integrity checks (data checksums).
Cryptographic hash functions are often described as "one-way hash functions", CRC32 lacks the "one-way" part.
That being said, you should consider the following: since the set of all possible 25-characters-or-less filenames is more than 2^32, some file names are bound to have the same hash value. Therefore, it might be that for some of the CRC32 values you get - there will be several possible sources (file-names). You will need a way to determine the "real" source (i assume that human-decision would be the best choice, since our brain is a great pattern-recognition device, but it really depends on your scenario).
Several methods can be used to partially achieve what you are asking for. Brute-force is one of them (although, with 25 characters long file names, brute-force may take a while). A modified dictionary attack is another option. Other options are based on analysis of the CRC32 algorithm, and will require that you dive into the implementation details of the algorithm (otherwise you'll have a hard time understanding what you're implementing). For example, see this article, or this artice.
EDIT: definitions by Bruce Schneier (author of Applied Cryptography, among other things):
One-way functions are relatively easy
to compute, but significantly harder
to reverse. … . In this context,
"hard" is defined as something like:
It would take millions of years to
compute x from f(x), even if all the
computers in the worlds were assigned
to the problem.
A hash function is a function,
mathematical or otherwise, that takes
a variable length input string and
(called a pre-image) and converts it
to a fixed length (generally smaller)
output string (called a hash value).
The security of a one-way hash
function is its one-wayness.
A hash function like CRC32 calculates a simple value given (variable) input. The calculation is not reversible - i.e. you cannot reliably get the original value given only the hash.
Yep, the general method is to find out the rule how u hash encryt result be the same as