Hash function to obtain a limited length result

Hash function to obtain a limited length result - c#

I need to hash a number (about 22 digits) and the result length must be less than 12 characters. It can be a number or a mix of characters, and must be unique. (The number entered will be unique too).
For example, if the number entered is 000000000000000000001, the result should be something like 2s5As5A62s.
I looked at the typicals, like MD5, SHA-1, etc., but they give high length results.

The problem with your question is that the input is larger than the output and unique. If you're expecting a unique output as well, it won't happen. The reason behind this that if you have an input space of say 22 numeric digits (10^22 possibilities) and an output space of hexadecimal digits with a length of 11 digits (16^11 possibilities), you end up with more input possibilities than output possibilities.
The graph below shows that you would need a an output space of 19 hexadecimal digits and a perfect one-to-one function, otherwise you will have collisions pretty often (more than 50% of the time). I assume this is something you do not want, but you did not specify.
Since what you want cannot be done, I would suggest rethinking your design or using a checksum such as the cyclic redundancy check (CRC). CRC-64 will produce a 64 bit output and when encoded with any base64 algorithm, will give you something along the lines of what you want. This does not provide cryptographic strength like SHA-1, so it should never be used in anything related to information security.
However, if you were able to change your criteria to allow for long hash outputs, then I would strongly suggest you look at SHA-512, as it will provide high quality outputs with an extremely low chance of duplication. By a low chance I mean that no two inputs have yet been found to equal the same hash in the history of the algorithm.
If both of these suggestions still are not great for you, then your last alternative is probably just going with only base64 on the input data. It will essentially utilize the standard English alphabet in the best way possible to represent your data, thus reducing the number of characters as much as possible while retaining a complete representation of the input data. This is not a hash function, but simply a method for encoding binary data.

Why not taking MD5 or SHA-N then refactor to BASE64 (or base-whatever) and take only 12 characters of them ?
NB: In all case the hash will NEVER be unique (but can offer low collision probability)

You can't use a hash if it has to be unique.
You need about 74 bits to store such a number. If you convert it to base-64 it will be about 12 characters.

Can you elaborate on what your requirement is for the hashing? Do you need to make sure the result is diverse? (i.e. not 1 = a, 2 = b)
Just thinking out loud, and a little bit laterally, but could you not apply principles of run-length encoding on your number, treating it as data you want to compress. You could then use the base64 version of your compressed version.

Related

SHA256 hash calculation

for my employer I have to present customers of a web-app with checksums for certain files they download.
I'd like to present the user with the hash their client tools are also likely to generate, hence I have been comparing online hashing tools. My question is regarding their form of hashing, since they differ, strangely enough.
After a quick search I tested with 5:
http://www.convertstring.com/Hash/SHA256
http://www.freeformatter.com/sha256-generator.html#ad-output
http://online-encoder.com/sha256-encoder-decoder.html
http://www.xorbin.com/tools/sha256-hash-calculator
http://www.everpassword.com/sha-256-generator
Entering the value 'test' (without 'enter' after it) all 5 give me the same SHA256 result. However, and here begins the peculiar thing, when I enter the value 'test[enter]test' (so two lines) online tool 1, 2 and 3 give me the same SHA256 hash, and site 4 and 5 give me a different one (so 1, 2 and 3 are equal, and 4 and 5 are equal). This most likely has to do with the way the tool, or underlying code handles \r\n, or at least I think so.
Coincidentally, site 1, 2 and 3 present me with the same hash as my C# code does:
var sha256Now = ComputeHash(Encoding.UTF8.GetBytes("test\r\ntest"), new SHA256CryptoServiceProvider());
private static string ComputeHash(byte[] inputBytes, HashAlgorithm algorithm)
{
var hashedBytes = algorithm.ComputeHash(inputBytes);
return BitConverter.ToString(hashedBytes);
}
The question is: which sites are 'right'?
Is there any way to know if a hash is compliant with the standard?
UPDATE1: Changed the encoding to UTF8. This has no influence on the output hash being created though. Thx #Hans. (because my Encoding.Default is probably Encoding.UTF8)
UPDATE2: Maybe I should expand the question a bit, since it may have been under-explained, sorry. I guess what I am asking is more of a usability question than a technical one; Should I offer all the hashes with different line endings? Or should I stick to one? The client will probably call my company afraid that their file was changed somehow if they have a different way of calculating the hash. How is this usually solved?

All those sites return valid values.
Sites 4 and 5 use \n as line break.
EDIT
I see you edited your question to add Encoding.Default.GetBytes in the code example.
This is interesting, because you see there is some string to byte array conversion to run before computing the hash. Line breaking (\n or \r\n) as well as text encoding are both ways to interpret your string to get different bytes values.
Once you have the same bytes as input, all hash results will be identical.
EDIT 2:
If you're dealing with bytes directly, then just compute the hash with those bytes. Don't try to provide different hash values; a hash must only return one value. If your clients have a different hash value than yours, then they are doing it wrong.
That being said, I'm pretty sure it won't ever happen because there isn't any way to misinterpret a byte array.

Create a unique id with built-in checksum?

I want to auto-generate a unique 8-10 character ID string that includes a checksum bit of some kind to guard against a typo at data entry. I would prefer something that does not have sequential numbers where the data entry person would end up in a "rut" and get used to typing the same sequence all the time.
Are there any best practices/ pitfalls associated with this sort of thing?
UPDATE: OK, I guess I need to provide more detail.
I want to use alphanumerics, not just digits
I want behavior similar to a credit card checksum, except with 8-10 characters instead of 16 digits
I want to have the id be unique; there should not be a possibility of collision.
SECOND UPDATE OK, I don't understand what is confusing about this, but I will try to explain further. I am trying to create tracking numbers that will go on forms, which will be filled out and data-entered at a later time. I will generate the id and slap it on the form; the id needs to be unique, it needs to support a LOT of numbers, and it needs to be reasonably idiot-proof for data-entry.
I don't know if this has been done, or even if it can be done, but it does not hurt to ask.

Your question is VERY general - thus just some general aspects:
Does the ID need to be "unguessable" ?
IF yes then some sort of hash should be in the mix.
Does the ID need to be "secure" (like for example an activation key or something) ?
IF yes then some sort of public key cryptography should be in the mix.
Does the ID / checksum calculation need to be fast ?
IF yes then perhaps some very simple algorithm like CRC32 or Luhn (credit card checksum algorithm) or soem barcode checksum algorithm could be worth looking at.
Is the ID generation centralized ?
IF not then you might need to check out GUIDs, current time, MAC address and similar stuff.
UPDATE - as per comments:
use a sequence in the DB
take that value and hash it, for example with MD5
take the least significant 40-48 bits of that hash
encode it as Base-36 (0-9 and A-Z) which gives you 8-10 "digits" (alphanumeric)
check the result against the DB and discard if the ID already there (for the very rare possibility of a collision)
calculate CRC-6-ITU (see http://www.itu.int/rec/T-REC-G.704-199810-I/en on page 3)
attach the CRC result as the last "digit" (as base-36 too)
and thus you have a unique ID including checksum
to check the entered value you can just recalculate CRC-6-ITU from all digits but the last one and compare the result with the last digit.
The above is rather "unguessable" but definitely not of "high security".
UPDATE 2 - as per comment:
For some inspiration on how to calculate CRC in javascript see this - it contains javascript code for CRC-8 etc.
You should be able to adapt this code based on the CRC-6-ITU polynomial.

You might imitate airline reservation systems: they convert a number into base-36, using A-Z and 0-9 as the characters. Their upper limit is thus 36^6.
If you need to guarantee uniqueness, and you don't want them to be sequential, you have to keep the used-up random numbers in a table somewhere.
After you have your random or pseudorandom ID, you only need to calculate your checkdigit.
Use a CRC algorithm. They can be adapted to any desired length (in your case, 6 bits).
Edit
In case it's not clear: even if you use alpha codes, you'll have to turn it into a number before generating the checkdigit.
Edit
Checksum validation is not heavyweight, it can be implemented client-side in javascript.
A six character alphanumeric (i.e. airline record locator) = 10 octillion numbers. Surely that's enough? (See Wolfram Alpha for exact result.)

Most credit cards use the Luhn algorithm (also known as mod10 algorithm) as checksum algorithm to validate card numbers. From Wikipedia:
The Luhn algorithm will detect any single-digit error, as well as
almost all transpositions of adjacent digits. It will not, however,
detect transposition of the two-digit sequence 09 to 90 (or vice
versa).
The algorithm is generic and can be applied to any identification number.

As #BrokenGlass noted, you can use the Luhn check digit algorithm. Credit cards and the like use the Luhn algorithm modulo 10. Luhn mod 10 is computes a check digit for a sentence drawn from the alphabet consisting solely of decimal digits (0-9). However, it is easily adapted to compute a check digit for sentences drawn from an alphabet of any size (binary, octal, hex, alphanumeric, etc.)
To do that, all you need are two methods and one property:
The number of codepoints in the alphabet in use.
This is essentially the base of the numbering system. For instance, the hexadecimal (base 16) alphabet consists of 16 characters (ignoring the issue of case-sensitivity): '0123456789ABCDEF'. '0'–'9' have their usual meaning; 'A'–'F' are the base-16 digits representing 10–15.
A means of converting a character from the alphabet in use into its corresponding codepoint.
For instance in hexadecimal, the characters '0'–'9' represent code points 0–9; the characters 'A'–'F' represent codepoints 10-15.
A means of converting a codepoint into the corresponding character.
The converse of the above. For instance, in hexadecimal, the codepoint 12 would convert to the character 'C'.
You should probably through an ArgumentException, if the code point given doesn't exist in the alphabet.
The Wikipedia article, "Luhn mod N algorithm" does a pretty good job of explaining the computation of the check digit and its validation.

Hashes (MD5, SHA1, SHA256, SHA384, SHA512) - why isn't it possible to get the value back from the hash?

On this blog post, there is a sentence as below:
This hash is unique for the given text. If you use the hash function
on the same text again, you'll get the same hash. But there is no way
to get the given text from the hash.
Forgive my ignorance on math but I cannot understand why it is not possible to get the given text from the hash.
I would understand if we use one key to encrypt the value and another to decrypt but I cannot figure it out in my mind. What is really going on here behind the scenes?
Anything that clears my mind will be appreciated.

Hashing is not encryption.
A hash produces a "digest" - a summary of the input. Whatever the input size, the hash size is always the same (see how MD5 returns the same size result for any input size).
With a hash, you can get the same hash from several different inputs (hash collisions) - how would you reverse this? Which is the correct input?
I suggest reading this blog post from Troy Hunt on the matter in order to gain better understanding of hashes, passwords and security.
Encryption is a different thing - you would get a different cypher from the input and key - and the size of the cypher will tend to be larger as the input is larger. This is reversible if you have the right key.
Update (following the different comments):
Though collisions can happen, when using a cryptographically significant hash (like the ones you have posted about), they will be rare and difficult to produce.
When hashing passwords, always use a salt - this reduces the chances of the hash being reversed by rainbow tables to almost nothing (assuming a good salt has been used).
You need to decide about the tradeoffs of the cost of hashing (can be processor intensive) and the cost of what you are protecting.
As you are simply protecting the login details, using the .NET membership provider should provide enough security.

Hash functions are many to one functions. This means that many inputs will give the same result but that for any given input you get one and only one result.
Why this is so can be intuitively seen by considering a hash function that takes a string input of any length and generates a 32 bit integer. There are obviously far more strings than 2^32 which means that your hash function cannot give each input string a unique output. (see http://en.wikipedia.org/wiki/Pigeonhole_principle for more discussion - the Uses and applications section specifically talks about hashes)
Given we now know that any result from our hash function could have been generated from one or more inputs and we have no information other than the result we have no way to determine which input was used so it cannot be reversed.

There are at least two reasons:
Hashing usually uses asymmetric functions for calculations - meaning that finding reverse value of some operation is MUCH more difficult (in time/resources/efforts) than the direct operation.
Hashes of same algorithm are always of the same length - meaning there is a limited set of possible hashes. This means that for every hash there will be infinite number of collisions - different source data block which form the same hash value.

It's not encrypt/decrypt. For example, simple hash function:
int hash(int data)
{
return data % 2;
}
Problem?

Hashing is like using a checksum to verify data, not to encrypt or compress data.

This is essentially math, a Hash function is a function that is NOT 1 to 1. It takes a Range of inputs in the set of all binary data B* and maps it to some fixed length binary string set Bn for fixed n or so.( this definition is onto however)
you can try and calculate the pre-image, of a given hash via brute force, but without knowing the size, it is infinite.

You can hash any length of data you want, from a single byte to a terabyte file. All possible data can be hashed to a 256 bit value (taking SHA-256 as an example). That means that there are 2^256 possible values output from the SHA-256 hash algorithm. However, there are a lot more than 2^256 possible values that can be input to SHA-256. You can input any combination of bytes for any length you want.
Because there are far more possible inputs than possible outputs, then some of the inputs must generate the same output. Since you don't know which of the many possible inputs generated the output, it is not possible to reliably go backwards.

A very simple hash algorithm would be to take the first character of each word within a text. If you take the same text you can always get out the same hash but it is impossible to rebuilt the original text from only having the first character of each word.
Example hash from my answer above:
AvshawbtttfcoewwatIyttstycagotshbisitrtotfohtfcoew
And now try to find out the corresponding text from the given hash. ;-)

Compression of small string

I have data 0f 340 bytes in string mostly consists of signs and numbers like "føàA¹º#ƒUë5§Ž§"
I want to compress into 250 or less bytes to save it on my RFID card.
As this data is related to finger print temp. I want lossless compression.
So is there any algorithm which i can implement in C# to compress it?

If the data is strictly numbers and signs, I highly recommend changing the numbers into int based values. eg:
+12939272-23923+927392
can be compress into 3 piece of 32-bit integers, which is 22 bytes => 16 bytes. Picking the right integer size (whether 32-bit, 24-bit, 16-bit) should help.
If the integer size varies greatly, you could possibly use 8-bit to begin and use the value 255 to specify that the next 8-bit becomes the 8 more significant bits of the integer, making it 15-bit.
alternatively, you could identify the most significant character and assign 0 for it. the second most significant character gets 10, and the third 110. This is a very crude compression, but if you data is very limited, this might just do the job for you.

Is there any other information you know about your string? For instance does it contain certain characters more often than others? Does it contain all 255 characters or just a subset of them?
If so, huffman encoding may help you, see this or this other link for implementations in C#.
To be honest it just depends on how your input string looks like. What I'd do is try the using rar, zip, 7zip (LZMA) with very small dictionary sizes (otherwise they'll just use up too much space for preprocessed information) and see how big the raw compressed file they produce is (will probably have to use their libraries in order to make them strip headers to conserve space). If any of them produce a file under 250b, then find the c# library for it and there you go.

Making smaller a string, c#

I need a library/tool/function that compresses a 50-60 char long string to smaller.
Do you know any?

Effective compression on that scale will be difficult. You might consider Huffman coding. This might give you smaller compression than gzip (since it will result in binary codes instead of a base-85 sequence).

Are you perhaps thinking of a cryptographic hash? For example, SHA-1 (http://en.wikipedia.org/wiki/SHA-1) can be used on an input string to produce a 20-byte digest. Of course, the digest will always be 20 bytes - even if the input string is shorter than 20 bytes.

The framework includes the GZipStream and DeflateStream classes. But that might not really be what you are after - what input strings have to be compressed? ASCII only? Letters only? Alphanumerical string? Full Unicode? And what are allowed output strings?
From an algorithmic stand point and without any further knowledge of the space of possible inputs I suggest to use arithmetic coding. This might shrink the compressed size by a few additional bits compared to Huffman coding because it is not restricted to an integral number of bits per symbol - something that can turn out important when dealing with such small inputs.

If your string only contains lowercase characters between a-z and 0-9 you could encode it in 7bits.
This will compress a 60 char string to 53 bytes. If you don't need digits you could use 6bits instead, bringing it down to 45 bytes.
So choosing the right compression method depends on what data your string contains.

You could simply gzip it
http://www.example-code.com/csharp/gzip_compressString.asp

I would use some basic like RLE or shared dictionary based compression followed by a block cipher that keeps the size constant.
Maybe smaz is also interesting for you.
Examples of basic compression algorithms:
RLE
(Modified or not) Huffman coding
Burrows-Wheeler transformation
Examples of block ciphers ("bit twiddlers"):
AES
Blowfish
DES
Triple DES
Serpent
Twofish
You will be able to find out what fullfills your needs using wikipedia (links above).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.