I want to auto-generate a unique 8-10 character ID string that includes a checksum bit of some kind to guard against a typo at data entry. I would prefer something that does not have sequential numbers where the data entry person would end up in a "rut" and get used to typing the same sequence all the time.
Are there any best practices/ pitfalls associated with this sort of thing?
UPDATE: OK, I guess I need to provide more detail.
I want to use alphanumerics, not just digits
I want behavior similar to a credit card checksum, except with 8-10 characters instead of 16 digits
I want to have the id be unique; there should not be a possibility of collision.
SECOND UPDATE OK, I don't understand what is confusing about this, but I will try to explain further. I am trying to create tracking numbers that will go on forms, which will be filled out and data-entered at a later time. I will generate the id and slap it on the form; the id needs to be unique, it needs to support a LOT of numbers, and it needs to be reasonably idiot-proof for data-entry.
I don't know if this has been done, or even if it can be done, but it does not hurt to ask.
Your question is VERY general - thus just some general aspects:
Does the ID need to be "unguessable" ?
IF yes then some sort of hash should be in the mix.
Does the ID need to be "secure" (like for example an activation key or something) ?
IF yes then some sort of public key cryptography should be in the mix.
Does the ID / checksum calculation need to be fast ?
IF yes then perhaps some very simple algorithm like CRC32 or Luhn (credit card checksum algorithm) or soem barcode checksum algorithm could be worth looking at.
Is the ID generation centralized ?
IF not then you might need to check out GUIDs, current time, MAC address and similar stuff.
UPDATE - as per comments:
use a sequence in the DB
take that value and hash it, for example with MD5
take the least significant 40-48 bits of that hash
encode it as Base-36 (0-9 and A-Z) which gives you 8-10 "digits" (alphanumeric)
check the result against the DB and discard if the ID already there (for the very rare possibility of a collision)
calculate CRC-6-ITU (see http://www.itu.int/rec/T-REC-G.704-199810-I/en on page 3)
attach the CRC result as the last "digit" (as base-36 too)
and thus you have a unique ID including checksum
to check the entered value you can just recalculate CRC-6-ITU from all digits but the last one and compare the result with the last digit.
The above is rather "unguessable" but definitely not of "high security".
UPDATE 2 - as per comment:
For some inspiration on how to calculate CRC in javascript see this - it contains javascript code for CRC-8 etc.
You should be able to adapt this code based on the CRC-6-ITU polynomial.
You might imitate airline reservation systems: they convert a number into base-36, using A-Z and 0-9 as the characters. Their upper limit is thus 36^6.
If you need to guarantee uniqueness, and you don't want them to be sequential, you have to keep the used-up random numbers in a table somewhere.
After you have your random or pseudorandom ID, you only need to calculate your checkdigit.
Use a CRC algorithm. They can be adapted to any desired length (in your case, 6 bits).
Edit
In case it's not clear: even if you use alpha codes, you'll have to turn it into a number before generating the checkdigit.
Edit
Checksum validation is not heavyweight, it can be implemented client-side in javascript.
A six character alphanumeric (i.e. airline record locator) = 10 octillion numbers. Surely that's enough? (See Wolfram Alpha for exact result.)
Most credit cards use the Luhn algorithm (also known as mod10 algorithm) as checksum algorithm to validate card numbers. From Wikipedia:
The Luhn algorithm will detect any single-digit error, as well as
almost all transpositions of adjacent digits. It will not, however,
detect transposition of the two-digit sequence 09 to 90 (or vice
versa).
The algorithm is generic and can be applied to any identification number.
As #BrokenGlass noted, you can use the Luhn check digit algorithm. Credit cards and the like use the Luhn algorithm modulo 10. Luhn mod 10 is computes a check digit for a sentence drawn from the alphabet consisting solely of decimal digits (0-9). However, it is easily adapted to compute a check digit for sentences drawn from an alphabet of any size (binary, octal, hex, alphanumeric, etc.)
To do that, all you need are two methods and one property:
The number of codepoints in the alphabet in use.
This is essentially the base of the numbering system. For instance, the hexadecimal (base 16) alphabet consists of 16 characters (ignoring the issue of case-sensitivity): '0123456789ABCDEF'. '0'–'9' have their usual meaning; 'A'–'F' are the base-16 digits representing 10–15.
A means of converting a character from the alphabet in use into its corresponding codepoint.
For instance in hexadecimal, the characters '0'–'9' represent code points 0–9; the characters 'A'–'F' represent codepoints 10-15.
A means of converting a codepoint into the corresponding character.
The converse of the above. For instance, in hexadecimal, the codepoint 12 would convert to the character 'C'.
You should probably through an ArgumentException, if the code point given doesn't exist in the alphabet.
The Wikipedia article, "Luhn mod N algorithm" does a pretty good job of explaining the computation of the check digit and its validation.
Related
I have a program whose input is like
~1^(2~&3) 0x3FFE 0x2FCE 0xFCC1
and right now I'm constructing the algorithm that parses the equation
~1^(2~&3)
and hopefully I'll be able to do it without any repeated passes through sections of the equation. Does C#, in it standard libraries, have a way of parsing an int an keeping track of the number of characters parsed? So that, for example, if I'm at the point
~1207300&11
^
in an equation then I want to be able to grab 1207300 and know that I parsed 7 characters so that I can move 7 indixes forward to
~1207300&11
^
Or will I have to hand-roll such a function?
Does C#, in it standard libraries, have a way of parsing an int an keeping track of the number of characters parsed?
No, unfortunately not. All the parsing routines expect the input to be a number and nothing else. (Whitespace is also allowed and ignored.)
Find out how many chars are in the number by running a simple loop. Then, Substring the number out and pass it to int.Parse.
From a list of integers in C#, I need to generate a list of unique values. I thought in MD5 or similar but they generates too many bytes.
Integer size is 2 bytes.
I want to get a one way correspondence, for example
0 -> ARY812Q3
1 -> S6321Q66
2 -> 13TZ79K2
So, proving the hash, the user cannot know the integer or to interfere a sequence behind a list of hashes.
For now, I tried to use MD5(my number) and then I used the first 8 characters. However I found the first collision at 51389. Which other alternatives I could use?
As I say, I only need one way. It is not necessary to be able to calculate the integer from the hash. The system uses a dictionary to find them.
UPDATE:
Replying some suggestions about using GetHashCode(). GetHashCode returns the same integer. My purpose is to hide to the end user the integer. In this case, the integer is the primary key of a database. I do not want to give this information to users because they could deduce the number of records in the database or the increment of records by week.
Hashes are not unique, so maybe I need to use encryption like TripleDes or so, but I wanted to use something fast and simple. Also, TripleDes returns too many bytes too.
UPDATE 2:
I was talking about hashes and it is an error. In reality, I am trying to obfuscate it, and I tried it using hash algorithm, that it is not a good idea because they are not unique.
Update May 2017
Feel free to use (or modify) the library I developed, installable via Nuget with:
Install-Package Kent.Cryptography.Obfuscation
This converts a non-negative id such as 127 to 8-character string, e.g. xVrAndNb, and back (with some available options to randomize the sequence each time it's generated).
Example Usage
var obfuscator = new Obfuscator();
string maskedID = obfuscator.Obfuscate(15);
Full documentation at: Github.
Old Answer
I came across this problem way back and I couldn't find what I want in StackOverflow. So I made this obfuscation class and just shared it on github.
Obfuscation.cs - Github
You can use it by:
Obfuscation obfuscation = new Obfuscation();
string maskedValue = obfuscation.Obfuscate(5);
int? value = obfuscation.DeObfuscate(maskedValue);
Perhaps it can be of help to future visitor :)
Encrypt it with Skip32, which produces a 32 bit output. I found this C# implementation but can't vouch for its correctness. Skip32 is a relatively uncommon crypto choice and probably hasn't been analyzed much. Still it should be sufficient for your obfuscation purposes.
The strong choice would be format preserving encryption using AES in FFX mode. But that's pretty complicated and probably overkill for your application.
When encoded with Base32 (case insensitive, alphanumeric) a 32 bit value corresponds to 7 characters. When encoded in hex, it corresponds to 8 characters.
There is also the non cryptographic alternative of generating a random value, storing it in the database and handling collisions.
Xor the integer. Maybe with a random key that it is generated per user (stored in session). While it's not strictly a hash (as it is reversible), the advantages are that you don't need to store it anywhere, and the size will be the same.
For what you want, I'd recommend using GUIDs (or other kind of unique identifier where the probability of collision is either minimal or none) and storing them in the database row, then just never show the ID to the user.
IMHO, it's kind of bad practice to ever show the primary key in the database to the user (much less to let users do any kind of operations on them).
If they need to have raw access to the database for some reason, then just don't use ints as primary keys, and make them guids (but then your requirement loses importance since they can just access the number of records)
Edit
Based on your requirements, if you don't care the algorithm is potentially computationally expensive, then you can just generate a random 8 byte string every time a new row is added, and keep generating random strings until you find one that is not already in the database.
This is far from optimal, and -can- be computationally expensive, but taking you use a 16-bit id and the maximum number of rows is 65536, I'd not care too much about it (the possibility of an 8 byte random string to be in a 65536 possibility list is minimal, so you'll probably be good at first or as much as second try, if your pseudo-random generator is good).
I need to hash a number (about 22 digits) and the result length must be less than 12 characters. It can be a number or a mix of characters, and must be unique. (The number entered will be unique too).
For example, if the number entered is 000000000000000000001, the result should be something like 2s5As5A62s.
I looked at the typicals, like MD5, SHA-1, etc., but they give high length results.
The problem with your question is that the input is larger than the output and unique. If you're expecting a unique output as well, it won't happen. The reason behind this that if you have an input space of say 22 numeric digits (10^22 possibilities) and an output space of hexadecimal digits with a length of 11 digits (16^11 possibilities), you end up with more input possibilities than output possibilities.
The graph below shows that you would need a an output space of 19 hexadecimal digits and a perfect one-to-one function, otherwise you will have collisions pretty often (more than 50% of the time). I assume this is something you do not want, but you did not specify.
Since what you want cannot be done, I would suggest rethinking your design or using a checksum such as the cyclic redundancy check (CRC). CRC-64 will produce a 64 bit output and when encoded with any base64 algorithm, will give you something along the lines of what you want. This does not provide cryptographic strength like SHA-1, so it should never be used in anything related to information security.
However, if you were able to change your criteria to allow for long hash outputs, then I would strongly suggest you look at SHA-512, as it will provide high quality outputs with an extremely low chance of duplication. By a low chance I mean that no two inputs have yet been found to equal the same hash in the history of the algorithm.
If both of these suggestions still are not great for you, then your last alternative is probably just going with only base64 on the input data. It will essentially utilize the standard English alphabet in the best way possible to represent your data, thus reducing the number of characters as much as possible while retaining a complete representation of the input data. This is not a hash function, but simply a method for encoding binary data.
Why not taking MD5 or SHA-N then refactor to BASE64 (or base-whatever) and take only 12 characters of them ?
NB: In all case the hash will NEVER be unique (but can offer low collision probability)
You can't use a hash if it has to be unique.
You need about 74 bits to store such a number. If you convert it to base-64 it will be about 12 characters.
Can you elaborate on what your requirement is for the hashing? Do you need to make sure the result is diverse? (i.e. not 1 = a, 2 = b)
Just thinking out loud, and a little bit laterally, but could you not apply principles of run-length encoding on your number, treating it as data you want to compress. You could then use the base64 version of your compressed version.
Looking for some hash function to make string to int mapping with following restrictions.
restrictions:
Same strings go to same number.
Different strings go to different numbers.
During one run of application I am getting strings from same length, only in the runtime I know the length.
Any suggestions how to create the hash function ?
A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.
This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.
Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.
Here is the current implementation of dictionary.cs.
You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.
Different strings go to different numbers.
There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.
Is the String.GetHashCode function not right for your needs?
What's the best way to convert (to hash) a string like 3800290030, which represents an id for a classification into a four character one like 3450 (I need to support at max 9999 classes). We will only have less than 1000 classes in 10 character space and it will never grow to more than 10k.
The hash needs to be unique and always the same for the same an input.
The resulting string should be numeric (but it will be saved as char(4) in SQL Server).
I removed the requirement for reversibility.
This is my solution, please comment:
string classTIC = "3254002092";
MD5 md5Hasher = MD5.Create();
byte[] classHash = md5Hasher.ComputeHash(Encoding.Default.GetBytes(classTIC));
StringBuilder sBuilder = new StringBuilder();
foreach (byte b in classHash)
{
sBuilder.Append(b.ToString());
}
string newClass = (double.Parse(sBuilder.ToString())%9999 + 1).ToString();
You can do something like
str.GetHashCode() % 9999 + 1;
The hash can't be unique since you have more than 9,999 strings
It is not unique so it cannot be reversible
and of course my answer is wrong in case you don't have more than 9999 different 10 character classes.
In case you don't have more than 9999 classes you need to have a mapping from string id to its 4 char representation - for example - save the stings in a list and each string key will be its index in the list
When you want to reverse the process, and have no knowledge about the id's apart from that there are at most 9999 of them, I think you need to use a translation dictionary to map each id to its short version.
Even without the need to reverse the process, I don't think there is a way to guerantee unique id's without such a dictionary.
This short version could then simply be incremented by one with each new id.
You do not want a hash. Hashing by design allows for collisions. There is no possible hashing function for the kind of strings you work with that won't have collisions.
You need to build a persistent mapping table to convert the string to a number. Logically similar to a Dictionary<string, int>. The first string you'll add gets number 0. When you need to map, look up the string and return its associate number. If it is not present then add the string and simply assign it a number equal to the count.
Making this mapping table persistent is what you'll need to think about. Trivially done with a dbase of course.
ehn no idea
Unique is difficult, you have - in your request - 4 characters - thats a max of 9999, collision will occur.
Hash is not reversible. Data is lost (obviously).
I think you might need to create and store a lookup table to be able to support your requirements. And in that case you don't even need a hash you could just increment the last used 4 digit lookup code.
use md5 or sha like:
string = substring(md5("05910395410"),0,4)
or write your own simple method, for example
sum = 0
foreach(char c in string)
{
sum+=(int)c;
}
sum %= 9999
Convert the number to base35/base36
ex: 3800290030 decimal = 22CGHK5 base-35 //length: 7
Or may be convert to Base60 [ignoring Capital O and small o to not confuse with 0]
ex: 3800290030 decimal = 4tDw7A base-60 //length: 6
Convert your int to binary and then base64 encode it. It wont be numbers then, but it will be a reversible hash.
Edit:
As far as my sense tells me you are asking for the impossible.
You cannot take a totally random data and somehow reduce the amount of data it takes to encode it (some might be shorter, others might be longer), thus your requirement that the number is unique is not possible, there has to be some dataloss somewhere and no matter how you do it it won't ensure uniqueness.
Second, due to the above it is also not possible to make it reversible. Thus that is out of the question.
Therefore, the only possible way I can see, is if you have an enumerable data source. IE. you know all the values prior to calculating the value. In that case you can simply assign them a sequencial id.