Represent string into unique int code

Represent string into unique int code - c#

Well my question is how can I represent a string into an int code, I don't want it parsed or converted to int (sort of translating from english to french or german for example).
What I want is to convert the string into an int code that can be used as a search reference, I was going to use the hash code of the string to convert it but since hashing has things to do with environmental settings of the machine is not optimal for my project, I had already considered using the ascii codes of each letter for the word but sadly there are an incredible amount of long words in several languages and the app is globalized so it's not a very viable solution, the project is going to be deployed as an azure cloud site so I don't have full-text search
Any ideas what can I do in this case?

You are already giving a solution instead of giving requirements, so it may very well be that there are better options.
Anyway, you could use a platform independent hash with the managed cryptography classes like SHA512Managed. Be aware however that this is not guaranteed unique, so you might end up with collisions; but at least it's built in and you don't have to reinvent the wheel. Go here for an example.

I solved this by creating another table that contained just the words from each post and calling it with the ID of each word

One such hash is getting the sum char code of each letter int hash = s.Select<char, int>(x => (int)x).Aggregate((x, y) => x + y); however that hash has collisions. Another way is concatenate the char code of each letter, however you quickly surpass the number allowed per integer. One such work around for this is subtracting 64 from the value of the chars uint hash = Convert.ToUInt32(s.Select<char, string>(x => (((int)x) - 64).ToString()).Aggregate((x, y) => x + y));

Related

How to obfuscate an integer?

From a list of integers in C#, I need to generate a list of unique values. I thought in MD5 or similar but they generates too many bytes.
Integer size is 2 bytes.
I want to get a one way correspondence, for example
0 -> ARY812Q3
1 -> S6321Q66
2 -> 13TZ79K2
So, proving the hash, the user cannot know the integer or to interfere a sequence behind a list of hashes.
For now, I tried to use MD5(my number) and then I used the first 8 characters. However I found the first collision at 51389. Which other alternatives I could use?
As I say, I only need one way. It is not necessary to be able to calculate the integer from the hash. The system uses a dictionary to find them.
UPDATE:
Replying some suggestions about using GetHashCode(). GetHashCode returns the same integer. My purpose is to hide to the end user the integer. In this case, the integer is the primary key of a database. I do not want to give this information to users because they could deduce the number of records in the database or the increment of records by week.
Hashes are not unique, so maybe I need to use encryption like TripleDes or so, but I wanted to use something fast and simple. Also, TripleDes returns too many bytes too.
UPDATE 2:
I was talking about hashes and it is an error. In reality, I am trying to obfuscate it, and I tried it using hash algorithm, that it is not a good idea because they are not unique.

Update May 2017
Feel free to use (or modify) the library I developed, installable via Nuget with:
Install-Package Kent.Cryptography.Obfuscation
This converts a non-negative id such as 127 to 8-character string, e.g. xVrAndNb, and back (with some available options to randomize the sequence each time it's generated).
Example Usage
var obfuscator = new Obfuscator();
string maskedID = obfuscator.Obfuscate(15);
Full documentation at: Github.
Old Answer
I came across this problem way back and I couldn't find what I want in StackOverflow. So I made this obfuscation class and just shared it on github.
Obfuscation.cs - Github
You can use it by:
Obfuscation obfuscation = new Obfuscation();
string maskedValue = obfuscation.Obfuscate(5);
int? value = obfuscation.DeObfuscate(maskedValue);
Perhaps it can be of help to future visitor :)

Encrypt it with Skip32, which produces a 32 bit output. I found this C# implementation but can't vouch for its correctness. Skip32 is a relatively uncommon crypto choice and probably hasn't been analyzed much. Still it should be sufficient for your obfuscation purposes.
The strong choice would be format preserving encryption using AES in FFX mode. But that's pretty complicated and probably overkill for your application.
When encoded with Base32 (case insensitive, alphanumeric) a 32 bit value corresponds to 7 characters. When encoded in hex, it corresponds to 8 characters.
There is also the non cryptographic alternative of generating a random value, storing it in the database and handling collisions.

Xor the integer. Maybe with a random key that it is generated per user (stored in session). While it's not strictly a hash (as it is reversible), the advantages are that you don't need to store it anywhere, and the size will be the same.

For what you want, I'd recommend using GUIDs (or other kind of unique identifier where the probability of collision is either minimal or none) and storing them in the database row, then just never show the ID to the user.
IMHO, it's kind of bad practice to ever show the primary key in the database to the user (much less to let users do any kind of operations on them).
If they need to have raw access to the database for some reason, then just don't use ints as primary keys, and make them guids (but then your requirement loses importance since they can just access the number of records)
Edit
Based on your requirements, if you don't care the algorithm is potentially computationally expensive, then you can just generate a random 8 byte string every time a new row is added, and keep generating random strings until you find one that is not already in the database.
This is far from optimal, and -can- be computationally expensive, but taking you use a 16-bit id and the maximum number of rows is 65536, I'd not care too much about it (the possibility of an 8 byte random string to be in a 65536 possibility list is minimal, so you'll probably be good at first or as much as second try, if your pseudo-random generator is good).

hash that maps strings to integers

Looking for some hash function to make string to int mapping with following restrictions.
restrictions:
Same strings go to same number.
Different strings go to different numbers.
During one run of application I am getting strings from same length, only in the runtime I know the length.
Any suggestions how to create the hash function ?

A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.
This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.
Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.
Here is the current implementation of dictionary.cs.

You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.

Different strings go to different numbers.
There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.

Is the String.GetHashCode function not right for your needs?

Convert ten character classification string into four character one in C#

What's the best way to convert (to hash) a string like 3800290030, which represents an id for a classification into a four character one like 3450 (I need to support at max 9999 classes). We will only have less than 1000 classes in 10 character space and it will never grow to more than 10k.
The hash needs to be unique and always the same for the same an input.
The resulting string should be numeric (but it will be saved as char(4) in SQL Server).
I removed the requirement for reversibility.
This is my solution, please comment:
string classTIC = "3254002092";
MD5 md5Hasher = MD5.Create();
byte[] classHash = md5Hasher.ComputeHash(Encoding.Default.GetBytes(classTIC));
StringBuilder sBuilder = new StringBuilder();
foreach (byte b in classHash)
{
sBuilder.Append(b.ToString());
}
string newClass = (double.Parse(sBuilder.ToString())%9999 + 1).ToString();

You can do something like
str.GetHashCode() % 9999 + 1;
The hash can't be unique since you have more than 9,999 strings
It is not unique so it cannot be reversible
and of course my answer is wrong in case you don't have more than 9999 different 10 character classes.
In case you don't have more than 9999 classes you need to have a mapping from string id to its 4 char representation - for example - save the stings in a list and each string key will be its index in the list

When you want to reverse the process, and have no knowledge about the id's apart from that there are at most 9999 of them, I think you need to use a translation dictionary to map each id to its short version.
Even without the need to reverse the process, I don't think there is a way to guerantee unique id's without such a dictionary.
This short version could then simply be incremented by one with each new id.

You do not want a hash. Hashing by design allows for collisions. There is no possible hashing function for the kind of strings you work with that won't have collisions.
You need to build a persistent mapping table to convert the string to a number. Logically similar to a Dictionary<string, int>. The first string you'll add gets number 0. When you need to map, look up the string and return its associate number. If it is not present then add the string and simply assign it a number equal to the count.
Making this mapping table persistent is what you'll need to think about. Trivially done with a dbase of course.

ehn no idea
Unique is difficult, you have - in your request - 4 characters - thats a max of 9999, collision will occur.
Hash is not reversible. Data is lost (obviously).

I think you might need to create and store a lookup table to be able to support your requirements. And in that case you don't even need a hash you could just increment the last used 4 digit lookup code.

use md5 or sha like:
string = substring(md5("05910395410"),0,4)
or write your own simple method, for example
sum = 0
foreach(char c in string)
{
sum+=(int)c;
}
sum %= 9999

Convert the number to base35/base36
ex: 3800290030 decimal = 22CGHK5 base-35 //length: 7
Or may be convert to Base60 [ignoring Capital O and small o to not confuse with 0]
ex: 3800290030 decimal = 4tDw7A base-60 //length: 6

Convert your int to binary and then base64 encode it. It wont be numbers then, but it will be a reversible hash.
Edit:
As far as my sense tells me you are asking for the impossible.
You cannot take a totally random data and somehow reduce the amount of data it takes to encode it (some might be shorter, others might be longer), thus your requirement that the number is unique is not possible, there has to be some dataloss somewhere and no matter how you do it it won't ensure uniqueness.
Second, due to the above it is also not possible to make it reversible. Thus that is out of the question.
Therefore, the only possible way I can see, is if you have an enumerable data source. IE. you know all the values prior to calculating the value. In that case you can simply assign them a sequencial id.

long/large numbers and modulus in .NET

I'm currently writing a quick custom encoding method where I take a stamp a key with a number to verify that it is a valid key.
Basically I was taking whatever number that comes out of the encoding and multiplying it by a key.
I would then multiply those numbers to the deploy to the user/customer who purchases the key. I wanted to simply use (Code % Key == 0) to verify that the key is valid, but for large values the mod function does not seem to function as expected.
Number = 468721387;
Key = 12345678;
Code = Number * Key;
Using the numbers above:
Code % Key == 11418772
And for smaller numbers it would correctly return 0. Is there a reliable way to check divisibility for a long in .NET?
Thanks!
EDIT:
Ok, tell me if I'm special and missing something...
long a = DateTime.Now.Ticks;
long b = 12345;
long c = a * b;
long d = c % b;
d == 10001 (Bad)
and
long a = DateTime.Now.Ticks;
long b = 12;
long c = a * b;
long d = c % b;
d == 0 (Good)
What am I doing wrong?

As others have said, your problem is integer overflow. You can make this more obvious by checking "Check for arithmetic overflow/underflow" in the "Advanced Build Settings" dialog. When you do so, you'll get an OverflowException when you perform *DateTime.Now.Ticks * 12345*.
One simple solution is just to change "long" to "decimal" (or "double") in your code.
In .NET 4.0, there is a new BigInteger class.
Finally, you say you're "... writing a quick custom encoding method ...", so a simple homebrew solution may be satisfactory for your needs. However, if this is production code, you might consider more robust solutions involving cryptography or something from a third-party who specializes in software licensing.

The answers that say that integer overflow is the likely culprit are almost certainly correct; you can verify that by putting a "checked" block around the multiplication and seeing if it throws an exception.
But there is a much larger problem here that everyone seems to be ignoring.
The best thing to do is to take a large step back and reconsider the wisdom of this entire scheme. It appears that you are attempting to design a crypto-based security system but you are clearly not an expert on cryptographic arithmetic. That is a huge red warning flag. If you need a crypto-based security system DO NOT ATTEMPT TO ROLL YOUR OWN. There are plenty of off-the-shelf crypto systems that are built by experts, heavily tested, and readily available. Use one of them.
If you are in fact hell-bent on rolling your own crypto, getting the math right in 64 bits is the least of your worries. 64 bit integers are way too small for this crypto application. You need to be using a much larger integer size; otherwise, finding a key that matches the code is trivial.
Again, I cannot emphasize strongly enough how difficult it is to construct correct crypto-based security code that actually protects real users from real threats.

Integer Overflow...see my comment.
The value of the multiplication you're doing overflows the int data type and causes it to wrap (int values fall between +/-2147483647).
Pick a more appropriate data type to hold a value as large as 5786683315615386 (the result of your multiplication).
UPDATE
Your new example changes things a little.
You're using long, but now you're using System.DateTime.Ticks which on Mono (not sure about the MS platform) is returning 633909674610619350.
When you multiply that by a large number, you are now overflowing a long just like you were overflowing an int previously. At that point, you'll probably need to use a double to work with the values you want (decimal may work as well, depending on how large your multiplier gets).

Apparently, your Code fails to fit in the int data type. Try using long instead:
long code = (long)number * key;
The (long) cast is necessary. Without the cast, the multiplication will be done in 32-bit integer form (assuming number and key variables are typed int) and the result will be casted to long which is not what you want. By casting one of the operands to long, you tell the compiler to perform the multiplication on two long numbers.

Need a smaller alternative to GUID for DB ID but still unique and random for URL

I have looked all of the place for this and I can't seem to get a complete answer for this. So if the answer does already exist on stackoverflow then I apologize in advance.
I want a unique and random ID so that users in my website can't guess the next number and just hop to someone else's information. I plan to stick to a incrementing ID for the primary key but to also store a random and unique ID (sort of a hash) for that row in the DB and put an index on it.
From my searching I realize that I would like to avoid collisions and I have read some mentions of SHA1.
My basic requirements are
Something smaller than a GUID. (Looks horrible in URL)
Must be unique
Avoid collisions
Not a long list of strange characters that are unreadable.
An example of what I am looking for would be www.somesite.com/page.aspx?id=AF78FEB
I am not sure whether I should be implementing this in the database (I am using SQL Server 2005) or in the code (I am using C# ASP.Net)
EDIT:
From all the reading I have done I realize that this is security through obscurity. I do intend having proper authorization and authentication for access to the pages. I will use .Net's Authentication and authorization framework. But once a legitimate user has logged in and is accessing a legimate (but dynamically created page) filled with links to items that belong to him. For example a link might be www.site.com/page.aspx?item_id=123. What is stopping him from clicking on that link, then altering the URL above to go www.site.com/page.aspx?item_id=456 which does NOT belong to him? I know some Java technologies like Struts (I stand to be corrected) store everything in the session and somehow work it out from that but I have no idea how this is done.

Raymond Chen has a good article on why you shouldn't use "half a guid", and offers a suitable solution to generating your own "not quite guid but good enough" type value here:
GUIDs are globally unique, but substrings of GUIDs aren't
His strategy (without a specific implementiation) was based on:
Four bits to encode the computer number,
56 bits for the timestamp, and
four bits as a uniquifier.
We can reduce the number of bits to make the computer unique since the number of computers in the cluster is bounded, and we can reduce the number of bits in the timestamp by assuming that the program won’t be in service 200 years from now.
You can get away with a four-bit uniquifier by assuming that the clock won’t drift more than an hour out of skew (say) and that the clock won’t reset more than sixteen times per hour.

UPDATE (4 Feb 2017):
Walter Stabosz discovered a bug in the original code. Upon investigation there were further bugs discovered, however, extensive testing and reworking of the code by myself, the original author (CraigTP) has now fixed all of these issues. I've updated the code here with the correct working version, and you can also download a Visual Studio 2015 solution here which contains the "shortcode" generation code and a fairly comprehensive test suite to prove correctness.
One interesting mechanism I've used in the past is to internally just use an incrementing integer/long, but to "map" that integer to a alphanumeric "code".
Example
Console.WriteLine($"1371 as a shortcode is: {ShortCodes.LongToShortCode(1371)}");
Console.WriteLine($"12345 as a shortcode is: {ShortCodes.LongToShortCode(12345)}");
Console.WriteLine($"7422822196733609484 as a shortcode is: {ShortCodes.LongToShortCode(7422822196733609484)}");
Console.WriteLine($"abc as a long is: {ShortCodes.ShortCodeToLong("abc")}");
Console.WriteLine($"ir6 as a long is: {ShortCodes.ShortCodeToLong("ir6")}");
Console.WriteLine($"atnhb4evqqcyx as a long is: {ShortCodes.ShortCodeToLong("atnhb4evqqcyx")}");
// PLh7lX5fsEKqLgMrI9zCIA
Console.WriteLine(GuidToShortGuid( Guid.Parse("957bb83c-5f7e-42b0-aa2e-032b23dcc220") ) );
Code
The following code shows a simple class that will change a long to a "code" (and back again!):
public static class ShortCodes
{
// You may change the "shortcode_Keyspace" variable to contain as many or as few characters as you
// please. The more characters that are included in the "shortcode_Keyspace" constant, the shorter
// the codes you can produce for a given long.
private static string shortcodeKeyspace = "abcdefghijklmnopqrstuvwxyz0123456789";
public static string LongToShortCode(long number)
{
// Guard clause. If passed 0 as input
// we always return empty string.
if (number == 0)
{
return string.Empty;
}
var keyspaceLength = shortcodeKeyspace.Length;
var shortcodeResult = "";
var numberToEncode = number;
var i = 0;
do
{
i++;
var characterValue = numberToEncode % keyspaceLength == 0 ? keyspaceLength : numberToEncode % keyspaceLength;
var indexer = (int) characterValue - 1;
shortcodeResult = shortcodeKeyspace[indexer] + shortcodeResult;
numberToEncode = ((numberToEncode - characterValue) / keyspaceLength);
}
while (numberToEncode != 0);
return shortcodeResult;
}
public static long ShortCodeToLong(string shortcode)
{
var keyspaceLength = shortcodeKeyspace.Length;
long shortcodeResult = 0;
var shortcodeLength = shortcode.Length;
var codeToDecode = shortcode;
foreach (var character in codeToDecode)
{
shortcodeLength--;
var codeChar = character;
var codeCharIndex = shortcodeKeyspace.IndexOf(codeChar);
if (codeCharIndex < 0)
{
// The character is not part of the keyspace and so entire shortcode is invalid.
return 0;
}
try
{
checked
{
shortcodeResult += (codeCharIndex + 1) * (long) (Math.Pow(keyspaceLength, shortcodeLength));
}
}
catch(OverflowException)
{
// We've overflowed the maximum size for a long (possibly the shortcode is invalid or too long).
return 0;
}
}
return shortcodeResult;
}
}
}
This is essentially your own baseX numbering system (where the X is the number of unique characters in the shortCode_Keyspace constant.
To make things unpredicable, start your internal incrementing numbering at something other than 1 or 0 (i.e start at 184723) and also change the order of the characters in the shortCode_Keyspace constant (i.e. use the letters A-Z and the numbers 0-9, but scamble their order within the constant string. This will help make each code somewhat unpredictable.
If you're using this to "protect" anything, this is still security by obscurity, and if a given user can observe enough of these generated codes, they can predict the relevant code for a given long. The "security" (if you can call it that) of this is that the shortCode_Keyspace constant is scrambled, and remains secret.
EDIT:
If you just want to generate a GUID, and transform it to something that is still unique, but contains a few less characters, this little function will do the trick:
public static string GuidToShortGuid(Guid gooid)
{
string encoded = Convert.ToBase64String(gooid.ToByteArray());
encoded = encoded.Replace("/", "_").Replace("+", "-");
return encoded.Substring(0, 22);
}

If you don't want other users to see people information why don't you secure the page which you are using the id?
If you do that then it won't matter if you use an incrementing Id.

[In response to the edit]
You should consider query strings as "evil input". You need to programmatically check that the authenticated user is allowed to view the requested item.
if( !item456.BelongsTo(user123) )
{
// Either show them one of their items or a show an error message.
}

You could randomly generate a number. Check that this number is not already in the DB and use it. If you want it to appear as a random string you could just convert it to hexadecimal, so you get A-F in there just like in your example.

A GUID is 128 bit. If you take these bits and don’t use a character set with just 16 characters to represent them (16=2^4 and 128/4 = 32 chacters) but a character set with, let’s say, 64 characters (like Base 64), you would end up at only 22 characters (64=2^6 and 128/6 = 21.333, so 22 characters).

Take your auto-increment ID, and HMAC-SHA1 it with a secret known only to you. This will generate a random-looking 160-bits that hide the real incremental ID. Then, take a prefix of a length that makes collisions sufficiently unlikely for your application---say 64-bits, which you can encode in 8 characters. Use this as your string.
HMAC will guarantee that no one can map from the bits shown back to the underlying number. By hashing an auto-increment ID, you can be pretty sure that it will be unique. So your risk for collisions comes from the likelihood of a 64-bit partial collision in SHA1. With this method, you can predetermine if you will have any collisions by pre-generating all the random strings that this method which generate (e.g. up to the number of rows you expect) and checking.
Of course, if you are willing to specify a unique condition on your database column, then simply generating a totally random number will work just as well. You just have to be careful about the source of randomness.

How long is too long? You could convert the GUID to Base 64, which ends up making it quite a bit shorter.

What you could do is something I do when I want exactly what you are wanting.
Create your GUID.
Get remove the dashes, and get a
substring of how long you want your
ID
Check the db for that ID, if it
exists goto step 1.
Insert record.
This is the simplest way to insure it is obscured and unique.

I have just had an idea and I see Greg also pointed it out. I have the user stored in the session with a user ID. When I create my query I will join on the Users table with that User ID, if the result set is empty then we know he was hacking the URL and I can redirect to an error page.

A GUID is just a number
The latest generation of GUIDs (version 4) is basically a big random number*
Because it's a big random number the chances of a collision are REALLY small.
The biggest number you can make with a GUID is over:
5,000,000,000,000,000,000,000,000,000,000,000,000
So if you generate two GUIDs the chance the second GUID is the same as the first is:
1 in 5,000,000,000,000,000,000,000,000,000,000,000,000
If you generate 100 BILLION GUIDs.
The chance your 100 billionth GUID collides with the other 99,999,999,999 GUIDs is:
1 in 50,000,000,000,000,000,000,000,000
Why 128 bits?
One reason is that computers like working with multiples of 8 bits.
8, 16, 32, 64, 128, etc
The other reason is that the guy who came up with the GUID felt 64 wasn't enough, and 256 was way too much.
Do you need 128 bits?
No, how many bits you need depends on how many numbers you expect to generate and how sure you want to be that they don't collide.
64 bit example
Then the chance that your second number would collide with the first would be:
1 in 18,000,000,000,000,000,000 (64 bit)
Instead of:
1 in 5,000,000,000,000,000,000,000,000,000,000,000,000 (128 bit)
What about the 100 billionth number?
The chance your 100 billionth number collides with the other 99,999,999,999 would be:
1 in 180,000,000 (64 bit)
Instead of:
1 in 50,000,000,000,000,000,000,000,000 (128 bit)
So should you use 64 bits?
Depends are you generating 100 billion numbers? Even if you were then does 180,000,000 make you uncomfortable?
A little more details about GUIDs
I'm specifically talking about version 4.
Version 4 doesn't actually use all 128 bits for the random number portion, it uses 122 bits. The other 6 bits are used to indicate that is version 4 of the GUID standard.
The numbers in this answer are based on 122 bits.
And yes since it's just a random number you can just take the number of bits you want from it. (Just make sure you don't take any of the 6 versioning bits that never change - see above).
Instead of taking bits from the GUID though you could instead use the the same random number generator the GUID got it's bits from.
It probably used the random number generator that comes with the operating system.

Late to the party but I found this to be the most reliable way to generate Base62 random strings in C#.
private static Random random = new Random();
void Main()
{
var s = RandomString(7);
Console.WriteLine(s);
}
public static string RandomString(int length)
{
const string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.