Convert a String to unique Long - c#

I have a string made up of 2 parts (see code)
I want to know the UnknownDeterministicFunction which returns a Long, which can deterministically produce the same long for a given string.
private void MyProgram()
{
string resultStr = "XXX"+"12345678";
//1st part is a string of characters (the "XXX")
//2nd part is a string of numbers (the "12345678")
long resultLng = UnknownDeterministicFunction(myString);
}
private long UnknownDeterministicFunction(string inputStr)
{
// ???
}
Is this possible in C#?

First of all
there are 8 ** (2 ** 30) different strings (which up to 2 GB long)
there are 2 ** 64 differrent long (which are 64-bit integers)
So you can't guarantee long be unique (good old Pigeonhole principle). If you are ready for possible, though improbable collisions (i.e. different strings can well return the same long) you may want to implement hash functions, e.g.
hash function for string
or
Good Hash Function for Strings
usually, hash function returns Int32; in that case just combine two int into one long
int hash1 = GetHashOneAlgorithm(myString);
int hash2 = GetHashAnotherAlgorithm(myString);
long result = ((long) hash1 << 32) | hash2;

OK the answer is simple.
private long UnknownDeterministicFunction(string inputStr)
{
//not taking care of null...
return (long)inputStr.GetHashCode()
}

Related

How do I implement BN_num_bytes() (and BN_num_bits() ) in C#?

I'm porting this line from C++ to C#, and I'm not an experienced C++ programmer:
unsigned int nSize = BN_num_bytes(this);
In .NET I'm using System.Numerics.BigInteger
BigInteger num = originalBigNumber;
byte[] numAsBytes = num.ToByteArray();
uint compactBitsRepresentation = 0;
uint size2 = (uint)numAsBytes.Length;
I think there is a fundamental difference in how they operate internally, since the sources' unit tests' results don't match if the BigInt equals:
0
Any negative number
0x00123456
I know literally nothing about BN_num_bytes (edit: the comments just told me that it's a macro for BN_num_bits).
Question
Would you verify these guesses about the code:
I need to port BN_num_bytes which is a macro for ((BN_num_bits(bn)+7)/8) (Thank you #WhozCraig)
I need to port BN_num_bits which is floor(log2(w))+1
Then, if the possibility exists that leading and trailing bytes aren't counted, then what happens on Big/Little endian machines? Does it matter?
Based on these answers on Security.StackExchange, and that my application isn't performance critical, I may use the default implementation in .NET and not use an alternate library that may already implement a comparable workaround.
Edit: so far my implementation looks something like this, but I'm not sure what the "LookupTable" is as mentioned in the comments.
private static int BN_num_bytes(byte[] numAsBytes)
{
int bits = BN_num_bits(numAsBytes);
return (bits + 7) / 8;
}
private static int BN_num_bits(byte[] numAsBytes)
{
var log2 = Math.Log(numAsBytes.Length, 2);
var floor = Math.Floor(log2);
return (uint)floor + 1;
}
Edit 2:
After some more searching, I found that:
BN_num_bits does not return the number of significant bits of a given bignum, but rather the position of the most significant 1 bit, which is not necessarily the same thing
Though I still don't know what the source of it looks like...
The man page (OpenSSL project) of BN_num_bits says that "Basically, except for a zero, it returns floor(log2(w))+1.".
So these are the correct implementations of the BN_num_bytes and BN_num_bits functions for .Net's BigInteger.
public static int BN_num_bytes(BigInteger number) {
if (number == 0) {
return 0;
}
return 1 + (int)Math.Floor(BigInteger.Log(BigInteger.Abs(number), 2)) / 8;
}
public static int BN_num_bits(BigInteger number) {
if (number == 0) {
return 0;
}
return 1 + (int)Math.Floor(BigInteger.Log(BigInteger.Abs(number), 2));
}
You should probably change these into extension methods for convenience.
You should understand that these functions measure the minimum number of bits/bytes that are needed to express a given integer number. Variables declared as int (System.Int32) take 4 bytes of memory, but you only need 1 byte (or 3 bits) to express the integer number 7. This is what BN_num_bytes and BN_num_bits calculate - the minimum required storage size for a concrete number.
You can find the source code of the original implementations of the functions in the official OpenSSL repository.
Combine what WhozCraig in the comments said with this link explaining BN_num_bits:
http://www.openssl.org/docs/crypto/BN_num_bytes.html
And you end up with something like this, which should tell you the significant number of bytes:
public static int NumberOfBytes(BigInteger bigInt)
{
if (bigInt == 0)
{
return 0; //you need to check what BN_num_bits actually does here as not clear from docs, probably returns 0
}
return (int)Math.Ceiling(BigInteger.Log(bigInt + 1, 2) + 7) / 8;
}

What's the best way to represent System.Double as a sortable string?

In data formats where all underlying types are strings, numeric types must be converted to a standardized string format which can be compared alphabetically. For example, a short for the value 27 could be represented as 00027 if there are no negatives.
What's the best way to represent a double as a string? In my case I can ignore negatives, but I'd be curious how you'd represent the double in either case.
UPDATE
Based on Jon Skeet's suggestion, I'm now using this, though I'm not 100% sure it'll work correctly:
static readonly string UlongFormatString = new string('0', ulong.MaxValue.ToString().Length);
public static string ToSortableString(this double n)
{
return BitConverter.ToUInt64(BitConverter.GetBytes(BitConverter.DoubleToInt64Bits(n)), 0).ToString(UlongFormatString);
}
public static double DoubleFromSortableString(this string n)
{
return BitConverter.Int64BitsToDouble(BitConverter.ToInt64(BitConverter.GetBytes(ulong.Parse(n)), 0));
}
UPDATE 2
I have confirmed what Jon suspected - negatives don't work using this method. Here is some sample code:
void Main()
{
var a = double.MaxValue;
var b = double.MaxValue/2;
var c = 0d;
var d = double.MinValue/2;
var e = double.MinValue;
Console.WriteLine(a.ToSortableString());
Console.WriteLine(b.ToSortableString());
Console.WriteLine(c.ToSortableString());
Console.WriteLine(d.ToSortableString());
Console.WriteLine(e.ToSortableString());
}
static class Test
{
static readonly string UlongFormatString = new string('0', ulong.MaxValue.ToString().Length);
public static string ToSortableString(this double n)
{
return BitConverter.ToUInt64(BitConverter.GetBytes(BitConverter.DoubleToInt64Bits(n)), 0).ToString(UlongFormatString);
}
}
Which produces the following output:
09218868437227405311
09214364837600034815
00000000000000000000
18437736874454810623
18442240474082181119
Clearly not sorted as expected.
UPDATE 3
The accepted answer below is the correct one. Thanks guys!
Padding is potentially rather awkward for doubles, given the enormous range (double.MaxValue is 1.7976931348623157E+308).
Does the string representation still have to be human-readable, or just reversible?
That gives a reversible conversion leading to a reasonably short string representation preserving lexicographic ordering - but it wouldn't be at all obvious what the double value was just from the string.
EDIT: Don't use BitConverter.DoubleToInt64Bits alone. That reverses the ordering for negative values.
I'm sure you can perform this conversion using DoubleToInt64Bits and then some bit-twiddling, but unfortunately I can't get it to work right now, and I have three kids who are desperate to go to the park...
In order to make everything sort correctly, negative numbers need to be stored in ones-complement format instead of sign magnitude (otherwise negatives and positives sort in opposite orders), and the sign bit needs to be flipped (to make negative sort less-than positives). This code should do the trick:
static ulong EncodeDouble(double d)
{
long ieee = System.BitConverter.DoubleToInt64Bits(d);
ulong widezero = 0;
return ((ieee < 0)? widezero: ((~widezero) >> 1)) ^ (ulong)~ieee;
}
static double DecodeDouble(ulong lex)
{
ulong widezero = 0;
long ieee = (long)(((0 <= (long)lex)? widezero: ((~widezero) >> 1)) ^ ~lex);
return System.BitConverter.Int64BitsToDouble(ieee);
}
Demonstration here: http://ideone.com/JPNPY
Here's the complete solution, to and from strings:
static string EncodeDouble(double d)
{
long ieee = System.BitConverter.DoubleToInt64Bits(d);
ulong widezero = 0;
ulong lex = ((ieee < 0)? widezero: ((~widezero) >> 1)) ^ (ulong)~ieee;
return lex.ToString("X16");
}
static double DecodeDouble(string s)
{
ulong lex = ulong.Parse(s, System.Globalization.NumberStyles.AllowHexSpecifier);
ulong widezero = 0;
long ieee = (long)(((0 <= (long)lex)? widezero: ((~widezero) >> 1)) ^ ~lex);
return System.BitConverter.Int64BitsToDouble(ieee);
}
Demonstration: http://ideone.com/pFciY
I believe that a modified scientific notation, with the exponent first, and using underscore for positive, would sort lexically in the same order as numerically.
If you want, you can even append the normal representation, since a suffix won't affect sorting.
Examples
E000M3 +3.0
E001M2.7 +27.0
Unfortunately, it doesn't work for either negative numbers or negative exponents. You could introduce a bias for the exponent, like the IEEE format uses internally.
As it turns out... The org.apache.solr.util package contains the NumberUtils class. This class has static methods that do everything needed to convert doubles (and other data values) to sortable strings (and back). The methods could not be easier to use. A few notes:
Of course, NumberUtils is written in Java (not c#). My guess it that the code could be converted to c#... However, I am not well versed in c#. The source is readily available online.
The resulting strings are not printable (at all).
The comments in the code indicate that all exotic cases, including negative numbers and infinities, should work correctly.
I haven't done any benchmarks... However, based on a quick scan of the code, it should be very fast.
The code below shows what needs to done to use this library.
String key = NumberUtils.double2sortableStr(35.2);

Converting a partial MD5 hash code into a long

I'm using the MD5 algorithm to hash the key for an on-disk hash table (I know it's questionable whether this is the best algorithm to use for this, but I'm going with it for now. The problem is generalizable to any algorithm that produces a byte array). My problem is this:
The size of the hash code determines the number of combinations (buckets) in the hash table. Since MD5 is 128 bit, there are a huge number of combinations (~ 3.4e38) which is way too big for my purpose. So what I want to do is pick off the first n bits of the byte array that MD5 produces, and convert those into a long (or ulong) value. Since MD5 produces a byte array, it would be easy to do if I wanted an integral number of bytes, but this leads to too big a jump in the number of combinations. I'm finding the single bit version to be a lot trickier.
Goal:
n = 10 // I.e. I want 2^10 combinations
long pos = someFcn(byte[] key, n)
where key is the value being hashed, and n is the number of bits of the MD5 result I want to use. Pos, then, will be an integer from 0 to 1023 (in the case of n = 10). If n = 11, the code will be from 0 to 2^11-1 = 2027, etc. Has to be somewhat fast/efficient.
Doesn't seem that hard but it's eluding me. Any help would be much appreciated. Thanks.
First, convert the first four bytes into an integer, with BitConverter.ToInt32. It's getting four bytes no matter what, but this probably won't make it measurably slower, since you're working with 32-bit registers for the rest of the calculations anyway, and complex stuff like "if it's < 16 then do this with the first two bytes" will just make it more complicated
Then, given that integer, take the lowest N bits. If you really want a specific number of bits [a power of two number of buckets] not known at compile time, ~((-1)<<N) is a nice trick to get 2^N-1.
Or you could simply use ToUInt32 instead and modulo a prime number [it might be slightly better to convert to UInt64 instead, then you've got fully half the bits to start with, in this case]
To obtain the first 10 bits, for example:
int result = ((int)key[0] << 2) | (((int)key[1] >> 6) & 0x03)
If you have an array like this,
unsigned char data[2000];
then you can just scrape off the first n bits into an integer like so:
typedef unsigned long long int MyInt;
MyInt scrape(size_t n, unsigned char * data)
{
MyInt result = 0;
size_t b;
for (b = 0; b < n / 8; ++b)
{
result <<= 8;
result += data[b];
}
const size_t remaining_bits = n % 8;
result <<= remaining_bits;
result += (data[b] >> (8 - remaining_bits));
return result;
}
I'm assuming that CHAR_BITS == 8, feel free to generalize the code if you like. Also the size of the array times 8 must be at least n.

Reversing a hash function

I have the following hash function, and I'm trying to get my way to reverse it, so that I can find the key from a hashed value.
uint Hash(string s)
{
uint result = 0;
for (int i = 0; i < s.Length; i++)
{
result = ((result << 5) + result) + s[i];
}
return result;
}
The code is in C# but I assume it is clear.
I am aware that for one hashed value, there can be more than one key, but my intent is not to find them all, just one that satisfies the hash function suffices.
EDIT :
The string that the function accepts is formed only from digits 0 to 9 and the chars '*' and '#' hence the Unhash function must respect this criteria too.
Any ideas? Thank you.
This should reverse the operations:
string Unhash(uint hash)
{
List<char> s = new List<char>();
while (hash != 0)
{
s.Add((char)(hash % 33));
hash /= 33;
}
s.Reverse();
return new string(s.ToArray());
}
This should return a string that gives the same hash as the original string, but it is very unlikely to be the exact same string.
Characters 0-9,*,# have ASCII values 48-57,42,35, or binary: 00110000 ... 00111001, 00101010, 00100011
First 5 bits of those values are different, and 6th bit is always 1. This means that you can deduce your last character in a loop by taking current hash:
uint lastChar = hash & 0x1F - ((hash >> 5) - 1) & 0x1F + 0x20;
(if this doesn't work, I don't know who wrote it)
Now roll back hash,
hash = (hash - lastChar) / 33;
and repeat the loop until hash becomes zero. I don't have C# on me, but I'm 70% confident that this should work with only minor changes.
Brute force should work if uint is 32 bits. Try at least 2^32 strings and one of them is likely to hash to the same value. Should only take a few minutes on a modern pc.
You have 12 possible characters, and 12^9 is about 2^32, so if you try 9 character strings you're likely to find your target hash. I'll do 10 character strings just to be safe.
(simple recursive implementation in C++, don't know C# that well)
#define NUM_VALID_CHARS 12
#define STRING_LENGTH 10
const char valid_chars[NUM_VALID_CHARS] = {'0', ..., '#' ,'*'};
void unhash(uint hash_value, char *string, int nchars) {
if (nchars == STRING_LENGTH) {
string[STRING_LENGTH] = 0;
if (Hash(string) == hash_value) { printf("%s\n", string); }
} else {
for (int i = 0; i < NUM_VALID_CHARS; i++) {
string[nchars] = valid_chars[i];
unhash(hash_value, string, nchars + 1);
}
}
}
Then call it with:
char string[STRING_LENGTH + 1];
unhash(hash_value, string, 0);
Hash functions are designed to be difficult or impossible to reverse, hence the name (visualize meat + potatoes being ground up)
I would start out by writing each step that result = ((result << 5) + result) + s[i]; does on a separate line. This will make solving a lot easier. Then all you have to do is the opposite of each line (in the opposite order too).

YouTube-like GUID

Is it possible to generate short GUID like in YouTube (N7Et6c9nL9w)?
How can it be done? I want to use it in web app.
You could use Base64:
string base64Guid = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
That generates a string like E1HKfn68Pkms5zsZsvKONw==. Since a GUID is always 128 bits, you can omit the == that you know will always be present at the end and that will give you a 22 character string. This isn't as short as YouTube though.
URL Friendly Solution
As mentioned in the accepted answer, base64 is a good solution but it can cause issues if you want to use the GUID in a URL. This is because + and / are valid base64 characters, but have special meaning in URLs.
Luckily, there are unused characters in base64 that are URL friendly. Here is a more complete answer:
public string ToShortString(Guid guid)
{
var base64Guid = Convert.ToBase64String(guid.ToByteArray());
// Replace URL unfriendly characters
base64Guid = base64Guid.Replace('+', '-').Replace('/', '_');
// Remove the trailing ==
return base64Guid.Substring(0, base64Guid.Length - 2);
}
public Guid FromShortString(string str)
{
str = str.Replace('_', '/').Replace('-', '+');
var byteArray = Convert.FromBase64String(str + "==");
return new Guid(byteArray);
}
Usage:
Guid guid = Guid.NewGuid();
string shortStr = ToShortString(guid);
// shortStr will look something like 2LP8GcHr-EC4D__QTizUWw
Guid guid2 = FromShortString(shortStr);
Assert.AreEqual(guid, guid2);
EDIT:
Can we do better? (Theoretical limit)
The above yields a 22 character, URL friendly GUID.
This is because a GUID uses 128 bits, so representing it in base64 requires
characters, which is 21.33, which rounds up to 22.
There are actually 66 URL friendly characters (we aren't using . and ~). So theoretically, we could use base66 to get
characters, which is 21.17, which also rounds up to 22.
So this is optimal for a full, valid GUID.
However, GUID uses 6 bits to indicate the version and variant, which in our case are constant. So we technically only need 122 bits, which in both bases rounds to 21 ( = 20.33). So with more manipulation, we could remove another character. This requires wrangling the bits out however, so I leave this as an exercise to the reader.
How does youtube do it?
YouTube IDs use 11 characters. How do they do it?
A GUID uses 122 bits, which guarantees collisions are virtually impossible. This means you can generate a random GUID and be certain it is unique without checking. However, we don't need so many bits for just a regular ID.
We could use a smaller ID. If we use 66 bits or less, we have a higher risk of collision, but can represent this ID with 11 characters (even in base64). One could either accept the risk of collision, or test for a collision and regenerate.
With 122 bits (regular GUID), you would have to generate ~ GUIDs to have a 1% chance of collision.
With 66 bits, you would have to generate ~ or 1 billion IDs to have a 1% chance of collision. That is not that many IDs.
My guess is youtube uses 64 bits (which is more memory friendly than 66 bits), and checks for collisions to regenerate the ID if necessary.
If you want to abandon GUIDs in favor of smaller IDs, here is code for that:
class IdFactory
{
private Random random = new Random();
public int CharacterCount { get; }
public IdFactory(int characterCount)
{
CharacterCount = characterCount;
}
public string Generate()
{
// bitCount = characterCount * log (targetBase) / log(2)
var bitCount = 6 * CharacterCount;
var byteCount = (int)Math.Ceiling(bitCount / 8f);
byte[] buffer = new byte[byteCount];
random.NextBytes(buffer);
string guid = Convert.ToBase64String(buffer);
// Replace URL unfriendly characters
guid = guid.Replace('+', '-').Replace('/', '_');
// Trim characters to fit the count
return guid.Substring(0, CharacterCount);
}
}
Usage:
var factory = new IdFactory(characterCount: 11);
string guid = factory.Generate();
// guid will look like Mh3darwiZhp
This uses 64 characters which is not optimal, but requires much less code (since we can reuse Convert.ToBase64String).
You should be a lot more careful of collisions if you use this.
9 chars is not a GUID. Given that, you could use the hexadecimal representation of an int, which gives you a 8 char string.
You can use an id you might already have. Also you can use .GetHashCode against different simple types and there you have a different int. You can also xor different fields. And if you are into it, you might even use a Random number - hey, you have well above 2.000.000.000+ possible values if you stick to the positives ;)
It's not a GUID but rather an auto-incremented unique alphanumeric string
Please see the following code where I am trying to do the same, It uses the TotalMilliseconds from EPOCH and a valid set of characters to generate a unique string that is incremented with each passing milliseconds.
The one other way is to use numeric counters but that is expensive to maintain and will create a series where you can + or - values to guess the previous or the next unique string in the system and we don't what that to happen.
Do remember:
This will not be globally unique but unique to the instance where it's defined
It uses Thread.Sleep() to handle multithreading issue
public string YoutubeLikeId()
{
Thread.Sleep(1);//make everything unique while looping
long ticks = (long)(DateTime.UtcNow
.Subtract(new DateTime(1970, 1, 1,0,0,0,0))).TotalMilliseconds;//EPOCH
char[] baseChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
.ToCharArray();
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
do{
buffer[--i] = baseChars[ticks % targetBase];
ticks = ticks / targetBase;
}
while (ticks > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
}
The output will come something like
XOTgBsu
XOTgBtB
XOTgBtR
XOTgBtg
XOTgBtw
XOTgBuE
Update: The same can be achieved from Guid as
var guid = Guid.NewGuid();
guid.ToString("N");
guid.ToString("N").Substring(0,8);
guid.ToString("N").Substring(8,4);
guid.ToString("N").Substring(12,4);
guid.ToString("N").Substring(16,4);
guid.ToString("N").Substring(20,12);
For a Guid ecd65132-ab5a-4587-87b8-b875e2fe0f35 it will break it down in chunks as ecd65132 ,ab5a , 4587,87b8,b875e2fe0f35
but it's not guarantee it to be unique always.
Update 2: There is also a project called ShortGuid to get a url friendly GUID it can be converted from/to a regular Guid
When I went under the hood I found it works by encoding the Guid to Base64 as the code below:
public static string Encode(Guid guid)
{
string encoded = Convert.ToBase64String(guid.ToByteArray());
encoded = encoded
.Replace("/", "_")
.Replace("+", "-");
return encoded.Substring(0, 22);
}
The good thing about it it can be decoded again to get the Guid back with
public static Guid Decode(string value)
{
// avoid parsing larger strings/blobs
if (value.Length != 22)
{
throw new ArgumentException("A ShortGuid must be exactly 22 characters long. Receive a character string.");
}
string base64 = value
.Replace("_", "/")
.Replace("-", "+") + "==";
byte[] blob = Convert.FromBase64String(base64);
var guid = new Guid(blob);
var sanityCheck = Encode(guid);
if (sanityCheck != value)
{
throw new FormatException(
#"Invalid strict ShortGuid encoded string. The string '{value}' is valid URL-safe Base64, " +
#"but failed a round-trip test expecting '{sanityCheck}'."
);
}
return guid;
}
So a Guid 4039124b-6153-4721-84dc-f56f5b057ac2 will be encoded as SxI5QFNhIUeE3PVvWwV6wg and the Output will look something like.
ANf-MxRHHky2TptaXBxcwA
zpjp-stmVE6ZCbOjbeyzew
jk7P-XYFokmqgGguk_530A
81t6YZtkikGfLglibYkDhQ
qiM2GmqCK0e8wQvOSn-zLA
As others have mentioned, YouTube's VideoId is not technically a GUID since it's not inherently unique.
As per Wikipedia:
The total number of unique keys is 2128 or 3.4×1038. This number is so
large that the probability of the same number being generated randomly
twice is negligible.
The uniqueness YouTube's VideoId is maintained by their generator algorithm.
You can either write your own algorithm, or you can use some sort of random string generator and utilize the UNIQUE CONSTRAINT constraint in SQL to enforce its uniqueness.
First, create a UNIQUE CONSTRAINT in your database:
ALTER TABLE MyTable
ADD CONSTRAINT UniqueUrlId
UNIQUE (UrlId);
Then, for example, generate a random string (from philipproplesch's answer):
string shortUrl = System.Web.Security.Membership.GeneratePassword(11, 0);
If the generated UrlId is sufficiently random and sufficiently long you should rarely encounter the exception that is thrown when SQL encounters a duplicate UrlId. In such an event, you can easily handle the exception in your web app.
Technically it's not a Guid. Youtube has a simple randomized string generator that you can probably whip up in a few minutes using an array of allowed characters and a random number generator.
It might be not the best solution, but you can do something like that:
string shortUrl = System.Web.Security.Membership.GeneratePassword(11, 0);
This id is probably not globally unique. GUID's should be globally unique as they include elements which should not occur elsewhere (the MAC address of the machine generating the ID, the time the ID was generated, etc.)
If what you need is an ID that is unique within your application, use a number fountain - perhaps encoding the value as a hexadecimal number. Every time you need an id, grab it from the number fountain.
If you have multiple servers allocating id's, you could grab a range of numbers (a few tens or thousands depending on how quickly you're allocating ids) and that should do the job. an 8 digit hex number will give you 4 billion ids - but your first id's will be a lot shorter.
Maybe using NanoId will save you from a lot of headaches:
https://github.com/codeyu/nanoid-net
You can do something like:
var id = Nanoid.Generate('1234567890abcdef', 10) //=> "4f90d13a42"
And you can check the collision probability here:
https://alex7kom.github.io/nano-nanoid-cc/

Categories

Resources