Generate unique hash from filename - c#

I'm looking to generate a unique random hash that has a miniscule chance of being duplicated. It should only contain numbers, and I want it to be 4 characters long. I have the file path in the form of
filepath = "c:\\users\\john\\filename.csv"
Now, I'd like to only select the "filename" part of that string and create a hash from that filename, though I want it to be different each time so if two users upload a similarly named file it will likely generate a different hash code. What's the best way to go about doing this?
I will be using this hash to append "001", "002", etc. on to create student IDs.

Generating a unique hash from a file's filename is fairly simple.
However...
It should only contain numbers, and I want it to be 4 characters long.
With only 4 numeric characters, you're going to be guaranteed to have a collision with 1000 different files, and will likely be hit quite a bit sooner. This makes it impossible to have a "minuscule chance of being duplicated".
Edit in response to comments:
You could do some simple type of hash, though this will give quite a few collisions:
string ComputeFourDigitStringHash(string filepath)
{
string filename = System.IO.Path.GetFileNameWithoutExtension(filepath);
int hash = filename.GetHashCode() % 10000;
return hash.ToString("0000");
}
This will give you a 4 digit "hash" from the filename portion of the string. Note that it will have a lot of collisions, but it will give you something you can use.

Related

Probability of already existing file System.IO.Path.GetRandomFileName()

Recently I got the exception:
Message:
System.IO.IOException: The file 'C:\Windows\TEMP\635568456627146499.xlsx' already exists.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
This was the result of the following code I used for generating file names:
Path.Combine(Path.GetTempPath(), DateTime.Now.Ticks + ".xlsx");
After realising that it is possible to create two files in one Tick, I changed the code to:
Path.Combine(Path.GetTempPath(), Path.GetRandomFileName() + ".xlsx");
But I am still wondering what is the probability of the above exception in the new case?
Internally, GetRandomFileName uses RNGCryptoServiceProvider to generate 11-character (name:8+ext:3) string. The string represents a base-32 encoded number, so the total number of possible strings is 3211 or 255.
Assuming uniform distribution, the chances of making a duplicate are about 2-55, or 1 in 36 quadrillion. That's pretty low: for comparison, your chances of winning NY lotto are roughly one million times higher.
The probability of getting duplicate names with GetRandomFileName are really low, but if you look at it source here, you see that they don't check if the name is duplicate (They can't because you can't tell the path where this file should be created)
Instead the Path.GetTempFileName return an unique file name inside the Temp directory.
(So removing also the need to build the temp path in your code)
GetTempFileName uses the Win32 API GetTempFileName requesting the creation of an unique file name.
The Win32 API creates the file with a zero length and release the handle.
So you don't fall in concurrency scenarios. Better use this one.
GetRandomFileName() returns 8.3 char string. This is 11 characters that can vary. Assuming it contains only letters and digits, this gives us an "alphabet" of 36 characters. So the number of variations is least 36^11, which makes the probability of above exception extremely low.
I would like to put my answer in comment area rather than here, but I don't have enough reputation to add comment.
For your first snippet, I think you can precheck if file exists or not.
For the second one, code will generate random name but random means you still have tiny teeny possibility to get the exception....but I don't think you need worry about this. Existence check will help.

Should I Use Path.GetRandomFileName or use a Guid?

I need to generate unique folder names, should I use Path.GetRandomFileName or just use Guid.NewGuid?
Guids say they are globally unique, GetRandomFileName does not make such a claim.
I think both are equally random, the difference being that Path.GetRandomFileName will produce a 8.3 filename (total of 11 characters) so is going to have a smaller set of unique names than those generated by Guid.NewGuid.

Compress Guids by hashing in small data sets

I'm working on a mobile app and i want to optimise the data that it's receiving from the server (as JSON).
There are 3 lists returned (each containing its own class of objects, the approximate list sizes are 50, 100 and 170). Each object has a Guid id and there is some relation data for each object. E.g.:
o = { Id = "8f088552-5b24-4ba4-a6e5-8958c4353581",
RelatedIds = ["19d2e562-0874-473f-8e05-7052e8defd9a", "615b4c47-199a-4f7d-8268-08ed43d9c891", ... ] }
Is there a way to compress these Guids to something sorter without storing an identity map? Perhaps using a hash function?
You can convert the 16-byte representation of a GUID into a Base 64 string. However you didn't mention a programming language so we can't help further.
A hash function is not recommended here because hash functions are generally lossy.
No. One of the attributes of (non-cryptographic) hashes is that they collide: hash(a) == hash(b) but a != b. They are a performance optimization in the case where you are doing a lot of equality checks and you expect many false results (because if hash(a) != hash(b) then a != b). A GUID->counter map is probably the best way to get smaller ids here.
You can convert hex (base16) to base64, and remove all the punctuation. You should save 25% for using base64, and another 4 bytes for punctuation.
Thinking about it some more i've realized that HTTP compression (if enabled) is probably going to compress that data well enough anyway, so it's not really worth the effort to compress data manually.

Creating unique URLs in ASP.NET

In my website, I need to create unique URLs that an admin user would use to send it to a group of users. The unique URL is created whenever an admin creates a new form. I understand I can use a guid to represent unique URLs, but I am looking for something shorter (hopefully around 4 characters, since it's easier to remember). How would I generate a unique URL in ASP.NET that would look like this:
http://mydomain.com/ABCD
I understand some of the URL shortener websites (like bit.ly) does something like this with a very short unique URL. Is there an algorithm I can use?
How about something like
public static string GetRandomString (int length)
{
string charPool = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890";
StringBuilder sb = new StringBuilder();
Random rnd = new Random();
while ((length--) > 0)
sb.Append(charPool[(int)(rnd.NextDouble() * charPool.Length)]);
return sb.ToString();
}
and call
GetRandomString(4);
Just write an algorithm to select a certain number of characters from a GUID (e.g. the first 4 or 8 characters, every even character up to 4 or 8 characters.)
Be sure to check it against the database to make sure it isn't already in use, and if it is regenerate it. As a safeguard, maybe make a timeout (if it tries to generate 10 and they're all in use, give up,) but it's unlikely to use every possible combination.
I believe bit.ly performs a hash and then base64 encodes the result. You could do the same, although it'll be more than 4 characters. Be sure to add code that handles hashing collisions. You could append 1, 2, 3, etc. when the first hash is in use.
Another approach is to create a new table in a database. Every time you need a new URL, add a row to this table. You could use the PK as the URL value. This will give you up to 10,000 unique values using only four characters. Base64 encode for even more.

C# Random Code Field Generator for Object

I have an object with the following properties
GID
ID
Code
Name
Some of the clients dont want to enter the Code so the intial plan was to put the ID in the code but the baseobject of the orm is different so I'm like screwed...
my plan was to put ####-#### totally random values in code how can I generate something like that say a windows 7 serial generator type stuff but would that not have an overhead what would you do in this case.
Do you want a random value, or a unique value?
random != unique.
Remember, random merely states a probability of not generating the same value, or a probability of generating the same value again. As time increases, likelihood of generating a previous value increases - becoming a near certainty. Which do you require?
Personally, I recommend just using a Guid with some context [refer to easiest section below]. I also provided some other suggestions so you have options, depending on your situation.
easiest
If Code is an unbounded string [ie can be of any length], easiest semi-legible means of generating a unique code would be
OrmObject ormObject= new OrmObject ();
string code = string.
Format ("{0} [{1}]", ormObject.Name, Guid.NewGuid ()).
Trim ();
// generates something like
// "My Product [DA9190E1-7FC6-49d6-9EA5-589BBE6E005E]"
you can substitute ormObject.Name for any distinguishable string. I would typically use typeof (objectInstance.GetType ()).Name but that will only work if OrmObject is a base class, if it's a concrete class used for everything they will all end up with similar tags. The point is to add some user context, such that - as in #Yuriy Faktorovich's referenced wtf article - users have something to read.
random
I responded a day or two ago about random number generation. Not so much generating numbers as building a simple flexible framework around a generator to improve quality of code and data, this should help streamline your source.
If you read that, you could easily write an extension method, say
public static class IRandomExtensions
{
public static CodeType GetCode (this IRandom random)
{
// 1. get as many random bytes as required
// 2. transform bytes into a 'Code'
// 3. bob's your uncle
...
}
}
// elsewhere in code
...
OrmObject ormObject = new OrmObject ();
ormObject.Code = random.GetCode ();
...
To actually generate a value, I would suggest implementing an IRandom interface with a System.Security.Cryptography.RNGCryptoServiceProvider implementation. Said implementation would generate a buffer of X random bytes, and dole out as many as required, regenerating a stream when exhausted.
Furthermore - I don't know why I keep writing, I guess this problem is really quite fascinating! - if CodeType is string and you want something readable, you could just take said random bytes and turn them into a "seemingly" readable string via Base64 conversion
public static class IRandomExtensions
{
// assuming 'CodeType' is in fact a string
public static string GetCode (this IRandom random)
{
// 1. get as many random bytes as required
byte[] randomBytes; // fill from random
// 2. transform bytes into a 'Code'
string randomBase64String =
System.Convert.ToBase64String (randomBytes).Trim ("=");
// 3. bob's your uncle
...
}
}
Remember
random != unique.
Your values will repeat. Eventually.
unique
There are a number of questions you need to ask yourself about your problem.
Must all Code values be unique? [if not, you're trying too hard]
What Type is Code? [if any-length string, use a full Guid]
Is this a distributed application? [if not, use a DB value as suggested by #LBushkin above]
If it is a distributed application, can client applications generate and submit instances of these objects? [if so, then you want a globally unique identifier, and again Guids are a sure bet]
I'm sure you have more constraints, but this is an example of the kind of line of inquiry you need to perform when you encounter a problem like your own. From these questions, you will come up with a series of constraints. These constraints will inform your design.
Hope this helps :)
Btw, you will receive better quality solutions if you post more details [ie constraints] about your problem. Again, what Type is Code, are there length constraints? Format constraints? Character constraints?
Arg, last edit, I swear. If you do end up using Guids, you may wish to obfuscate this, or even "compress" their representation by encoding them in base64 - similar to base64 conversion above for random numbers.
public static class GuidExtensions
{
public static string ToBase64String (this Guid id)
{
return System.Convert.
ToBase64String (id.ToByteArray ()).
Trim ("=");
}
}
Unlike truncating, base64 conversion is not a lossful transformation. Of course, the trim above is lossful in context of full base64 expansion - but = is just padding, extra information introduced by the conversion, and not part of original Guid data. If you want to go back to a Guid from this base64 converted value, then you will have to re-pad your base64 string until its length is a multiple of 4 - don't ask, just look up base64 if you are interested :)
You could generate a Guid using :
Guid.NewGuid().ToString();
It would give you something like :
788E94A0-C492-11DE-BFD4-FCE355D89593
Use an Autonumber column or Sequencer from your database to generate a unique code number. Almost all modern databases support automatically generated numbers in one form or another. Look into what you database supports.
Autonumber/Sequencer values from the DB are guaranteed to be unique and are relatively inexpensive to acquire. If you want to avoid completely sequential numbers assigned to codes, you can pad and concatenate several sequencer values together.

Categories

Resources