Generate short, unique identifiers [closed]

Generate short, unique identifiers [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for an algorithm which generates identifiers suitable for both, external use in e.g. URLs as well as persistence with the following requirements:
Short, like a max. of 8 characters
URL-friendly, so no special characters
Human-friendly, e.g. no ambigous characters like L/l, 0/O
Incremental for fast indexing
Random to prevent guessing without knowing the algorithm (would be nice, but not important)
Unique without requiring to check the database
I looked at various solutions, but all I found have some major tradeoffs. For example:
GUID: Too long, not incremental
GUID base64 encoded: Still too long, not incremental
GUID ascii85 encoded: Short, not incremental, too many unsuitable characters
GUID encodings like base32, base36: Short, but loss of information
Comb GUID: Too long, however incremental
All others based on random: Require checking the DB for uniqueness
Time-based: Prone to collisions in clustered or multi-threaded environments
Edit: Why has this been marked off-topic? The requirements describe a specific problem to which numerous legitimate solutions can be provided. In fact, some of the solutions here are so good, I'm struggling with choosing the one to mark as answer.

If at all possible I'd keep the user requirements (short, readable) and the database requirements (incremental, fast indexing) separate. User-facing requirements change. You don't want to have to modify your tables because tomorrow you decide to change the length or other specifics of your user-facing ID.
One approach is to generate your ID using user-friendly characters, like
23456789ABCDEFGHJKLMNPQRSTUVWXYZ and just make it random.
But when inserting into the database, don't make that value the primary key for the record it references or even store it in that table. Insert it into its own table with an identity primary key, and then store that int or bigint key with your record.
That way your primary table can have an incremental primary key. If you need to reference a record by its "friendly" ID then you join to your friendly ID table.
My guess is that if you're generating a high enough volume of these IDs that you're concerned about index performance then the rate at which human users retrieve those values will be much lower. So the slightly slower lookup of the random value in the friendly ID table won't be a problem.

The following uses a combination of an ID that is known to be unique (because it comes from a unique ID column in a relational database) and a random sequence of letters and numbers to generate a token:
public static string GenerateAccessToken(string uniqueId) // generates a unique, random, and alphanumeric token
{
const string availableChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
using (var generator = new RNGCryptoServiceProvider())
{
var bytes = new byte[16];
generator.GetBytes(bytes);
var chars = bytes.Select(b => availableChars[b % availableChars.Length]);
var token = new string(chars.ToArray());
return uniqueId + token;
}
}
The token is guaranteed to be both unique and random (or at least "pseudo random"). You can manipulate the length by changing the length of bytes.
To avoid confusion between "0" and "O" or "l" and "1", you can remove those characters from availableChars.
Edit
I just realized this doesn't quite pass the "no database check" requirement, though when I've used code like this, I've always already had an entity in memory that I knew contained a unique ID, so I'm hoping the same applies to your situation. I don't think it's possible to quite achieve all your requirements, so I'm hoping this would still be a good balance of attributes.

Did you tried proquints?
A Proquint is a PRO-nouncable QUINT-uplet of alternating unambiguous consonants and vowels, for example: "lusab".
I think they meet almost all your requirements.
See the proposal here.
And here is the official implementation in C and Java.
I've worked on a port to .NET that you can download as Proquint.NET.

A simple solution I implemented before does not fulfill all of your constraints but might be acceptable if you think about your problem a little bit differently.
First, I used a function to obfusticate the database id func(id) => y, and func(y) => id. (I used a Feistel cipher, and here is an example of implementing of such a function) Second, convert the obfusticated id to base 62 so it becomes short and url-friendly. (You can use a smaller character set to achieve Human-friendly) This creates a one-to-one mapping from database id to string identifiers. In my implementation, 1, 2 maps to 2PawdM, 5eeGE8 correspondingly, and I can get the database id 1, 2 back from the string 2PawdM and 5eeGE8. The mapping could be entirely different when you use a different obfustication function.
With this solution, the identifiers themselves are NOT incremental, however, because the identifiers maps directly to database id, you can compute the corresponding database id and directly do any database queries on id column instead. You don't need to generate a string identifier and store it to database, and the uniqueness is guaranteed by database itself when you store the record with a auto-incremented id column.

Related

List vs Dictionary when referring by index/key [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
So I have been mainly using lists to retrieve small amounts of data from a database which feeds into a web application but have recently come across dictionaries which produce more readable code with keys but what is the performance difference when just referring by index/key?
I understand that a dictionary uses more memory but what is best practice in this scenario and is it worth the performance/maintenance trade-off bearing in mind that I will not be performing searches or sorting the data?

When you do want to find some one item through list, then you should see ALL items till you find its key.
Let's see some basic example. You have
Person
{
public int ID {get;set;}
public string Name {get;set;}
}
and you have collection List<Person> persons and you want to find some person by its ID:
var person = persons.FirstOrDefault(x => x.ID == 5);
As written it has to enumerate the entire List until it finds the entry in the List that has the correct ID (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n).
However, if you want to find through the Dictionary dictPersons :
var person = dictPersons[person.ID];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every person. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted. It is called hash-function)
So, Dictionary is faster than Listbecause Dictionary does not iterate through the all collection, but Dictionary takes the item from the exact place(hash-function calculates this place). It is a better algorithm.
Dictionary relies on chaining (maintaining a list of items for each hash table bucket) to resolve collisions whereas Hashtable uses rehashing for collision resolution (when a collision occurs, tries another hash function to map the key to a bucket). You can read how hash function works and difference between chaining and rehashing.

Unless you're actually experiencing performance issues and need to optimize it's better to go with what's more readable and maintainable. That's especially true since you mentioned that it's small amounts of data. Without exaggerating - it's possible that over the life of the application the cumulative difference in performance (if any) won't equal the time you save by making your code more readable.
To put it in perspective, consider the work that your application already does just to read request headers and parse views and read values from configuration files. Not only will the difference in performance between the list and the dictionary be small, it will also be a tiny fraction of the overall processing your application does just to serve a single page request.
And even then, if you were to see performance issues and needed to optimize, there would probably be plenty of other optimizations (like caching) that would make a bigger difference.

Is this a cryptographically strong Guid?

I'm looking at using a Guid as a random anonymous visitor identifier for a website (stored both as a cookie client-size, and in a db server-side), and I wanted a cryptographically strong way of generating Guids (so as to minimize the chance of collisions).
For the record, there are 16 bytes (or 128 bits) in a Guid.
This is what I have in mind:
/// <summary>
/// Generate a cryptographically strong Guid
/// </summary>
/// <returns>a random Guid</returns>
private Guid GenerateNewGuid()
{
byte[] guidBytes = new byte[16]; // Guids are 16 bytes long
RNGCryptoServiceProvider random = new RNGCryptoServiceProvider();
random.GetBytes(guidBytes);
return new Guid(guidBytes);
}
Is there a better way to do this?
Edit:
This will be used for two purposes, a unique Id for a visitor, and a transaction Id for purchases (which will briefly be the token needed for viewing/updating sensitive information).

In answer to the OP's actual question whether this is cryptographically strong, the answer is yes since it is created directly from RNGCryptoServiceProvider. However the currently accepted answer provides a solution that is most definitely not cryptographically secure as per this SO answer:
Is Microsoft's GUID generator cryptographically secure.
Whether this is the correct approach architecturally due to theoretical lack of uniqueness (easily checked with a db lookup) is another concern.

So, what you're building is not technically a GUID. A GUID is a Globally Unique Identifier. You're building a random string of 128 bits. I suggest, like the previous answerer, that you use the built-in GUID generation methods. This method has a (albeit tremendously small) chance of generating duplicate GUID's.
There are a few advantages to using the built-in functionality, including cross-machine uniqueness [partially due to the MAC Address being referenced in the guid, see here: http://en.wikipedia.org/wiki/Globally_Unique_Identifier.
Regardless of whether you use the built in methods, I suggest that you not expose the Purchase GUID to the customer. The standard method used by Microsoft code is to expose a Session GUID that identifies the customer and expires comparatively quickly. Cookies track customer username and saved passwords for session creation. Thus your 'short term purchase ID' is never actually passed to (or, more importantly, received from) the client and there is a more durable wall between your customers' personal information and the Interwebs at large.

Collisions are theoretically impossible (it's not Globally Unique for nothing), but predictability is a whole other question. As Christopher Stevenson correctly points out, given a few previously generated GUIDs it actually becomes possible to start predicting a pattern within a much smaller keyspace than you'd think. GUIDs guarantee uniqueness, not predictability. Most algorithms take it into account, but you should never count on it, especially not as transaction Id for purchases, however briefly. You're creating an open door for brute force session hijacking attacks.
To create a proper unique ID, take some random stuff from your system, append some visitor specific information, and append a string only you know on the server, and then put a good hash algorithm over the whole thing. Hashes are meant to be unpredictable and unreversable, unlike GUIDs.
To simplify: if uniqueness is all you care about, why not just give all your visitors sequential integers, from 1 to infinity. Guaranteed to be unique, just terribly predictable that when you just purchased item 684 you can start hacking away at 685 until it appears.

To avoid collisions:
If you can't keep a global count, then use Guid.NewGuid().
Otherwise, increment some integer and use 1, 2, 3, 4...
"But isn't that ridiculously easy to guess?"
Yes, but accidental and deliberate collisions are different problems with different solutions, best solved separately, note least because predictability helps prevent accidental collision while simultaneously making deliberate collision easier.
If you can increment globally, then number 2 guarantees no collisions. UUIDs were invented as a means to approximate that without the ability to globally track.
Let's say we use incrementing integers. Let's say the ID we have in a given case is 123.
We can then do something like:
private static string GetProtectedID(int id)
{
using(var sha = System.Security.Cryptography.SHA1.Create())
{
return string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(hashString)).Select(b => b.ToString("X2"))) + id.ToString();
}
}
Which produces 09C495910319E4BED2A64EA16149521C51791D8E123. To decode it back to the id we do:
private static int GetIDFromProtectedID(string str)
{
int chkID;
if(int.TryParse(str.Substring(40), out chkID))
{
string chkHash = chkID.ToString() + "this is my secret seed kjٵتשڪᴻᴌḶḇᶄ™∞ﮟﻑfasdfj90213";
using(var sha = System.Security.Cryptography.SHA1.Create())
{
if(string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(hashString)).Select(b => b.ToString("X2"))) == str.Substring(0, 40))
return chkID;
}
}
return 0;//or perhaps raise an exception here.
}
Even if someone guessed from that they were given number 123, it wouldn't let them deduce that the id for 122 was B96594E536C9F10ED964EEB4E3D407F183FDA043122.
Alternatively, the two could be given as separate tokens, and so on.

I generally just use Guid.NewGuid();
http://msdn.microsoft.com/en-us/library/system.guid.newguid(v=vs.110).aspx

Unique ticket numbers for software support system

I'm developing a ticketing system for tracking bugs and software changes using ASP.NET MVC 4 and Entity Framework 5. I need a way to pick a unique number from a set of possible numbers. My thought is to create a set of possible numbers and mark numbers from this set as they are used and assigned to a support ticket.
I have this code for generating all possible ticket numbers to choose from, but I want to have leading zeroes so that all ticket numbers have the same length:
public static class GenerateNumber
{
private static IEnumerable<int> GenerateNumbers(int count)
{
return Enumerable.Range(0, count);
}
public static IEnumerable<string> GenerateTicketNumbers(int count)
{
return GenerateNumbers(count).Select(n => "TN" + n.ToString());
}
}
I want the output of
IEnumerable<string> ticketNumbers = GenerateNumber.GenerateTicketNumbers(Int32.MaxValue);
to be something like this:
TN0000000001
.
.
.
TN2147483647
Hopefully we won't need anything as large as Int32.MaxValue as that would mean we have way too many bugs haha. I just wanted to be safe than sorry on the limits of the available numbers. Perhaps we could use the methodology of reusing ticket numbers after they have been resolved. However, I don't know how I feel about reuse as it could lead to ambiguity for referring to documentation later on.
Considering the size of this set, is this the most efficient method to go about having unique ticket numbers?

Use an identity column in the database - this will autoincrement for you.
If you need a prefix as well, then store this as a separate varchar column and then for display purposes you can concatenate it (with your requisite leading zeros if that is absolutely really necessary). Trying to store an incrementing number in a varchar field is going to bite you in the ass one day.
As a side note, why the leading zeros? If I am fixing a ticket, I want to annotate my code with the ticket number. Leading zeros are just a pain - why not just have TN-123 and have the number get bigger as required?

Are there any drawbacks to relying on the System.Guid.NewGuid() function when looking for unique IDs for data?

I'm looking to generate unique ids for identifying some data in my system. I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid()s. Are there any drawbacks to this approach, or am I in the clear?

I'm looking to generate unique ids for identifying some data in my system.
I'd recommend a GUID then, since they are by definition globally unique identifiers.
I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid(). Are there any drawbacks to this approach, or am I in the clear?
Well, since we do not know what you would consider a drawback, it is hard to say. A number of possible drawbacks come to mind:
GUIDs are big: 128 bits is a lot of bits.
GUIDs are not guaranteed to have any particular distribution; it is perfectly legal for GUIDs to be generated sequentially, and it is perfectly legal for the to be distributed uniformly over their 124 bit space (128 bits minus the four bits that are the version number of course.) This can have serious impacts on database performance if the GUID is being used as a primary key on a database that is indexed into sorted order by the GUID; insertions are much more efficient if the new row always goes at the end. A uniformly distributed GUID will almost never be at the end.
Version 4 GUIDs are not necessarily cryptographically random; if GUIDs are generated by a non-crypto-random generator, an attacker could in theory predict what your GUIDs are when given a representative sample of them. An attacker could in theory determine the probability that two GUIDs were generated in the same session. Version one GUIDs are of course barely random at all, and can tell the sophisticated reader when and where they were generated.
And so on.
I am planning a series of articles about these and other characteristics of GUIDs in the next couple of weeks; watch my blog for details.
UPDATE: https://ericlippert.com/2012/04/24/guid-guide-part-one/

When you use System.Guid.NewGuid(), you may still want to check that the guid doesn't already exist in your system.
While a guid is so complex as to be virtually unique, there is nothing to guarantee that it doesn't already exist except probability. It's just incredibly statistically unlikely, to the point that in almost any case it's the same as being unique.
Generating to identical guids is like winning the lottery twice - there's nothing to actually prevent it, it's just so unlikely it might as well be impossible.
Most of the time you could probably get away with not checking for existing matches, but in a very extreme case with lots of generation going on, or where the system absolutely must not fail, it could be worth checking.
EDIT
Let me clarify a little more. It is highly, highly unlikely that you would ever see a duplicate guid. That's the point. It's "globally unique", meaning there's such an infinitesimally chance of a duplicate that you can assume it will be unique. However, if we are talking about code that keeps an aircraft in the sky, monitors a nuclear reactor, or handles life support on the International Space Station, I, personally, would still check for a duplicate, just because it would really be terrible to hit that edge case. If you're just writing a blog engine, on the other hand, go ahead, use it without checking.

Feel free to use NewGuid(). There is no problem with its uniqueness.
There is too low probability that it will generate the same guid twice; a nice example can be found here: Simple proof that GUID is not unique
var bigHeapOGuids = new Dictionary<Guid, Guid>();
try
{
do
{
Guid guid = Guid.NewGuid();
bigHeapOGuids.Add(guid ,guid );
} while (true);
}
catch (OutOfMemoryException)
{
}
At some point it just crashed on OutOfMemory and not on duplicated key conflict.

fastest code to generate unique base62 hashes

hey guys i want to generate unique base62 hashes - something similar to what tinyurl and bit.ly do using c#. this would be based on an auto increment field ID of type bigint (like most of these sites)
min chars would be 1 and max chars would be 6... if you had to write the fastest code (least amount of cpu usage) in c# for this hash how would you write it?

Please see my answer to another Stack Overflow question which is similar, here:
Need a smaller alternative to GUID for DB ID but still unique and random for URL
I posted a C# class called "ShortCodes", that does exactly what you're looking for, i.e. generate a unique baseX (where X is anything you like!) hash based upon a integer/long number, and also to convert back again.
I actually wrote this little class precisely to mimic the short code/hash generation of sites like TinyUrl.com and Bit.ly for my own purposes.
I can't say if this is the absolute fastest way of achieving this, but it's not exactly slow either! :)

Eric Lippert suggested lookup tables in a similar earlier question. His answer is perfect for your purposes as well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.