fastest code to generate unique base62 hashes - c#

hey guys i want to generate unique base62 hashes - something similar to what tinyurl and bit.ly do using c#. this would be based on an auto increment field ID of type bigint (like most of these sites)
min chars would be 1 and max chars would be 6... if you had to write the fastest code (least amount of cpu usage) in c# for this hash how would you write it?

Please see my answer to another Stack Overflow question which is similar, here:
Need a smaller alternative to GUID for DB ID but still unique and random for URL
I posted a C# class called "ShortCodes", that does exactly what you're looking for, i.e. generate a unique baseX (where X is anything you like!) hash based upon a integer/long number, and also to convert back again.
I actually wrote this little class precisely to mimic the short code/hash generation of sites like TinyUrl.com and Bit.ly for my own purposes.
I can't say if this is the absolute fastest way of achieving this, but it's not exactly slow either! :)

Eric Lippert suggested lookup tables in a similar earlier question. His answer is perfect for your purposes as well.

Related

Generate another Guid using one Guid and vice versa

Is there a way/algorithm/method to generate a new Guid (x) using our old GUid (y) and then get y back whenever we want from x?
Something similar to below answer but it shows a way to old Guid(I can consider it as a string) to convert to Guid but not a way back.
https://stackoverflow.com/a/9386095/5887074
How can I generate a GUID for a string?
P.S.: This is not related to anything security. The two Guids will just be used to find records from the table. We can convert Guid to string in this conversion if required.
There are thousands of ways: a guid is 128 bits, so you could flip one bit which would make it simple to translate back and forth. Or you could do modulo 42 and make it look as if you made something unpredictable. Or you could reverse the order of the bits, do a NOT operation on all of them or rearrange the bits by some predefined pattern.
But I suspect that you have a use case which you do not define. Please tell a bit more about the problem you want to solve. Your request sounds a little bit dangerous as it sounds as if you want to enable some kind of tracking between seemingly unrelated entities. If there is some security issues involved you are very likely to get it wrong if both cleartext (guid pre translation) and cipher (guid after translation) are public. Perhaps simple AES encryption would suffice as a translation function, but I think you need to specify you problems in much more details to get a useful answer.

Optimizing output parameters

I'm trying to solve a problem statement using C# as programming language.
In the problem system for an input (double/decimal) say Hi, the output generated is a form of dataset containing number of parameters (Fi, Pi and Ti). I somehow have to filter out only those entries in the data set which would satisfy the following conditions.
Fi > Fmin, where Fmin is some constant
Pi > Pmin, where Pmin is some constant
Ti < Tmax, where Tmax is some constant
Is there an efficient algorithm I could use in such cases where I could zero in on an optimal set of values for Hi for which the output parameter values are well within the constraints. Also I thought using Genetic Algorithms in this case makes sense but somehow I'm not able to formulate and fit the problem specific to Genetic Algorithms.
Any pointers/ suggestions are truly appreciated.
you can use Linq query
var result = DataSet.Where(x=>x.Fi> Fmin && x.Pi>Pmin && Ti < Tmax);
Well, it's hard for me to guess. I don't know the properties of the function for Fi etc.
An log-Barrier Method could be something interesting here. Or the SQP Method. But it has to be differntiable.
Otherwise simulated annealing could be interesting.
But these are just some guesses. It really depends on the problem.
I doubt that a Genetic Algorithm makes sense, seeing as you have only one input variable (Hi) that determines the outputs (Fi, Pi, Ti). The power of a Genetic Algorithm is that it blends good solutions into new solutions. If your solution is only one number, blending two good solutions will probably mean that you're finding some Hi inbetween (such as the average -> 0.5Hi1 + 0.5Hi2 or some other linear combination aHi1 + (1-a)Hi2 with a between 0 and 1).
I would recommend looking into Multi-start Local Search heuristics, such as link. This is a pretty solid heuristic that allows you to explore the solution space for Hi.
In their simplest form, such heuristics calculate the performance for N random values of Hi, and then search for further improvements in the area of the best performing Hi values out of those N initial values.
This sort of stuff is also pretty straight-forward to code, assuming that you have a way to obtain the Fi, Ti, and Pi values from your Hi input, and that you have some way to figure out which of your solutions perform 'best' (for instance through a fitness function as mentioned in the comments).

Are there any drawbacks to relying on the System.Guid.NewGuid() function when looking for unique IDs for data?

I'm looking to generate unique ids for identifying some data in my system. I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid()s. Are there any drawbacks to this approach, or am I in the clear?
I'm looking to generate unique ids for identifying some data in my system.
I'd recommend a GUID then, since they are by definition globally unique identifiers.
I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid(). Are there any drawbacks to this approach, or am I in the clear?
Well, since we do not know what you would consider a drawback, it is hard to say. A number of possible drawbacks come to mind:
GUIDs are big: 128 bits is a lot of bits.
GUIDs are not guaranteed to have any particular distribution; it is perfectly legal for GUIDs to be generated sequentially, and it is perfectly legal for the to be distributed uniformly over their 124 bit space (128 bits minus the four bits that are the version number of course.) This can have serious impacts on database performance if the GUID is being used as a primary key on a database that is indexed into sorted order by the GUID; insertions are much more efficient if the new row always goes at the end. A uniformly distributed GUID will almost never be at the end.
Version 4 GUIDs are not necessarily cryptographically random; if GUIDs are generated by a non-crypto-random generator, an attacker could in theory predict what your GUIDs are when given a representative sample of them. An attacker could in theory determine the probability that two GUIDs were generated in the same session. Version one GUIDs are of course barely random at all, and can tell the sophisticated reader when and where they were generated.
And so on.
I am planning a series of articles about these and other characteristics of GUIDs in the next couple of weeks; watch my blog for details.
UPDATE: https://ericlippert.com/2012/04/24/guid-guide-part-one/
When you use System.Guid.NewGuid(), you may still want to check that the guid doesn't already exist in your system.
While a guid is so complex as to be virtually unique, there is nothing to guarantee that it doesn't already exist except probability. It's just incredibly statistically unlikely, to the point that in almost any case it's the same as being unique.
Generating to identical guids is like winning the lottery twice - there's nothing to actually prevent it, it's just so unlikely it might as well be impossible.
Most of the time you could probably get away with not checking for existing matches, but in a very extreme case with lots of generation going on, or where the system absolutely must not fail, it could be worth checking.
EDIT
Let me clarify a little more. It is highly, highly unlikely that you would ever see a duplicate guid. That's the point. It's "globally unique", meaning there's such an infinitesimally chance of a duplicate that you can assume it will be unique. However, if we are talking about code that keeps an aircraft in the sky, monitors a nuclear reactor, or handles life support on the International Space Station, I, personally, would still check for a duplicate, just because it would really be terrible to hit that edge case. If you're just writing a blog engine, on the other hand, go ahead, use it without checking.
Feel free to use NewGuid(). There is no problem with its uniqueness.
There is too low probability that it will generate the same guid twice; a nice example can be found here: Simple proof that GUID is not unique
var bigHeapOGuids = new Dictionary<Guid, Guid>();
try
{
do
{
Guid guid = Guid.NewGuid();
bigHeapOGuids.Add(guid ,guid );
} while (true);
}
catch (OutOfMemoryException)
{
}
At some point it just crashed on OutOfMemory and not on duplicated key conflict.

Efficient Datastructure for tags?

Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?
Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective
Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.
The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?
EDIT
I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items
Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely?
I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).
Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?
First measure then optimize :-)
EDIT
Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.
Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)
Here you would need to iterate all posts for every tag query though.
If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.
For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.
You need second table with 2 fields: tag_id question_id
That's it. Then you create indexes on tag_id, question_id and question_id, tag_id - that would be covering index so all your queries would be very fast.
I have a feeling you abstracted your question too much; you didn't say very much about how you want to access the datastructure, which is very important.
That being said, I suggest to count the number of occurances for each tag and then use Huffman coding to come up with the shortest encoding which can be used for the tags. This is not entirely perfect, but I'd stick with it until you've demonstrate that it's inappropriate. You can then associate the codes with each question.
If you want to efficiently lookup questions within a specific tag, you will need some kind of index. Maybe, all Tag objects could have an array of references (references, pointers, nummeric-id, etc) to all the questions that are tagged with this particular tag. This way you simply need to find the tag object and you have an array pointing to all the questions of that tag.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Categories

Resources