Related
I am working with existing data and have records which contain an array double[23] and double[46]. The values in the array can be the same across multiple records. I would like to generate an id (perhaps an int) to uniquely identify the values in each array.
There are places in the application where I need to group records based on the values in the array being identical. While there are ways to query for this, I was hoping for a single int field (or something similar) to group on. This would really help simplify queries and especially help with report tools where grouping on a smaller single field would help immensely.
I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values. I had tried implementing
((IStructuralEquatable)combined).GetHashCode(EqualityComparer<double>.Default);
To compare the structure and data, but again, I don't think this is guaranteed to match another double[] having the same values.
Perhaps a form of checksum would work but admittedly I am having trouble implementing something. I am looking for suggestions/direction.
Here is data for 3 sample records. Data in record 1&3 are the same so a generated id should match for those.
32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9
40.8,49.0,50.0,49.0,47.8,47.0,45.8,44.5,43.5,42.0,40.5,38.5,36.8,34.5,32.5,30.5,28.0,25.5,22.5,19.5,16.5,13.5,10.5,48.0
32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9
Perhaps this is not possible without just checking all the data, but was hoping for a better solution to simplify the application and improve the speed.
The goal is to add a new id field to the existing records to represent the array data. That way, passing records into report tools would group together easily on one field rather than checking the whole array on each record.
I appreciate any direction.
EDIT - Some issues I ran into trying things (incase it helps someone)
In trying to understand this originally, I was calling this code (which is part of .NET). I understood these functions would hash the values of the array together (only 8 values in this case). I didn't think it included the array handle. The result was not quite as expected as there is a bug MS corrected in .NET as per the commented line below. With the fix I was getting better results.
int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();
int ret = 0;
for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
//.NET 4.6.2, in .NET 4.5.2 it is ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0)))
}
return ret;
}
internal static int CombineHashCodes(int h1, int h2) {
return (((h1 << 5) + h1) ^ h2);
}
I modified this to handle more than 8 values and still had some hashes not matching. I later determined the issue was in the data; I was unaware some of the records had some doubles stored with more than one decimal place (should have been rounded). This of course changed the hash. Now that I have the data consistent, I am seeing matching hashes; any arrays with identical values have an identical hash.
I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values
Quite the opposite, a hash function is required by design to return equal hashes for equal inputs. For example, 0 is a good starting point for your hash function, returning the value 0 for equal rows. Everything else is just an optimization to try to reduce false positives.
Perhaps this is not possible without just checking all the data
Of course you need to check all the data, how else would you do it?
However your implementation is broken. The default hash function for an array hashes the handle to the array itself, so different instances of arrays with the same data will show up as different. What you want to do is to use a HashCode instance and Add() each element of your array in it to get a proper hash code.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for an algorithm which generates identifiers suitable for both, external use in e.g. URLs as well as persistence with the following requirements:
Short, like a max. of 8 characters
URL-friendly, so no special characters
Human-friendly, e.g. no ambigous characters like L/l, 0/O
Incremental for fast indexing
Random to prevent guessing without knowing the algorithm (would be nice, but not important)
Unique without requiring to check the database
I looked at various solutions, but all I found have some major tradeoffs. For example:
GUID: Too long, not incremental
GUID base64 encoded: Still too long, not incremental
GUID ascii85 encoded: Short, not incremental, too many unsuitable characters
GUID encodings like base32, base36: Short, but loss of information
Comb GUID: Too long, however incremental
All others based on random: Require checking the DB for uniqueness
Time-based: Prone to collisions in clustered or multi-threaded environments
Edit: Why has this been marked off-topic? The requirements describe a specific problem to which numerous legitimate solutions can be provided. In fact, some of the solutions here are so good, I'm struggling with choosing the one to mark as answer.
If at all possible I'd keep the user requirements (short, readable) and the database requirements (incremental, fast indexing) separate. User-facing requirements change. You don't want to have to modify your tables because tomorrow you decide to change the length or other specifics of your user-facing ID.
One approach is to generate your ID using user-friendly characters, like
23456789ABCDEFGHJKLMNPQRSTUVWXYZ and just make it random.
But when inserting into the database, don't make that value the primary key for the record it references or even store it in that table. Insert it into its own table with an identity primary key, and then store that int or bigint key with your record.
That way your primary table can have an incremental primary key. If you need to reference a record by its "friendly" ID then you join to your friendly ID table.
My guess is that if you're generating a high enough volume of these IDs that you're concerned about index performance then the rate at which human users retrieve those values will be much lower. So the slightly slower lookup of the random value in the friendly ID table won't be a problem.
The following uses a combination of an ID that is known to be unique (because it comes from a unique ID column in a relational database) and a random sequence of letters and numbers to generate a token:
public static string GenerateAccessToken(string uniqueId) // generates a unique, random, and alphanumeric token
{
const string availableChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
using (var generator = new RNGCryptoServiceProvider())
{
var bytes = new byte[16];
generator.GetBytes(bytes);
var chars = bytes.Select(b => availableChars[b % availableChars.Length]);
var token = new string(chars.ToArray());
return uniqueId + token;
}
}
The token is guaranteed to be both unique and random (or at least "pseudo random"). You can manipulate the length by changing the length of bytes.
To avoid confusion between "0" and "O" or "l" and "1", you can remove those characters from availableChars.
Edit
I just realized this doesn't quite pass the "no database check" requirement, though when I've used code like this, I've always already had an entity in memory that I knew contained a unique ID, so I'm hoping the same applies to your situation. I don't think it's possible to quite achieve all your requirements, so I'm hoping this would still be a good balance of attributes.
Did you tried proquints?
A Proquint is a PRO-nouncable QUINT-uplet of alternating unambiguous consonants and vowels, for example: "lusab".
I think they meet almost all your requirements.
See the proposal here.
And here is the official implementation in C and Java.
I've worked on a port to .NET that you can download as Proquint.NET.
A simple solution I implemented before does not fulfill all of your constraints but might be acceptable if you think about your problem a little bit differently.
First, I used a function to obfusticate the database id func(id) => y, and func(y) => id. (I used a Feistel cipher, and here is an example of implementing of such a function) Second, convert the obfusticated id to base 62 so it becomes short and url-friendly. (You can use a smaller character set to achieve Human-friendly) This creates a one-to-one mapping from database id to string identifiers. In my implementation, 1, 2 maps to 2PawdM, 5eeGE8 correspondingly, and I can get the database id 1, 2 back from the string 2PawdM and 5eeGE8. The mapping could be entirely different when you use a different obfustication function.
With this solution, the identifiers themselves are NOT incremental, however, because the identifiers maps directly to database id, you can compute the corresponding database id and directly do any database queries on id column instead. You don't need to generate a string identifier and store it to database, and the uniqueness is guaranteed by database itself when you store the record with a auto-incremented id column.
I am writing a templating engine and I am searching for a good way to detect if a template has changed.
For this I have the following requirements (in order of importance):
non-equal strings are required to be detected different
as fast as possible
as less memory as possible (=> do not store the whole string for comparison)
high propability to detect equal strings as equal
It is not a big problem, if sometimes equal strings are not detected as equal as this would just trigger a "re-rendering" which would not be needed, but because of the "heavy work" of this, this should happen as less as possible.
I first thought of using String.GetHashCode(), but the probalility of getting the same hash-code for two non-equal strings is pretty high.
Are there any good combinations like checking hash-code and Length to get the probability of to non-equal strings wrongly detected as equal to an unrealisticly happening low number?
Or is using some hashing algorithm, like MD5 or SHA, a good alternative (after hash-code is equal)?
My rendering looks something like the following:
public string RenderTemplate(string name, string template)
{
var cachedTemplate = Cache.Get(name);
if(cachedTemplate == null || !cachedTemplate.Equals(template)) // <= Equals
{
cachedTemplate = new Template(name, template);
cachedTemplate.Render();
Cache.Set(name, cachedTemplate);
}
return cachedTemplate.Result;
}
The Equals is the point I am asking about.
I am also open for other suggestions how this could be solved.
UPDATE:
To add some numbers to get more context:
I expect to have >1000 individual templates and each template will have up to at least a few thousand characters.
This is why I would like to avoid storing the whole template-string "in memory" only for the comparison.
Most of the templates are stored in the DB.
UPDATE 2:
What do you think about extending my RenderTemplate method with a timestamp as suggested by Nikola:
public string RenderTemplate(string name, string template, DateTime timestamp)
Then I could compare name, GetHashCode and timestamp which does not need much memory, should be pretty fast and the probability of a "wrongly detected equality" is practically 0. The timestamp I can read from the DB (have it already there) or the "last changed date" from the file-system for a file-based template.
You don't have much choice. If you don't compare strings by comparing their content, use a hash algorithm to determine if strings are equal. Personally, I would probably use a hash algorithm. If you are a bit paranoid and afraid of a collision, choose algorithm with widest space (e.g. SHA512).
Why do you need to compare strings to determine that a template has changed? Why not use a different approach?
If file is stored on disk, why not use a file watcher?
If stored in database, why not use a timestamp to detect when it was saved?
If application is restarted, anyway reload templates
Also, it's worrying that a template for UI changes so often that you must make checks like this. I think you have more problems with design beside comparing strings.
I'm looking at using a Guid as a random anonymous visitor identifier for a website (stored both as a cookie client-size, and in a db server-side), and I wanted a cryptographically strong way of generating Guids (so as to minimize the chance of collisions).
For the record, there are 16 bytes (or 128 bits) in a Guid.
This is what I have in mind:
/// <summary>
/// Generate a cryptographically strong Guid
/// </summary>
/// <returns>a random Guid</returns>
private Guid GenerateNewGuid()
{
byte[] guidBytes = new byte[16]; // Guids are 16 bytes long
RNGCryptoServiceProvider random = new RNGCryptoServiceProvider();
random.GetBytes(guidBytes);
return new Guid(guidBytes);
}
Is there a better way to do this?
Edit:
This will be used for two purposes, a unique Id for a visitor, and a transaction Id for purchases (which will briefly be the token needed for viewing/updating sensitive information).
In answer to the OP's actual question whether this is cryptographically strong, the answer is yes since it is created directly from RNGCryptoServiceProvider. However the currently accepted answer provides a solution that is most definitely not cryptographically secure as per this SO answer:
Is Microsoft's GUID generator cryptographically secure.
Whether this is the correct approach architecturally due to theoretical lack of uniqueness (easily checked with a db lookup) is another concern.
So, what you're building is not technically a GUID. A GUID is a Globally Unique Identifier. You're building a random string of 128 bits. I suggest, like the previous answerer, that you use the built-in GUID generation methods. This method has a (albeit tremendously small) chance of generating duplicate GUID's.
There are a few advantages to using the built-in functionality, including cross-machine uniqueness [partially due to the MAC Address being referenced in the guid, see here: http://en.wikipedia.org/wiki/Globally_Unique_Identifier.
Regardless of whether you use the built in methods, I suggest that you not expose the Purchase GUID to the customer. The standard method used by Microsoft code is to expose a Session GUID that identifies the customer and expires comparatively quickly. Cookies track customer username and saved passwords for session creation. Thus your 'short term purchase ID' is never actually passed to (or, more importantly, received from) the client and there is a more durable wall between your customers' personal information and the Interwebs at large.
Collisions are theoretically impossible (it's not Globally Unique for nothing), but predictability is a whole other question. As Christopher Stevenson correctly points out, given a few previously generated GUIDs it actually becomes possible to start predicting a pattern within a much smaller keyspace than you'd think. GUIDs guarantee uniqueness, not predictability. Most algorithms take it into account, but you should never count on it, especially not as transaction Id for purchases, however briefly. You're creating an open door for brute force session hijacking attacks.
To create a proper unique ID, take some random stuff from your system, append some visitor specific information, and append a string only you know on the server, and then put a good hash algorithm over the whole thing. Hashes are meant to be unpredictable and unreversable, unlike GUIDs.
To simplify: if uniqueness is all you care about, why not just give all your visitors sequential integers, from 1 to infinity. Guaranteed to be unique, just terribly predictable that when you just purchased item 684 you can start hacking away at 685 until it appears.
To avoid collisions:
If you can't keep a global count, then use Guid.NewGuid().
Otherwise, increment some integer and use 1, 2, 3, 4...
"But isn't that ridiculously easy to guess?"
Yes, but accidental and deliberate collisions are different problems with different solutions, best solved separately, note least because predictability helps prevent accidental collision while simultaneously making deliberate collision easier.
If you can increment globally, then number 2 guarantees no collisions. UUIDs were invented as a means to approximate that without the ability to globally track.
Let's say we use incrementing integers. Let's say the ID we have in a given case is 123.
We can then do something like:
private static string GetProtectedID(int id)
{
using(var sha = System.Security.Cryptography.SHA1.Create())
{
return string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(hashString)).Select(b => b.ToString("X2"))) + id.ToString();
}
}
Which produces 09C495910319E4BED2A64EA16149521C51791D8E123. To decode it back to the id we do:
private static int GetIDFromProtectedID(string str)
{
int chkID;
if(int.TryParse(str.Substring(40), out chkID))
{
string chkHash = chkID.ToString() + "this is my secret seed kjٵتשڪᴻᴌḶḇᶄ™∞ﮟﻑfasdfj90213";
using(var sha = System.Security.Cryptography.SHA1.Create())
{
if(string.Join("", sha.ComputeHash(Encoding.UTF8.GetBytes(hashString)).Select(b => b.ToString("X2"))) == str.Substring(0, 40))
return chkID;
}
}
return 0;//or perhaps raise an exception here.
}
Even if someone guessed from that they were given number 123, it wouldn't let them deduce that the id for 122 was B96594E536C9F10ED964EEB4E3D407F183FDA043122.
Alternatively, the two could be given as separate tokens, and so on.
I generally just use Guid.NewGuid();
http://msdn.microsoft.com/en-us/library/system.guid.newguid(v=vs.110).aspx
I have an object with the following properties
GID
ID
Code
Name
Some of the clients dont want to enter the Code so the intial plan was to put the ID in the code but the baseobject of the orm is different so I'm like screwed...
my plan was to put ####-#### totally random values in code how can I generate something like that say a windows 7 serial generator type stuff but would that not have an overhead what would you do in this case.
Do you want a random value, or a unique value?
random != unique.
Remember, random merely states a probability of not generating the same value, or a probability of generating the same value again. As time increases, likelihood of generating a previous value increases - becoming a near certainty. Which do you require?
Personally, I recommend just using a Guid with some context [refer to easiest section below]. I also provided some other suggestions so you have options, depending on your situation.
easiest
If Code is an unbounded string [ie can be of any length], easiest semi-legible means of generating a unique code would be
OrmObject ormObject= new OrmObject ();
string code = string.
Format ("{0} [{1}]", ormObject.Name, Guid.NewGuid ()).
Trim ();
// generates something like
// "My Product [DA9190E1-7FC6-49d6-9EA5-589BBE6E005E]"
you can substitute ormObject.Name for any distinguishable string. I would typically use typeof (objectInstance.GetType ()).Name but that will only work if OrmObject is a base class, if it's a concrete class used for everything they will all end up with similar tags. The point is to add some user context, such that - as in #Yuriy Faktorovich's referenced wtf article - users have something to read.
random
I responded a day or two ago about random number generation. Not so much generating numbers as building a simple flexible framework around a generator to improve quality of code and data, this should help streamline your source.
If you read that, you could easily write an extension method, say
public static class IRandomExtensions
{
public static CodeType GetCode (this IRandom random)
{
// 1. get as many random bytes as required
// 2. transform bytes into a 'Code'
// 3. bob's your uncle
...
}
}
// elsewhere in code
...
OrmObject ormObject = new OrmObject ();
ormObject.Code = random.GetCode ();
...
To actually generate a value, I would suggest implementing an IRandom interface with a System.Security.Cryptography.RNGCryptoServiceProvider implementation. Said implementation would generate a buffer of X random bytes, and dole out as many as required, regenerating a stream when exhausted.
Furthermore - I don't know why I keep writing, I guess this problem is really quite fascinating! - if CodeType is string and you want something readable, you could just take said random bytes and turn them into a "seemingly" readable string via Base64 conversion
public static class IRandomExtensions
{
// assuming 'CodeType' is in fact a string
public static string GetCode (this IRandom random)
{
// 1. get as many random bytes as required
byte[] randomBytes; // fill from random
// 2. transform bytes into a 'Code'
string randomBase64String =
System.Convert.ToBase64String (randomBytes).Trim ("=");
// 3. bob's your uncle
...
}
}
Remember
random != unique.
Your values will repeat. Eventually.
unique
There are a number of questions you need to ask yourself about your problem.
Must all Code values be unique? [if not, you're trying too hard]
What Type is Code? [if any-length string, use a full Guid]
Is this a distributed application? [if not, use a DB value as suggested by #LBushkin above]
If it is a distributed application, can client applications generate and submit instances of these objects? [if so, then you want a globally unique identifier, and again Guids are a sure bet]
I'm sure you have more constraints, but this is an example of the kind of line of inquiry you need to perform when you encounter a problem like your own. From these questions, you will come up with a series of constraints. These constraints will inform your design.
Hope this helps :)
Btw, you will receive better quality solutions if you post more details [ie constraints] about your problem. Again, what Type is Code, are there length constraints? Format constraints? Character constraints?
Arg, last edit, I swear. If you do end up using Guids, you may wish to obfuscate this, or even "compress" their representation by encoding them in base64 - similar to base64 conversion above for random numbers.
public static class GuidExtensions
{
public static string ToBase64String (this Guid id)
{
return System.Convert.
ToBase64String (id.ToByteArray ()).
Trim ("=");
}
}
Unlike truncating, base64 conversion is not a lossful transformation. Of course, the trim above is lossful in context of full base64 expansion - but = is just padding, extra information introduced by the conversion, and not part of original Guid data. If you want to go back to a Guid from this base64 converted value, then you will have to re-pad your base64 string until its length is a multiple of 4 - don't ask, just look up base64 if you are interested :)
You could generate a Guid using :
Guid.NewGuid().ToString();
It would give you something like :
788E94A0-C492-11DE-BFD4-FCE355D89593
Use an Autonumber column or Sequencer from your database to generate a unique code number. Almost all modern databases support automatically generated numbers in one form or another. Look into what you database supports.
Autonumber/Sequencer values from the DB are guaranteed to be unique and are relatively inexpensive to acquire. If you want to avoid completely sequential numbers assigned to codes, you can pad and concatenate several sequencer values together.