I have a tree structure where each node knows its CRC. What's a reasonable way to compute a CRC for each sub-tree that would give me a good value for the entire sub-tree to that point? In other words, a value to identify if any part of the sub-tree was changed.
My current thought is simply take each child node CRC, convert it to a string/byte[], concatenate all the nodes together, and take the CRC of that byte[]. But I'm not sure if this might lead to easy collisions as I suspect this removes quite a bit of information.
(I looked at crc32_combine but it seems inappropriate because I don't have any lengths. I could use a length of zero, but would that be any better or worse?)
Working in C# but I guess this is really language agnostic.
EDIT: Ended up going with this technique. Will switch to longer hashes if collisions seem to be a problem. While I don't need leaf order to be important, am not using xor just in case it is later on.
Ideally you would combine the CRCs of the nodes to compute the CRC of a sub-tree, using something like crc32_combine(). The result would be the same as computing the CRC over all the nodes in whatever canonical ordering you have defined. This would only check the ordering though, not the structure of the tree. A different structure with the same ordering would give the same CRC. This will be true no matter how you combine the CRCs, unless you include additional information on the tree structure.
The use of crc32_combine() requires the length of the data for each of the CRCs being combined (except the first). This is apparently not saved and not available in this case. You can instead make a stream of bytes of the CRCs in the canonical order and take the CRC of that stream. (You will need to decide if the CRCs are to be stored big or little endian in the byte stream, and then stick to your convention.)
The use of cryptographic signatures such as SHA1 or MD5 is unnecessary, unless you are worried for some reason that a devious human is interfering with your computed check values and trying to trick you into thinking that contents of the tree have not changed when they have. (The devious human can already do this at the level of the nodes anyway, since CRCs are easily spoofed.) Otherwise such signatures are just a waste of CPU time. If you simply want a longer hash, more than 32 bits, to reduce the probability of collisions, then you can use a fast hash function such as one from the CityHash family.
I'd probably use least SHA1 for your checksums since the collisions aren't that infrequent for MD5s and your idea about combining the CRCs seems solid though personally I'd XOR the hashes together to save on RAM and CPU cycles.
You should use something designed for this such as SHA-2. You may be able to get away with CRC32 depending on your particular requirements. There is a similar question posted here with more discussion:
Can CRC32 be used as a hash function?
Related
From a list of integers in C#, I need to generate a list of unique values. I thought in MD5 or similar but they generates too many bytes.
Integer size is 2 bytes.
I want to get a one way correspondence, for example
0 -> ARY812Q3
1 -> S6321Q66
2 -> 13TZ79K2
So, proving the hash, the user cannot know the integer or to interfere a sequence behind a list of hashes.
For now, I tried to use MD5(my number) and then I used the first 8 characters. However I found the first collision at 51389. Which other alternatives I could use?
As I say, I only need one way. It is not necessary to be able to calculate the integer from the hash. The system uses a dictionary to find them.
UPDATE:
Replying some suggestions about using GetHashCode(). GetHashCode returns the same integer. My purpose is to hide to the end user the integer. In this case, the integer is the primary key of a database. I do not want to give this information to users because they could deduce the number of records in the database or the increment of records by week.
Hashes are not unique, so maybe I need to use encryption like TripleDes or so, but I wanted to use something fast and simple. Also, TripleDes returns too many bytes too.
UPDATE 2:
I was talking about hashes and it is an error. In reality, I am trying to obfuscate it, and I tried it using hash algorithm, that it is not a good idea because they are not unique.
Update May 2017
Feel free to use (or modify) the library I developed, installable via Nuget with:
Install-Package Kent.Cryptography.Obfuscation
This converts a non-negative id such as 127 to 8-character string, e.g. xVrAndNb, and back (with some available options to randomize the sequence each time it's generated).
Example Usage
var obfuscator = new Obfuscator();
string maskedID = obfuscator.Obfuscate(15);
Full documentation at: Github.
Old Answer
I came across this problem way back and I couldn't find what I want in StackOverflow. So I made this obfuscation class and just shared it on github.
Obfuscation.cs - Github
You can use it by:
Obfuscation obfuscation = new Obfuscation();
string maskedValue = obfuscation.Obfuscate(5);
int? value = obfuscation.DeObfuscate(maskedValue);
Perhaps it can be of help to future visitor :)
Encrypt it with Skip32, which produces a 32 bit output. I found this C# implementation but can't vouch for its correctness. Skip32 is a relatively uncommon crypto choice and probably hasn't been analyzed much. Still it should be sufficient for your obfuscation purposes.
The strong choice would be format preserving encryption using AES in FFX mode. But that's pretty complicated and probably overkill for your application.
When encoded with Base32 (case insensitive, alphanumeric) a 32 bit value corresponds to 7 characters. When encoded in hex, it corresponds to 8 characters.
There is also the non cryptographic alternative of generating a random value, storing it in the database and handling collisions.
Xor the integer. Maybe with a random key that it is generated per user (stored in session). While it's not strictly a hash (as it is reversible), the advantages are that you don't need to store it anywhere, and the size will be the same.
For what you want, I'd recommend using GUIDs (or other kind of unique identifier where the probability of collision is either minimal or none) and storing them in the database row, then just never show the ID to the user.
IMHO, it's kind of bad practice to ever show the primary key in the database to the user (much less to let users do any kind of operations on them).
If they need to have raw access to the database for some reason, then just don't use ints as primary keys, and make them guids (but then your requirement loses importance since they can just access the number of records)
Edit
Based on your requirements, if you don't care the algorithm is potentially computationally expensive, then you can just generate a random 8 byte string every time a new row is added, and keep generating random strings until you find one that is not already in the database.
This is far from optimal, and -can- be computationally expensive, but taking you use a 16-bit id and the maximum number of rows is 65536, I'd not care too much about it (the possibility of an 8 byte random string to be in a 65536 possibility list is minimal, so you'll probably be good at first or as much as second try, if your pseudo-random generator is good).
I need to hash a number (about 22 digits) and the result length must be less than 12 characters. It can be a number or a mix of characters, and must be unique. (The number entered will be unique too).
For example, if the number entered is 000000000000000000001, the result should be something like 2s5As5A62s.
I looked at the typicals, like MD5, SHA-1, etc., but they give high length results.
The problem with your question is that the input is larger than the output and unique. If you're expecting a unique output as well, it won't happen. The reason behind this that if you have an input space of say 22 numeric digits (10^22 possibilities) and an output space of hexadecimal digits with a length of 11 digits (16^11 possibilities), you end up with more input possibilities than output possibilities.
The graph below shows that you would need a an output space of 19 hexadecimal digits and a perfect one-to-one function, otherwise you will have collisions pretty often (more than 50% of the time). I assume this is something you do not want, but you did not specify.
Since what you want cannot be done, I would suggest rethinking your design or using a checksum such as the cyclic redundancy check (CRC). CRC-64 will produce a 64 bit output and when encoded with any base64 algorithm, will give you something along the lines of what you want. This does not provide cryptographic strength like SHA-1, so it should never be used in anything related to information security.
However, if you were able to change your criteria to allow for long hash outputs, then I would strongly suggest you look at SHA-512, as it will provide high quality outputs with an extremely low chance of duplication. By a low chance I mean that no two inputs have yet been found to equal the same hash in the history of the algorithm.
If both of these suggestions still are not great for you, then your last alternative is probably just going with only base64 on the input data. It will essentially utilize the standard English alphabet in the best way possible to represent your data, thus reducing the number of characters as much as possible while retaining a complete representation of the input data. This is not a hash function, but simply a method for encoding binary data.
Why not taking MD5 or SHA-N then refactor to BASE64 (or base-whatever) and take only 12 characters of them ?
NB: In all case the hash will NEVER be unique (but can offer low collision probability)
You can't use a hash if it has to be unique.
You need about 74 bits to store such a number. If you convert it to base-64 it will be about 12 characters.
Can you elaborate on what your requirement is for the hashing? Do you need to make sure the result is diverse? (i.e. not 1 = a, 2 = b)
Just thinking out loud, and a little bit laterally, but could you not apply principles of run-length encoding on your number, treating it as data you want to compress. You could then use the base64 version of your compressed version.
I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:
Add a new value
Iterate over all of the values.
There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.
I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.
EDIT: To answer your questions:
No, the items don't need to be sorted
By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
There could be a decent number of these sets in memory at once (~100).
Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.
As for wasting space: you waste 6MB in the worst case.
EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)
For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.
You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:
Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
A custom hash table, perhaps Google sparsehash through C++/CLI
BSTs storing ranges/intervals
Supernode BSTs
It would depend on the distribution of the sizes of your sets. Unless you expect most of the sets to be (close to) the minimum you've specified, I'd probably use a bitset. To cover a range up to 50,000,000, a bitset ends up ~6 megabytes.
Compared to storing the numbers directly, this is marginally larger for the minimum size set you've specified (~6 megabytes instead of ~4), but considerably smaller for the maximum size set (1/32nd the size).
The second possibility would be to use a delta encoding. For example, instead of storing each number directly, store the difference between that number and the previous number that was included. Given a maximum magnitude of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~50. This means you can theoretically store the difference in <6 bits on average. I'd probably use the 7 least significant bits directly, and if you need to encode a larger gap, set the msb and (for example) store the size of the gap in the lower 7 bits plus the next three bytes. That can't happen very often, so in most cases you're using only one byte per number, for about 4:1 compression compared to storing numbers directly. In the best case this would use ~1 megabyte for a set, and in the worst about 50 megabytes -- 4:1 compression compared to storing numbers directly.
If you don't mind a little bit of extra code, you could use an adaptive scheme -- delta encoding for small sets (up to 6,000,000 numbers), and a bitmap for larger sets.
I think the answer depends on what you mean by "passing around" and what you're trying to accomplish. You say you are only adding to the list: how often do you add? How fast will the list grow? What is an acceptable overhead for memory use, versus the time to reallocate memory?
In your worst case, 50,000,000 32-bit numbers = 200 megabytes using the most efficient possible data storage mechanism. Assuming you may end up with this much use in your worst case scenario, is it OK to use this much memory all the time? Is that better than having to reallocate memory frequently? What's the distribution of typical usage patterns? You could always just use an int[] that's pre-allocated to the whole 50 million.
As far as access speed for your operations, nothing is faster than iterating and adding to a pre-allocated chunk of memory.
From OP edit: There could be a decent number of these sets in memory at once (~100).
Hey now. You need to store 100 sets of 1 to 50 million numbers in memory at once? I think the bitset method is the only possible way this could work.
That would be 600 megabytes. Not insignificant, but unless they are (typically) mostly empty, it seems very unlikely that you would find a more efficient storage mechanism.
Now, if you don't use bitsets, but rather use dynamically sized constructs, and they could somehow take up less space to begin with, you're talking about a real ugly memory allocation/deallocation/garbage collection scenario.
Let's assume you really need to do this, though I can only imagine why. So your server's got a ton of memory, just allocate as many of these 6 megabyte bitsets as you need and recycle them. Allocation and garbage collection are no longer a problem. Yeah, you're using a ton of memory, but that seems inevitable.
I am retrieving lists of crc32 hashes that contain names of files, not there contents.
I need to be able to decrypt the strings which are hashed names like "vacationplans_2010.txt"
which are less then 25 characters long.
is this possible?
it is one-way hash function. It can't be decrypted.
Despite what other users answered, CRC32 is not a cryptographic hash function; it is meant for integrity checks (data checksums).
Cryptographic hash functions are often described as "one-way hash functions", CRC32 lacks the "one-way" part.
That being said, you should consider the following: since the set of all possible 25-characters-or-less filenames is more than 2^32, some file names are bound to have the same hash value. Therefore, it might be that for some of the CRC32 values you get - there will be several possible sources (file-names). You will need a way to determine the "real" source (i assume that human-decision would be the best choice, since our brain is a great pattern-recognition device, but it really depends on your scenario).
Several methods can be used to partially achieve what you are asking for. Brute-force is one of them (although, with 25 characters long file names, brute-force may take a while). A modified dictionary attack is another option. Other options are based on analysis of the CRC32 algorithm, and will require that you dive into the implementation details of the algorithm (otherwise you'll have a hard time understanding what you're implementing). For example, see this article, or this artice.
EDIT: definitions by Bruce Schneier (author of Applied Cryptography, among other things):
One-way functions are relatively easy
to compute, but significantly harder
to reverse. … . In this context,
"hard" is defined as something like:
It would take millions of years to
compute x from f(x), even if all the
computers in the worlds were assigned
to the problem.
A hash function is a function,
mathematical or otherwise, that takes
a variable length input string and
(called a pre-image) and converts it
to a fixed length (generally smaller)
output string (called a hash value).
The security of a one-way hash
function is its one-wayness.
A hash function like CRC32 calculates a simple value given (variable) input. The calculation is not reversible - i.e. you cannot reliably get the original value given only the hash.
Yep, the general method is to find out the rule how u hash encryt result be the same as
I need to perform a bitwise equality between two bytes. That means that for instance if I have two bytes: 00011011 and 00011110 the result is 11111010
The only fast way I see is to use the following statement
byte a, b;//set input bytes
byte c = ~(a^b);//output bytes
But I wonder if there is a faster solution for this. After these equality operation I want to mask the bits I need. So I need to use an AND-operation. So the code becomes:
byte a, b;//set input bytes
byte m;//mask, intresting bits are set to 1, others to 0
byte c = (~(a^b))&m;//output bytes
aren't there any faster and more simple methods that don't need to use all those bitwise operations because this part of the code will be called very often.
I doubt it can be done in fewer operations. That looks optimal. Perhaps you could store ~(a^b) in a lookup table (256*256 entries)? I doubt you would get much benefit and possibly even make things worse, but you could try it.
You are looking in the wrong place for this optimization; you won't end up finding any better bitwise operation here. Even if you did, it's hardly going to speed anything up. The real win will come from processing more than just a byte at a time. The processor is already having to do a bunch of bit shifting and masking operations just so that it can pretend for you that you are working with bytes. Process your arrays of bytes 1 word at a time, or use vector instructions if they are available.
These operations seem fast enough to be honest. I think you shouldn't try to optimize them further, but finish your software first, see if you are happy with the overall performance and use a profiler if you are not. I am fairly sure the problem will be elsewhere.
What you want is an XNOR operation. Unfortunately this is not supported by C#/Mono. I think your solution is optimal.