I've got a huge (>>10m) list of entries. Each entry offers two hash functions:
Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space)
Expensive: takes a lot of time to compute, but the distribution is a lot better also
An ordinary Dictionary lets me use only one of these hash functions. I'd like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions.
It seems like a good idea to use a dictionary inside a dictionory for this. I currently basically use this monstrosity:
Dictionary<int, Dictionary<int, List<Foo>>>;
I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash.
It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago.
To my knowledge, this functionality is not included in the basic framework. I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first.
As for my specific case:
First hash function = number of files in a file system directory (fast)
Second hash function = sum of size of files (slow)
Edits:
Changed title and added more informations.
Added quite important missing detail
In your case, you are technically using a modified function (A|B), not a double-hashed. However, depending on how huge your "huge" list of entries is and the characteristics of your data, consider the following:
A 20% full hash table with a not-so-good distribution can have more than 80% chance of collision. This means your expected function cost could be: (0.8 expensive + 0.2 cheap) + (cost of lookups). So if your table is more than 20% full it may not be worth using the (A|B) scheme.
Coming up with a perfect hash function is possible but O(n^3) which makes it impractical.
If performance is supremely important, you can make a specifically tuned hash table for your specific data by testing various hash functions on your key data.
First off, I think you're on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I'd like to ask a few questions:
Have you considered using something more unique for each entry?
I am assuming that each entry is a file system directory information, have you considered using its full path as key? prefixing with computer name/ip address?
On the other hand, if you're using number of files as hash key, are those directories never going to change? Because if the hash key/result changes, you will never be able to find it again.
While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that?
Just my few cents.
Have you taken a look at the Power Collections or C5 Collections libaries? The Power Collections library hasn't had much action recently, but the C5 stuff seems to be fairly up to date.
I'm not sure if either library has what you need, but they're pretty useful and they're open source so it may provide a decent base implementation for you to extend to your desired functionality.
You're basically talking about a hash table of hash table's, each using a different GetHashCode implementation... while it's possible I think you'd want to consider seriously whether you'll actually get a performance improvement over just doing one or the other...
Will there actually be a substantial number of objects that will be located via the quick-hash mechanism without having to resort to the more expensive to narrow it down further? Because if you can't locate a significant amount purely off the first calculation you really save nothing by doing it in two steps (Not knowing the data it's hard to predict whether this is the case).
If it is going to be a significant amount located in one step then you'll probably have to do a fair bit of tuning to work out how many records to store on each hash location of the outer before resorting to an inner "expensive" hashtable lookup rather than the more treatment of hashed data, but under certain circumstances I can see how you'd get a performance gain from this (The circumstances would be few and far between, but aren't inconceivable).
Edit
I just saw your ammendment to the question - you plan to do both lookups regardless... I doubt you'll get any performance benefits from this that you can't get just by configuring the main hash table a bit better. Have you tried using a single dictionary with an appropriate capacity passed in the constructor and perhaps an XOR of the two hash codes as your hash code?
Related
I need to store about 60000 IP-address-like things such that I can quickly determine if the store contains an address or return all the addresses that match a pattern like 3.4.*.* or *.5.*.*. The previous implementation used HashTables nested four levels deep. It's not fully thread safe, and this is causing us bugs. I need to make this thread safe with locking on the outer layer, or I could change all those to ConcurrentDictionaries, but neither of those options seemed quite right. Using a byte for a key in a dictionary never felt quite right to me in general, especially a heavy-weight dictionary. Suggestions?
Guava uses a prefix trie for storing IP lookup matches. You can see the code here:
https://code.google.com/p/google-collections/source/browse/trunk/src/com/google/common/collect/PrefixTrie.java?r=2
This is Java code, but I'm sure you could easily adapt it to C#. The technique of a prefix trie is applicable independent of the language and gets you trailing wildcard matches for free. If you want arbitrary wildcards, you'll still need to implement that yourself. Alternatively, you could build a data structure similar to a Directed acyclic word graph (DAWG). This will let you more directly implement the arbitrary wild card matches.
This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 9 years ago.
I need to synchronize files from directory A to directory B. I check for files in A and then compare them with files in B one by one. If a file of same name as A is found in B, I check to see if files are different by comparing their size. If the file sizes are different, I log this and move on to next file. However if the file sizes are same, I need to verify the contents of the files are different as well. For this, I thought of creating hashes of both files and compare them. Is this better or should I compare the files byte by byte? Please also tell why would you choose either one of the methods.
I am using C# (.NET 4) and need to preserve all files on B while replicating newly added files on A and reporting (and skipping) any duplicates.
Thanks.
EDIT: This job will run nightly and I have the option of storing hashes of files on directory B only, directory A will be populated dynamically so I can not pre-hash those files. Also which hash algorithms are better for this purpose as I want to avoid hash collisions as well.
If you need to synhronize files, there's another thing you can compare: file date - if this is any different, the file has been most probably changed.
Also, in really most of cases the hash (I'd go for md5 or sha1 - not crc because of limited value range and therefore rather frequent collisions) will be sufficient. And if those hashs are equal you should do a byte-by-byte compare. Surely this is an additional step, but it's rarely needed if at all.
Actually you should save the hash on B, so you don't need to recalculate it every time, but you must make sure, that the files on B cannot be changed without updating their hashs.
You already have a hash-function here. Your hash function is file-->(filename, filesize). Also, since you can only have one file with a given filename in a directory, you are guaranteed not to have more than one collision for each file per run.
You're asking if you need a better one. Well, I don't know, is performance adequate with the hash function you already have? If it's adequate for you, you don't need a better hash function.
If you use only a hash code to compare two files, then if the hash codes differ you can be sure that the files are different.
But if the hash codes are the same, then you don't know for sure if the files are really the same.
If you use a 32-bit hash code then there is a 1 in 2^32 chance that the files are different even though the hash code is the same. For a 64-bit hash code, the chance is naturally 1 in 2^64.
Storing the hash codes for all the files on B will make initial comparing much faster, but you then have to decide what to do if two hash codes are the same. Do you take a chance and assume that they are both the same? Or do you go and do a byte-by-byte comparison after you discover two files with the same hash?
Note that if you do a byte-by-byte comparison after you have computed the hash code for a file, you'll end up accessing the file contents twice. This can make using hash codes slower if a significant proportion of the files are the same. As ever, you have to do some timings to see which is faster.
If you can live with the small chance that you'll falsely assume two files to be the same you can avoid the confirming comparison... but I wouldn't like to take that chance myself.
In summary, I would probably just do the comparison each time and not bother with the hashing (other than what you're already doing with comparing the filename and size).
Note that if you find that almost all files that match by filename and size are also identical, then using hashing will almost certainly slow things down.
I'm working on a solution and one of the features is to check that some files have not been tampered in other words hacked. I was planning on using the MD5 sum with a mixture of created and modified dates, but wanted to see if anybody has done something like this before. I'm using C# at the moment but you could suggest any other language. I just want to hear the technique part of it or architecture.
We have an application that checks file validity for safety reasons. The CRC32 checksums are stored in a separate file using a simple dictionary lookup. Which of CRC32, MD5, or any other hashing/checksumming feature is purely choice: you simply need to know if the file has changed (at least that's what you've said). As each byte of the file is included in the calculation, any changes will be picked up, including simple swapping of bytes.
Don't use file dates: too unreliable and can be easily changed.
This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.
How are random numbers generated.? How do languages such as java etc generate random numbers, especially how it is done for GUIDs.? i found that algorithms like Pseudorandomnumber generator uses initial values.
But i need to create a random number program, in which a number once occurred should never repeats even if the system is restarted etc. I thought that i need to store the values anywhere so that i can check if the number repeats or not, but it will be too complex when the list goes beyond limits.?
First: If the number is guaranteed to never repeat, it's not very random.
Second: There are lots of PRNG algorithms.
UPDATE:
Third: There's an IETF RFC for UUIDs (what MS calls GUIDs), but you should recognize that (U|G)UIDs are not cryptographically secure, if that is a concern for you.
UPDATE 2:
If you want to actually use something like this in production code (not just for your own edification) please use a pre-existing library. This is the sort of code that is almost guaranteed to have subtle bugs in it if you've never done it before (or even if you have).
UPDATE 3:
Here's the docs for .NET's GUID
There are a lot of ways you could generate random numbers. It's usually done with a system/library call which uses a pseudo-number generator with a seed as you've already described.
But, there are other ways of getting random numbers which involve specialized hardware to get TRUE random numbers. I know of some poker sites that use this kind of hardware. It's very interesting to read how they do it.
Most random number generators have a way to "randomly" reïnitialize the seed value. (Sometimes called randomize).
If that's not possible, you can also use the system clock to initialize the seed.
You could use this code sample:
http://xkcd.com/221/
Or, you can use this book:
http://www.amazon.com/Million-Random-Digits-Normal-Deviates/dp/0833030477
But seriously, don't implement it yourself, use an existing library. You can't be the first person to do this.
Specifically regarding Java:
java.util.Random uses a linear congruential generator, which is not very good
java.util.UUID#randomUUID() uses java.security.SecureRandom, an interface for a variety of cryptographically secure RNGs - the default is based on SHA-1, I believe.
UUIDs/GUIDs are not necessarily random
It's easy to find implementations of RNGs on the net that are much better than java.util.Random, such as the Mersenne Twister or multiply-with-carry
I understand that you are seeking a way to generate random number using C#. If yes, RNGCryptoServiceProvider is what you are looking for.
[EDIT]
If you generate a fairly long number of bytes using RNGCryptoServiceProvider, it is likely to be unique but there is no gurantee. In theory, true random numbers doesnt mean to be unique. You roll a dice 2 times and you may get head both the times but they are still random. TRUE RANDOM!
I guess to apply the check of being unique, you just have to roll out your own mechanism of keeping history of previously generated numbers.