This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 9 years ago.
I need to synchronize files from directory A to directory B. I check for files in A and then compare them with files in B one by one. If a file of same name as A is found in B, I check to see if files are different by comparing their size. If the file sizes are different, I log this and move on to next file. However if the file sizes are same, I need to verify the contents of the files are different as well. For this, I thought of creating hashes of both files and compare them. Is this better or should I compare the files byte by byte? Please also tell why would you choose either one of the methods.
I am using C# (.NET 4) and need to preserve all files on B while replicating newly added files on A and reporting (and skipping) any duplicates.
Thanks.
EDIT: This job will run nightly and I have the option of storing hashes of files on directory B only, directory A will be populated dynamically so I can not pre-hash those files. Also which hash algorithms are better for this purpose as I want to avoid hash collisions as well.
If you need to synhronize files, there's another thing you can compare: file date - if this is any different, the file has been most probably changed.
Also, in really most of cases the hash (I'd go for md5 or sha1 - not crc because of limited value range and therefore rather frequent collisions) will be sufficient. And if those hashs are equal you should do a byte-by-byte compare. Surely this is an additional step, but it's rarely needed if at all.
Actually you should save the hash on B, so you don't need to recalculate it every time, but you must make sure, that the files on B cannot be changed without updating their hashs.
You already have a hash-function here. Your hash function is file-->(filename, filesize). Also, since you can only have one file with a given filename in a directory, you are guaranteed not to have more than one collision for each file per run.
You're asking if you need a better one. Well, I don't know, is performance adequate with the hash function you already have? If it's adequate for you, you don't need a better hash function.
If you use only a hash code to compare two files, then if the hash codes differ you can be sure that the files are different.
But if the hash codes are the same, then you don't know for sure if the files are really the same.
If you use a 32-bit hash code then there is a 1 in 2^32 chance that the files are different even though the hash code is the same. For a 64-bit hash code, the chance is naturally 1 in 2^64.
Storing the hash codes for all the files on B will make initial comparing much faster, but you then have to decide what to do if two hash codes are the same. Do you take a chance and assume that they are both the same? Or do you go and do a byte-by-byte comparison after you discover two files with the same hash?
Note that if you do a byte-by-byte comparison after you have computed the hash code for a file, you'll end up accessing the file contents twice. This can make using hash codes slower if a significant proportion of the files are the same. As ever, you have to do some timings to see which is faster.
If you can live with the small chance that you'll falsely assume two files to be the same you can avoid the confirming comparison... but I wouldn't like to take that chance myself.
In summary, I would probably just do the comparison each time and not bother with the hashing (other than what you're already doing with comparing the filename and size).
Note that if you find that almost all files that match by filename and size are also identical, then using hashing will almost certainly slow things down.
Related
I am working on transferring files over the network. There is zero tolerance for data loss during the transfers. I've been asked to compute the SHA256 values for the original and the copied file to verify the contents are the same. So far I have made comparisons based on copying and pasting the file, and letting Windows rename the file with the -copy appended to the filename. I have also tried renaming the file after the rename above, as well as removing the file extension. So far they all produce the same hash. I've also coded altering file attributes (just changed lastWrittenTime and fileCreationTime) and this does not seem to have an effect on the hash.
Checksum result of copying and pasting a file(explorer appends "-copy to name):
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
Checksum result of renaming the -copy in explorer:
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
Checksum result of changing file extension:
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
What part/s of the file are used when the hash is created?
Ok, zero tolerance was a bit much, if the hash doesn't match the file will have to be resent.
The entire binary file contents are streamed through the hashing algorithm. File metadata (such as name, date etc) doesn't play a part.
First, a general recommendation: don't do this. Use rsync or something similar to do bulk file transfers. Rsync has years of optimisations and debugging behind it, has countless options to control how (and whether) the copying happens, and is available on Windows. Don't waste time building something that has already been built.
But if you must…
Hashing algorithms generally care about bytes, not files. When applying SHA256 to a file, you are simply reading the bytes and passing them through the algo.
If you want to hash paths, permissions, etc, you should do this at the directory level, because these things constitute the "contents" of a directory. There is no standard byte-level representation of directories, so you'll have make one up yourself. Something that looks like a directory listing in sorted order usually suffices. And make sure that each entry contains the hash of the corresponding thing, be it a file or another directory. This way, the hash of the directory uniquely specifies not only the name and attributes of each child, but, recursively, the entire contents of the subdirectory.
Note: the fact that identical files have the same hash can actually work in your favour, by avoiding transmission of the second file once the system realises that a file with the same hash is already present at the destination. Of course, you would have to code for this explicitly. But also note that doing so can allow super-cheap syncing when files have been moved or copied, since they will have the same hash as before. Only affected directories (from the immediate parent(s) to the root) will have different hash values.
Finally, a minor quibble: there is no such thing as zero tolerance. Forget whether SHA256 collisions will happen in the lifetime of the Universe. A gamma ray can flip the bit that says, "These two files don't match!" Such flippings happen exceedingly rarely, but more often than you might think. In a noisy quantum universe, we should avoid talking in absolutes.
In .NET, I need a way to compare two files. I thought of a class, which represents a diff:
public enum DiffEntryState
{
New,
Removed,
Changed
}
public class DiffEntry
{
public byte[] Bytes;
public long FileOffset;
public DiffEntryState State = BackupByteEntryState.Changed;
}
The names should be pretty self-explanatory. I thought of adding a State to each entry, so that I can distinguish between the cases were the first file is larger than the second or vice versa.
I'm wondering, if there is a common and fast way to retrieve the byte-by-byte differences of two files. I would simply create a stream for each file and compare chunks of these streams until one ends. Is there a better way, or does the Framework have a built-in solution? Keep in mind that I need the differences itself, not only the feedback that there ARE differences.
//Edit:
After sleeping a night over the problem, I guess I'm taking the wrong approach here. The whole tool is a backup solution, which will be able to save only the changed bytes and thus reduce the overall necessary space for the backup. Instead of saving a compressed 14 MB file each time, only 200k or less will be saved.
But, after thinking about the problem, I realized that it wouldn't be enough to save only the differences byte-by-byte. Take a Text for example:
"This is a string."
"This was a string."
As a matter of fact, the only change here is "is" to "was". But my approach would assume that the changed content is now "was a string". If this happens at the beginning of a huge file, well, this approach is useless.
Obviously, I need a way to index a file and detect all moved, copied or changed blocks in comparison to the original file.
Phew...
Take a look at Diff.NET,could be helpful .
For general case binary differencing, look at A Linear Time, Constant Space Differencing Algorithm by Randal C. Burns and Darrell D. E. Long. Also, Randal Burns' master's thesis, Differential Compression: A Generalized Solution For Binary Files, goes into more detail and provides pseudo-code for the algorithm.
You might also get some useful ideas from About Remote Differential Compression and from Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression
For text file differencing, I recommend starting with An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers. This algorithm can be used to diff any two sequences. To compare two text files, generate sequences of hash codes (e.g., by calling string.GetHashCode()) for each line in each file. Then run those sequences (e.g., IList) through Myers' algorithm to find the shortest edit script (i.e., inserts and deletes) that will convert the first sequence into the second.
I hope this helps. I'm the author of Diff.Net, and it uses Burns' algorithm for binary differencing and Myers' algorithm for text differencing. The source code for Diff.Net's libraries (Menees.Diffs and Menees.Diffs.Controls) are available under the Apache License, Version 2.0, and the references above should help you implement your own solution without having to start from scratch.
There is no built-in functionality.
So you have to compare the files byte by byte or use a library that does this for you.
I'm working on a solution and one of the features is to check that some files have not been tampered in other words hacked. I was planning on using the MD5 sum with a mixture of created and modified dates, but wanted to see if anybody has done something like this before. I'm using C# at the moment but you could suggest any other language. I just want to hear the technique part of it or architecture.
We have an application that checks file validity for safety reasons. The CRC32 checksums are stored in a separate file using a simple dictionary lookup. Which of CRC32, MD5, or any other hashing/checksumming feature is purely choice: you simply need to know if the file has changed (at least that's what you've said). As each byte of the file is included in the calculation, any changes will be picked up, including simple swapping of bytes.
Don't use file dates: too unreliable and can be easily changed.
I've got a huge (>>10m) list of entries. Each entry offers two hash functions:
Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space)
Expensive: takes a lot of time to compute, but the distribution is a lot better also
An ordinary Dictionary lets me use only one of these hash functions. I'd like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions.
It seems like a good idea to use a dictionary inside a dictionory for this. I currently basically use this monstrosity:
Dictionary<int, Dictionary<int, List<Foo>>>;
I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash.
It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago.
To my knowledge, this functionality is not included in the basic framework. I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first.
As for my specific case:
First hash function = number of files in a file system directory (fast)
Second hash function = sum of size of files (slow)
Edits:
Changed title and added more informations.
Added quite important missing detail
In your case, you are technically using a modified function (A|B), not a double-hashed. However, depending on how huge your "huge" list of entries is and the characteristics of your data, consider the following:
A 20% full hash table with a not-so-good distribution can have more than 80% chance of collision. This means your expected function cost could be: (0.8 expensive + 0.2 cheap) + (cost of lookups). So if your table is more than 20% full it may not be worth using the (A|B) scheme.
Coming up with a perfect hash function is possible but O(n^3) which makes it impractical.
If performance is supremely important, you can make a specifically tuned hash table for your specific data by testing various hash functions on your key data.
First off, I think you're on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I'd like to ask a few questions:
Have you considered using something more unique for each entry?
I am assuming that each entry is a file system directory information, have you considered using its full path as key? prefixing with computer name/ip address?
On the other hand, if you're using number of files as hash key, are those directories never going to change? Because if the hash key/result changes, you will never be able to find it again.
While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that?
Just my few cents.
Have you taken a look at the Power Collections or C5 Collections libaries? The Power Collections library hasn't had much action recently, but the C5 stuff seems to be fairly up to date.
I'm not sure if either library has what you need, but they're pretty useful and they're open source so it may provide a decent base implementation for you to extend to your desired functionality.
You're basically talking about a hash table of hash table's, each using a different GetHashCode implementation... while it's possible I think you'd want to consider seriously whether you'll actually get a performance improvement over just doing one or the other...
Will there actually be a substantial number of objects that will be located via the quick-hash mechanism without having to resort to the more expensive to narrow it down further? Because if you can't locate a significant amount purely off the first calculation you really save nothing by doing it in two steps (Not knowing the data it's hard to predict whether this is the case).
If it is going to be a significant amount located in one step then you'll probably have to do a fair bit of tuning to work out how many records to store on each hash location of the outer before resorting to an inner "expensive" hashtable lookup rather than the more treatment of hashed data, but under certain circumstances I can see how you'd get a performance gain from this (The circumstances would be few and far between, but aren't inconceivable).
Edit
I just saw your ammendment to the question - you plan to do both lookups regardless... I doubt you'll get any performance benefits from this that you can't get just by configuring the main hash table a bit better. Have you tried using a single dictionary with an appropriate capacity passed in the constructor and perhaps an XOR of the two hash codes as your hash code?
Does anyone have, or know of, a binary patch generation algorithm implementation in C#?
Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.
The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.
My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).
Any advice, or pointers for implementation would be nice.
Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.
The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:
Grab the first four bytes from the old file, call this the key
Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
Repeat steps 1-3 for all 4-byte blocks in the old file
From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
If found, find the longest match if there are several, by comparing bytes from the two files
Encode a reference to that location in the old file, and skip the matched block in the new file
If not found, encode 1 byte from the new file, and skip it
Repeat steps 5-8 for the rest of the new file
This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.
A more memory-efficient algorithm uses windowing, but produces much bigger patch files.
There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.
Edit #1: Here is a more detailed description of the above algorithm.
First, combine the two files, so that you have one big file. Remember the cut-point between the two files.
Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.
Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.
Edit #2: Source code to the above algorithm
You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.
The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.
#lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.
Or perhaps I'm just dense... :)
I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:
Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.
My needs aren't optimal though, so I'm looking for a more practical solution.
Thanks for the answer though, added a bookmark to his utilities if I ever need them.
Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.
Edit #2: I'll definitely hunt down the python xdelta implementation.
Sorry I couldn't be more help. I would definately keep looking at xdelta because I have used it a number of times to produce quality diffs on 600MB+ ISO files we have generated for distributing our products and it performs very well.
bsdiff was designed to create very small patches for binary files. As stated on its page, it requires max(17*n,9*n+m)+O(1) bytes of memory and runs in O((n+m) log n) time (where n is the size of the old file and m is the size of the new file).
The original implementation is in C, but a C# port is described here and available here.
Have you seen VCDiff? It is part of a Misc library that appears to be fairly active (last release r259, April 23rd 2008). I haven't used it, but thought it was worth mentioning.
It might be worth checking out what some of the other guys are doing in this space and not necessarily in the C# arena either.
This is a library written in c#
SVN also has a binary diff algorithm and I know there's an implementation in python although I couldn't find it with a quick search. They might give you some ideas on where to improve your own algorithm
If this is for installation or distribution, have you considered using the Windows Installer SDK? It has the ability to patch binary files.
http://msdn.microsoft.com/en-us/library/aa370578(VS.85).aspx
This is a rough guideline, but the following is for the rsync algorithm which can be used to create your binary patches.
http://rsync.samba.org/tech_report/tech_report.html