This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 7 years ago.
say i have a file A.doc.
then i copy it to b.doc and move it to another directory.
for me, it is still the same file.
but how can i determine that it is?
when i download files i sometimes read about getting the mda5 something or the checksum, but i don't know what that is about.
Is there a way to check whether these files are binary equal?
If you want to be 100% sure of the exact bytes in the file being the same, then opening two streams and comparing each byte of the files is the only way.
If you just want to be pretty sure (99.9999%?), I would calculate a MD5 hash of each file and compare the hashes instead. Check out System.Security.Cryptography.MD5CryptoServiceProvider.
In my testing, if the files are usually equivalent then comparing MD5 hashes is about three times faster than comparing each byte of the file.
If the files are usually different then comparing byte-by-byte will be much faster, because you don't have to read in the whole file, you can stop as soon as a single byte differs.
Edit: I originally based this answer off a quick test which read from each file byte-by-byte, and compared them byte-by-byte. I falsely assumed that the buffered nature of the System.IO.FileStream would save me from worrying about hard disk block sizes and read speeds; this was not true. I retested my program that reads from each file in 4096 byte chunks and then compares the chunks - this method is slightly faster overall than MD5 even when the files are exactly the same, and will of course be much faster if they differ.
I'm leaving this answer as a mild warning about the FileStream class, and because I still thinkit has some value as an answer to "how do I calculate the MD5 of a file in .NET". Apart from that though, it's not the best way to fulfill the original request.
example of calculating the MD5 hashes of two files (now tested!):
using (var reader1 = new System.IO.FileStream(filepath1, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
using (var reader2 = new System.IO.FileStream(filepath2, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
byte[] hash1;
byte[] hash2;
using (var md51 = new System.Security.Cryptography.MD5CryptoServiceProvider())
{
md51.ComputeHash(reader1);
hash1 = md51.Hash;
}
using (var md52 = new System.Security.Cryptography.MD5CryptoServiceProvider())
{
md52.ComputeHash(reader2);
hash2 = md52.Hash;
}
int j = 0;
for (j = 0; j < hash1.Length; j++)
{
if (hash1[j] != hash2[j])
{
break;
}
}
if (j == hash1.Length)
{
Console.WriteLine("The files were equal.");
}
else
{
Console.WriteLine("The files were not equal.");
}
}
}
First compare the size of the files , if the size is not the same then the files are different , if the size is the same , then simply compare the files content.
Indeed there is. Open both files, read them in as byte arrays, and compare each byte. If they are equal, then the file is equal.
Related
I have homework from my university teacher. I have to write code which will encrypt\decrypt small part of big file (about 10GB). I use algorithm Salsa20.
The main thing is not to load RAM. As he said, I should read, for example, 100 lines then encrypt\decrypt it, write to file and back.
I create List
List<string> dict = new List<string>();
Read lines (because reading all bytes is loading lots of RAM)
using (StreamReader sReader = new StreamReader(filePath))
{
while (dict.Count < 100)
{
dict.Add(sReader.ReadLine());
}
}
Try to create one line from
string words = string.Join("", dict.ToArray());
Encrypt this line
string encrypted;
using (var salsa = new Salsa20.Salsa20())
using (var mstream_out = new MemoryStream())
{
salsa.Key = key;
salsa.IV = iv;
using (var cstream = new CryptoStream(mstream_out,
salsa.CreateEncryptor(), CryptoStreamMode.Write))
{
var bytes = Encoding.UTF8.GetBytes(words);
cstream.Write(bytes, 0, bytes.Length);
}
encrypted = Encoding.UTF8.GetString(mstream_out.ToArray());
}
Then I need to write 100 lines of encrypted string, but I don't know how to do it! Is there any solution?
OK, so here's what you could do.
Accept a filename, a starting line number and an ending line number.
Read the lines, simply writing them them to another file if they are lower than the starting line number or larger than the ending line number.
Once you read a line that is in the range, you can encrypt it with the key and an IV. You will possibly need to encode it to a byte array e.g. using UTF-8 first, as modern ciphers such as Salsa operate on bytes, not text.
You can use the line number possibly as nonce/IV for your stream cipher, if you don't expect the number of lines to change. Otherwise you can prefix the ciphertext with a large, fixed size, random nonce.
The ciphertext - possibly including the nonce - can be encoded as base64 without line endings. Then you write the base 64 line to the other file.
Keep encrypting the lines until you found the end index. It is up to you if your ending line is inclusive or exclusive.
Now read the remaining lines, and write them to the other file.
Don't forget to finalize the encryption and close the file. You may possibly want to destroy the source input file.
Encrypting bytes may be easier as you could write to the original file. However, writing encrypted strings will likely always expand the ciphertext compared with the plaintext. So you need to copy the file, as it needs to grow from the middle out.
I haven't got a clue why you would keep a list or dictionary in memory. If that's part of the requirements then I don't see it in the rest of the question. If you read in all the lines of a file that way then clearly you're using up memory.
Of course, if your 4 GiB file is just a single line then you're still using way too much memory. In that case you need to stream everything, parsing text from files, putting it in a character buffer, character-decoding it, encrypting it, encoding it again to base 64 and writing it to file. Certainly doable, but tricky if you've never done such things.
I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.
Compare whole file byte by byte - not fast enough for large files.
File Comparison using Hashes - not 100% match the two files with the same hash.
What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?
If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
Different size, not byte-for-byte equal. Simples!
If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.
Serial Compare
Test File Size(s): 118 MB
Duration: 579 ms
Equal? true
static bool Compare(string filePath1, string filePath2)
{
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
if (file.Length != file2.Length)
{
return false;
}
int count;
const int size = 0x1000000;
var buffer = new byte[size];
var buffer2 = new byte[size];
while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
{
file2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
return false;
}
}
}
}
}
return true;
}
Parallel Compare
Test File Size(s): 118 MB
Duration: 340 ms
Equal? true
static bool Compare2(string filePath1, string filePath2)
{
bool success = true;
var info = new FileInfo(filePath1);
var info2 = new FileInfo(filePath2);
if (info.Length != info2.Length)
{
return false;
}
long fileLength = info.Length;
const int size = 0x1000000;
Parallel.For(0, fileLength / size, x =>
{
var start = (int)x * size;
if (start >= fileLength)
{
return;
}
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
var buffer = new byte[size];
var buffer2 = new byte[size];
file.Position = start;
file2.Position = start;
int count = file.Read(buffer, 0, size);
file2.Read(buffer2, 0, size);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
success = false;
return;
}
}
}
}
});
return success;
}
MD5 Compare
Test File Size(s): 118 MB
Duration: 702 ms
Equal? true
static bool Compare3(string filePath1, string filePath2)
{
byte[] hash1 = GenerateHash(filePath1);
byte[] hash2 = GenerateHash(filePath2);
if (hash1.Length != hash2.Length)
{
return false;
}
for (int i = 0; i < hash1.Length; i++)
{
if (hash1[i] != hash2[i])
{
return false;
}
}
return true;
}
static byte[] GenerateHash(string filePath)
{
MD5 crypto = MD5.Create();
using (FileStream stream = File.OpenRead(filePath))
{
return crypto.ComputeHash(stream);
}
}
tl;dr Compare byte segments in parallel to determine if two files are equal.
Why not both?
Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.
There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.
So a there are two things you'll want to address:
Concurrency - Make sure you're reading data at the same time you're checking it.
Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)
The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.
Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?
If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense.
I might try something like the following:
1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.
2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.
3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.
To calculate a hash, the entire file needs to be read.
How about opening both files together, and comparing them chunk by chunk?
Pseudo code:
open file A
open file B
while file A has more data
{
if next chunk of A != next chunk of B return false
}
return true
This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.
What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.
I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.
I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.
The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?
All code is in C#.
Thank you.
Update
I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
I hope you're checking for an MD5 match only if the file size already matches.
Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.
Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.
Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.
I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.
just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.
Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.
If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.
Am I missing something here? What does hashing buy you in this case?
First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.
There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.
This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).
You should either use a hash function which has not been broken yet (like SHA-1), or (as #SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.
References:
http://www.win.tue.nl/hashclash/SoftIntCodeSign/
Use MD5CryptoServiceProvider and BufferedStream
using (FileStream stream = File.OpenRead(filePath))
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new MD5CryptoServiceProvider();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}