Improve performance of SHA-1 ComputeHash - c#

I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?
fs = new FileStream(txtFile.Text, FileMode.Open);
formatted = string.Empty;
using (SHA1Managed sha1 = new SHA1Managed())
{
byte[] hash = sha1.ComputeHash(fs);
foreach (byte b in hash)
{
formatted += b.ToString("X2");
}
}
fs.Close();
Update:
System:
OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm
Tests:
Test1 = 59.895 seconds
Test2 = 59.94 seconds

The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).
Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).
Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.
I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.

Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.
It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)
How slow is "quite slow" anyway? Compared with, say, copying the file?
If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?

First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.
To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.

First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.
Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.
My guess would be that ComputeHash() uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.

Neither is SHA1Managed the best choice for large input strings, nor is Byte.ToString("X2") the fastest way to convert the byte array to a string.
I just finished an article with detailed benchmarks on that topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng and also considers SHA1.Create() on different length input strings.
In the second part, it shows 5 different methods of converting the byte array to string where Byte.ToString("X2") is the worst.
My largest input was only 10,000 characters so you may want to run my benchmarks on your 2 GB file. Would be quite interesting if/how that changes the numbers.
http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/
However, for file integrity checks you are better off using MD5 as you already wrote.

You Can Use This logic for getting SHA-1 value.
I was using it in java.
public class sha1Calculate {
public static void main(String[] args)throws Exception
{
File file = new File("D:\\Android Links.txt");
String outputTxt= "";
String hashcode = null;
try {
FileInputStream input = new FileInputStream(file);
ByteArrayOutputStream output = new ByteArrayOutputStream ();
byte [] buffer = new byte [65536];
int l;
while ((l = input.read (buffer)) > 0)
output.write (buffer, 0, l);
input.close ();
output.close ();
byte [] data = output.toByteArray ();
MessageDigest digest = MessageDigest.getInstance( "SHA-1" );
byte[] bytes = data;
digest.update(bytes, 0, bytes.length);
bytes = digest.digest();
StringBuilder sb = new StringBuilder();
for( byte b : bytes )
{
sb.append( String.format("%02X", b) );
}
System.out.println("Digest(in hex format):: " + sb.toString());
}catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Related

compare the contents of large files

I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.
Compare whole file byte by byte - not fast enough for large files.
File Comparison using Hashes - not 100% match the two files with the same hash.
What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?
If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
Different size, not byte-for-byte equal. Simples!
If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.
Serial Compare
Test File Size(s): 118 MB
Duration: 579 ms
Equal? true
static bool Compare(string filePath1, string filePath2)
{
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
if (file.Length != file2.Length)
{
return false;
}
int count;
const int size = 0x1000000;
var buffer = new byte[size];
var buffer2 = new byte[size];
while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
{
file2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
return false;
}
}
}
}
}
return true;
}
Parallel Compare
Test File Size(s): 118 MB
Duration: 340 ms
Equal? true
static bool Compare2(string filePath1, string filePath2)
{
bool success = true;
var info = new FileInfo(filePath1);
var info2 = new FileInfo(filePath2);
if (info.Length != info2.Length)
{
return false;
}
long fileLength = info.Length;
const int size = 0x1000000;
Parallel.For(0, fileLength / size, x =>
{
var start = (int)x * size;
if (start >= fileLength)
{
return;
}
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
var buffer = new byte[size];
var buffer2 = new byte[size];
file.Position = start;
file2.Position = start;
int count = file.Read(buffer, 0, size);
file2.Read(buffer2, 0, size);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
success = false;
return;
}
}
}
}
});
return success;
}
MD5 Compare
Test File Size(s): 118 MB
Duration: 702 ms
Equal? true
static bool Compare3(string filePath1, string filePath2)
{
byte[] hash1 = GenerateHash(filePath1);
byte[] hash2 = GenerateHash(filePath2);
if (hash1.Length != hash2.Length)
{
return false;
}
for (int i = 0; i < hash1.Length; i++)
{
if (hash1[i] != hash2[i])
{
return false;
}
}
return true;
}
static byte[] GenerateHash(string filePath)
{
MD5 crypto = MD5.Create();
using (FileStream stream = File.OpenRead(filePath))
{
return crypto.ComputeHash(stream);
}
}
tl;dr Compare byte segments in parallel to determine if two files are equal.
Why not both?
Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.
There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.
So a there are two things you'll want to address:
Concurrency - Make sure you're reading data at the same time you're checking it.
Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)
The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.
Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?
If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense.
I might try something like the following:
1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.
2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.
3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.
To calculate a hash, the entire file needs to be read.
How about opening both files together, and comparing them chunk by chunk?
Pseudo code:
open file A
open file B
while file A has more data
{
if next chunk of A != next chunk of B return false
}
return true
This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.

Parsing large data file from disk significantly slower than parsing in memory?

While writing a simple library to parse a game's data files, I noticed that reading an entire data file into memory and parsing from there was significantly faster (by up to 15x, 106s v 7s).
Parsing is usually sequential but seeks will be done every now and then to read some data stored elsewhere in a file, linked by an offset.
I realise that parsing from memory will definitely be faster, but something is wrong if the difference is so significant. I wrote some code to simulate this:
public static void Main(string[] args)
{
Stopwatch n = new Stopwatch();
n.Start();
byte[] b = File.ReadAllBytes(#"D:\Path\To\Large\File");
using (MemoryStream s = new MemoryStream(b, false))
RandomRead(s);
n.Stop();
Console.WriteLine("Memory read done in {0}.", n.Elapsed);
b = null;
n.Reset();
n.Start();
using (FileStream s = File.Open(#"D:\Path\To\Large\File", FileMode.Open))
RandomRead(s);
n.Stop();
Console.WriteLine("File read done in {0}.", n.Elapsed);
Console.ReadLine();
}
private static void RandomRead(Stream s)
{
// simulate a mostly sequential, but sometimes random, read
using (BinaryReader br = new BinaryReader(s)) {
long l = s.Length;
Random r = new Random();
int c = 0;
while (l > 0) {
l -= br.ReadBytes(r.Next(1, 5)).Length;
if (c++ <= r.Next(10, 15)) continue;
// simulate seeking
long o = s.Position;
s.Position = r.Next(0, (int)s.Length);
l -= br.ReadBytes(r.Next(1, 5)).Length;
s.Position = o;
c = 0;
}
}
}
I used one of the game's data files as input to this. That file was about 102 MB, and it produced this result (Memory read done in 00:00:03.3092618. File read done in 00:00:32.6495245.) which has memory reading about 11x faster than file.
The memory read was done before the file read to try and improve its speed via the file cache. It's still that much slower.
I've tried increasing or decreasing FileStream's buffer size; nothing produced significantly better results, and increasing or decreasing it too much just worsened the speed.
Is there something I'm doing wrong, or is this to be expected? Is there any way to at least make the slowdown less significant?
Why is reading the entire file at once and then parsing it so much faster than reading and parsing simultaneously?
I've actually compared to a similar library written in C++, which uses the Windows native CreateFileMapping and MapViewOfFile to read files, and it's very fast. Could it be the constant switching from managed to unmanaged and the involved marshaling that causes this?
I've also tried MemoryMappedFiles present in .NET 4; the speed gain was only about one second.
Is there something I'm doing wrong, or is this to be expected?
No, nothing wrong. This is entirely expected. That accessing the disk is an order of magnitude slower than accessing memory is more than reasonable.
Update:
That a single read of the file followed by processing is faster than processing while reading is also expected.
Doing a large IO operation and processing in memory would be faster than getting a bit from disk, processing it, calling the disk again (waiting for the IO to complete), processing that bit etc...
Is there something I'm doing wrong, or is this to be expected?
A harddisk has, compared to RAM, huge access times. Sequential reads are pretty speedy, but as soon as the heads have to move (because data is fragmented) it takes lots of milliseconds to get the next bit of data, during which your application is idling.
Is there any way to at least make the slowdown less significant?
Buy an SSD.
You also can take a look at Memory-Mapped Files for .NET:
MemoryMappedFile.CreateFromFile().
As for your edit: I'd go with #Oded and say that reading the file on beforehand adds a penalty. However, that should not cause the method that first reads the whole file to be seven times as slow as 'process-as-you-read'.
I decided to do some benchmarks comparing various ways of reading a file in C++ and C#. First I created a 256mb file. In the c++ benchmarks, buffered means I first copied the entire file to a buffer then read the data from the buffer. All the benchmarks read the file, directly or indirectly, byte by byte sequentially and calculate a checksum. All times are measured from the moment I open the file until I am completely done and the file is closed. All benchmarks were run multiple times to maintain consistent OS file caching.
C++
Unbuffered memory mapped file: 300ms
Buffered memory mapped file: 500ms
Unbuffered fread: 23,000ms
Buffered fread: 500ms
Unbuffered ifstream: 26,000ms
Buffered ifstream: 700ms
C#
MemoryMappedFile: 112,000ms
FileStream: 2,800ms
MemoryStream: 2,300ms
ReadAllBytes: 600ms
Interpret the data as you wish. C#'s memory mapped files are slower than even the worst case c++ code, whereas c++'s memory mapped files are the fastest things around. C#'s ReadAllBytes is decently fast, only twice as slow as c++'s memory mapped files. So if you want the best performance I recommend you use ReadAllBytes and then access the data directly from the array without using a stream.

What is the best memory buffer size to allocate to download a file from Internet?

What is the best memory buffer size to allocate to download a file from Internet? Some of the samples said that it should be 1K. Well, I need to know in general why is it? And also what's the difference if we download a small .PNG or a large .AVI?
Stream remoteStream;
Stream localStream;
WebResponse response;
try
{
response = request.EndGetResponse(result);
if (response == null)
return;
remoteStream = response.GetResponseStream();
var localFile = Path.Combine(FileManager.GetFolderContent(), TaskResult.ContentItem.FileName);
localStream = File.Create(localFile);
var buffer = new byte[1024];
int bytesRead;
do
{
bytesRead = remoteStream.Read(buffer, 0, buffer.Length);
localStream.Write(buffer, 0, bytesRead);
BytesProcessed += bytesRead;
} while (bytesRead > 0);
}
For what it's worth, I tested reading a 1484 KB text file using progressive powers of two (sizes of 2,4,8,16...). I printed out to the console window the number of milliseconds required to read each one. Much past 8192 it didn't seem like much of a difference. Here are the results on my Windows 7 64 bit machine.
2^1 = 2 :264.0151
2^2 = 4 :193.011
2^3 = 8 :175.01
2^4 = 16 :153.0088
2^5 = 32 :139.0079
2^6 = 64 :134.0077
2^7 = 128 :132.0075
2^8 = 256 :130.0075
2^9 = 512 :133.0076
2^10 = 1024 :133.0076
2^11 = 2048 :90.0051
2^12 = 4096 :69.0039
2^13 = 8192 :60.0035
2^14 = 16384 :56.0032
2^15 = 32768 :53.003
2^16 = 65536 :53.003
2^17 = 131072 :52.003
2^18 = 262144 :53.003
2^19 = 524288 :54.0031
2^20 = 1048576 :55.0031
2^21 = 2097152 :54.0031
2^22 = 4194304 :54.0031
2^23 = 8388608 :54.003
2^24 = 16777216 :55.0032
Use at least 4KB. It's the normal page size for Windows (i.e. the granularity at which Windows itself manages memory), which means that the .Net memory allocator doesn't need to break down a 4KB page into 1KB allocations.
Of course, using a 64KB block will be faster, but only marginally so.
2k, 4k or 8k are good choices.
It is not important how much is the page size, the change in speed would be really marginal and unpredictable.
First of all, C# memory can be moved, C# uses a compacting generational garbage collector. There is not any kind of information on where data will be allocated.
Second, arrays in C# can be formed by non-contiguous area of memory!
Arrays are stored contiguously in virtual memory but contiguous virtual memory doesn't mean contiguous physical memory.
Third, array data structure in C# occupies some bytes more than the content itself (it stores array size and other informations). If you allocate page size amount of bytes, using the array will switch page almost always!
I would think that optimizing code using page size can be an non-optimization.
Usually C# arrays performs very well but if you really need precise allocation of data you need to use pinned arrays or Marshal allocation, but this will slow down the garbage collector.
Using marshal allocation and unsafe code can be a little faster but really it don't worth the effort.
I would say it is better to just use your arrays without thinking too much about the page size. Use 2k, 4k or 8k buffers.
I have problem with remote machine closing connection when used 64K buffer when downloading from iis.
I solved the problem raising the buffer to 2M
It will depend on the hardware and scope too. I work for cloud deployed workloads, in server world you may find 40G Ethernet cards and you can assume MTUs of 9000 bytes. Additionally, you dont want your ethernet card interrupt your processor for every single frame. So, ignoring the middle actors in the Windows/Linux kernel you should go for a one or two folds higher:
100 * 9000 ~~ 900kB so I generally choose 512KB as default value (as long as I know this value is not oversizing the regular expected file size being downloaded)
In some cases you can find out (or know, or hack around in a debugger and hence find out albeit in a non-change-resistant way) the size of a buffer used by the stream(s) you are writing to or reading from. In this case it will give a slight advantage if you match that size, or failing that, for one buffer to be a whole multiple of the other.
Otherwise 4096 unless you've a reason otherwise (wanting a small buffer to give rapid UI feedback for example), for the reasons already given.

What is the fastest way to find if an array of byte arrays contains another byte array?

I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.

Faster MD5 alternative?

I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.
The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?
All code is in C#.
Thank you.
Update
I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
I hope you're checking for an MD5 match only if the file size already matches.
Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.
Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.
Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.
I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.
just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.
Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.
If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.
Am I missing something here? What does hashing buy you in this case?
First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.
There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.
This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).
You should either use a hash function which has not been broken yet (like SHA-1), or (as #SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.
References:
http://www.win.tue.nl/hashclash/SoftIntCodeSign/
Use MD5CryptoServiceProvider and BufferedStream
using (FileStream stream = File.OpenRead(filePath))
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new MD5CryptoServiceProvider();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}

Categories

Resources