I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.
The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?
All code is in C#.
Thank you.
Update
I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
I hope you're checking for an MD5 match only if the file size already matches.
Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.
Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.
Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.
I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.
just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.
Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.
If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.
Am I missing something here? What does hashing buy you in this case?
First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.
There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.
This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).
You should either use a hash function which has not been broken yet (like SHA-1), or (as #SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.
References:
http://www.win.tue.nl/hashclash/SoftIntCodeSign/
Use MD5CryptoServiceProvider and BufferedStream
using (FileStream stream = File.OpenRead(filePath))
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new MD5CryptoServiceProvider();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Related
We have many files (word, excel) in our systems on a local network. We get backup from these files every day. Now I want to know: is a file different from it's backup or not? For example, suppose we have a file, "test.docx" and it's backup name is "test_backup.docx", I want to know that if a user makes any change on "test.docx" or not? I would like to compare these two files.
One way is to compare these files word by word, and when a difference is detected, we can conclude that file is updated.
Now my question is, is any other way such as check sum, for detecting this difference? and with this method can I find where is the update occurred?
Thanks.
Have you seen SyncToy?
It sounds like you want to automate a backup copy process and you don't really care about the particular differences, just trying to determine if there are any. My answer is based on that presumption.
Hashing is a good approach to determine if a file should really be backed up but it requires reading the whole file and performing an expensive task on it.
You can pre-process the backup file list by looking at files' size and timestamps - modified, accessed: if they don't match - back it up without checksum. If they match then it's up to you to assume they are the same or to hash the contents, I'd try first assuming they are the same if all timestamps and size matches the backup copy and if this heuristic proves wrong - resort to hashing, but find the fastest algorithm possible - your application of hashes doesn't seem to require high security but rather high performance, SHA and MD5 would both be an overkill with dreadful performance
Here's how we compute a signature of a file:
public static string Signature(this FileInfo input)
{
MD5CryptoServiceProvider cryptoTransform = new MD5CryptoServiceProvider();
FileStream fs = new FileStream(input.FullName, FileMode.Open, FileAccess.Read);
BinaryReader reader = new BinaryReader(fs);
byte[] data = reader.ReadBytes((int)fs.Length);
string hash = BitConverter.ToString(cryptoTransform.ComputeHash(data)).Replace("-", "");
reader.Close();
fs.Close();
return hash;
}
Then, we compute that signature against signature of previous version to detect changes.
This question already exists:
Closed 11 years ago.
Possible Duplicate:
Reversing an MD5 Hash
Given this method in c#
public string CalculateFileHash(string filePaths) {
var csp = new MD5CryptoServiceProvider();
var pathBytes = csp.ComputeHash(Encoding.UTF8.GetBytes(filePaths));
return BitConverter.ToUInt64(pathBytes, 0).ToString();
}
how would one reverse this process with a "DecodeFileHash" method?
var fileQuery = "fileone.css,filetwo.css,file3.css";
var hashedQuery = CalculateFileHash(fileQuery); // e.g. "23948759234"
var decodedQuery = DecodeFileHash(hashedQuery); // "fileone.css,filetwo.css,file3.css"
where decodedQuery == fileQuery in the end.
Is this even possible? If it isn't possible, would there by any way to generate a hash that I could easily decode?
Edit: So just to be clear, I just want to compress the variable "fileQuery" and decompress fileQuery to determine what it originally was. Any suggestions for solving that problem since hashing/decoding is out?
Edit Again: just doing a base64 encode/decode sounds like the optimal solution then.
public string EncodeTo64(string toEncode) {
var toEncodeAsBytes = Encoding.ASCII.GetBytes(toEncode);
var returnValue = System.Convert.ToBase64String(toEncodeAsBytes);
return returnValue;
}
public string DecodeFrom64(string encodedData) {
var encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
var returnValue = Encoding.ASCII.GetString(encodedDataAsBytes);
return returnValue;
}
Impossible. By definition and design hashes cannot be reverted to plain text or their original input.
It sounds like you are actually trying to compress the files. If that is the case, here is a simple method to do so using GZip:
public static byte[] Compress( byte[] data )
{
var output = new MemoryStream();
using ( var gzip = new GZipStream( output, CompressionMode.Compress, true ) )
{
gzip.Write( data, 0, data.Length );
gzip.Close();
}
return output.ToArray();
}
A hash is derived from the original information, but it does not contain the original information. If you want a shorter value that hides the original information, but can be resolved to the original value your options are fairly limited:
Compress the original information. If you need a string, then your original information would have to be fairly large in order for the compressed, base-64-encoded version to not be even bigger than the original data.
Encrypt the original information - this is more secure than just compressing it, and can be combined with compression, but it's also probably going to be larger than the original information.
Store the original information somewhere and return a lookup key.
If you want to be able to get the data back, you want compression, not hashing.
What you want to do is Encrypt and Decrypt....
Not Hash and Unhash which, as #Thomas pointed out, is impossible.
Hashes are typically defeated using rainbow tables or some other data set which includes something which produces the same hash... not guaranteed to be the input value, just some value which produces the same output in the hashing algorithm.
Jeff Atwood has some good code for understanding encryption here:
http://www.codeproject.com/KB/security/SimpleEncryption.aspx
If that's useful to you
A cryptographic hash is by definition not reversible with typical amounts of computation power. It's usually not even possible to find any input which has the same hash as your original input.
Getting back the original input is mathematically impossible if there are more than 2^n different inputs. With n being the bitlength of the hash(128 for md5). Look up the pidgeonhole principle.
A hash is no lossless compression function.
A cryptographic hash, like MD5, is designed to be a one-way function, that is, it is computationally infeasable to derive the source data from which a given hash was computed. MD5, though, hasn't considered to be secure for some time, due to weaknesses that have been discovered:
Wikipedia on MD5 Security
MD5 Considered Harmful
Another weakness in MD5 is that due to its relative small size, large rainbow tables have been published that let you look up a given MD5 hash to get a source input that will collide with the specified hash value.
I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?
fs = new FileStream(txtFile.Text, FileMode.Open);
formatted = string.Empty;
using (SHA1Managed sha1 = new SHA1Managed())
{
byte[] hash = sha1.ComputeHash(fs);
foreach (byte b in hash)
{
formatted += b.ToString("X2");
}
}
fs.Close();
Update:
System:
OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm
Tests:
Test1 = 59.895 seconds
Test2 = 59.94 seconds
The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).
Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).
Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.
I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.
Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.
It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)
How slow is "quite slow" anyway? Compared with, say, copying the file?
If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?
First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.
To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.
First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.
Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.
My guess would be that ComputeHash() uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.
Neither is SHA1Managed the best choice for large input strings, nor is Byte.ToString("X2") the fastest way to convert the byte array to a string.
I just finished an article with detailed benchmarks on that topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng and also considers SHA1.Create() on different length input strings.
In the second part, it shows 5 different methods of converting the byte array to string where Byte.ToString("X2") is the worst.
My largest input was only 10,000 characters so you may want to run my benchmarks on your 2 GB file. Would be quite interesting if/how that changes the numbers.
http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/
However, for file integrity checks you are better off using MD5 as you already wrote.
You Can Use This logic for getting SHA-1 value.
I was using it in java.
public class sha1Calculate {
public static void main(String[] args)throws Exception
{
File file = new File("D:\\Android Links.txt");
String outputTxt= "";
String hashcode = null;
try {
FileInputStream input = new FileInputStream(file);
ByteArrayOutputStream output = new ByteArrayOutputStream ();
byte [] buffer = new byte [65536];
int l;
while ((l = input.read (buffer)) > 0)
output.write (buffer, 0, l);
input.close ();
output.close ();
byte [] data = output.toByteArray ();
MessageDigest digest = MessageDigest.getInstance( "SHA-1" );
byte[] bytes = data;
digest.update(bytes, 0, bytes.length);
bytes = digest.digest();
StringBuilder sb = new StringBuilder();
for( byte b : bytes )
{
sb.append( String.format("%02X", b) );
}
System.out.println("Digest(in hex format):: " + sb.toString());
}catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.
To interact with an external data feed I need to pass a rolling security key which has been MD5 hashed (every day we need to generate a new MD5 hashed key).
I'm trading up whether or not to do it every time we call the external feed or not. I need to has a string of about 10 characters for the feed.
It's for an ASP.NET (C#/ .NET 3.5) site and the feed is used on pretty much every page. Would I best off generating the hash once a day and then storing it in the application cache, and taking the memory hit, or generating it on each request?
The only acceptable basis for optimizations is data. Measure generating this inline and measure caching it.
My high-end workstation can calculate well over 100k MD5 hashes of a 10-byte data segment in a second. There would be zero benefit from caching this for me and I bet it's the same for you.
Generate some sample data. Well, a lot of it. Compute the MD5 of the sample data. Measure the time it takes. Decide for yourself.
Calculate the time complexity of the algorithm!
Look at the following code:
public string GetMD5Hash(string input)
{
System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
byte[] bs = System.Text.Encoding.UTF8.GetBytes(input);
bs = x.ComputeHash(bs);
System.Text.StringBuilder s = new System.Text.StringBuilder();
foreach (byte b in bs)
{
s.Append(b.ToString("x2").ToLower());
}
string password = s.ToString();
return password;
}
If we were to calculate the time complexity we would get T= 11 + n * 2 however this is just "what we see" i.e. ToLower might do some heavy work which we don't know. But from this point we can see that this algorithm is O(n) in all cases. Meaning time grows as data growns.
Also to adress the cache issue, I'd rather have my "heavy" work in Memory since memory is less expensive when compared to CPU-usage.
If it'll be the same for a given day caching it might be an idea. You could even set the cache to be 24 hours and write code to regenerate the hash when the cache expires
Using the Asp.Net cache is very easy so I don't see why you shouldn't cache the key.
Storing the key in cache may even save some memory since you can reuse it instead of creating a new one for each request.