I'm writing unit tests around an HTML to PDF process and have a set of sample input HTML files and a set of PDF representing what the expected result would be.
I'd like to compare these to determine that the process has generated the correct output.
Obviously PDF files have some non-deterministic components in them so I can't do a straight up binary compare. I don't particularly want to delve into parsing the PDF output, so I thought it might be neat to just check how much the files differ by (and have the test pass if they differ by, say, less than 1%).
I can't simply count the differing bytes in the same array location as it seems there can be slight size differences in the output, so things will be offset slightly differently in each file.
So, the question is, is there a tried and tested algorithm for determining how much the general content of 2 large byte arrays differ?
Thanks,
Steve.
Edit:
Attaching an image to illustrate how the files generated are broadly the same...
OK, so I've found a method that seems to work pretty well. It's not necessarily hugely efficient, but each test still runs in about half a second, so it's fine for my purpose. Posting it here in case it helps someone else out.
It basically just sums the bytes in each array and calculates the difference:
public static bool IsAnalogousTo(this byte[] left, byte[] right, int tolerance)
{
long leftSum = 0;
foreach (var b in left)
{
leftSum += b;
}
long rightSum = 0;
foreach (var b in right)
{
rightSum += b;
}
return Math.Abs(leftSum - rightSum) < left.Length / tolerance;
}
My thinking is that the files are ~115k in size - if the sum of all the bytes in files of that size is less than ~15k then that means less than one bit in every ten bytes (the tolerance parameter) is different.
This works well for what I want. For other purposes where more accuracy is required, it would probably better to do it in chunks to ensure regions of the file are similar.
Of course, on a small set of data, this would be useless. It would say that [10, 20, 30, 40] and [0, 0, 0, 100] are the same, but on 115,000 bytes of reasonably structured data such as a PDF, I think this is probably acceptable.
Related
[Apologies for the wall of text, I'm not hugely sure if the problem is my misunderstanding of writing values in big-endian order, converting the source value into bytes, or a misunderstanding of a file format, so I've tried to include enough background information that someone can tell me of my stupidity.]
Some time ago I wrote some code that would read and write Adobe Photoshop Color Swatch files (*.aco). I wrote the code, I tested it with unit tests, and via the simple expedient of actually opening a generated swatch file in an older version of Photoshop, and all seemed well.
In an ACO file (sticking to RGB color space only for the purposes of this question), each color channel is a uint16 using the full range of 0-65535 for the value, and is written to files in big-endian order. In the original versions of my code, I was using System.Drawing.Color which of course only does the range 0-255, so I would multiple and divide by 256 as appropriate.
This is an example of how I would write the values to a file
private void WriteInt16(Stream stream, short value)
{
stream.WriteByte((byte)(value >> 8));
stream.WriteByte((byte)(value >> 0));
}
I break an int16 into two bytes, and write them using big-endian order. Simple enough I thought.
After the code was in production use, I received bug reports starting that third party applications that supported aco files were reading palettes created by my code as pure black. I still found Photoshop itself was very happy with the files, but I did find a third party application that turned them into black. So the problem definitely existed.
The screenshot below is from a diagnostic tool I ended up writing trying to work out what was going on. (I've munged the screenshot a bit to try and make it fit in this narrow width, but it should show enough detail)
The block on the left is an original aco file. The block on the right is the same file, after it had been re-saved using my code. The coloring is fairly irrelevant for this post, save that the pale salmon is the RGB color data (plus a forth dummy channel).
From looking at this image, the problem was readily discoverable. Due to my conversion of (byte)*256, this meant that one byte was always zero, and when written out as zero it seemed that every other third party application in existence read the file wrong, except for Photoshop.
Newer versions of the code look like this
public static void WriteBigEndian(this Stream stream, ushort value)
{
byte[] buffer;
buffer = new byte[2];
buffer[0] = (byte)(value >> 8);
buffer[1] = (byte)value;
if (buffer[1] == 0 && buffer[0] != 0)
{
buffer[1] = buffer[0];
}
stream.Write(buffer, 0, 2);
}
Where I "fix" the problem changing the 0 byte to match the non-zero byte. I also store and use the original uint16 values to avoid lose of precision by converting them to bytes and back, but that's beside the point.
My question (finally) is WHY does this work. I don't understand the fix I stumbled into (this was my first time at working with big-endian ordering). Does big-endian order for uint16 not support a non-zero byte followed by a zero byte, have I misunderstood how to break down a uint16 into its component bytes, am I misunderstanding the ACO file format, or is it just the case that all of these programs read the files wrong (that seems somewhat unlikely!). Or, did I diagnose the issue incorrectly in the first place?
I've used the same conversion / serialization code in other places and its worrying that I'm propagating a fix for an issue that might only be relevant in the original ACO scenario, thus introducing other errors elsewhere.
I have tried searching for answers, but come up with a blank. I've actually avoided posting this question for months with my uncertainty over the root cause, but I would like an answer if someone knows.
I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.
Compare whole file byte by byte - not fast enough for large files.
File Comparison using Hashes - not 100% match the two files with the same hash.
What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?
If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
Different size, not byte-for-byte equal. Simples!
If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.
Serial Compare
Test File Size(s): 118 MB
Duration: 579 ms
Equal? true
static bool Compare(string filePath1, string filePath2)
{
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
if (file.Length != file2.Length)
{
return false;
}
int count;
const int size = 0x1000000;
var buffer = new byte[size];
var buffer2 = new byte[size];
while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
{
file2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
return false;
}
}
}
}
}
return true;
}
Parallel Compare
Test File Size(s): 118 MB
Duration: 340 ms
Equal? true
static bool Compare2(string filePath1, string filePath2)
{
bool success = true;
var info = new FileInfo(filePath1);
var info2 = new FileInfo(filePath2);
if (info.Length != info2.Length)
{
return false;
}
long fileLength = info.Length;
const int size = 0x1000000;
Parallel.For(0, fileLength / size, x =>
{
var start = (int)x * size;
if (start >= fileLength)
{
return;
}
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
var buffer = new byte[size];
var buffer2 = new byte[size];
file.Position = start;
file2.Position = start;
int count = file.Read(buffer, 0, size);
file2.Read(buffer2, 0, size);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
success = false;
return;
}
}
}
}
});
return success;
}
MD5 Compare
Test File Size(s): 118 MB
Duration: 702 ms
Equal? true
static bool Compare3(string filePath1, string filePath2)
{
byte[] hash1 = GenerateHash(filePath1);
byte[] hash2 = GenerateHash(filePath2);
if (hash1.Length != hash2.Length)
{
return false;
}
for (int i = 0; i < hash1.Length; i++)
{
if (hash1[i] != hash2[i])
{
return false;
}
}
return true;
}
static byte[] GenerateHash(string filePath)
{
MD5 crypto = MD5.Create();
using (FileStream stream = File.OpenRead(filePath))
{
return crypto.ComputeHash(stream);
}
}
tl;dr Compare byte segments in parallel to determine if two files are equal.
Why not both?
Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.
There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.
So a there are two things you'll want to address:
Concurrency - Make sure you're reading data at the same time you're checking it.
Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)
The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.
Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?
If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense.
I might try something like the following:
1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.
2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.
3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.
To calculate a hash, the entire file needs to be read.
How about opening both files together, and comparing them chunk by chunk?
Pseudo code:
open file A
open file B
while file A has more data
{
if next chunk of A != next chunk of B return false
}
return true
This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.
How to decode (shifting and xoring) a massive byte array in a fast way?
I need it for a file viewer application that opens the archive file and decodes the files inside and display them to the user. The files are encrypted with a byte shifting and xoring system. It is impossible for me to change the algorithm. Currently, I just read all the bytes and then run the Decode function on them.
The decode function that I currently use:
byte[] DecodeVOQ(byte[] EncodedBytes)
{
for (int i = 0; i < EncodedBytes.Length; i++)
{
EncodedBytes[i] ^= (byte)194;
EncodedBytes[i] = (byte)((EncodedBytes[i] << 4) | (EncodedBytes[i] >> 4));
}
return EncodedBytes;
}
Edit: I found out that the real performance problem is with displaying the text. Reading + Decoding is pretty fast.
One possible optimization would be to precompute the output for any input byte. So you'd have:
private static byte[] DecodedBytes = PrecomputeDecodedBytes();
public static byte[] DecodeVOQ(byte[] data)
{
for (int i = 0; i < data.Length; i++)
{
data[i] = DecodedBytes[data[i]];
}
return data;
}
It's quite possible that that will be slower than your existing bitshifting algorithm though. EDIT: I've just tried comparing this with the original bitshift but using a temporary local variable: they're about the same.
Have you benchmarked the current performance? Is it definitely too slow? In particular, loading the file from just about any storage medium will be much much slower than the cost of decoding. I've just tried this on my laptop - for 200MB of data, it takes about half a second. (EDIT: With Marcelo's answer, it takes under half a second.) Is that really too slow?
Would you be happy to use more than one processor? It's an embarrassingly parallelizable routine, after all. If you're using .NET 4, the TPL may well make this pretty simple.
I should emphasize again though that this isn't "encryption" - it's a mild form of obfuscation, in the same way that the base-64 encoding of a username/password for basic HTTP authentication is.
I'm thinking a table driven approach would be faster, right? Since it's just bytes, and no byte depends on an adjacent byte, there are only 256 possible choices, so just lookup each one
You might speed things up by using a temporary:
byte b = EncodedBytes[i] ^ (byte)194;
EncodedBytes[i] = (byte)((b << 4) | (b >> 4));
You might speed things up further by using unsafe and raw pointers, thus avoiding checked accesses (though I don't know if that's a consideration with current JIT optimisers).
One approach to consider is to decode the data just as it's being displayed. That is, only decode a portion at a time. But I suspect you are just dumping the data to an edit control or something, which would not really make that an option. How are you displaying the data?
Other than that, I'm not sure how you'd speed it up too much.
Without .NET you'd be able to decode that 4 bytes in a time, but here, actually, the only one thing you can do is to precompute translation table.
That's not xor, that's shift and or...
In assembly, that would be a single "rotate byte by 4" instruction.
BTW can't you decode it on demand? Decode the file in blocks, as you stream the file.
I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?
Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.
[update]
Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes.
Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.
Well I did a small test and I would definitely recommend using Memory Mapped Files.
I Created a File containing 350M double values (2.6 GB as many mentioned before) and then tested the time it takes to map the file to memory and then access any of the elements.
In all my tests in my laptop (Win7, .Net 4.0, Core2 Duo 2.0 GHz, 4GB RAM) it took less than a second to map the file and at that point accessing any of the elements took virtually 0ms (all time is in the validation of the index).
Then I decided to go through all 350M numbers and the whole process took about 3 minutes (paging included) so if in your case you have to iterate they may be another option.
Nevertheless I wrapped the access, just for example purposes there a lot conditions you should check before using this code, and it looks like this
public class Storage<T> : IDisposable, IEnumerable<T> where T : struct
{
MemoryMappedFile mappedFile;
MemoryMappedViewAccessor accesor;
long elementSize;
long numberOfElements;
public Storage(string filePath)
{
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
mappedFile = MemoryMappedFile.CreateFromFile(filePath);
accesor = mappedFile.CreateViewAccessor(0, info.Length);
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
}
public long Length
{
get
{
return numberOfElements;
}
}
public T this[long index]
{
get
{
if (index < 0 || index > numberOfElements)
{
throw new ArgumentOutOfRangeException();
}
T value = default(T);
accesor.Read<T>(index * elementSize, out value);
return value;
}
}
public void Dispose()
{
if (accesor != null)
{
accesor.Dispose();
accesor = null;
}
if (mappedFile != null)
{
mappedFile.Dispose();
mappedFile = null;
}
}
public IEnumerator<T> GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
public static T[] GetArray(string filePath)
{
T[] elements;
int elementSize;
long numberOfElements;
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
using (MemoryMappedFile mappedFile = MemoryMappedFile.CreateFromFile(filePath))
{
using(MemoryMappedViewAccessor accesor = mappedFile.CreateViewAccessor(0, info.Length))
{
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
elements = new T[numberOfElements];
if (numberOfElements > int.MaxValue)
{
//you will need to split the array
}
else
{
accesor.ReadArray<T>(0, elements, 0, (int)numberOfElements);
}
}
}
return elements;
}
}
Here is an example of how you can use the class
Stopwatch watch = Stopwatch.StartNew();
using (Storage<double> helper = new Storage<double>("Storage.bin"))
{
Console.WriteLine("Initialization Time: {0}", watch.ElapsedMilliseconds);
string item;
long index;
Console.Write("Item to show: ");
while (!string.IsNullOrWhiteSpace((item = Console.ReadLine())))
{
if (long.TryParse(item, out index) && index >= 0 && index < helper.Length)
{
watch.Reset();
watch.Start();
double value = helper[index];
Console.WriteLine("Access Time: {0}", watch.ElapsedMilliseconds);
Console.WriteLine("Item: {0}", value);
}
else
{
Console.Write("Invalid index");
}
Console.Write("Item to show: ");
}
}
UPDATE I added a static method to load all data in a file to an array. Obviously this approach takes more time initially (on my laptop takes between 1 and 2 min) but after that access performance is what you expect from .Net. This method should be useful if you have to access data frequently.
Usage is pretty simple
double[] helper = Storage<double>.GetArray("Storage.bin");
HTH
It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.
(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)
Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.
The first question you have presumably already answered is "does this have to be precalculated?". Is there some algorithm you can use that will make it possible to calculate the required values on demand to avoid this problem? Assuming not...
That is only 2.6GB of data - on a 64 bit processor you'll have no problem with a tiny amount of data like that. But if you're running on a 5 year old computer with a 10 year old OS then it's a non-starter, as that much data will immediately fill the available working set for a 32-bit application.
One approach that would be obvious in C++ would be to use a memory-mapped file. This makes the data appear to your application as if it is in RAM, but the OS actually pages bits of it in only as it is accessed, so very little real RAM is used. I'm not sure if you could do this directly from C#, but you could easily enough do it in C++/CLI and then access it from C#.
Alternatively, assuming the question "do you need all of it in RAM simultaneously" has been answered with "yes", then you can't go for any kind of virtualisation approach, so...
Loading in multiple threads won't help - you are going to be I/O bound, so you'll have n threads waiting for data (and asking the hard drive to seek between the chunks they are reading) rather than one thread waiitng for data (which is being read sequentially, with no seeks). So threads will just cause more seeking and thus may well make it slower. (The only case where splitting the data up might help is if you split it to different physical disks so different chunks of data can be read in parallel - don't do this in software; buy a RAID array)
The only place where multithreading may help is to make the load happen in the background while the rest of your application starts up, and allow the user to start using the portion of the data that is already loaded while the rest of the buffer fills, so the user (hopefully) doesn't have to wait much while the data is loading.
So, you're back to loading the data into one massive array in a single thread...
However, you may be able to speed this up considerably by compressing the data. There are a couple of general approaches woth considering:
If you know something about the data, you may be able to invent an encoding scheme that makes the data smaller (and therefore faster to load). e.g. if the values tend to be close to each other (e.g. imagine the data points that describe a sine wave - the values range from very small to very large, but each value is only ever a small increment from the last) you may be able to represent the 'deltas' in a float without losing the accuracy of the original double values, halving the data size. If there is any symmetry or repetition to the data you may be able to exploit it (e.g. imagine storing all the positions to describe a whole circle, versus storing one quadrant and using a bit of trivial and fast maths to reflect it 4 times - an easy way to quarter the amount of data I/O). Any reduction in data size would give a corresponding reduction in load time. In addition, many of these schemes would allow the data to remain "encoded" in RAM, so you'd use far less RAM but still be able to quickly fetch the data when it was needed.
Alternatively, you can very easily wrap your stream with a generic compression algorithm such as Deflate. This may not work, but usually the cost of decompressing the data on the CPU is less than the I/O time that you save by loading less source data, so the net result is that it loads significantly faster. And of course, save a load of disk space too.
In typical case, loading speed will be limited by speed of storage you're loading data from--i.e. hard drive.
If you want it to be faster, you'll need to use faster storage, f.e. multiple hard drives joined in a RAID scheme.
If your data can be reasonably compressed, do that. Try to find algorithm which will use exactly as much CPU power as you have---less than that and your external storage speed will be limiting factor; more than that and your CPU speed will be limiting factor. If your compression algorithm can use multiple cores, then multithreading can be useful.
If your data are somehow predictable, you might want to come up with custom compression scheme. F.e. if consecutive numbers are close to each other, you might want to store differences between numbers---this might help compression efficiency.
Do you really need double precision? Maybe floats will do the job? Maybe you don't need full range of doubles? For example if you need full 53 bits of mantissa precision, but need only to store numbers between -1.0 and 1.0, you can try to chop few bits per number by not storing exponents in full range.
Making this parallel would be a bad idea unless you're running on a SSD. The limiting factor is going to be the disk IO--and if you run two threads the head is going to be jumping back and forth between the two areas being read. This will slow it down a lot more than any possible speedup from parallelization.
Remember that drives are MECHANICAL devices and insanely slow compared to the processor. If you can do a million instructions in order to avoid a single head seek you will still come out ahead.
Also, once the file is on disk make sure to defrag the disk to ensure it's in one contiguous block.
That does not sound like a good idea to me. 350,000,000 * 8 bytes = 2,800,000,000 bytes. Even if you manage to avoid the OutOfMemoryException the process may be swapping in/out of the page file anyway. You might as well leave the data in the file and load smaller chucks as they are needed. The point is that just because you can allocate that much memory does not mean you should.
With a suitable disk configuration, splitting into multiple files across disks would make sense - and reading each file in a separate thread would then work nicely (if you've some stripyness - RAID whatever :) - then it could make sense to read from a single file with multiple threads).
I think you're on a hiding to nothing attempting this with a single physical disk, though.
Just saw this : .NET 4.0 has support for memory mapped files. That would be a very fast way to do it, and no support required for parallelization etc.
I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.