compare the contents of large files - c#

I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.
Compare whole file byte by byte - not fast enough for large files.
File Comparison using Hashes - not 100% match the two files with the same hash.
What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?

If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
Different size, not byte-for-byte equal. Simples!
If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.

Serial Compare
Test File Size(s): 118 MB
Duration: 579 ms
Equal? true
static bool Compare(string filePath1, string filePath2)
{
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
if (file.Length != file2.Length)
{
return false;
}
int count;
const int size = 0x1000000;
var buffer = new byte[size];
var buffer2 = new byte[size];
while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
{
file2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
return false;
}
}
}
}
}
return true;
}
Parallel Compare
Test File Size(s): 118 MB
Duration: 340 ms
Equal? true
static bool Compare2(string filePath1, string filePath2)
{
bool success = true;
var info = new FileInfo(filePath1);
var info2 = new FileInfo(filePath2);
if (info.Length != info2.Length)
{
return false;
}
long fileLength = info.Length;
const int size = 0x1000000;
Parallel.For(0, fileLength / size, x =>
{
var start = (int)x * size;
if (start >= fileLength)
{
return;
}
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
var buffer = new byte[size];
var buffer2 = new byte[size];
file.Position = start;
file2.Position = start;
int count = file.Read(buffer, 0, size);
file2.Read(buffer2, 0, size);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
success = false;
return;
}
}
}
}
});
return success;
}
MD5 Compare
Test File Size(s): 118 MB
Duration: 702 ms
Equal? true
static bool Compare3(string filePath1, string filePath2)
{
byte[] hash1 = GenerateHash(filePath1);
byte[] hash2 = GenerateHash(filePath2);
if (hash1.Length != hash2.Length)
{
return false;
}
for (int i = 0; i < hash1.Length; i++)
{
if (hash1[i] != hash2[i])
{
return false;
}
}
return true;
}
static byte[] GenerateHash(string filePath)
{
MD5 crypto = MD5.Create();
using (FileStream stream = File.OpenRead(filePath))
{
return crypto.ComputeHash(stream);
}
}
tl;dr Compare byte segments in parallel to determine if two files are equal.

Why not both?
Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.

There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.
So a there are two things you'll want to address:
Concurrency - Make sure you're reading data at the same time you're checking it.
Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)
The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.

Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?

If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense.
I might try something like the following:
1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.
2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.
3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.

To calculate a hash, the entire file needs to be read.
How about opening both files together, and comparing them chunk by chunk?
Pseudo code:
open file A
open file B
while file A has more data
{
if next chunk of A != next chunk of B return false
}
return true
This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.

Related

Determining Similarity of Binary Files

I'm writing unit tests around an HTML to PDF process and have a set of sample input HTML files and a set of PDF representing what the expected result would be.
I'd like to compare these to determine that the process has generated the correct output.
Obviously PDF files have some non-deterministic components in them so I can't do a straight up binary compare. I don't particularly want to delve into parsing the PDF output, so I thought it might be neat to just check how much the files differ by (and have the test pass if they differ by, say, less than 1%).
I can't simply count the differing bytes in the same array location as it seems there can be slight size differences in the output, so things will be offset slightly differently in each file.
So, the question is, is there a tried and tested algorithm for determining how much the general content of 2 large byte arrays differ?
Thanks,
Steve.
Edit:
Attaching an image to illustrate how the files generated are broadly the same...
OK, so I've found a method that seems to work pretty well. It's not necessarily hugely efficient, but each test still runs in about half a second, so it's fine for my purpose. Posting it here in case it helps someone else out.
It basically just sums the bytes in each array and calculates the difference:
public static bool IsAnalogousTo(this byte[] left, byte[] right, int tolerance)
{
long leftSum = 0;
foreach (var b in left)
{
leftSum += b;
}
long rightSum = 0;
foreach (var b in right)
{
rightSum += b;
}
return Math.Abs(leftSum - rightSum) < left.Length / tolerance;
}
My thinking is that the files are ~115k in size - if the sum of all the bytes in files of that size is less than ~15k then that means less than one bit in every ten bytes (the tolerance parameter) is different.
This works well for what I want. For other purposes where more accuracy is required, it would probably better to do it in chunks to ensure regions of the file are similar.
Of course, on a small set of data, this would be useless. It would say that [10, 20, 30, 40] and [0, 0, 0, 100] are the same, but on 115,000 bytes of reasonably structured data such as a PDF, I think this is probably acceptable.

Fastest way to delete the first few bytes of a file

I am using a windows mobile compact edition 6.5 phone and am writing out binary data to a file from bluetooth. These files get quite large, 16M+ and what I need to do is to once the file is written then I need to search the file for a start character and then delete everything before, thus eliminating garbage. I cannot do this inline when the data comes in due to graphing issues and speed as I get alot of data coming in and there is already too many if conditions on the incoming data. I figured it was best to post process. Anyway here is my dilemma, speed of search for the start bytes and the rewrite of the file takes sometimes 5mins or more...I basically move the file over to a temp file parse through it and rewrite a whole new file. I have to do this byte by byte.
private void closeFiles() {
try {
// Close file stream for raw data.
if (this.fsRaw != null) {
this.fsRaw.Flush();
this.fsRaw.Close();
// Move file, seek the first sync bytes,
// write to fsRaw stream with sync byte and rest of data after it
File.Move(this.s_fileNameRaw, this.s_fileNameRaw + ".old");
FileStream fsRaw_Copy = File.Open(this.s_fileNameRaw + ".old", FileMode.Open);
this.fsRaw = File.Create(this.s_fileNameRaw);
int x = 0;
bool syncFound = false;
// search for sync byte algorithm
while (x != -1) {
... logic to search for sync byte
if (x != -1 && syncFound) {
this.fsPatientRaw.WriteByte((byte)x);
}
}
this.fsRaw.Close();
fsRaw_Copy.Close();
File.Delete(this.s_fileNameRaw + ".old");
}
} catch(IOException e) {
CLogger.WriteLog(ELogLevel.ERROR,"Exception in writing: " + e.Message);
}
}
There has got to be a faster way than this!
------------Testing times using answer -------------
Initial Test my way with one byte read and and one byte write:
27 Kb/sec
using a answer below and a 32768 byte buffer:
321 Kb/sec
using a answer below and a 65536 byte buffer:
501 Kb/sec
You're doing a byte-wise copy of the entire file. That can't be efficient for a load of reasons. Search for the start offset (and end offset if you need both), then copy from one stream to another the entire contents between the two offsets (or the start offset and end of file).
EDIT
You don't have to read the entire contents to make the copy. Something like this (untested, but you get the idea) would work.
private void CopyPartial(string sourceName, byte syncByte, string destName)
{
using (var input = File.OpenRead(sourceName))
using (var reader = new BinaryReader(input))
using (var output = File.Create(destName))
{
var start = 0;
// seek to sync byte
while (reader.ReadByte() != syncByte)
{
start++;
}
var buffer = new byte[4096]; // 4k page - adjust as you see fit
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
output.Write(buffer, 0, actual);
} while (reader.PeekChar() >= 0);
}
}
EDIT 2
I actually needed something similar to this today, so I decided to write it without the PeekChar() call. Here's the kernel of what I did - feel free to integrate it with the second do...while loop above.
var buffer = new byte[1024];
var total = 0;
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, actual);
total += actual;
} while (total < reader.BaseStream.Length);
Don't discount an approach because you're afraid it will be too slow. Try it! It'll only take 5-10 minutes to give it a try and may result in a much better solution.
If the detection process for the start of the data is not too complex/slow, then avoiding writing data until you hit the start may actually make the program skip past the junk data more efficiently.
How to do this:
Use a simple bool to know whether or not you have detected the start of the data. If you are reading junk, then don't waste time writing it to the output, just scan it to detect the start of the data. Once you find the start, then stop scanning for the start and just copy the data to the output. Just copying the good data will incur no more than an if (found) check, which really won't make any noticeable difference to your performance.
You may find that in itself solves the problem. But you can optimise it if you need more performance:
What can you do to minimise the work you do to detect the start of the data? Perhaps if you are looking for a complex sequence you only need to check for one particular byte value that starts the sequence, and it's only if you find that start byte that you need to do any more complex checking. There are some very simple but efficient string searching algorithms that may help in this sort of case too. Or perhaps you can allocate a buffer (e.g. 4kB) and gradually fill it with bytes from your incoming stream. When the buffer is filled, then and only then search for the end of the "junk" in your buffer. By batching the work you can make use of memory/cache coherence to make the processing considerably more efficient than it would be if you did the same work byte by byte.
Do all the other "conditions on the incoming data" need to be continually checked? How can you minimise the amount of work you need to do but still achieve the required results? Perhaps some of the ideas above might help here too?
Do you actually need to do any processing on the data while you are skipping junk? If not, then you can break the whole thing into two phases (skip junk, copy data), and skipping the junk won't cost you anything when it actually matters.

Improve performance of SHA-1 ComputeHash

I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?
fs = new FileStream(txtFile.Text, FileMode.Open);
formatted = string.Empty;
using (SHA1Managed sha1 = new SHA1Managed())
{
byte[] hash = sha1.ComputeHash(fs);
foreach (byte b in hash)
{
formatted += b.ToString("X2");
}
}
fs.Close();
Update:
System:
OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm
Tests:
Test1 = 59.895 seconds
Test2 = 59.94 seconds
The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).
Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).
Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.
I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.
Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.
It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)
How slow is "quite slow" anyway? Compared with, say, copying the file?
If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?
First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.
To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.
First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.
Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.
My guess would be that ComputeHash() uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.
Neither is SHA1Managed the best choice for large input strings, nor is Byte.ToString("X2") the fastest way to convert the byte array to a string.
I just finished an article with detailed benchmarks on that topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng and also considers SHA1.Create() on different length input strings.
In the second part, it shows 5 different methods of converting the byte array to string where Byte.ToString("X2") is the worst.
My largest input was only 10,000 characters so you may want to run my benchmarks on your 2 GB file. Would be quite interesting if/how that changes the numbers.
http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/
However, for file integrity checks you are better off using MD5 as you already wrote.
You Can Use This logic for getting SHA-1 value.
I was using it in java.
public class sha1Calculate {
public static void main(String[] args)throws Exception
{
File file = new File("D:\\Android Links.txt");
String outputTxt= "";
String hashcode = null;
try {
FileInputStream input = new FileInputStream(file);
ByteArrayOutputStream output = new ByteArrayOutputStream ();
byte [] buffer = new byte [65536];
int l;
while ((l = input.read (buffer)) > 0)
output.write (buffer, 0, l);
input.close ();
output.close ();
byte [] data = output.toByteArray ();
MessageDigest digest = MessageDigest.getInstance( "SHA-1" );
byte[] bytes = data;
digest.update(bytes, 0, bytes.length);
bytes = digest.digest();
StringBuilder sb = new StringBuilder();
for( byte b : bytes )
{
sb.append( String.format("%02X", b) );
}
System.out.println("Digest(in hex format):: " + sb.toString());
}catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Quickly load 350M numbers into a double[] array in C#

I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?
Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.
[update]
Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes.
Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.
Well I did a small test and I would definitely recommend using Memory Mapped Files.
I Created a File containing 350M double values (2.6 GB as many mentioned before) and then tested the time it takes to map the file to memory and then access any of the elements.
In all my tests in my laptop (Win7, .Net 4.0, Core2 Duo 2.0 GHz, 4GB RAM) it took less than a second to map the file and at that point accessing any of the elements took virtually 0ms (all time is in the validation of the index).
Then I decided to go through all 350M numbers and the whole process took about 3 minutes (paging included) so if in your case you have to iterate they may be another option.
Nevertheless I wrapped the access, just for example purposes there a lot conditions you should check before using this code, and it looks like this
public class Storage<T> : IDisposable, IEnumerable<T> where T : struct
{
MemoryMappedFile mappedFile;
MemoryMappedViewAccessor accesor;
long elementSize;
long numberOfElements;
public Storage(string filePath)
{
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
mappedFile = MemoryMappedFile.CreateFromFile(filePath);
accesor = mappedFile.CreateViewAccessor(0, info.Length);
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
}
public long Length
{
get
{
return numberOfElements;
}
}
public T this[long index]
{
get
{
if (index < 0 || index > numberOfElements)
{
throw new ArgumentOutOfRangeException();
}
T value = default(T);
accesor.Read<T>(index * elementSize, out value);
return value;
}
}
public void Dispose()
{
if (accesor != null)
{
accesor.Dispose();
accesor = null;
}
if (mappedFile != null)
{
mappedFile.Dispose();
mappedFile = null;
}
}
public IEnumerator<T> GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
public static T[] GetArray(string filePath)
{
T[] elements;
int elementSize;
long numberOfElements;
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
using (MemoryMappedFile mappedFile = MemoryMappedFile.CreateFromFile(filePath))
{
using(MemoryMappedViewAccessor accesor = mappedFile.CreateViewAccessor(0, info.Length))
{
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
elements = new T[numberOfElements];
if (numberOfElements > int.MaxValue)
{
//you will need to split the array
}
else
{
accesor.ReadArray<T>(0, elements, 0, (int)numberOfElements);
}
}
}
return elements;
}
}
Here is an example of how you can use the class
Stopwatch watch = Stopwatch.StartNew();
using (Storage<double> helper = new Storage<double>("Storage.bin"))
{
Console.WriteLine("Initialization Time: {0}", watch.ElapsedMilliseconds);
string item;
long index;
Console.Write("Item to show: ");
while (!string.IsNullOrWhiteSpace((item = Console.ReadLine())))
{
if (long.TryParse(item, out index) && index >= 0 && index < helper.Length)
{
watch.Reset();
watch.Start();
double value = helper[index];
Console.WriteLine("Access Time: {0}", watch.ElapsedMilliseconds);
Console.WriteLine("Item: {0}", value);
}
else
{
Console.Write("Invalid index");
}
Console.Write("Item to show: ");
}
}
UPDATE I added a static method to load all data in a file to an array. Obviously this approach takes more time initially (on my laptop takes between 1 and 2 min) but after that access performance is what you expect from .Net. This method should be useful if you have to access data frequently.
Usage is pretty simple
double[] helper = Storage<double>.GetArray("Storage.bin");
HTH
It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.
(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)
Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.
The first question you have presumably already answered is "does this have to be precalculated?". Is there some algorithm you can use that will make it possible to calculate the required values on demand to avoid this problem? Assuming not...
That is only 2.6GB of data - on a 64 bit processor you'll have no problem with a tiny amount of data like that. But if you're running on a 5 year old computer with a 10 year old OS then it's a non-starter, as that much data will immediately fill the available working set for a 32-bit application.
One approach that would be obvious in C++ would be to use a memory-mapped file. This makes the data appear to your application as if it is in RAM, but the OS actually pages bits of it in only as it is accessed, so very little real RAM is used. I'm not sure if you could do this directly from C#, but you could easily enough do it in C++/CLI and then access it from C#.
Alternatively, assuming the question "do you need all of it in RAM simultaneously" has been answered with "yes", then you can't go for any kind of virtualisation approach, so...
Loading in multiple threads won't help - you are going to be I/O bound, so you'll have n threads waiting for data (and asking the hard drive to seek between the chunks they are reading) rather than one thread waiitng for data (which is being read sequentially, with no seeks). So threads will just cause more seeking and thus may well make it slower. (The only case where splitting the data up might help is if you split it to different physical disks so different chunks of data can be read in parallel - don't do this in software; buy a RAID array)
The only place where multithreading may help is to make the load happen in the background while the rest of your application starts up, and allow the user to start using the portion of the data that is already loaded while the rest of the buffer fills, so the user (hopefully) doesn't have to wait much while the data is loading.
So, you're back to loading the data into one massive array in a single thread...
However, you may be able to speed this up considerably by compressing the data. There are a couple of general approaches woth considering:
If you know something about the data, you may be able to invent an encoding scheme that makes the data smaller (and therefore faster to load). e.g. if the values tend to be close to each other (e.g. imagine the data points that describe a sine wave - the values range from very small to very large, but each value is only ever a small increment from the last) you may be able to represent the 'deltas' in a float without losing the accuracy of the original double values, halving the data size. If there is any symmetry or repetition to the data you may be able to exploit it (e.g. imagine storing all the positions to describe a whole circle, versus storing one quadrant and using a bit of trivial and fast maths to reflect it 4 times - an easy way to quarter the amount of data I/O). Any reduction in data size would give a corresponding reduction in load time. In addition, many of these schemes would allow the data to remain "encoded" in RAM, so you'd use far less RAM but still be able to quickly fetch the data when it was needed.
Alternatively, you can very easily wrap your stream with a generic compression algorithm such as Deflate. This may not work, but usually the cost of decompressing the data on the CPU is less than the I/O time that you save by loading less source data, so the net result is that it loads significantly faster. And of course, save a load of disk space too.
In typical case, loading speed will be limited by speed of storage you're loading data from--i.e. hard drive.
If you want it to be faster, you'll need to use faster storage, f.e. multiple hard drives joined in a RAID scheme.
If your data can be reasonably compressed, do that. Try to find algorithm which will use exactly as much CPU power as you have---less than that and your external storage speed will be limiting factor; more than that and your CPU speed will be limiting factor. If your compression algorithm can use multiple cores, then multithreading can be useful.
If your data are somehow predictable, you might want to come up with custom compression scheme. F.e. if consecutive numbers are close to each other, you might want to store differences between numbers---this might help compression efficiency.
Do you really need double precision? Maybe floats will do the job? Maybe you don't need full range of doubles? For example if you need full 53 bits of mantissa precision, but need only to store numbers between -1.0 and 1.0, you can try to chop few bits per number by not storing exponents in full range.
Making this parallel would be a bad idea unless you're running on a SSD. The limiting factor is going to be the disk IO--and if you run two threads the head is going to be jumping back and forth between the two areas being read. This will slow it down a lot more than any possible speedup from parallelization.
Remember that drives are MECHANICAL devices and insanely slow compared to the processor. If you can do a million instructions in order to avoid a single head seek you will still come out ahead.
Also, once the file is on disk make sure to defrag the disk to ensure it's in one contiguous block.
That does not sound like a good idea to me. 350,000,000 * 8 bytes = 2,800,000,000 bytes. Even if you manage to avoid the OutOfMemoryException the process may be swapping in/out of the page file anyway. You might as well leave the data in the file and load smaller chucks as they are needed. The point is that just because you can allocate that much memory does not mean you should.
With a suitable disk configuration, splitting into multiple files across disks would make sense - and reading each file in a separate thread would then work nicely (if you've some stripyness - RAID whatever :) - then it could make sense to read from a single file with multiple threads).
I think you're on a hiding to nothing attempting this with a single physical disk, though.
Just saw this : .NET 4.0 has support for memory mapped files. That would be a very fast way to do it, and no support required for parallelization etc.

What is the fastest way to find if an array of byte arrays contains another byte array?

I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.

Categories

Resources