check 2 file check sum, for detecting any differences between them - c#

We have many files (word, excel) in our systems on a local network. We get backup from these files every day. Now I want to know: is a file different from it's backup or not? For example, suppose we have a file, "test.docx" and it's backup name is "test_backup.docx", I want to know that if a user makes any change on "test.docx" or not? I would like to compare these two files.
One way is to compare these files word by word, and when a difference is detected, we can conclude that file is updated.
Now my question is, is any other way such as check sum, for detecting this difference? and with this method can I find where is the update occurred?
Thanks.

Have you seen SyncToy?
It sounds like you want to automate a backup copy process and you don't really care about the particular differences, just trying to determine if there are any. My answer is based on that presumption.
Hashing is a good approach to determine if a file should really be backed up but it requires reading the whole file and performing an expensive task on it.
You can pre-process the backup file list by looking at files' size and timestamps - modified, accessed: if they don't match - back it up without checksum. If they match then it's up to you to assume they are the same or to hash the contents, I'd try first assuming they are the same if all timestamps and size matches the backup copy and if this heuristic proves wrong - resort to hashing, but find the fastest algorithm possible - your application of hashes doesn't seem to require high security but rather high performance, SHA and MD5 would both be an overkill with dreadful performance

Here's how we compute a signature of a file:
public static string Signature(this FileInfo input)
{
MD5CryptoServiceProvider cryptoTransform = new MD5CryptoServiceProvider();
FileStream fs = new FileStream(input.FullName, FileMode.Open, FileAccess.Read);
BinaryReader reader = new BinaryReader(fs);
byte[] data = reader.ReadBytes((int)fs.Length);
string hash = BitConverter.ToString(cryptoTransform.ComputeHash(data)).Replace("-", "");
reader.Close();
fs.Close();
return hash;
}
Then, we compute that signature against signature of previous version to detect changes.

Related

Why binary data writing's result is string?

I want to create a binary file and store string data in it, I used this sample:
FileStream fs = new FileStream("c:\\test.data", FileMode.Create);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(Encoding.ASCII.GetBytes("david stein"));
bw.Close();
but when I opened created file by this sample (test.data) in notepad, it has string data in it ("david stein"), now my question is that whats the difference between this binary writing and text writing when the result is string?
I'm looking to create a data in binary file until user can not open and read my data by note pad and if user open it in notepad see real binary data .
in some files when you open theme in text editors you can not read file content like jpg files contents,they do not use any encryption methods,what about it?how can i wite my data like this?
now my question is that whats the difference between this binary writing and text writing when the result is string?
The data in a file is always "a sequence of bytes". In this case, the sequence of bytes you've written is "the bytes representing the text 'david stein'" in the ASCII encoding. So yes, if you open the file in any editor which tries to interpret the bytes as text in a way which is compatible with ASCII, you'll see the text "david stein". Really it's just a load of bytes though - it all depends on how you interpret them.
If you'd written:
File.WriteAllText("c:\\test.data", "david stein", Encoding.ASCII);
you'd have ended up with the exact same sequence of bytes. There are any number of ways you could have created a file with the same bytes in. There's nothing about File.WriteAllText which "marks" the file as text, and there's nothing about FileStream or BinaryWriter which marks the file as binary.
EDIT: From comments:
I'm looking to create a data in binary file until user can not open and read my data by note pad
Well, there are lots of ways of doing that with different levels of security. Ideally, you'd want some sort of encryption - but then the code reading the data would need to be able to decrypt it as well, which means it would need to be able to get a key. That then moves the question to "how do I store a key".
If you only need to verify the data in the file (e.g. check that it matches something from elsewhere) then you could use a cryptographic hash instead.
If you only need to prevent the most casual of snoopers, you could use something which is basically just obfuscation - a very weak form of encryption with no "key" as such. Anyone who dceompiled your code would easily be able to get at the data in that case, but you may not care too much about that.
It all depends on your requirements.
All data is binary. A text file is binary data that happens to be a limited subset that represent valid characters, but it's still binary.
The way text editors typically differentiate a text file from a binary file is they scan a certain portion of the file for zero values, \0. These never exist in text-only files and almost always exist in binary files.

C#: Reading Huge CSV File

I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?
First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.
Why don't you convert your csv to a normal database. Even sqlexpress will be fine.
Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.
Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace
You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option
This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.

Sorting Binary File Index Based

I have a binary file which can be seen as a concatenation of different sub-file:
INPUT FILE:
Hex Offset ID SortIndex
0000000 SubFile#1 3
0000AAA SubFile#2 1
0000BBB SubFile#3 2
...
FFFFFFF SubFile#N N
These are the information i have about each SubFile:
Starting Offset
Lenght in bytes
Final sequence Order
What's the fastest way to produce a Sorted Output File in your opinion ?
For instance OUTPUT FILE will contain the SubFile in the following order:
SubFile#2
SubFile#3
SubFile#1
...
I have thought about:
Split the Input File extracting each Subfile to disk, then
concatenate them in the correct order
Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.
Consider the following information also:
Input file can be really huge (200MB~1GB)
For those who knows, i am speaking about IBM AFP Files.
Both my solution are easy to implement, but looks really not performing in my opinion.
Thanks in advance
Also if file is big the number of IDs is not so huge.
You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array.
I expect this to be faster than other methods.
So... let's make some pseudocode.
public struct FileItem : IComparable<FileItem>
{
public String Id;
public int SortIndex;
public uint Offset;
public uint Length;
public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}
public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
FileItem[] result = // fill the array
Array.Sort(result);
}
public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
foreach (FileItem item in items)
{
Copy from inputFile[item.Offset .. item.Length] to outputFile.
}
}
The number of read operations is linear, O(n), but seeking is required.
The only performance problem about seeking is cache miss by hard drive cache.
Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.
If you are using a solid state disk instead seeking time is 0 :)
Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly.
You can ensure better time if you preallocate the size of the file before starting to write it.
FileStream myFileStream = ...
myFileStream.SetLength(predictedTotalSizeOfFile);
Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.
The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.
It is not useful to try a multithreaded approach, it will just slow down the copy.
It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).
Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code.
Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.
The first solution I thought of was to read the input file sequentially and build a Subfile-object for every subfile. These objects will be put into b+tree as soon as they are created. The tree will order the subfiles by their SortIndex. A good b-tree implementation will have linked child nodes which enables you to iterate over the subfiles in the correct order and write them into the output file
another way could be to use random access files. you can load all SortIndexes and offsets. then sort them and write the output file in the sorted way. in this case all depends on how random access files work. in this case all depends on the random access file reader implementation. if it just reads the file until a specified position it would not be very performant.. honestly, I have no idea how they work... :(

What is the fastest way to find if an array of byte arrays contains another byte array?

I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.

Faster MD5 alternative?

I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.
The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?
All code is in C#.
Thank you.
Update
I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
I hope you're checking for an MD5 match only if the file size already matches.
Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.
Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.
Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.
I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.
just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.
Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.
If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.
Am I missing something here? What does hashing buy you in this case?
First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.
There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.
This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).
You should either use a hash function which has not been broken yet (like SHA-1), or (as #SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.
References:
http://www.win.tue.nl/hashclash/SoftIntCodeSign/
Use MD5CryptoServiceProvider and BufferedStream
using (FileStream stream = File.OpenRead(filePath))
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new MD5CryptoServiceProvider();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}

Categories

Resources