I am working on transferring files over the network. There is zero tolerance for data loss during the transfers. I've been asked to compute the SHA256 values for the original and the copied file to verify the contents are the same. So far I have made comparisons based on copying and pasting the file, and letting Windows rename the file with the -copy appended to the filename. I have also tried renaming the file after the rename above, as well as removing the file extension. So far they all produce the same hash. I've also coded altering file attributes (just changed lastWrittenTime and fileCreationTime) and this does not seem to have an effect on the hash.
Checksum result of copying and pasting a file(explorer appends "-copy to name):
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
Checksum result of renaming the -copy in explorer:
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
Checksum result of changing file extension:
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
What part/s of the file are used when the hash is created?
Ok, zero tolerance was a bit much, if the hash doesn't match the file will have to be resent.
The entire binary file contents are streamed through the hashing algorithm. File metadata (such as name, date etc) doesn't play a part.
First, a general recommendation: don't do this. Use rsync or something similar to do bulk file transfers. Rsync has years of optimisations and debugging behind it, has countless options to control how (and whether) the copying happens, and is available on Windows. Don't waste time building something that has already been built.
But if you must…
Hashing algorithms generally care about bytes, not files. When applying SHA256 to a file, you are simply reading the bytes and passing them through the algo.
If you want to hash paths, permissions, etc, you should do this at the directory level, because these things constitute the "contents" of a directory. There is no standard byte-level representation of directories, so you'll have make one up yourself. Something that looks like a directory listing in sorted order usually suffices. And make sure that each entry contains the hash of the corresponding thing, be it a file or another directory. This way, the hash of the directory uniquely specifies not only the name and attributes of each child, but, recursively, the entire contents of the subdirectory.
Note: the fact that identical files have the same hash can actually work in your favour, by avoiding transmission of the second file once the system realises that a file with the same hash is already present at the destination. Of course, you would have to code for this explicitly. But also note that doing so can allow super-cheap syncing when files have been moved or copied, since they will have the same hash as before. Only affected directories (from the immediate parent(s) to the root) will have different hash values.
Finally, a minor quibble: there is no such thing as zero tolerance. Forget whether SHA256 collisions will happen in the lifetime of the Universe. A gamma ray can flip the bit that says, "These two files don't match!" Such flippings happen exceedingly rarely, but more often than you might think. In a noisy quantum universe, we should avoid talking in absolutes.
Related
The metadata of the .Net EXE shows that it has been using SHA1 for its internal purpose.
The property navigation is : Metadata->Headers->FileInfo->SHA1
Steps to reproduce:
Create any console app with .Net Framework or.Net Core
Generate the EXE
Use any .Net Reflector to view Metadata. For Eg. dotPeek
Load the EXE and navigate to the above path - Metadata->Headers->FileInfo->SHA1
It shows SHA1 is key and has some value associated with it.
Screenshot of the same:
Questions:
As it is known that SHA1 is not secure and SHA256 should be used everywhere.
What is this property about and where is it used internally?
Do we have the option to change it to SHA256 due to security reasons?
Docs: PE Format#Certificate Data
A PE image hash (or file hash) is similar to a file checksum in that the hash algorithm produces a message digest that is related to the integrity of a file. However, a checksum is produced by a simple algorithm and is used primarily to detect whether a block of memory on disk has gone bad and the values stored there have become corrupted. A file hash is similar to a checksum in that it also detects file corruption. However, unlike most checksum algorithms, it is very difficult to modify a file without changing the file hash from its original unmodified value. A file hash can thus be used to detect intentional and even subtle modifications to a file, such as those introduced by viruses, hackers, or Trojan horse programs.
Emphasis mine.
Modifying (or recreating) an executable and making it have the same hash is still not trivial, not even for SHA-1. See also Cryptography.SE: How secure is SHA1? What are the chances of a real exploit?.
I am wondering if I can make the MD5 for a dll/exe consistant after a new build?
Every time I rebuild my project and get a different MD5 with the tool "Microsoft File Checksum Integrity Verifier".
I found some articals about the issue, someone said it was due to the timestamp on the head of PE32 file. I have no knowledge about it. Please could anyone help? Thank you in advance!
Below is how I get the MD5 sum. The MD5Compare.exe are exactly the same except that they are not created in the same build.
C:\Users\Administrator>fciv.exe D:\Lab\MD5Compare\MD5Compare\bin\Debug\2 -wp MD5
Compare.exe
//
// File Checksum Integrity Verifier version 2.05.
//
5cdca6373aca0e588e1e3df92a1d5d0a MD5Compare.exe
C:\Users\Administrator>fciv.exe D:\Lab\MD5Compare\MD5Compare\bin\Debug\2 -wp MD5
Compare.exe
//
// File Checksum Integrity Verifier version 2.05.
//
cf5caace5481edc79fd7bf3e99b48a5b MD5Compare.exe
No, the checksum has to be different because the data in the file has actually changed, even if no code has - no functional difference in compilation been made, no new features added to the assembly - since the timestamp of the build, for one, will be different.
So you need to take into account metadata here, and how it is stored/affects the properties of a file on a file system, and therefore integrity checks.
Please consider what MD5 is supposed to do: It's supposed to ensure that nobody has changed your files on a binary level. It's supposed to ensure that your file is exactly the same. Having multiple builds (different files) have the same MD5-checksum would defeat the purpose of having MD5.
If you can change the files while the checksum stays the same, so could hackers.
This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 9 years ago.
I need to synchronize files from directory A to directory B. I check for files in A and then compare them with files in B one by one. If a file of same name as A is found in B, I check to see if files are different by comparing their size. If the file sizes are different, I log this and move on to next file. However if the file sizes are same, I need to verify the contents of the files are different as well. For this, I thought of creating hashes of both files and compare them. Is this better or should I compare the files byte by byte? Please also tell why would you choose either one of the methods.
I am using C# (.NET 4) and need to preserve all files on B while replicating newly added files on A and reporting (and skipping) any duplicates.
Thanks.
EDIT: This job will run nightly and I have the option of storing hashes of files on directory B only, directory A will be populated dynamically so I can not pre-hash those files. Also which hash algorithms are better for this purpose as I want to avoid hash collisions as well.
If you need to synhronize files, there's another thing you can compare: file date - if this is any different, the file has been most probably changed.
Also, in really most of cases the hash (I'd go for md5 or sha1 - not crc because of limited value range and therefore rather frequent collisions) will be sufficient. And if those hashs are equal you should do a byte-by-byte compare. Surely this is an additional step, but it's rarely needed if at all.
Actually you should save the hash on B, so you don't need to recalculate it every time, but you must make sure, that the files on B cannot be changed without updating their hashs.
You already have a hash-function here. Your hash function is file-->(filename, filesize). Also, since you can only have one file with a given filename in a directory, you are guaranteed not to have more than one collision for each file per run.
You're asking if you need a better one. Well, I don't know, is performance adequate with the hash function you already have? If it's adequate for you, you don't need a better hash function.
If you use only a hash code to compare two files, then if the hash codes differ you can be sure that the files are different.
But if the hash codes are the same, then you don't know for sure if the files are really the same.
If you use a 32-bit hash code then there is a 1 in 2^32 chance that the files are different even though the hash code is the same. For a 64-bit hash code, the chance is naturally 1 in 2^64.
Storing the hash codes for all the files on B will make initial comparing much faster, but you then have to decide what to do if two hash codes are the same. Do you take a chance and assume that they are both the same? Or do you go and do a byte-by-byte comparison after you discover two files with the same hash?
Note that if you do a byte-by-byte comparison after you have computed the hash code for a file, you'll end up accessing the file contents twice. This can make using hash codes slower if a significant proportion of the files are the same. As ever, you have to do some timings to see which is faster.
If you can live with the small chance that you'll falsely assume two files to be the same you can avoid the confirming comparison... but I wouldn't like to take that chance myself.
In summary, I would probably just do the comparison each time and not bother with the hashing (other than what you're already doing with comparing the filename and size).
Note that if you find that almost all files that match by filename and size are also identical, then using hashing will almost certainly slow things down.
In .NET, I need a way to compare two files. I thought of a class, which represents a diff:
public enum DiffEntryState
{
New,
Removed,
Changed
}
public class DiffEntry
{
public byte[] Bytes;
public long FileOffset;
public DiffEntryState State = BackupByteEntryState.Changed;
}
The names should be pretty self-explanatory. I thought of adding a State to each entry, so that I can distinguish between the cases were the first file is larger than the second or vice versa.
I'm wondering, if there is a common and fast way to retrieve the byte-by-byte differences of two files. I would simply create a stream for each file and compare chunks of these streams until one ends. Is there a better way, or does the Framework have a built-in solution? Keep in mind that I need the differences itself, not only the feedback that there ARE differences.
//Edit:
After sleeping a night over the problem, I guess I'm taking the wrong approach here. The whole tool is a backup solution, which will be able to save only the changed bytes and thus reduce the overall necessary space for the backup. Instead of saving a compressed 14 MB file each time, only 200k or less will be saved.
But, after thinking about the problem, I realized that it wouldn't be enough to save only the differences byte-by-byte. Take a Text for example:
"This is a string."
"This was a string."
As a matter of fact, the only change here is "is" to "was". But my approach would assume that the changed content is now "was a string". If this happens at the beginning of a huge file, well, this approach is useless.
Obviously, I need a way to index a file and detect all moved, copied or changed blocks in comparison to the original file.
Phew...
Take a look at Diff.NET,could be helpful .
For general case binary differencing, look at A Linear Time, Constant Space Differencing Algorithm by Randal C. Burns and Darrell D. E. Long. Also, Randal Burns' master's thesis, Differential Compression: A Generalized Solution For Binary Files, goes into more detail and provides pseudo-code for the algorithm.
You might also get some useful ideas from About Remote Differential Compression and from Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression
For text file differencing, I recommend starting with An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers. This algorithm can be used to diff any two sequences. To compare two text files, generate sequences of hash codes (e.g., by calling string.GetHashCode()) for each line in each file. Then run those sequences (e.g., IList) through Myers' algorithm to find the shortest edit script (i.e., inserts and deletes) that will convert the first sequence into the second.
I hope this helps. I'm the author of Diff.Net, and it uses Burns' algorithm for binary differencing and Myers' algorithm for text differencing. The source code for Diff.Net's libraries (Menees.Diffs and Menees.Diffs.Controls) are available under the Apache License, Version 2.0, and the references above should help you implement your own solution without having to start from scratch.
There is no built-in functionality.
So you have to compare the files byte by byte or use a library that does this for you.
I'm working on a solution and one of the features is to check that some files have not been tampered in other words hacked. I was planning on using the MD5 sum with a mixture of created and modified dates, but wanted to see if anybody has done something like this before. I'm using C# at the moment but you could suggest any other language. I just want to hear the technique part of it or architecture.
We have an application that checks file validity for safety reasons. The CRC32 checksums are stored in a separate file using a simple dictionary lookup. Which of CRC32, MD5, or any other hashing/checksumming feature is purely choice: you simply need to know if the file has changed (at least that's what you've said). As each byte of the file is included in the calculation, any changes will be picked up, including simple swapping of bytes.
Don't use file dates: too unreliable and can be easily changed.