Find identical files in two folders with .Net

Find identical files in two folders with .Net - c#

I have a folder with music videos which I want to backup from my laptop to a external hdd. I dont want to use a backup-Image, but a direct file copy so I can directly watch the music videos from the backup hdd on another computer/laptop or a console.
Curently I use the freeware SyncBack Free to mirror the files to the external hdd. SyncBack Free is a nice tool, but it does not seem to fully satisfy my needs. The problem is that I like to modify the filenames of my music videos from time to time. Though SyncBack Free has a option for files with identical content it does not seem to work for videos and you end up with two copies from the same file in each folder when you synchronise after a file name change.
So im thinking about writing my own freeware backup software.
The question is:
-how can I identify identical files with c#/.Net 4.0 without using the filename? Im thinking of generating hashes or a checksum for the files without knowing much about it
-Is it to slow to really be used for a backup software?

You can get a hash of a file like this
using System.Security.Cryptography;
static string GetFileHash(string filename)
{
byte[] data = File.ReadAllBytes(filename);
byte[] hash = MD5.Create().ComputeHash(data);
return Convert.ToBase64String(hash);
}
MD5 is not the most secure hash, but it is still fast which makes it good for file checksums. If the files are large ComputerHash() also takes a Stream.
You may also want to check out some other check sum algorithms in the HashLib library. It contains CRC and other algorithms which should be even faster. You can download it with nuget.
There are other strategies you can use as well such as checking if only the first x bytes are the same.
You can keep a database of hashes that have been backed up so that you don't have to recompute the hashes each time the backup runs. You could loop through only files which have been modified since the last backup time and see if their hash is in your hash database. SQLite comes to mind as a good database to use for this if you want your backup program to be portable.

Related

Zip folder to SVN?

This may sound a silly question but I just wanted to clear something up. I've zipped a folder up and added it to my SVN repository. Is doing this all ok? or should I upload the unzipped folder instead?
I just need to be sure!

If you are going to change the contents of the directory, then you should store it unzipped. Having it in zip file will exhaust storage on server much faster, as if you were storing every version of your zip as a separate file on your server.
Zip format has one cool properly: every file inside archive takes some segment of bytes, and is compressed/decompressed independently of all the other files. As the result, if you have a 100 MB zip, and modify two files inside each having size 1 MB, then the new zip will have at most 2 MB of new data, the rest 98 MB will be most likely by byte-exact copies of some pieces of the old zip. So it is in theory possible to represent small in-zip changes as small deltas. But there are many problems in practice.
First of all, you must be sure that you don't recompress the unchanged files. If you make both the first zip and the second zip from scratch using different programs, program versions, compression settings, etc., you can get slightly different compression on the unchanged files. As the result, the actual bytes in zip file will greatly differ, and any hope for small delta will be lost. The better approach is taking the first zip, and adding/removing files in it.
The main problem however is how SVN stores deltas. As far as I know, SVN uses xdelta algorithm for computing deltas. This algorithm is perfectly capable of detecting equal blocks inside zip file, if given unlimited memory. The problem is that SVN uses memory-limited version with a window of size = 100 KB. Even if you simply remove a segment longer than 100 KB from a file, then SVN's delta computation will break on it, and the rest of the file will be simply copied into delta. Most likely, the delta will take as much space as the whole file takes.

How are delta file backups encoded?

With a backup application, a good and space-efficient way to back up is to detect changes in files. Some online services such as Dropbox do this as well since Dropbox includes version history. How do backup applications detect changes in files and store them?
If you have a monumentally large file which has already been backed up, and you make a small change (such as in a Microsoft Word document), how can an application detect a change and process it? If the file has changes made often, there must be an efficient algorithm to only process changes and not the entire file. Is there an algorithm to do this in C# .NET?
Edit: I'm trying to figure out how to encode two files as the original and the changes (in a VCDIFF format or etc.) I know how to use the format and decode it just fine.

to detect changes, you can compute the Hash code (such as MD5) for the original and the modified versions of the file. if they are identical, no changes are made.
I think DropBox has its own protocol to detect which part of this file is modified.
you can figure out to find out your own way, for example, divide the file to fixed-size parts, store their hash codes. when the client download the file, send these information to the client. after modifying the file, recalculate the hash codes for the parts, compare them with original hash codes, upload the parts that were modified, rebuild the file from original parts and the modified parts.
rsync is an open source tool that synchronizes files using delta encoding.
----------------------------------------------------EDIT: my idea above is very simple and not efficient. you can a look at VCDIFF that were explained by research paper and implemented in many languages (C#).

C# - Volume Shadow Copy Alternatives for Backup?

I implemented a RAMDisk into my C# application, and everything is going great, except I need to back up the contents regularly due to it being volatile. I have been battling with AlphaVSS for Shadow Copy backups for a week, then someone informed me that VSS does not work on a RAMDisk.
The contents that are located on the RAMDisk (world files for Minecraft) are very small, but there can be hundreds of them. The majority of them are .dat files only a few hundred bytes in size, and there is other files that are 2-8MB each.
I posted about this yesterday Here, and the solution that was suggested was to use a FileStream, and save the data out of it. I just read that this is a horrible idea for binary data on another Stack Overflow question, so I am looking for a better approach to backup all of these little files, some of which might be in use.

I suggest you first zip all the small files together, then back them up to a location.
ref:
zip library: http://www.icsharpcode.net/opensource/sharpziplib/
use System.IO.File.Copy to copy the zip packed.

How to encrypt a .zip file?

Is it possible to encrypt a Zip file? I know encryption is used to make .txt files unreadable until decrypted with a key. Although I want to do the same with a .zip file.
I have multiple files I want users to download from the internet through my program I'm creating, so I thought I'll compress them files in a .zip and then encrypt the Zip for added security. (I don't want users to access the file within the .zip without a serial code)
I was going to keep the 'serial key' in a database online which the program would get.
Am going about this in the wrong way?

Both DotNetZip and SharpZipLib support encrypting the contents of zips and are free.

Use the dotnetzip library to perform your zipping/unzipping operations.
It supports AES encryption. From the website:
The library supports zip passwords, Unicode, ZIP64, stream input and output, AES encryption, multiple compression levels, self-extracting archives, spanned archives, and more.

Yes you can use third party zip libraries as shown by other answers, but keep in mind that your method of protecting files is rudimentary... it would not be terribly difficult to observe your program operating and recover the files after your program helpfully decrypts them. If you are storing the key as a constant in the program, that is pretty trivial to extract as well.
Software protection is a complex field, and it's very difficult to prevent determined users from viewing data that is stored on systems they control. Commercial software goes to great lengths to prevent this, and the solutions are quite complicated. (e.g. try hooking a debugger to Skype and see what happens)

XCopy - Only grab files that are fully uploaded

I have an automated job that pulls files that are uploaded to our servers via a client facing site using xcopy.
Is there any way to only pull files that are fully uploaded?
I have thought about creating a second "inProcess" folder that will be used for uploading and then move those files once fully uploaded, but that still creates a window of time when the file is in transition to a "Done" folder...
Any thoughts?

use the .filepart extension for temporary files.
It's probably the most simple and clear way of doing this.
WinSCP does this.

You can upload an MD5 hash of the file and then upload the file and if the file uploaded doesn't match the MD5 then it's not finished (or if it takes to long, perhaps it didn't upload properly)
MD5 is often used to check the integrity of a file by creating a hash that represents the file. If the file varies at all, it will almost always (as in, basically never for our purposes) generate a different MD5 hash. The only reason a file would not match it's previously uploaded MD5 hash is if it wasn't finished or the MD5/file was corrupted on upload.
There is also this. but it's perl and from expert exchange (ick)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.