How are delta file backups encoded? - c#

With a backup application, a good and space-efficient way to back up is to detect changes in files. Some online services such as Dropbox do this as well since Dropbox includes version history. How do backup applications detect changes in files and store them?
If you have a monumentally large file which has already been backed up, and you make a small change (such as in a Microsoft Word document), how can an application detect a change and process it? If the file has changes made often, there must be an efficient algorithm to only process changes and not the entire file. Is there an algorithm to do this in C# .NET?
Edit: I'm trying to figure out how to encode two files as the original and the changes (in a VCDIFF format or etc.) I know how to use the format and decode it just fine.

to detect changes, you can compute the Hash code (such as MD5) for the original and the modified versions of the file. if they are identical, no changes are made.
I think DropBox has its own protocol to detect which part of this file is modified.
you can figure out to find out your own way, for example, divide the file to fixed-size parts, store their hash codes. when the client download the file, send these information to the client. after modifying the file, recalculate the hash codes for the parts, compare them with original hash codes, upload the parts that were modified, rebuild the file from original parts and the modified parts.
rsync is an open source tool that synchronizes files using delta encoding.
----------------------------------------------------EDIT: my idea above is very simple and not efficient. you can a look at VCDIFF that were explained by research paper and implemented in many languages (C#).

Related

Saving images locally yet unreadable

I have a program in c# that downloads images from a web service.
The download usually takes time so I want to save the images locally so I would only need to download each image once. The problem with that is when the images saves the user of the program can see the image in the files and change it.
Is there a way to save the image in the program, yet keep it from users to see and change in the folder?
EDIT: solution used:
Encrypting the images and their names when I save them, and only access them this way. (decrypting when after reading them).
What is your intent? Anything your program has access to do, your user does as well. If you're just trying to prevent people from accidentally mucking with your images, then save off a SHA1 or similar hash of the file and store it separately. When you need an image, check the SHA1 and redownload if it doesn't match. This will prevent casual tampering, but still isn't 100% effective against malicious changes.

Find identical files in two folders with .Net

I have a folder with music videos which I want to backup from my laptop to a external hdd. I dont want to use a backup-Image, but a direct file copy so I can directly watch the music videos from the backup hdd on another computer/laptop or a console.
Curently I use the freeware SyncBack Free to mirror the files to the external hdd. SyncBack Free is a nice tool, but it does not seem to fully satisfy my needs. The problem is that I like to modify the filenames of my music videos from time to time. Though SyncBack Free has a option for files with identical content it does not seem to work for videos and you end up with two copies from the same file in each folder when you synchronise after a file name change.
So im thinking about writing my own freeware backup software.
The question is:
-how can I identify identical files with c#/.Net 4.0 without using the filename? Im thinking of generating hashes or a checksum for the files without knowing much about it
-Is it to slow to really be used for a backup software?
You can get a hash of a file like this
using System.Security.Cryptography;
static string GetFileHash(string filename)
{
byte[] data = File.ReadAllBytes(filename);
byte[] hash = MD5.Create().ComputeHash(data);
return Convert.ToBase64String(hash);
}
MD5 is not the most secure hash, but it is still fast which makes it good for file checksums. If the files are large ComputerHash() also takes a Stream.
You may also want to check out some other check sum algorithms in the HashLib library. It contains CRC and other algorithms which should be even faster. You can download it with nuget.
There are other strategies you can use as well such as checking if only the first x bytes are the same.
You can keep a database of hashes that have been backed up so that you don't have to recompute the hashes each time the backup runs. You could loop through only files which have been modified since the last backup time and see if their hash is in your hash database. SQLite comes to mind as a good database to use for this if you want your backup program to be portable.

Best way to store multiple revisions of a text file to a single file

I'm working on a C# application that needs to store all the successive revisions of a given report file to a single project file: each time the (plain text) report file changes, the contents of the new version shall be appended to the project file, along with some metadata. Other requirements:
each version of the report file is 100 kB to 1 MB. Theoritically, the maximum number of revisions is unlimited but it should be less than 1000 in practice.
to keep things simple, I'd like to avoid computing differences between the revisions of the report - just store the whole report to the project file every time it has changed.
the project file should be compressed - it doesn't need to be a text file
it should be easy to retrieve a given version of the report from the application
How can I implement this in an efficient way? Should I create a custom binary file, consider using a database, other ideas?
Many thanks, Guy.
What's wrong with the simple workflow?
Un-gzip file
Append header and new report
Gzip project file
Gzip is a standard format, so it's easily accessible. Subsequent reports probably won't change that much, so you'll have a great compression ratio. To file every report, just open the file and scan the headers. (If scanning doesn't work, also mirror the metadata in an SQLite database, and make sure to include offsets into the project file so you can seek to the right place quickly.)
If your requirements are flexible (e.g. that "shall append" part) and you just want something to keep track of past versions of the file, a revision control system will do all of what you need quite easily.
No need to implement that. I would suggest you to use source control. Personally I use subversion with TortoiseSVN client. There is also a plug-in that integrates Subversion with Visual Studio, VisualSVN. Have a look at them.
If using SVN is not an option, you can just store each revision in an individual file (with filename that represents date for example). You can use separate files for metadata as well. Then all the aforementioned files are zipped into one file (look at http://DotNetZip.codeplex.com/ for example).
I don't think there is much point building this yourself when there are already tens, if not hundreds, of systems that are basically designed to do exactly this - source control systems.
I'd recommend choosing some source control solution that has bindings to C# and store your document in there. Then you can easily check out any revision of the document. You will also be able to diff, branch, etc. if necessary.
To give just one example to get you started you can use Subversion with C# bindings.
You could use alternate data streams to store the old revisions of your file. There is no built-in support in the .NET framework, but there exist some helper classes and articles like here and here.
I have never used this myself, so I can't really tell if this is a good option. But it seems, it would make an elegant solution, since you could store each file version in a separate data stream and only the current version in the "main file". In any case, it will probably only work on NTFS drives.
I think that the already SVN (or another source control system) is a very good idea because source control seems to have exactly the features you require. But if that's not an option you could use a file database like SQL Server Compact Edition or SQLite.

Options for header in raw byte file

I have a large raw data file (up to 1GB) which contains raw samples from a USB data logger.
I need to store extra information relating to the file (sample rate, description, trigger point, last seek position etc) and was looking into adding this as a some sort of header.
The header file should ideally be human readable and flexible so I've so far ruled out some sort of binary serialization into a header.
I also want to avoid two separate files as they could end up separated when copied or backed up. I remembered somebody telling me that newer *.*x Microsoft Office documents are actually a number of files in a zip. Is there a simple way to achieve this? Could I still keep the quick seek times to the raw file?
Update
I started using the binary serializer and found it to be a pain. I ended up using the xml serializer as I'm more comfortable using it.
I reserve some space at the start of the files for the xml. Simple
When you say you want to make the header human readable, this suggests opening the file in a text editor. Do you really want to do this considering the file size and (I'm assuming), the remainder of the file being non-human readable binary data? If it is, just write the text header data to the start of the binary file - it will be visible when the file is opened but, of course, the remainder of the file will look like garbage.
You could create an uncompressed ZIP archive, which may allow you to seek directly to the binary data. See this for information on creating a ZIP archive: http://weblogs.asp.net/jgalloway/archive/2007/10/25/creating-zip-archives-in-net-without-an-external-library-like-sharpziplib.aspx

Transferring files with metadata

I am writing a client windows app which will allow files and respective metadata to be uploaded to a server. For example gear.stl (original file) and gear.stl.xml (metadata). I am trying to figure out the correct protcol to use to transfer the files.
I was thinking about using ftp since it is widely used and a proven method to transfer files, except that I would have to transfer 2 files for every actual file (.stl and .stl.xml). However, another thought had also crossed my mind ... What if I create an object and wrap the file, metadata and the directory I needed to tranfer it to, serialize the object and then submit a request to a webservice, to transfer the file.
Original file size would range from 100k to 10MB. Metadata size would probably be less than 200k
The webservice call seems like an easier process to me to deserialize the object and distribute the file and respective metadata accordingly. However I'm not sure if this is a sound idea or if there is a better way to transfer this data other than the two methods I have mentioned.
If someone can point me in the right direction it would be much appreciated.
You could wrap it in a zip file like the "new" office document format does. You might even be able to use their classes to package it all up.
Edit:
Take a look at the System.IO.Packaging.Package class. It seems to be what you need. This class resides in the WindowsBase.dll assembly and became available in .NET 3.0.
PS: Remember that even though it is a zip file, it doesn't need to be compressed. If you have very large files, it may be better to keep them uncompressed. It all depends on how they're going to be used and if the transport size is an issue.

Categories

Resources