Compare two version of files and apply changes to older file - c#

I've been searching & googling a lot about this question, and I already know how to compare two files (hashes, checksums, etc.). But it's not quite what I need. What I need is described below.
Lets assume I have a file and I've backuped it. Later I've made some changes to this file, so I want to apply changes to the backup version. Since two files can be big enought and changes can be small, I don't want to rewrite all the file, because I'm planning to backup it though the internet (maybe FTP) wich can take a lot of time.
How I see this (sample):
Backup version of file (bytes)
134 253 637 151
Newer version of file (bytes)
134 624 151 890
Instead of rewriting all bytes, we should:
change 253 to 624 (change bytes)
remove 637 bytes (remove bytes)
write 890 at the end of file (insert bytes)
The 1,2,3 options do not necessarily appear at once in each case.
Note that the backup file could be located somewhere else, and I only have acces to it through the internet (server could return something so we can compare files).
How can I achive this? I know it's possible cause I know software where it's implemented (but couldn't find out how).
Any hints, tutorials, etc. is welcomed and highly appriciated.
Thanks in advance.

You're trying to solve the same problem that every MMORPG has solved... creating and applying small patch files to update older versions of large binaries.
This is a well-studied problem and there are a number of solutions out there. For several existing options, see
Binary patch-generation in C#

Related

Zip folder to SVN?

This may sound a silly question but I just wanted to clear something up. I've zipped a folder up and added it to my SVN repository. Is doing this all ok? or should I upload the unzipped folder instead?
I just need to be sure!
If you are going to change the contents of the directory, then you should store it unzipped. Having it in zip file will exhaust storage on server much faster, as if you were storing every version of your zip as a separate file on your server.
Zip format has one cool properly: every file inside archive takes some segment of bytes, and is compressed/decompressed independently of all the other files. As the result, if you have a 100 MB zip, and modify two files inside each having size 1 MB, then the new zip will have at most 2 MB of new data, the rest 98 MB will be most likely by byte-exact copies of some pieces of the old zip. So it is in theory possible to represent small in-zip changes as small deltas. But there are many problems in practice.
First of all, you must be sure that you don't recompress the unchanged files. If you make both the first zip and the second zip from scratch using different programs, program versions, compression settings, etc., you can get slightly different compression on the unchanged files. As the result, the actual bytes in zip file will greatly differ, and any hope for small delta will be lost. The better approach is taking the first zip, and adding/removing files in it.
The main problem however is how SVN stores deltas. As far as I know, SVN uses xdelta algorithm for computing deltas. This algorithm is perfectly capable of detecting equal blocks inside zip file, if given unlimited memory. The problem is that SVN uses memory-limited version with a window of size = 100 KB. Even if you simply remove a segment longer than 100 KB from a file, then SVN's delta computation will break on it, and the rest of the file will be simply copied into delta. Most likely, the delta will take as much space as the whole file takes.

How are delta file backups encoded?

With a backup application, a good and space-efficient way to back up is to detect changes in files. Some online services such as Dropbox do this as well since Dropbox includes version history. How do backup applications detect changes in files and store them?
If you have a monumentally large file which has already been backed up, and you make a small change (such as in a Microsoft Word document), how can an application detect a change and process it? If the file has changes made often, there must be an efficient algorithm to only process changes and not the entire file. Is there an algorithm to do this in C# .NET?
Edit: I'm trying to figure out how to encode two files as the original and the changes (in a VCDIFF format or etc.) I know how to use the format and decode it just fine.
to detect changes, you can compute the Hash code (such as MD5) for the original and the modified versions of the file. if they are identical, no changes are made.
I think DropBox has its own protocol to detect which part of this file is modified.
you can figure out to find out your own way, for example, divide the file to fixed-size parts, store their hash codes. when the client download the file, send these information to the client. after modifying the file, recalculate the hash codes for the parts, compare them with original hash codes, upload the parts that were modified, rebuild the file from original parts and the modified parts.
rsync is an open source tool that synchronizes files using delta encoding.
----------------------------------------------------EDIT: my idea above is very simple and not efficient. you can a look at VCDIFF that were explained by research paper and implemented in many languages (C#).

I can't decompress file correctly with C#

Hello I am trying to compress a file using GZipStream.
I have created my own extension, let's call it .myextension
I try to compress .myextension and keep its extension. I mean that I am trying to compress a .myextension to the same extension. Example: I have myfile.myextension and
I want to compress it to myfile.myextension. It works. I can compress my file really well.
The problem is that when I try to decompress it using GZipStream it says that the magic number is incorrect.
How can I fix that? When decompressing should I just change the extension to .gz? Should I convert it somehow? Please help me I have no idea how to continue.
This is a common question. I would like to provide you the similar threads with the solutions:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=427166&SiteID=1
A 'Magic Number' is usually a fixed value, and often appearing somewhat arbitrary, possibly indecipherable. For example, a line of code may have:
If X = 31 then
'Do Something
End If
In this case, 31 is a 'Magic Number': It has no obvious meaning (and as far as coding is concerned, a term of derision).
Files (of different types) often have the first few bytes set to certain values, for example, a file which has the first two bytes as then hexadecimal numbers 42 4D is a Bitmap file. These numbers are 'magic numbers' (In this case, 42 4D corresponds to the characters BM). Other files have similar 'magic numbers'.
http://forums.microsoft.com/msdn/showpost.aspx?postid=1154042&siteid=1
Of course, the minute someone (team) develops a no-fuss compression/decompression custom task which supports zip,bzip2, gzip, rar, cab, jar, data and iso files, I'll use that, until that time, I'll stick with the open-source command-line utilities.
Of course, you can code up a solution, but this one is such low hanging fruit. For handling zip files, there is no native .NET library (at least not yet). Now there is support is for handling the compressed streams INSIDE the zip file, but not navigating the archive itself.
Now, as I mentioned in a previously, there are plenty of open-source zip utils like those on Sourceforge. These work fine on Win2003 Server x64, I can attest to that.
However, if you're insistent on a .NET solution for zip decompression, use http://www.icsharpcode.net/OpenSource/SharpZipLib/, which is open source, and which has a clean and reliable 100% .NET implementation.
First off, from other users who have had various issues, GZipStream should not be used since it has bugs. It does not compress short strings correctly and it does not detect corrupted compressed data. It is a very poor implementation.
As for your problem, others using GZipStream see a four-byte prefix to the gzip data which is the number of uncompressed bytes. If that is written to the file, that would cause the problem you are seeing. The gzip file should start with the hex bytes 1f 8b.

C# - Volume Shadow Copy Alternatives for Backup?

I implemented a RAMDisk into my C# application, and everything is going great, except I need to back up the contents regularly due to it being volatile. I have been battling with AlphaVSS for Shadow Copy backups for a week, then someone informed me that VSS does not work on a RAMDisk.
The contents that are located on the RAMDisk (world files for Minecraft) are very small, but there can be hundreds of them. The majority of them are .dat files only a few hundred bytes in size, and there is other files that are 2-8MB each.
I posted about this yesterday Here, and the solution that was suggested was to use a FileStream, and save the data out of it. I just read that this is a horrible idea for binary data on another Stack Overflow question, so I am looking for a better approach to backup all of these little files, some of which might be in use.
I suggest you first zip all the small files together, then back them up to a location.
ref:
zip library: http://www.icsharpcode.net/opensource/sharpziplib/
use System.IO.File.Copy to copy the zip packed.

Best way to store multiple revisions of a text file to a single file

I'm working on a C# application that needs to store all the successive revisions of a given report file to a single project file: each time the (plain text) report file changes, the contents of the new version shall be appended to the project file, along with some metadata. Other requirements:
each version of the report file is 100 kB to 1 MB. Theoritically, the maximum number of revisions is unlimited but it should be less than 1000 in practice.
to keep things simple, I'd like to avoid computing differences between the revisions of the report - just store the whole report to the project file every time it has changed.
the project file should be compressed - it doesn't need to be a text file
it should be easy to retrieve a given version of the report from the application
How can I implement this in an efficient way? Should I create a custom binary file, consider using a database, other ideas?
Many thanks, Guy.
What's wrong with the simple workflow?
Un-gzip file
Append header and new report
Gzip project file
Gzip is a standard format, so it's easily accessible. Subsequent reports probably won't change that much, so you'll have a great compression ratio. To file every report, just open the file and scan the headers. (If scanning doesn't work, also mirror the metadata in an SQLite database, and make sure to include offsets into the project file so you can seek to the right place quickly.)
If your requirements are flexible (e.g. that "shall append" part) and you just want something to keep track of past versions of the file, a revision control system will do all of what you need quite easily.
No need to implement that. I would suggest you to use source control. Personally I use subversion with TortoiseSVN client. There is also a plug-in that integrates Subversion with Visual Studio, VisualSVN. Have a look at them.
If using SVN is not an option, you can just store each revision in an individual file (with filename that represents date for example). You can use separate files for metadata as well. Then all the aforementioned files are zipped into one file (look at http://DotNetZip.codeplex.com/ for example).
I don't think there is much point building this yourself when there are already tens, if not hundreds, of systems that are basically designed to do exactly this - source control systems.
I'd recommend choosing some source control solution that has bindings to C# and store your document in there. Then you can easily check out any revision of the document. You will also be able to diff, branch, etc. if necessary.
To give just one example to get you started you can use Subversion with C# bindings.
You could use alternate data streams to store the old revisions of your file. There is no built-in support in the .NET framework, but there exist some helper classes and articles like here and here.
I have never used this myself, so I can't really tell if this is a good option. But it seems, it would make an elegant solution, since you could store each file version in a separate data stream and only the current version in the "main file". In any case, it will probably only work on NTFS drives.
I think that the already SVN (or another source control system) is a very good idea because source control seems to have exactly the features you require. But if that's not an option you could use a file database like SQL Server Compact Edition or SQLite.

Categories

Resources