Reversible string diff (history) algorithms for C#?

Reversible string diff (history) algorithms for C#? - c#

Here's an interesting question that I don't know much about in terms of existing solutions or research in the field, though I would imagine it relates to the field of compression.
Given two potentially large strings of text, where one represents a later version of the former, is it possible (well I know it's possible, I'm asking really are there existing solutions) to compare those two strings and reduce them to a set of differences that could then later be used to deterministically reconstruct the original strings?
In my case, I'm interested in storing the latest version of the string, but keeping "compressed" (diffed) historical backups that can be restored as needed, without actually having to store all of the duplicated information.
I don't know what to tag this, please help me out.

There is no built in classes in CLR that support diffing.
Related questions seem to have have useful information (i.e. Creating Delta Diff Patches of large Binary Files in C#). You can also look search on "Delta encoding" to start with (i.e. http://en.wikipedia.org/wiki/Delta_encoding).

Related

Storing data into custom made files

I am making a combinations generator. For small amounts of elements it's not a problem for the computer that the data is getting stored in the RAM memory instead in a file. But when the number of elements gets bigger, my computer runs out of memory (the exception OutOfMemoryException occurs). The combinations are numbers stored in Lists, that are currently getting stored in an another List.
But this only the first step- the generator work right. I want the data to be stored in file, from where a different program will be able to extract the combinations it needs. Mostly, I need to store the data in a separate file, because the generator will have to be able to create more and more and bigger combinations in the future. The computer will have to read certain parts of the data, without coping all of it in the temporary memory, because this is impossible.
I don't want to turn the data into text and when needed to convert the text back into data. I think this will make things slower because of the conversions. I want the lists to be stored into a custom made file, from where the program can directly extract the data without any converting.

There is a ton of options available I will briefly describe a few.
Use a database. From your description this does not look like a good choices but it will be the most flexible to all clients relativity fast and efficient storage.
Use one of the .net serializers from your description binary serializer will be your best choice. The serlizers offer a lot of advantages relativity fast and baked into .net with built in support and very easy to use.
Use a custom binary format. This will be the fastest option especially if you combine it with a memory mapped file. However binary formats can be hard to use and easy to screw up.

If you really want to store your data in file, you can use BinaryFormatter class. It is probably the most efficient way of serializing data objects into binary stream.
But I wouldn't recommend you generating combinations in this way if only don't you need to store them at one time and load long time after that. It's better to use lazy-generation of combinations. One by one, completely generated without the need to "generate bigger combinations in future" (generate "the biggest" needed combinations one by one - you might want to change your generation algorithm a bit - there are plenty answers how to do that already)

There's a good write up on how to serialize a List<> to a file at http://www.switchonthecode.com/tutorials/csharp-tutorial-serialize-objects-to-a-file

You can use something as a persistent data structure instead, this will reduce the amount of memory needed by your app, without changint too much the current code.
Have a look at this question:
Looking for a simple standalone persistent dictionary implementation in C#
there is a lot of resources doing that, in particular this answer seems to point to some really interesting links:

Get the differences of two files

In .NET, I need a way to compare two files. I thought of a class, which represents a diff:
public enum DiffEntryState
{
New,
Removed,
Changed
}
public class DiffEntry
{
public byte[] Bytes;
public long FileOffset;
public DiffEntryState State = BackupByteEntryState.Changed;
}
The names should be pretty self-explanatory. I thought of adding a State to each entry, so that I can distinguish between the cases were the first file is larger than the second or vice versa.
I'm wondering, if there is a common and fast way to retrieve the byte-by-byte differences of two files. I would simply create a stream for each file and compare chunks of these streams until one ends. Is there a better way, or does the Framework have a built-in solution? Keep in mind that I need the differences itself, not only the feedback that there ARE differences.
//Edit:
After sleeping a night over the problem, I guess I'm taking the wrong approach here. The whole tool is a backup solution, which will be able to save only the changed bytes and thus reduce the overall necessary space for the backup. Instead of saving a compressed 14 MB file each time, only 200k or less will be saved.
But, after thinking about the problem, I realized that it wouldn't be enough to save only the differences byte-by-byte. Take a Text for example:
"This is a string."
"This was a string."
As a matter of fact, the only change here is "is" to "was". But my approach would assume that the changed content is now "was a string". If this happens at the beginning of a huge file, well, this approach is useless.
Obviously, I need a way to index a file and detect all moved, copied or changed blocks in comparison to the original file.
Phew...

Take a look at Diff.NET,could be helpful .

For general case binary differencing, look at A Linear Time, Constant Space Differencing Algorithm by Randal C. Burns and Darrell D. E. Long. Also, Randal Burns' master's thesis, Differential Compression: A Generalized Solution For Binary Files, goes into more detail and provides pseudo-code for the algorithm.
You might also get some useful ideas from About Remote Differential Compression and from Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression
For text file differencing, I recommend starting with An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers. This algorithm can be used to diff any two sequences. To compare two text files, generate sequences of hash codes (e.g., by calling string.GetHashCode()) for each line in each file. Then run those sequences (e.g., IList) through Myers' algorithm to find the shortest edit script (i.e., inserts and deletes) that will convert the first sequence into the second.
I hope this helps. I'm the author of Diff.Net, and it uses Burns' algorithm for binary differencing and Myers' algorithm for text differencing. The source code for Diff.Net's libraries (Menees.Diffs and Menees.Diffs.Controls) are available under the Apache License, Version 2.0, and the references above should help you implement your own solution without having to start from scratch.

There is no built-in functionality.
So you have to compare the files byte by byte or use a library that does this for you.

Memory-mapped file IList implementation, for storing large datasets "in memory"?

I need to perform operations chronologically on huge time series implemented as IList. The data is ultimately stored into a database, but it would not make sense to submit tens of millions of queries to the database.
Currently the in-memory IList triggers an OutOfMemory exception when trying to store more than 8 million (small) objects, though I would need to deal with tens of millions.
After some research, it looks like the best way to do it would be to store data on disk and access it through an IList wrapper.
Memory-mapped files (introduced in .NET 4.0) seem the right interface to use, but I wonder what is the best way to write a class that should implement IList (for easy access) and internally deal with a memory-mapped file.
I am also curious to hear if you know about other ways ! I thought for example of an IList wrapper using data from db4o (someone mentionned here using a memory-mapped file as the IoAdapterFile, though using db4o probably adds a performance cost vs. dealing directly with the memory-mapped file).
I have come across this question asked in 2009, but it did not yield useful answers or serious ideas.

I found this PersistentDictionary<>, but it only works with strings, and by reading the source code I am not sure it was designed for very large datasets.
More scalable (up to 16 TB), the ESENT PersistentDictionary<>, uses the ESENT database engine present in Windows (XP+) and can store all serializable objects containing simple types.
Disk Based Data Structures, including Dictionary, List and Array with an "intelligent" serializer looked exactly like what I was looking for, but it did not run smoothly with extremely large datasets, especially as it does not make use of the "native" .NET MemoryMappedFiles yet, and support for 32 bits systems is experimental.
Update 1: I ended up implementing my own version that makes extensive use of .NET MemoryMappedFiles; it is very fast and I will probably release it on Codeplex once I have made it better for more general purpose usages.
Update 2: TeaFiles.Net also worked great for my purpose. Highly recommended (and free).

I see several options:
"in-memory-DB"
for example SQLite can be used this way - no need for any setup etc. just deploying the DLL (1 or 2) together with the app and the rest can be done programmatically
Load all data into temporary table(s) into the DB, with unknown (but big) amounts of data I found that this pays off really fast (and processing can usually be done inside the DB whcih is even better!)
use a MemoryMappedFile and a fixed structure size (array-like access via offset) but beware that physical memory is the limit except you use some sort of "sliding window" to map only parts into memory

The memory mapped files is a nice way to do it. But it going to be very slow if you need to access things randomly.
Your best bet is probably to come up with a fixed structure size when saved in memory (if you can) then you use the offset as the list item id. However deletes / sorting is always a problem.

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?

I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.

I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

Binary patch-generation in C#

Does anyone have, or know of, a binary patch generation algorithm implementation in C#?
Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.
The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.
My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).
Any advice, or pointers for implementation would be nice.
Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.
The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:
Grab the first four bytes from the old file, call this the key
Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
Repeat steps 1-3 for all 4-byte blocks in the old file
From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
If found, find the longest match if there are several, by comparing bytes from the two files
Encode a reference to that location in the old file, and skip the matched block in the new file
If not found, encode 1 byte from the new file, and skip it
Repeat steps 5-8 for the rest of the new file
This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.
A more memory-efficient algorithm uses windowing, but produces much bigger patch files.
There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.
Edit #1: Here is a more detailed description of the above algorithm.
First, combine the two files, so that you have one big file. Remember the cut-point between the two files.
Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.
Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.
Edit #2: Source code to the above algorithm
You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.
The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.
#lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.
Or perhaps I'm just dense... :)
I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:
Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.
My needs aren't optimal though, so I'm looking for a more practical solution.
Thanks for the answer though, added a bookmark to his utilities if I ever need them.
Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.
Edit #2: I'll definitely hunt down the python xdelta implementation.

Sorry I couldn't be more help. I would definately keep looking at xdelta because I have used it a number of times to produce quality diffs on 600MB+ ISO files we have generated for distributing our products and it performs very well.

bsdiff was designed to create very small patches for binary files. As stated on its page, it requires max(17*n,9*n+m)+O(1) bytes of memory and runs in O((n+m) log n) time (where n is the size of the old file and m is the size of the new file).
The original implementation is in C, but a C# port is described here and available here.

Have you seen VCDiff? It is part of a Misc library that appears to be fairly active (last release r259, April 23rd 2008). I haven't used it, but thought it was worth mentioning.

It might be worth checking out what some of the other guys are doing in this space and not necessarily in the C# arena either.
This is a library written in c#
SVN also has a binary diff algorithm and I know there's an implementation in python although I couldn't find it with a quick search. They might give you some ideas on where to improve your own algorithm

If this is for installation or distribution, have you considered using the Windows Installer SDK? It has the ability to patch binary files.
http://msdn.microsoft.com/en-us/library/aa370578(VS.85).aspx

This is a rough guideline, but the following is for the rsync algorithm which can be used to create your binary patches.
http://rsync.samba.org/tech_report/tech_report.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.