What is the best way to search a large binary file for a certain substring in C#?
To provide some specifics, I'm trying to extract the DWARF information from an executable, so I only care about certain parts of the binary file (namely the sections starting with the strings .debug_info, .debug_abbrev, etc.)
I don't see anything obvious in Stream, FileStream, or BinaryReader, so it looks like I'll have to read chunks in and search through the data for the strings myself.
Is there a better way?
There's nothing built into .NET that will do the search for you, so you're going to need to read in the file chunk by chunk and scan for what you want to find.
You can speed up the search in two ways.
Firstly, use bufferred IO and transfer large chunks at a time - don't read byte by byte, read 64KB, 256KB or 1MB chunks.
Secondly, don't do a linear scan for the piece you want - check out the Boyer-Moore (wikipedia link) algorithm for string searches - you can apply this to searching for the DWARF information you want.
I think you'll have to do it yourself, BinaryReader was not designed for searching for text in a binary file. However, you should be mindful of the text encoding you use when searching.
There must be a DWARF C library you could compile and use interop with? I did some searching and found this. If a library from there could be compiled into a DLL on Windows (I assume you're using Windows), then you could use System.Runtime.InteropServices to interact with the DLL and extract your information from there.
Perhaps?
Related
I need to fit as much information as I can into a small file size. In this case, the data is in a comma separated format and all values are stored as 2dp decimals (no titles).
I've had a look and my understanding is that all the characters I need are stored using ASCII (1 byte per character) in my standard .txt file that I am currently using. Apparently ASCII has 256 possible values, which is way more than I need - I could get by with only 16 characters.
Could I save my data in some kind of 4bit text file? I will be creating the file using c# (all google searches result in advice on making a text file, not how to make a smaller "font" text). Would doing this save any space in the end anyway?
I could zip up anything before I send it, but any advice on ideas to get the filesize down would be greatly appreciated.
[the file] it will be read by a piece of c# code
You are therefore controlling the serialization format. You can pick any format you like.
A quick way to save space and reuse your existing code is to compress the CSV. Gzip is built-in but it is rather weak. You can use a 7-Zip library. The 7-Zip algorithm is state of the art. If will get rid of the redundancies caused by decimal points and by mostly using the characters 0-9. It will not remove 100% of that but 99%(?).
You can make this even more efficient by using a better format. You can use BinaryReader/Writer to easily write something entirely custom.
Protocol Buffers is a bit easier and also extremely compact.
I think that the question is legitimate, but the answer is that you impose logical conditions that leave no place for any solution.
So if you could avoid CSV structure for your custom structure you could save something, but you need it and it pretty much determines your solution. The only variable left is how do you encode the text, but you can't encode the text in less than 8 bits, you can just use higher values like Unicode (16 bits).
I won't comment on using compression as you already mentioned that you are looking for alternative answers and you are aware of that.
In .NET, I need a way to compare two files. I thought of a class, which represents a diff:
public enum DiffEntryState
{
New,
Removed,
Changed
}
public class DiffEntry
{
public byte[] Bytes;
public long FileOffset;
public DiffEntryState State = BackupByteEntryState.Changed;
}
The names should be pretty self-explanatory. I thought of adding a State to each entry, so that I can distinguish between the cases were the first file is larger than the second or vice versa.
I'm wondering, if there is a common and fast way to retrieve the byte-by-byte differences of two files. I would simply create a stream for each file and compare chunks of these streams until one ends. Is there a better way, or does the Framework have a built-in solution? Keep in mind that I need the differences itself, not only the feedback that there ARE differences.
//Edit:
After sleeping a night over the problem, I guess I'm taking the wrong approach here. The whole tool is a backup solution, which will be able to save only the changed bytes and thus reduce the overall necessary space for the backup. Instead of saving a compressed 14 MB file each time, only 200k or less will be saved.
But, after thinking about the problem, I realized that it wouldn't be enough to save only the differences byte-by-byte. Take a Text for example:
"This is a string."
"This was a string."
As a matter of fact, the only change here is "is" to "was". But my approach would assume that the changed content is now "was a string". If this happens at the beginning of a huge file, well, this approach is useless.
Obviously, I need a way to index a file and detect all moved, copied or changed blocks in comparison to the original file.
Phew...
Take a look at Diff.NET,could be helpful .
For general case binary differencing, look at A Linear Time, Constant Space Differencing Algorithm by Randal C. Burns and Darrell D. E. Long. Also, Randal Burns' master's thesis, Differential Compression: A Generalized Solution For Binary Files, goes into more detail and provides pseudo-code for the algorithm.
You might also get some useful ideas from About Remote Differential Compression and from Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression
For text file differencing, I recommend starting with An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers. This algorithm can be used to diff any two sequences. To compare two text files, generate sequences of hash codes (e.g., by calling string.GetHashCode()) for each line in each file. Then run those sequences (e.g., IList) through Myers' algorithm to find the shortest edit script (i.e., inserts and deletes) that will convert the first sequence into the second.
I hope this helps. I'm the author of Diff.Net, and it uses Burns' algorithm for binary differencing and Myers' algorithm for text differencing. The source code for Diff.Net's libraries (Menees.Diffs and Menees.Diffs.Controls) are available under the Apache License, Version 2.0, and the references above should help you implement your own solution without having to start from scratch.
There is no built-in functionality.
So you have to compare the files byte by byte or use a library that does this for you.
I have an ogg vorbis file and I have to do two operations with it:
Cutting a part of a file from one position to another
Merging another file with existing one
How can I do these two operations in C#?
You can do this with libzplay http://libzplay.sourceforge.net/
The steps needed to do what is being asked about:
OpenFile
Seek
SetWaveOutFile(this supports .ogg exporting as well as other
formats)
StartPlayback
StopPlayback(at time needed)
Everything is extremely well documented on the linked site for multiple languages, including c#.
This answer is for all the other people that spent hours searching and weren't helped by the previous answers. This isn't a very efficient solution to the problem here, but while searching this question came up many times, and this might be helpful to others. :)
I'd look into the c documentation for libogg, and figure out how to do this with c. And then write almost the same code in C# using a wrapper over libogg.
I've created a low level wrapper over libogg and libvorbis using the interop assistant:
https://github.com/CodesInChaos/Xiph/blob/master/LowLevel.cs
That project also contains some higher level constructs, but I don't think they'll be useful for what you're doing.
BTW if the stream IDs between the files differ, you can simply append a file to another creating a valid file that plays both streams in sequence.
You probably need to read the input files packet wise using the decoding API, and then write the combined data out packet wise. Possibly replacing the stream ID and granulepos in between.
StreamID is an integer that identifies substreams in an ogg file. To append multiple such substreams you can simply ensure that they have a different ID and then write the data.
Splitting is a bit more annoying, since granulepos is a codec dependent timestamp, and I don't remember how it is defined for vorbis. Another problem here is that you can't simply split in the middle of a packet without reencoding.
I need to compress a very large xml file to the smallest possible size.
I work in C#, and I prefer it to be some open source or application that I can access thru my code, but I can handle an algorithm as well.
Thank you!
It may not be the "smallest size possible", but you could use use System.IO.Compression to compress it. Zipping tends to provide very good compression for text.
using (var fileStream = File.OpenWrite(...))
using (var zipStream = new GZipStream(fileStream, CompressionMode.Compress))
{
zipStream.Write(...);
}
As stated above, Efficient XML Interchange (EXI) achieves the best available XML compression pretty consistently. Even without schemas, it is not uncommon for EXI to be 2-5 times smaller than zip. With schemas, you'll do even better.
If you're not opposed to a commercial implementation, you can use the .NET version of Efficient XML and call it directly from your C# code using standard .NET APIs. You can download a free trial copy from http://www.agiledelta.com/efx_download.html.
have a look at XML Compression Tools you can also compress it using SharpZipLib
If you have a schema available for the XML file, you could try EXIficient. It is an implementation of the Efficient XML Interchange (EXI) format that is pretty much the best available general-purpose XML compression method. If you don't have a schema, EXI is still better than regular zip (the deflate algorithm, that is), but not very much, especially for large files.
EXIficient is only Java but you can probably make it into an application that you can call. I'm not aware of any open-source implementations of EXI in C#.
File size is not the only advantage of EXI (or any binary scheme). The processing time and memory overhead are also greatly reduced when reading/writing it. Imagine a program that copies floating point numbers to disk by simply copying the bytes. Now imagine another program converts the floating point numbers to formatted text, and pastes them into a text stream, and then feeds that stream through an expensive compression algorithm. Because of this ridiculous overhead, XML is basically unusable for very large files that could have been effortlessly processed with a binary representation.
Binary XML promises to address this longstanding weakness of XML. It would be very easy to make a utility that converts between binary/text representations (without knowing the XML schema), which means you can still edit the files easily when you want to.
XML is highly compressible. You can use DotNetZip to produce compressed zip files from you XML.
if you require maximum compression level i would recommend LZMA. There is a SDK (including C#) that is part of the open source 7-Zip project, available here.
If you are looking for the smallest possible size then try Fast Infoset as binary XML encoding and then compress using BZIP2 or LZMA. You will probably get better results than compressing text XML or using EXI. FastInfoset.NET includes implementations of the Fast Infoset standard and several compression formats to choose from but it's commercial.
Does anyone have, or know of, a binary patch generation algorithm implementation in C#?
Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.
The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.
My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).
Any advice, or pointers for implementation would be nice.
Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.
The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:
Grab the first four bytes from the old file, call this the key
Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
Repeat steps 1-3 for all 4-byte blocks in the old file
From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
If found, find the longest match if there are several, by comparing bytes from the two files
Encode a reference to that location in the old file, and skip the matched block in the new file
If not found, encode 1 byte from the new file, and skip it
Repeat steps 5-8 for the rest of the new file
This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.
A more memory-efficient algorithm uses windowing, but produces much bigger patch files.
There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.
Edit #1: Here is a more detailed description of the above algorithm.
First, combine the two files, so that you have one big file. Remember the cut-point between the two files.
Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.
Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.
Edit #2: Source code to the above algorithm
You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.
The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.
#lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.
Or perhaps I'm just dense... :)
I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:
Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.
My needs aren't optimal though, so I'm looking for a more practical solution.
Thanks for the answer though, added a bookmark to his utilities if I ever need them.
Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.
Edit #2: I'll definitely hunt down the python xdelta implementation.
Sorry I couldn't be more help. I would definately keep looking at xdelta because I have used it a number of times to produce quality diffs on 600MB+ ISO files we have generated for distributing our products and it performs very well.
bsdiff was designed to create very small patches for binary files. As stated on its page, it requires max(17*n,9*n+m)+O(1) bytes of memory and runs in O((n+m) log n) time (where n is the size of the old file and m is the size of the new file).
The original implementation is in C, but a C# port is described here and available here.
Have you seen VCDiff? It is part of a Misc library that appears to be fairly active (last release r259, April 23rd 2008). I haven't used it, but thought it was worth mentioning.
It might be worth checking out what some of the other guys are doing in this space and not necessarily in the C# arena either.
This is a library written in c#
SVN also has a binary diff algorithm and I know there's an implementation in python although I couldn't find it with a quick search. They might give you some ideas on where to improve your own algorithm
If this is for installation or distribution, have you considered using the Windows Installer SDK? It has the ability to patch binary files.
http://msdn.microsoft.com/en-us/library/aa370578(VS.85).aspx
This is a rough guideline, but the following is for the rsync algorithm which can be used to create your binary patches.
http://rsync.samba.org/tech_report/tech_report.html