I need to compress a very large xml file to the smallest possible size.
I work in C#, and I prefer it to be some open source or application that I can access thru my code, but I can handle an algorithm as well.
Thank you!
It may not be the "smallest size possible", but you could use use System.IO.Compression to compress it. Zipping tends to provide very good compression for text.
using (var fileStream = File.OpenWrite(...))
using (var zipStream = new GZipStream(fileStream, CompressionMode.Compress))
{
zipStream.Write(...);
}
As stated above, Efficient XML Interchange (EXI) achieves the best available XML compression pretty consistently. Even without schemas, it is not uncommon for EXI to be 2-5 times smaller than zip. With schemas, you'll do even better.
If you're not opposed to a commercial implementation, you can use the .NET version of Efficient XML and call it directly from your C# code using standard .NET APIs. You can download a free trial copy from http://www.agiledelta.com/efx_download.html.
have a look at XML Compression Tools you can also compress it using SharpZipLib
If you have a schema available for the XML file, you could try EXIficient. It is an implementation of the Efficient XML Interchange (EXI) format that is pretty much the best available general-purpose XML compression method. If you don't have a schema, EXI is still better than regular zip (the deflate algorithm, that is), but not very much, especially for large files.
EXIficient is only Java but you can probably make it into an application that you can call. I'm not aware of any open-source implementations of EXI in C#.
File size is not the only advantage of EXI (or any binary scheme). The processing time and memory overhead are also greatly reduced when reading/writing it. Imagine a program that copies floating point numbers to disk by simply copying the bytes. Now imagine another program converts the floating point numbers to formatted text, and pastes them into a text stream, and then feeds that stream through an expensive compression algorithm. Because of this ridiculous overhead, XML is basically unusable for very large files that could have been effortlessly processed with a binary representation.
Binary XML promises to address this longstanding weakness of XML. It would be very easy to make a utility that converts between binary/text representations (without knowing the XML schema), which means you can still edit the files easily when you want to.
XML is highly compressible. You can use DotNetZip to produce compressed zip files from you XML.
if you require maximum compression level i would recommend LZMA. There is a SDK (including C#) that is part of the open source 7-Zip project, available here.
If you are looking for the smallest possible size then try Fast Infoset as binary XML encoding and then compress using BZIP2 or LZMA. You will probably get better results than compressing text XML or using EXI. FastInfoset.NET includes implementations of the Fast Infoset standard and several compression formats to choose from but it's commercial.
Related
I cannot find out how to create an EXI decoder using C#/.NET which accepts a MemoryStream containing EXI valid code and simply outputs another MemoryStream containing XML code. I will parse XML code later with custom methods; I'm using EXI only to achieve best compression performances and low memory footprint. Until now, I have found some Java implementations as examples, but none for C#/.NET counterpart; hints of any kind are really appreciated.
The program that I am working on saves the snapshot of the current state to a xml file. I would like to store this in database (as blob) instead of xml.
Firstly, I think xml files are quite space-consuming and redundant, so we would like to compress the string in some way before storing in in the database. In addition, we would also like to introduce a simple cryptography so that people won't be able to figure out what it means without at least a simple key/password.
Note that I want to store it in the database as blob, so zipping it and then encrypting the zip file won't do, I guess.
How can I go about doing this?
Compress the XML data with DeflateStream and write it's output to a MemoryStream. Then call .ToArray() method to obtain your blob data. You can also do encryption with .NET in a similar way as well (after compression of course). If you believe deflate is not enough to save space, then try this library: XWRT.
Firstly, have a look at your serialization mechanism. The whole point of XML is that it's human readable. If that's no longer an important goal for you then it might be time to look at other serialization technologies which would be more suited to database storage (compressing XML into binary completely defeats the point of it :)
As an alternative format, BSON could be a good choice.
I have two separate apps - one a client (in C#), one a server (in C++). They need to exchange data in the form of "structs" and ~ about 1 MB of data a minute is sent from server to client.
Whats better to use - XML or my own Binary format?
With XML:
Translating XML to a struct using a parser would be slow I believe? ("good",but: load parser, load XML, parse)
The other option is parsing XML with regex (bad!)
With Binary:
compact data sizes
no need for meta information like tags;
but structs cannot be changed easily to accomodate new structs/new members in structs in future;
no conversion from text (XML) to binary (struct) necessary so is faster to receive and "assemble" into a struct)
Any pointers? Should I not be considering binary at all?? A bit confused about what approach to take.
1MB of data per minute is pretty tiny if you've got a reasonable network connection.
There are other choices between binary and XML - other human-readable text serialization formats, such as JSON.
When it comes to binary, you don't have to have versioning problems - technologies like Protocol Buffers (I'm biased: I work for Google and I've ported PB to C#) are explicitly designed with backward and forward compatibility in mind. There are other binary formats to consider as well, such as Thrift.
If you're worried about performance though, you should really measure it. I'm pretty sure my phone could parse 1MB of XML sufficiently quickly for it not to be a problem in this case... basically work out what you're most concerned about, in terms of:
Simplicity of code
Interoperability
Performance in terms of CPU
Network traffic
Backward/forward compatibility
Human readability of on-the-wire format
It's all a balancing act - but you're the one who has to decide how much weight to give each of those factors.
If you have .NET applications in both ends, use Windows Communication Foundation. This will allow you to defer the decision until deployment time, as it supports both binary and XML serialization.
As you stated, XML is a (little) slower but much more flexible and reliable. I would go with XML until there is a proven problem with performance.
You should also take a look a ProtoBuff as an alternative.
And, after your update, any cross-language, cross-platform and cross-version requirement strongly points away from binary formatting.
A good point for XML would be interoperability. Do you have other clients that also access your server?
Before you use your own binary format or do regex on XML...Have you considered the serialization namespace in .NET? There are Binary Formatters, SOAP formatters and there is also XmlSerialization.
Another advantage of a XML is that you can extend the data you are sending by adding an element, you wont have to alter the receiver's code to cope with the extra data until you are ready to.
Also even minimal(fast) compression of XML can dramatic reduce the wire load.
text/xml
Human readable
Easier to debug
Bandwidth can be saved by compressing
Tags document the data they contain
binary
Compact
Easy to parse (if fixed size fields are used, just overlay a struct)
Difficult to debug (hex editors are a pain)
Needs a separate document to understand what the data is.
Both forms are extensible and can be upgraded to newer versions provided you insert a type and version field at the beginning of the datagram.
you did not say if they are on the same machine or not. I assume not.
IN that case then there is another downside to binary. You cannot simply dump the structs on the wire, you could have endianness and sizeof issues.
XML is very wordy, YAML or JSON are much smaller
Don't forget that what most people think of as XML is XML serialized as text. It can be serialized to binary instead. This is what the netTcpBinding and other such bindings do in WCF. The XML infoset is output as binary, not as text. It's still XML, just in binary.
You could also use Google Protocol Buffers, which is a compact binary representation for structured data.
I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
Download from Microsoft
Gigabyte XML files! I don't envy you this task.
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
It seems that you already tried using XPathDocument and could not accomodate the parsed xml document in memory.
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files.
The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
Have you been trying XPathDocument?
This class is optimized for handling XPath queries efficiently.
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
You've outlined your choices already.
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.
What is the best way to search a large binary file for a certain substring in C#?
To provide some specifics, I'm trying to extract the DWARF information from an executable, so I only care about certain parts of the binary file (namely the sections starting with the strings .debug_info, .debug_abbrev, etc.)
I don't see anything obvious in Stream, FileStream, or BinaryReader, so it looks like I'll have to read chunks in and search through the data for the strings myself.
Is there a better way?
There's nothing built into .NET that will do the search for you, so you're going to need to read in the file chunk by chunk and scan for what you want to find.
You can speed up the search in two ways.
Firstly, use bufferred IO and transfer large chunks at a time - don't read byte by byte, read 64KB, 256KB or 1MB chunks.
Secondly, don't do a linear scan for the piece you want - check out the Boyer-Moore (wikipedia link) algorithm for string searches - you can apply this to searching for the DWARF information you want.
I think you'll have to do it yourself, BinaryReader was not designed for searching for text in a binary file. However, you should be mindful of the text encoding you use when searching.
There must be a DWARF C library you could compile and use interop with? I did some searching and found this. If a library from there could be compiled into a DLL on Windows (I assume you're using Windows), then you could use System.Runtime.InteropServices to interact with the DLL and extract your information from there.
Perhaps?