compressing XML file - c#

HI
I have an xml file with 500KB size which i need to send it to webservice, so i want to compress this data and send it to the webservice
i have heard of some base24Encoding something...
Can anyone throw more light on this
Suppose if i use GZipStream how can i send the file to the webservice
Thanks in Advance

Something like below (the first part just writes some random xml for us to work with). Your web-service would ideally take a byte[] argument, and would have (if using WSE3 or MCF over basic-http) MTOM enabled, which reduces the base-64 overhead. You just post it the byte[] and then reverse the compression at the other end.
if (File.Exists("my.xml")) File.Delete("my.xml");
using (XmlWriter xmlFile = XmlWriter.Create("my.xml")) {
Random rand = new Random();
xmlFile.WriteStartElement("xml");
for (int i = 0; i < 1000; i++) {
xmlFile.WriteElementString("add", rand.Next().ToString());
}
xmlFile.WriteEndElement();
xmlFile.Close();
}
// now we have some xml!
using (MemoryStream ms = new MemoryStream()) {
int origBytes = 0;
using (GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true))
using (FileStream file = File.OpenRead("my.xml")) {
byte[] buffer = new byte[2048];
int bytes;
while ((bytes = file.Read(buffer, 0, buffer.Length)) > 0) {
zip.Write(buffer, 0, bytes);
origBytes += bytes;
}
}
byte[] blob = ms.ToArray();
string asBase64 = Convert.ToBase64String(blob);
Console.WriteLine("Original: " + origBytes);
Console.WriteLine("Raw: " + blob.Length);
Console.WriteLine("Base64: " + asBase64.Length);
}
Alternatively, consider a different serialization format; there are dense binary protocols that are much smaller (and as a consequence don't benefit from gzip etc). For example, serializing via protobuf-net would give you a very efficient size. But this only applies to an object-model, not to arbitrary xml data.

There are some good articles about that here:
http://www.ibm.com/developerworks/xml/library/x-tipcomp.html
http://www.xml.com/pub/a/2000/03/22/deviant/

The best way to handle this scenario would be to have your web service accept a byte[] parameter which will represent the compressed XML. The Base64 encoding will be done automatically. To improve compression ratio you could use MTOM encoding. This will avoid the Base64 step which consists of converting your byte array to a string in order to send it over the wire and where you loose might in compression ratio.

You have following to choose from:
BinaryFormatter
ArrayList itemsToSerialize = new ArrayList();
itemsToSerialize.Add ( "john" );
itemsToSerialize.Add ( "smith" );
Stream stream = new FileStream( #"MyApplicationData.dat", System.IO.FileMode.Create );
IFormatter formatter = new BinaryFormatter();
formatter.Serialize( stream, itemsToSerialize );
stream.Close();
You could use WCF netTcpBinding
You could use *HttpBinding for WCF service hosted on IIS and follow this blog which walks you through setting up WCF gzip compression through IIS
You could zip response and request and then decompress it
new StreamReader(new GZipStream(webResponse.GetResponseStream(), CompressionMode.Decompress));

Related

C# increase the size of reading binary data

I am using the below code from Jon Skeet's article. Of late, the binary data that needs to be processed has grown multi-fold. The binary data size file size that I am trying to import is ~ 900 mb almost 1 gb. How do I increase the memory stream size.
public static byte[] ReadFully (Stream stream)
{
byte[] buffer = new byte[32768];
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
int read = stream.Read (buffer, 0, buffer.Length);
if (read <= 0)
return ms.ToArray();
ms.Write (buffer, 0, read);
}
}
}
Your method returns a byte array, which means it will return all of the data in the file. Your entire file will be loaded into memory.
If that is what you want to do, then simply use the built in File methods:
byte[] bytes = System.IO.File.ReadAllBytes(string path);
string text = System.IO.File.ReadAllText(string path);
If you don't want to load the entire file into memory, take advantage of your Stream
using (var fs = new FileStream("path", FileMode.Open))
using (var reader = new StreamReader(fs))
{
var line = reader.ReadLine();
// do stuff with 'line' here, or use one of the other
// StreamReader methods.
}
You don't have to increase the size of MemoryStream - by default it expands to fit the contents.
Apparently there can be problems with memory fragmentation, but you can pre-allocate memory to avoid them:
using (MemoryStream ms = new MemoryStream(1024 * 1024 * 1024)) // initial capacity 1GB
{
}
In my opinion 1GB should be no big deal these days, but it's probably better to process the data in chunks if possible. That is what Streams are designed for.

Send/Receive GZip compressed MSMQ messages in C#

I am trying to send large objects (>30MB) to a MSMQ queue. Due to the large amount of data we are are tring to send the idea was to GZip the objects prior to sending them, then unzipping them on the receiving end.
However, writing the compressed stream to the message.BodyStream property seems to work, but not reading it out from there.
I don't know what's wrong.
Message l_QueueMessage = new Message();
l_QueueMessage.Priority = priority;
using (MessageQueue l_Queue = CreateQueue())
{
GZipStream stream = new GZipStream(l_QueueMessage.BodyStream, CompressionMode.Compress);
Formatter.Serialize(stream, message);
l_Queue.Send(l_QueueMessage);
}
The Formatter is a global property of type BinaryFormatter. This is used to serialize/deserialize to the type of object we want to send/receive, e.g. "ProductItem".
The receving end looks like this:
GZipStream stream = new GZipStream(l_Message.BodyStream, CompressionMode.Decompress);
object decompressedObject = Formatter.Deserialize(stream);
ProductItem l_Item = decompressedObject as ProductItem;
m_ProductReceived(sender, new MessageReceivedEventArgs<ProductItem>(l_Item));
l_ProductQueue.BeginReceive();
I get an EndOfStreamException "{"Unable to read beyond the end of the stream."} trying to deserialize
at System.IO.BinaryReader.ReadByte()
Using the messageBodyStream property I actually circumvent the message.Formatter, which I don't initialize to anything, becasue I'm using my own ser/deser mechanism with the GZipStream. However, I am not sure if that's the correct way of doing this.
What am I missing?
Thanks!
In your original code, the problem is that you need to close the GZipStream in order for a GZip footer to be written correctly, and only then you can send it. If you dont, you end up sending bytes that can not be deserialized. That's also why your new code where sending is done later works.
OK, I made this work. The key was to convert the decompressed stream on the receiver to a byte[] array. Then the deserialization started working.
The sender code (notice the stream is closed before sending the message):
using (MessageQueue l_Queue = CreateQueue())
{
using (GZipStream stream = new GZipStream(l_QueueMessage.BodyStream, CompressionMode.Compress, true))
{
Formatter.Serialize(stream, message);
}
l_Queue.Send(l_QueueMessage);
}
The receiving end (notice how I convert the stream to a byte[] then deserialize):
using (GZipStream stream = new GZipStream(l_QueueMessage.BodyStream, CompressionMode.Decompress))
{
byte[] bytes = ReadFully(stream);
using (MemoryStream ms = new MemoryStream(bytes))
{
decompressedObject = Formatter.Deserialize(ms);
}
}
Still, don't know why this works using the ReadFully() function and not the Stream.CopyTo().
Does anyone?
Btw, ReadFully() is a function that creates a byte[] out of a Stream. I have to credit Jon Skeet for this at http://www.yoda.arachsys.com/csharp/readbinary.html. Thanks!
Try to separate compressing and sending:
byte[] binaryBuffer = null;
using (MemoryStream compressedBody = new MemoryStream())
{
using(GZipStream stream = new GZipStream(compressedBody, CompressionMode.Compress))
{
Formatter.Serialize(compressedBody, message);
binaryBuffer = compressedBody.GetBuffer();
}
}
using (MessageQueue l_Queue = CreateQueue())
{
l_QueueMessage.BodyStream.Write(binaryBuffer, 0, binaryBuffer.Length);
l_QueueMessage.BodyStream.Seek(0, SeekOrigin.Begin);
l_Queue.Send(l_QueueMessage);
}

C# HttpListener Response + GZipStream

I use HttpListener for my own http server (I do not use IIS). I want to compress my OutputStream by GZip compression:
byte[] refBuffer = Encoding.UTF8.GetBytes(...some data source...);
var varByteStream = new MemoryStream(refBuffer);
System.IO.Compression.GZipStream refGZipStream = new GZipStream(varByteStream, CompressionMode.Compress, false);
refGZipStream.BaseStream.CopyTo(refHttpListenerContext.Response.OutputStream);
refHttpListenerContext.Response.AddHeader("Content-Encoding", "gzip");
But I getting error in Chrome:
ERR_CONTENT_DECODING_FAILED
If I remove AddHeader, then it works, but the size of response is not seems being compressed. What am I doing wrong?
The problem is that your transfer is going in the wrong direction. What you want to do is attach the GZipStream to the Response.OutputStream and then call CopyTo on the MemoryStream, passing in the GZipStream, like so:
refHttpListenerContext.Response.AddHeader("Content-Encoding", "gzip");
byte[] refBuffer = Encoding.UTF8.GetBytes(...some data source...);
var varByteStream = new MemoryStream(refBuffer);
System.IO.Compression.GZipStream refGZipStream = new GZipStream(refHttpListenerContext.Response.OutputStream, CompressionMode.Compress, false);
varByteStream.CopyTo(refGZipStream);
refGZipStream.Flush();
The first problem (as mentioned by Brent M Spell) is the wrong position of the header. The second is that you don't use properly the GZipStream. This stream requires a "top" stream to write to, meaning an empty stream (you fill it with your buffer). Having an empty "top" stream then all you have to do is to write on GZipStream your buffer. As a result the memory stream will be filled by the compressed content. So you need something like:
byte[] buffer = ....;
using (var ms = new MemoryStream())
{
using (var zip = new GZipStream(ms, CompressionMode.Compress, true))
zip.Write(buffer, 0, buffer.Length);
buffer = ms.ToArray();
}
response.AddHeader("Content-Encoding", "gzip");
response.ContentLength64 = buffer.Length;
response.OutputStream.Write(buffer, 0, buffer.Length);
Hopeful this might help, they discuss how to get GZIP working.
Sockets in C#: How to get the response stream?

Raw Stream Has Data, Deflate Returns Zero Bytes

I'm reading data (an adCenter report, as it happens), which is supposed to be zipped. Reading the contents with an ordinary stream, I get a couple thousand bytes of gibberish, so this seems reasonable. So I feed the stream to DeflateStream.
First, it reports "Block length does not match with its complement." A brief search suggests that there is a two-byte prefix, and indeed if I call ReadByte() twice before opening DeflateStream, the exception goes away.
However, DeflateStream now returns nothing at all. I've spent most of the afternoon chasing leads on this, with no luck. Help me, StackOverflow, you're my only hope! Can anyone tell me what I'm missing?
Here's the code. Naturally I only enabled one of the two commented blocks at a time when testing.
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Skip the zlib prefix, which conflicts with the deflate specification
compressed.ReadByte(); compressed.ReadByte();
// Reports reading 3,000-odd bytes, followed by random characters
/*byte[] buffer = new byte[4096];
int bytesRead = compressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (DeflateStream decompressed = new DeflateStream(compressed, CompressionMode.Decompress))
{
// Reports reading 0 bytes, and no output
/*byte[] buffer = new byte[4096];
int bytesRead = decompressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
As you can probably guess from the last line, the unzipped content should be TDT.
Just for fun, I tried decompressing with GZipStream, but it reports that the magic number is not correct. MS' docs just say "The downloaded report is compressed by using zip compression. You must unzip the report before you can use its contents."
Here's the code that finally worked. I had to save the content out to a file and read it back in. This does not seem reasonable, but for the small quantities of data I'm working with, it's acceptable, I'll take it!
WebRequest request = HttpWebRequest.Create(reportURL);
WebResponse response = request.GetResponse();
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Save the content to a temporary location
string zipFilePath = #"\\Server\Folder\adCenter\Temp.zip";
using (StreamWriter file = new StreamWriter(zipFilePath))
{
compressed.CopyTo(file.BaseStream);
file.Flush();
}
// Get the first file from the temporary zip
ZipFile zipFile = ZipFile.Read(zipFilePath);
if (zipFile.Entries.Count > 1) throw new ApplicationException("Found " + zipFile.Entries.Count.ToString("#,##0") + " entries in the report; expected 1.");
ZipEntry report = zipFile[0];
// Extract the data
using (MemoryStream decompressed = new MemoryStream())
{
report.Extract(decompressed);
decompressed.Position = 0; // Note that the stream does NOT start at the beginning
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
You will find that DeflateStream is hugely limited in what data it will decompress. In fact if you are expecting entire files it will be of no use at all.
There are hundereds of (mostly small) variations of ZIP files and DeflateStream will get along only with two or three of them.
Best way is likely to use a dedicated library for reading Zip files/streams like DotNetZip or SharpZipLib (somewhat unmaintained).
You could write the stream to a file and try my tool Precomp on it. If you use it like this:
precomp -c- -v [name of input file]
any ZIP/gZip stream(s) inside the file will be detected and some verbose information will be reported (position and length of the stream). Additionally, if they can be decompressed and recompressed bit-to-bit identical, the output file will contain the decompressed stream(s).
Precomp detects ZIP/gZip (and some other) streams anywhere in the file, so you won't have to worry about header bytes or garbage at the beginning of the file.
If it doesn't detect a stream like this, try to add -slow, which detects deflate streams even if they don't have a ZIP/gZip header. If this fails, you can try -brute which even detects deflate streams that lack the two byte header, but this will be extremely slow and can cause false positives.
After that, you'll know if there is a (valid) deflate stream in the file and if so, the additional information should help you to decompress other reports correctly using zLib decompression routines or similar.

XmlMtomReader reading strategy

Consider the following code:
Stream stream = GetStreamFromSomewhere();
XmlDictionaryReader mtomReader =XmlDictionaryReader.CreateMtomReader
(
stream,
Encoding.UTF8,
XmlDictionaryReaderQuoatas.Max
);
/// ...
/// is there best way to read binary data from mtomReader's element??
string elementString = mtomReader.XmlReader.ReadElementString();
byte[] elementBytes = Covert.FromBase64String(elementString);
Stream elementFileStream = new FileStream(tempFileLocation);
elementFileStream.Write(elementBytes,0,elementBytes.Length);
elementFileStream.Close();
/// ...
mtomReader.Close();
The problem is that the size of the binary attachment supposed to be over 100Mb sometimes. Is there a way to read element's binary attachment block by block and then write it to the temporary file stream so i can escape from allocating memory for the hole stuff?
The second - even more specific issue - does mtomReader create any internal cache of the mime binary attachment before i read element's content, i.e. allocate memory for binary data? Or does it read bytes from the input stream directly?
For those who may be interested in the solution:
using (Stream stream = GetStreamFromSomewhere())
{
using (
XmlDictionaryReader mtomReader = XmlDictionaryReader.CreateMtomReader(
stream, Encoding.UTF8, XmlDictionaryReaderQuotas.Max))
{
string elementString = mtomReader.ReadElementString();
byte[] buffer = new byte[1024];
using (
Stream elementFileStream =
new FileStream(tempFileLocation, FileMode.Create))
{
while(mtomReader.XmlReader.ReadElementContentAsBase64(buffer,0,buffer.Length)
{
elementFileStream.Write(buffer, 0, buffer.Length);
}
}
/// ...
mtomReader.Close();
}
}
ReadElementContentAsBase64(...) helps read binary parts block by block. The second issue of my post was covered perfectly here: Does XmlMtomReader cache binary data from the input stream internally?
For an attachment of that size it would be better to use streaming.
Streamed transfers can improve the
scalability of a service by
eliminating the requirement for large
memory buffers. Whether changing the
transfer mode improves scalability
depends on the size of the messages
being transferred. Large message sizes
favor using streamed transfers.
See: http://msdn.microsoft.com/en-us/library/ms731913.aspx
To begin with, your code should be more like this:
using (Stream stream = GetStreamFromSomewhere())
{
using (
XmlDictionaryReader mtomReader = XmlDictionaryReader.CreateMtomReader(
stream, Encoding.UTF8, XmlDictionaryReaderQuotas.Max))
{
string elementString = mtomReader.ReadElementString();
byte[] elementBytes = Convert.FromBase64String(elementString);
using (
Stream elementFileStream =
new FileStream(tempFileLocation, FileMode.Create))
{
elementFileStream.Write(
elementBytes, 0, elementBytes.Length);
}
/// ...
mtomReader.Close();
}
}
Without the using blocks, you're at risk of resource leaks.

Categories

Resources