Stream chaining in computing a checksum: avoiding memory issues - c#

I have a FileStream connected to a xml file that I would like to read directly into a SHA512 object in order to compute a hash for the purposes of a checksum (not a security use).
The issue is twofold:
I want to omit some of the nodes in the xml,
the file is quite large, and I would rather not load the whole thing into into memory
I can read the whole file into a xml structure, delete the node, then write it to a stream that would then be plugged into SHA512.ComputeHash, but that will cause a performance loss. I would prefer to be able to somehow do the deletion of the nodes as an operation on a stream and then chain the streams together somehow into a single stream that can be passed into SHA512.ComputeHash(Stream).
How can I accomplish this?

using (var hash = new SHA512Cng())
using (var stream = new CryptoStream(Stream.Null, hash, CryptoStreamMode.Write))
using (var writer = XmlWriter.Create(stream))
using (var reader = XmlReader.Create("input.xml"))
{
while (reader.Read())
{
// ... write node to writer ...
}
writer.Flush();
stream.FlushFinalBlock();
var result = hash.Hash;
}

Related

Update a file in a ZipArchive

I have a ZipArchive object which contains an XML file that I am modifying. I then want to return the modified ZipArchive.
Here's the code I have:
var package = File.ReadAllBytes(/* location of existing .zip */);
using (var packageStream = new MemoryStream(package, true))
using (var zipPackage = new ZipArchive(packageStream, ZipArchiveMode.Update))
{
// obtain the specific entry
var myEntry = zipPackage.Entries.FirstOrDefault(entry => /* code elided */));
XElement xContents;
using (var reader = new StreamReader(myEntry.Open()))
{
// read the contents of the myEntry XML file
// then modify the contents into xContents
}
using (var writer = new StreamWriter(myEntry.Open()))
{
writer.Write(xContents.ToString());
}
return packageStream.ToArray();
}
This code throws a "Memory stream is not expandable" exception on the packageStream.ToArray() call.
Can anyone explain what I've done wrongly, and what is the correct way of updating an existing file inside a ZipArchive?
Clearly, ZipArchive wants to expand or resize the ZIP archive stream. However, you have provided a MemoryStream with a fixed stream length (due to using the constructor MemoryStream(byte[], bool), which creates a memory stream with a fixed length that is equal to the length of the array provided to the constructor).
Since ZipArchive wants to expand (or resize) the stream, provide an resizable MemoryStream (using its parameter-less constructor). Then copy the original file data into this MemoryStream and proceed with the ZIP archive manipulations.
And don't forget to reset the MemoryStream read/write position back to 0 after copying the original file data into it, otherwise ZipArchive will only see "End of Stream" when trying to read the ZIP archive data from this stream.
using (var packageStream = new MemoryStream())
{
using (var fs = File.OpenRead(/* location of existing .zip */))
{
fs.CopyTo(packageStream);
}
packageStream.Position = 0;
using (var zipPackage = new ZipArchive(packageStream, ZipArchiveMode.Update))
{
... do your thing ...
}
return packageStream.ToArray();
}
This code here contains one more correction. In the original code in the question, return packageStream.ToArray(); has been placed within the using block of the ZipArchive. At the time this line will be executed, the ZipArchive instance might not yet have written all data to the MemoryStream, perhaps keeping some data still in some internal buffers and/or perhaps having deferred writing some ZIP data structures.
To ensure that the ZipArchive has actually written all necessary data completely to the MemoryStream, it is here sufficient to move return packageStream.ToArray(); outside after the ZipArchive using block. At the end of its using block, the ZipArchive will be disposed which will also ensure that ZipArchive has written all so far yet unwritten data to the stream. Thus, accessing the MemoryStream after the ZipArchive has been disposed off will yield the complete data of the completely updated ZIP archive.
Side note: Do this only with small-ish ZIP files. The MemoryStream will obviously use internal data buffers (arrays) to hold the data in the MemoryStream. However, packageStream.ToArray(); will create a copy of the data in the MemoryStream, so for a period of time the memory requirements of this routine will be a little more than twice the size of the ZIP archive.

CopyToAsync vs ReadAsStreamAsync for huge request payload

I have to compute hash for huge payload, so I am using streams not to load all request content in memory. The question is what are the differences between this code:
using (var md5 = MD5.Create())
using (var stream = await authenticatableRequest.request.Content.ReadAsStreamAsync())
{
return md5.ComputeHash(stream);
}
And that one:
using (var md5 = MD5.Create())
using (var stream = new MemoryStream())
{
await authenticatableRequest.request.Content.CopyToAsync(stream);
stream.Position = 0;
return md5.ComputeHash(stream);
}
I expect the same behavior internally, but maybe I am missing something.
The first version looks Ok, let the hasher handle the stream reading. It was designed for that.
ComputeHash(stream) will read blocks in a while loop and call TransformBlock() repeatedly.
But the second piece of code will load everything into memory, so don't do that:
using (var stream = new MemoryStream())
{
await authenticatableRequest.request.Content.CopyToAsync(stream);
The second snippet will not only load everything into memory, it will use more memory than HttpContent.ReadAsByteArrayAsync().
A MemoryStream is a Stream API over a byte[] buffer whose initial size is zero. As data gets written into it, the buffer has to be reallocated into a buffer twice as large as the original. This can create a lot of temporary buffer objects whose size exceeds the final content.
This can be avoided by allocating the maximum expected buffer size from the beginning by providing the capacity parameter to the MemoryStream() constructor.
At best, this will be similar to calling :
var bytes = authenticatableRequest.request.Content.ReadAsByteArrayAsync();
return md5.ComputeHash(bytes);
I expect the same behavior internally,
Why? I mean, in one case you must load all into memory (because guess what, you define a memory stream). In the other case not necessarily.

How to read all bytes of a stream but the last 8

I have the following code:
using (var fs = new FileStream(#"C:\dump.bin", FileMode.Create))
{
income.CopyTo(fs);
}
income is a stream that I need to save to disk, the problem is that I want to ignore the last 8 bytes and save everything before that. The income stream is read only, forward only so I cannot predict its size and I don't want to load all the stream in memory due to huge files being sent.
Any help will be appreciated.
Maybe (or rather probably) there is a cleaner way of doing it but being pragmatic at the moment the first thought which comes to my mind is this:
using (var fs = new FileStream(#"C:\dump.bin", FileMode.Create))
{
income.CopyTo(fs);
fs.SetLength(Math.Max(income.Length - 8, 0));
}
Which set's the file length after it is written.

Difference Between StreamReader(string filepath) and StreamReader(Stream _stream)

I am little confused between two different constructor of StreamReader class i.e
1.StreamReader(Stream)
I know it takes stream bytes as input but the respective output is same.
here is my code using StreamReader(Stream) contructor
string filepath=#"C:\Users\Suchit\Desktop\p022_names.txt";
using(FileStream fs = new FileStream(filepath,FileMode.Open,FileAccess.Read))
{
using(StreamReader sw = new StreamReader(fs))
{
while(!sw.EndOfStream)
{
Console.WriteLine(sw.ReadLine());
}
}
}
2. StreamReader(String)
This conrtuctor takes the physical file path,
where our respective file exists but the output is again same.
Here is my code using StreamReader(String)
string filepath=#"C:\Users\Suchit\Desktop\p022_names.txt";
using (StreamReader sw = new StreamReader(filePath))
{
while(!sw.EndOfStream)
{
Console.WriteLine(sw.ReadLine());
}
}
So, Which one is better? When and where we should use respective code,
so that our code become more optimized and readable?
A class StreamReader (as well as StreamWriter) is just a wrapper for
FileStream, It needs a FileStream to read/write something to file.
So basically you have two options (ctor overloads) :
Create FileStream explicitly by yourself and wrap SR around it
Let the SR create FileStream for you
Consider this scenario :
using (FileStream fs = File.Open(#"C:\Temp\1.pb", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
using (StreamReader reader = new StreamReader(fs))
{
// ... read something
reader.ReadLine();
using (StreamWriter writer = new StreamWriter(fs))
{
// ... write something
writer.WriteLine("hello");
}
}
}
Both reader and writer works with the same filestream. Now if we change it to :
using (StreamReader reader = new StreamReader(#"C:\Temp\1.pb"))
{
// ... read something
reader.ReadLine();
using (StreamWriter writer = new StreamWriter(#"C:\Temp\1.pb"))
{
// ... write something
writer.WriteLine("hello");
}
}
System.IOException is thrown "The process cannot access the file C:\Temp\1.pb because it is being used by another process... This is because we try to open file with FileStream2 while we still use it in FileStream1. So generally speaking if you want to open file, perform one r/w operation and close it you're ok with StreamReader(string) overload. In case you would like to use the same FileStream for multiple operations or if by any other reason you'd like to have more control over Filestream then you should instantiate it first and pass to StreamReader(fs) .
Which one is better?
None. Both are same. As the name suggests StreamReader is used to work with streams; When you create an instance of StreamReader with "path", it will create the FileStream internally.
When and where we should use respective code
When you have the Stream upfront, use the overload which takes a Stream otherwise "path".
One advantage of using Stream overload is you can configure the FileStream as you want. For example if you're going to work with asynchronous methods, you need to open the file with asynchronous mode. If you don't then operation will not be truly asynchronous.
When at doubt don't hesitate to check the source yourself.
Note that the Stream overload doesn't take a FileStream. This allows you to read data from any sub class of Stream, which allows you to do things like read the result of a web request, read unzipped data, or read decrypted data.
Use the string path overload if you only want to read from a file and you don't need to use the FileStream for anything else. It just saves you from writing a line of code:
using (var stream = File.OpenRead(path))
using (var reader = new StreamReader(stream))
{
...
}
File.OpenText also does the same thing.
Both are same, just overloads, use one of them according to your need. If you have a local file then you can use StreamReader(string path) otherwise if you have just stream from online or some other source then other overload helps you i-e StreamReader(Stream stream)
Well after searching the new open source reference. You can see that the latter internaly expands to the former one. So passing a raw file path into the StreamReader makes him expand it internaly to a FileStream. For me this means, both are equivalent and you can use them as you prefer it.
My personal opinion is to use the latter one, because its less code to write and its more explicit. I don't like the way java is doing it with there thousand bytereader, streamreader, outputreaderreader and so on...
Basically both works same that is doing UTF8Encodeing and use Buffer of 1024 bytes.
But The StreamReader object calls Dispose() on the provided Stream object when StreamReader.Dispose is called.
You can refer the following Stream and String
You can use either of them depending on what you have in hand Stream or String file path.
Hope this makes it clear
StreamReader(string) is just an overload of StreamReader(Stream).
In the context of your question, you are probably better off using the StreamReader(string) overload, just because it means less code. StreamReader(Stream) might be minutely faster but you have to create a FileStream using the string you could have just put straight into the StreamReader, so whatever benefit you gained is lost.
Basically, StreamReader(string) is for files with static or easily mapped paths (as appears to be the case for you), while StreamReader(Stream) could be thought of as a fallback in case you have access to a file programmatically, but it's path is difficult to pin down.

very large string in memory

I am writing a program for formatting 100s of MB String data (nearing a gig) into xml == And I am required to return it as a response to an HTTP (GET) request .
I am using a StringWriter/XmlWriter to build an XML of the records in a loop and returning the
using (StringWriter writer = new StringWriter())
using (writer = XmlWriter.Create(writer, settings)) //where settings are the xml props
writer.ToString()
during testing I saw a few --out of memory exceptions-- and quite clueless on how to find a solution? do you guys have any suggestions for a memory optimized delivery of the response?
is there a memory efficient way of encoding the data? or maybe chunking the data --
I just can not think of how to return it without building the whole thing into one HUGE string object
thanks
--
a few clarifications --
this is an asp .net webservices app over a gigabit ethernet link as josh noted. I am not very familiar with it so still a bit of a learning curve.
I am using XMLWriter to create the XML and create a string out of it using String
some stats --
response xml size = about 385 megs (my data size will grow very quickly to way more than this)
string object size as calculated by a memory profiler = peaked at 605MB
and thanks to everyone who responded...
Use XmlTextWriter wrapped around Reponse.OutputStream to send the XML to the client and periodically flush the response. This way you never have to have more than a few mb in memory at any one time (at least for sending to the client).
Can't you just stream the response to the client? XmlWriter doesn't require its underlying stream to be buffered in memory. If it's ASP.NET you can use the Response.OutputStream or if it's WCF, you can use response streaming.
HTTP get for 1 gig? that's a lot! Perhaps you should reconsider.
At least gziping the output could help.
You should not create XML using string manipulation.
Instead, you should use the XmlTextWriter, XmlDocument, or (in .Net 3.5) XElement classes to build an XML tree in memory, then write it directly to Response.OutputStream using an XmlTextWriter.
Writing directly to an XmlTextWriter that wraps Response.OutputStream wil be most efficient (you'll never have an entire element tree in memory at once), but will be somewhat more complicated.
By doing it this way, you will never have a single string (or array) containing the entire object, and should thus avoid OutOfMemoryExceptions.
Had a similar problem, hope this will help someone. My initial code was:
var serializer = new XmlSerializer(type);
string xmlString;
using (var writer = new StringWriter())
{
serializer.Serialize(writer, objectData, sn); // OutOfMemoryException here
xmlString = writer.ToString();
}
I ended up replaceing StringWriter with MemoryStream and this solved my problem
using (var mem = new MemoryStream())
{
serializer.Serialize(mem, objectData, sn);
xmlString = Encoding.UTF8.GetString(mem.ToArray());
}
You'll have to return each record (or a small group of records) on their own individual GETs.

Categories

Resources