CopyToAsync vs ReadAsStreamAsync for huge request payload

CopyToAsync vs ReadAsStreamAsync for huge request payload - c#

I have to compute hash for huge payload, so I am using streams not to load all request content in memory. The question is what are the differences between this code:
using (var md5 = MD5.Create())
using (var stream = await authenticatableRequest.request.Content.ReadAsStreamAsync())
{
return md5.ComputeHash(stream);
}
And that one:
using (var md5 = MD5.Create())
using (var stream = new MemoryStream())
{
await authenticatableRequest.request.Content.CopyToAsync(stream);
stream.Position = 0;
return md5.ComputeHash(stream);
}
I expect the same behavior internally, but maybe I am missing something.

The first version looks Ok, let the hasher handle the stream reading. It was designed for that.
ComputeHash(stream) will read blocks in a while loop and call TransformBlock() repeatedly.
But the second piece of code will load everything into memory, so don't do that:
using (var stream = new MemoryStream())
{
await authenticatableRequest.request.Content.CopyToAsync(stream);

The second snippet will not only load everything into memory, it will use more memory than HttpContent.ReadAsByteArrayAsync().
A MemoryStream is a Stream API over a byte[] buffer whose initial size is zero. As data gets written into it, the buffer has to be reallocated into a buffer twice as large as the original. This can create a lot of temporary buffer objects whose size exceeds the final content.
This can be avoided by allocating the maximum expected buffer size from the beginning by providing the capacity parameter to the MemoryStream() constructor.
At best, this will be similar to calling :
var bytes = authenticatableRequest.request.Content.ReadAsByteArrayAsync();
return md5.ComputeHash(bytes);

I expect the same behavior internally,
Why? I mean, in one case you must load all into memory (because guess what, you define a memory stream). In the other case not necessarily.

Related

Using memory stream is throwing out of memory exeption

I have a requirement where I need to encrypt file of size 1-2 GB in azure function. In am using PGP core library to encrypt file in memory. The below code is throwing out of memory exception if file size is above 700 MB. Note:- I am using azure function. Scaling up of App service plan didn't help.
I there any alternate of Memory stream that I can use. After encryption , I am uploading file into blob storage.
var privateKeyEncoded = Encoding.UTF8.GetString(Convert.FromBase64String(_options.PGPKeys.PublicKey));
using Stream privateKeyStream = StringToStreamUtility.GenerateStreamFromString(privateKeyEncoded);
privateKeyStream.Position = 0;
var encryptionKeys = new EncryptionKeys(privateKeyStream);
var pgp = new PGP(encryptionKeys);
//encrypt stream
var encryptStream = new MemoryStream();
await pgp.EncryptStreamAsync(streamToEncrypt, encryptStream );

MemoryStream is a Stream wrapper over a byte[]` buffer. Every time that buffer is full, a new one with double the size is allocated and the data is copied. This eventually uses double the final buffer size (4GB for a 2GB file) but worse, it results in such memory fragmentation that eventually the memory allocator can't find a new contiguous memory block to allocate. That's when you get an OOM.
While you could avoid OOM errors by specifying a capacity in the constructor, storing 2GB in memory before even starting to write it is very wasteful. With a real FileStream the encrypted bytes would be written out as soon as they were available.
Azure Functions allow temporary storage. This means you can create a temporary file, open a stream on it and use it for encryption.
var tempPath=Path.GetTempFileName();
try
{
using (var outputStream=File.Open(tempPath))
{
await pgp.EncryptStreamAsync(streamToEncrypt, outputStream);
...
}
}
finally
{
File.Delete(tempPath);
}

MemoryStream uses a byte[] internally, and any byte[] is going to get a bit brittle as it gets around/above 1GiB (although in theory a byte[] can be nearly 2 GiB, in reality this isn't a good idea, and is rarely seen).
Frankly, MemoryStream simply isn't a good choice here; I'd probably suggest using a temporary file instead, and use a FileStream. This doesn't attempt to keep everything in memory at once, and is more reliable at large sizes. Alternatively: avoid ever needing all the data at once completely, by performing the encryption in a pass-thru streaming way.

What is the simplest way to decompress a ZIP buffer in C#?

When I use zlib in C/C++, I have a simple method uncompress which only requires two buffers and no more else. Its definition is like this:
int uncompress (Bytef *dest, uLongf *destLen, const Bytef *source,
uLong sourceLen);
/*
Decompresses the source buffer into the destination buffer. sourceLen is the byte length of the source buffer. Upon entry,
destLen is the total size of the destination buffer, which must be
large enough to hold the entire uncompressed data. (The size of
the uncompressed data must have been saved previously by the
compressor and transmitted to the decompressor by some mechanism
outside the scope of this compression library.) Upon exit, destLen
is the actual size of the uncompressed data.
uncompress returns Z_OK if success, Z_MEM_ERROR if there was not enough memory, Z_BUF_ERROR if there was not enough room in the output
buffer, or Z_DATA_ERROR if the input data was corrupted or incomplete.
In the case where there is not enough room, uncompress() will fill
the output buffer with the uncompressed data up to that point.
*/
I want to know if C# has a similar way. I checked SharpZipLib FAQ as follows but did not quite understand:
How do I compress/decompress files in memory?
Use a memory stream when creating the Zip stream!
MemoryStream outputMemStream = new MemoryStream();
using (ZipOutputStream zipOutput = new ZipOutputStream(outputMemStream)) {
// Use zipOutput stream as normal
...
You can get the resulting data with memory stream methods ToArray or GetBuffer.
ToArray is the cleaner and easiest to use correctly with the penalty
of duplicating allocated memory. GetBuffer returns a raw buffer raw
and so you need to account for the true length yourself.
See the framework class library help for more information.
I can't figure out if this block of code is for compression or decompression, if outputMemStream meas a compressed stream or an uncompressed stream. I really hope there is a easy-to-understand-way like in zlib. Thanks you very much if you can help me.

Check out the ZipArchive class, which I think has the features you need to accomplish in-memory decompression of zip files.
Assuming you have an array of bytes (byte []) which represent the ZIP file in memory, you have to instantiate a ZipArchive object which will be used to read that array of bytes and interpret them as the ZIP file you whish to load. If you check the ZipArchive class' available constructors in documentation, you will see that they require a stream object from which the data will be read. So, first step would be to convert your byte [] array to a stream that can be read by the constructors, and you can do this by using a MemoryStream object.
Here's an example of how to list all entries inside of a ZIP archive represented in memory as a bytes array:
byte [] zipArchiveBytes = ...; // Read the ZIP file in memory as an array of bytes
using (var inputStream = new MemoryStream(zipArchiveBytes))
using (var zipArchive = new ZipArchive(inputStream, ZipArchiveMode.Read))
{
Console.WriteLine("Listing archive entries...");
foreach (var archiveEntry in zipArchive.Entries)
Console.WriteLine($" {archiveEntry.FullName}");
}
Each file in the ZIP archive will be represented as a ZipArchiveEntry instance. This class offers properties which allow you to retrieve information such as the original length of a file from the ZIP archive, its compressed length, its name, etc.
In order to read a specific file which is contained inside the ZIP file, you can use ZipArchiveEntry.Open(). The following exemplifies how to open a specific file from an archive, if you have its FullName inside the ZIP archive:
ZipArchiveEntry archEntry = zipArchive.GetEntry("my-folder-inside-zip/dog-picture.jpg");
byte[] readResult;
using (Stream entryReadStream = archEntry.Open())
{
using (var tempMemStream = new MemoryStream())
{
entryReadStream.CopyTo(tempMemStream);
readResult = tempMemStream.ToArray();
}
}
This example reads the given file contents, and returns them as an array of bytes (stored in the byte[] readResult variable) which you can then use according to your needs.

StreamReader.ReadToEnd causes massive memory usage / leaks

What it does: for each EncryptedBase64PictureFile, reads the content, decrypts the base64 string and creates a picturebox.
Where the problem is: Insane memory usage! I guess that some data after each loop are not deleted properly. For example, 100 loops with input around 100MB of encrypted data, which should generate around 100MB of image files, uses around 1.5 GB of memory! And when i try to decrypt just a little more data, around 150MB, i get OutOfMemory exception. Visual studio's memory profiling report says, that " string fileContent= reader.ReadToEnd();" line is responsible for 80% of allocations.
for each EncryptedBase64PictureFile {
Rijndael rijAlg = Rijndael.Create();
rijAlg.Key = ASCIIEncoding.ASCII.GetBytes(sKey);
rijAlg.IV = ASCIIEncoding.ASCII.GetBytes(sKey);
FileStream fsread = new FileStream(EncryptedBase64PictureFile, FileMode.Open, FileAccess.Read);
ICryptoTransform desdecrypt = rijAlg.CreateDecryptor();
CryptoStream cryptostreamDecr = new CryptoStream(fsread,desdecrypt, CryptoStreamMode.Read);
StreamReader reader = new StreamReader(cryptostreamDecr);
string fileContent= reader.ReadToEnd(); //this should be the memory eater
var ms = new MemoryStream(Convert.FromBase64String(fileContent));
PictureBox myPictureBox= new PictureBox();
myPictureBox.Image = Image.FromStream(ms);
ms.Close();
reader.Close();
cryptostreamDecr.Close();
fsread.Close();
}
So the question is, is there a way to dealocate memory properly after each loop? Or is the problem in something else?
Thanx for each idea!
EDIT:
Of course i tried to dispose() all 4 streams, but the result was the same...
ms.Dispose();
reader.Dispose();
cryptostreamDecr.Dispose();
fsread.Dispose();
EDIT:
Found the problem. It was not dispose(), but creating the picture from stream. After deleting the picture, memory usage went from 1.5GB to 20MB.
EDIT:
Pictures are about 500kb in .jpg format, around 700kb in base64 encrypted format. But i have really no idea, how big is the imagebox object.
EDIT:
100 loops with input around 100MB was meant that each loop takes around 1MB, 100MB is total for 100 loops

Another answer: Live with it.
As in: You work with 100mb blocks in what appears to be a 32 bit application. This will not work without reusing buffers due to large object heap and general memory fragmentation.
As in: The memory is there, just not in large enough blocks. THis results in allocation errors.
There is no real way around this except going 64 bit where the larger address space handles the issue.
Information about this may be at:
https://connect.microsoft.com/VisualStudio/feedback/details/521147/large-object-heap-fragmentation-causes-outofmemoryexception
https://www.simple-talk.com/dotnet/.net-framework/large-object-heap-compaction-should-you-use-it/
has a possible solution these days, engabling large object heap conpaction:
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect(); // This can be omitted
LOH oprations are expensibe, but 100mb areas running aruond is not exactly a GC recommendet scenario. Not in 32 bit.

Use a base 64 transform when decrypting your stream. Do not use Convert.FromBase64String as this requires all data to be in memory.
using (FileStream f64 = File.Open(fileout, FileMode.Open) ) // content is in base64
using (var cs=new CryptoStream(f64, new FromBase64Transform(), CryptoStreamMode.Read ) ) // transform passed to constructor
using(var fo =File.Open(filein +".orig", FileMode.Create))
{
cs.CopyTo(fo); // stream is accessed as if it was already decrypted
}
Code sample taken from this related answer -
Convert a VERY LARGE binary file into a Base64String incrementally

How to shrink a string and be able to find the original later

I am working on this app that is still in beta, so I set up a logging system. The log is too long to be used in a mailto url so I thought about shrinking the text and then decrypt it.
Let's say I have a 50 line long log, this should help me make something like this zef16z1e6f8 and then have a procedure to use that to find out all 50 lines of the log.
I would like to note that I don't need any fancy TripleDES encryption or something.

First I would suggest re-looking at why you can't just mail the entire log content? Unless you have large logs (>5MB) I'd suggest just mailing the log. If you still want to pursue some shrinking strategy there are two I'd consider.
If you want a simple reference string which can be used to lookup your log data at some later stage you can just associate some sort of identifier with the data (e.g. a GUID as suggested by Eugene). This has the benefit of having a constant length, irrespective of the log size.
Alternatively you could just compress the log, this will shrink the data somewhat (anything up to about 90%, as Dan mentioned). However this has the downside of having a variable length and for very large logs may still exceed your size limitations. If you go this route you could do something like this (not tested):
private string GetCompressedString()
{
byte[] byteArray = Encoding.UTF8.GetBytes("Some long log string");
using (var ms = new MemoryStream())
{
using (var gz = new GZipStream(ms, CompressionMode.Compress, true))
{
ms.Write(byteArray, 0, byteArray.Length);
}
ms.Position = 0;
var compressedBytes = new byte[ms.Length];
ms.Read(compressedBytes, 0, compressedBytes.Length);
return Convert.ToBase64String(compressedBytes);
}
}

Stream chaining in computing a checksum: avoiding memory issues

I have a FileStream connected to a xml file that I would like to read directly into a SHA512 object in order to compute a hash for the purposes of a checksum (not a security use).
The issue is twofold:
I want to omit some of the nodes in the xml,
the file is quite large, and I would rather not load the whole thing into into memory
I can read the whole file into a xml structure, delete the node, then write it to a stream that would then be plugged into SHA512.ComputeHash, but that will cause a performance loss. I would prefer to be able to somehow do the deletion of the nodes as an operation on a stream and then chain the streams together somehow into a single stream that can be passed into SHA512.ComputeHash(Stream).
How can I accomplish this?

using (var hash = new SHA512Cng())
using (var stream = new CryptoStream(Stream.Null, hash, CryptoStreamMode.Write))
using (var writer = XmlWriter.Create(stream))
using (var reader = XmlReader.Create("input.xml"))
{
while (reader.Read())
{
// ... write node to writer ...
}
writer.Flush();
stream.FlushFinalBlock();
var result = hash.Hash;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.