Read large base64 strings into chunks to avoid allocation to LOH C#

Read large base64 strings into chunks to avoid allocation to LOH C# - c#

I'm sent large files from an API as base64 encoded strings which I convert to a byte[] array (in one go) and then return to the client via controller action, example:
byte[] fileBytes = Convert.FromBase64String(base64File);
return this.File(fileBytes);
FileContentResult - MSDN
In some cases, these files end up being very large which I believe is causing the fileBytes object to be allocated to the LOH where it wont be immediately freed once out of scope. This is happening enough that it is causing out of memory exceptions and the application to restart.
My question is, how can I read these large base64 strings without allocating a byte[] to the LOH? I thought about reading it into a stream and then returning a FileStreamResult instead e.g:
using(var ms = new MemoryStream(fileBytes))
{
// return stream from action
}
But I'd still need to convert the base64 to byte[] first. Is it possible to read the the base64 itself in smaller chunks, therefore creating smaller byte[] which wouldn't end up in LOH?

Related

Generate memory stream from String object using efficient fashion (ReadOnlyMemory/Span)

As far as I know, there is two main ways for converting a string into MemoryStream.
MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(contents));
UnmanagedMemoryStream ums = UnmanagedMemoryStream(p.ToPointer(), 100); or Marshal.Copy(p , buffer, 0, buffer.Length); which is the same as #2
Instead of copying string into byte[] and then creating a MemoryStream from it, I'm wondering to know if there is a nice workaround using ReadOnlyMemory<> and/or Span<> to create a MemoryStream to gain more performant/efficient code (with zero GC operation and instant assignment)?
I know that "someStringContentFoo".AsMemory() yields a ReadOnlyMemory<Char>, but I just cannot convert it to a MemoryStream.
Moreover, there is a Nuget package which converts ReadOnlyMemory<byte> to MemoryStream using AsStream() Extension Method, but I cannot convert ReadOnlyMemory<char> to ReadOnlyMemory<byte> without using ToArray() which sacrifices memory and GC.
Brief: Is there a way to convert string to MemoryStream without paying memory copy penalty?

Why does whitespace appear at the end of my C# TextWriter file?

I have created a text file using TextWriter C#, on final creation the text file often has various rows of whitespace at the end of the file. The whitespace is not included in any of the string objects that make up the file and I don’t know what is causing it. The larger the file the more whitespace there is.
I've tried various tests to see if the whitespace occurs based upon the content on the string, but this is not the case. i.e. I have identified the number of rows where the whitespace starts and changed the string for something completely different but the whitespace still occurs.
//To start:
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
//Loop through records & create a concatenated string object
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
//Add the line to the text file
tw.WriteLine(strUTL1);
//Once all rows are added I complete the file
tw.Flush();
tw.Close();
//Then return the file
return File(memoryStream.GetBuffer(), "text/plain", txtFileName);
I don't want to manipulate the file after completion (e.g. replace blank spaces), as this could lead to other problems. The file will be exchanged with a third party and needs to be formatted exactly.
Thank you for your assistance.

As the doc for MemoryStream.GetBuffer explains:
Note that the buffer contains allocated bytes which might be unused. For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray method; however, ToArray creates a copy of the data in memory.
Use .ToArray() (which will allocate a new array of the right size), or you can use the buffer returned from .GetBuffer() but you'll need to check the .Length to see how many valid bytes are in it.

GetBuffer() returns all the memory that was allocated, which is almost always more bytes than what you actually wrote into it.
Might I suggest using Encoding.UTF8.GetBytes(...) instead:
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
var bytes = Encoding.UTF8.GetBytes(strUTL1);
return File(bytes, "text/plain", txtFileName);

Use ToArray() instead of GetBuffer(), since the buffer is larger than needed.
That's often the case. Classes or functions that work with buffers usually reserve a certain size of memory to hold the data. The function will then return a value, how many bytes have been written to the buffer. You shall then only use the first n bytes of the buffer.
Citation of MSDN:
For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer() is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray() method; however, ToArray() creates a copy of the data in memory.

Hash from different files is always the same

I'm building an API which has a method that accepts a file via POST request.
Based on that file, i need to create a hash on the file itself (not the name), check if the hash already exists and do some other actions.
My problem is that whatever file i will send through postman, the hash is always the same for every file, which means that every time i get only 1 file which is overwritten.
Here is my method
private string GetHashFromImage(IFormFile file)
{
/* Creates a hash with the image as a parameter
* with the SHA1 algorithm and returns the hash
* as a string since the ComputeHash() method
* creates a byte array.
*/
System.IO.MemoryStream image = new System.IO.MemoryStream();
file.CopyTo(image);
var hashedValue = System.Security.Cryptography.SHA1.Create().ComputeHash(image);
var hashAsString = Convert.ToBase64String(hashedValue).Replace(#"/", #"");
image.Seek(0, System.IO.SeekOrigin.Begin);
return hashAsString;
}
}
I need a hash method that is agnostic to OS and will return the same hash on each file.

Not entirely sure why you're solution is not working but I think I have an idea on how to achieve what you want and it uses MD5 instead of SHA1.
Let's create a function that will receive an IFormFile, compute the MD5 hash of its contents then return the hash value as a string.
using System;
using System.IO;
using System.Security.Cryptography;
private string GetMD5Hash(IFormFile file)
{
// get stream from file then convert it to a MemoryStream
MemoryStream stream = new MemoryStream();
file.OpenReadStream().CopyTo(stream);
// compute md5 hash of the file's byte array.
byte[] bytes = MD5.Create().ComputeHash(stream.ToArray());
return BitConverter.ToString(bytes).Replace("-",string.Empty).ToLower();
}
Hope it works for you!

The real reason of this behaviour is the last position (same as position after image.Seek(0, System.IO.SeekOrigin.End)) in the calculated stream.
Stream operations like CopyTo, ComputeHash, etc change the position of sreams because they have to iterate through them. The final hash of any stream with position on the end is always same - like a hash of empty stream or empty array.
Convert stream to array works, of course, because to array function works with whole stream (from position = 0) but it is not generally very elegant solution because you have to copy whole stream into memory (this is same for memory stream - the data are also in memory).
When you work directly with stream the function (like compute hash from stream) reads the stream by small chunks (like 4096B) and compute hash iteretively (.NET source code). It means that original solution should work when the seek operation to the start is performed before hash calculation.
Actually you should be able to compute hash directly from input stream (in IFormFile) without copy whole stream into memory (array or memory stream) with better performance and without risk e.g. OutOfMemoryException.

how to convert Image to string the most efficient way?

I want to convert an image file to a string. The following works:
MemoryStream ms = new MemoryStream();
Image1.Save(ms, ImageFormat.Jpeg);
byte[] picture = ms.ToArray();
string formmattedPic = Convert.ToBase64String(picture);
However, when saving this to a XmlWriter, it takes ages before it's saved(20secs for a 26k image file). Is there a way to speed this action up?
Thanks,
Raks

There are three points where you are doing large operations needlessly:
Getting the stream's bytes
Converting it to Base64
Writing it to the XmlWriter.
Instead. First call Length and GetBuffer. This let's you operate upon the stream's buffer directly. (Do flush it first though).
Then, implement base-64 yourself. It's relatively simple as you take groups of 3 bytes, do some bit-twiddling to get the index into the character it'll be converted to, and then output that character. At the very end you add some = symbols according to how many bytes where in the last block sent (= for one remainder byte, == for two remainder bytes and none if there were no partial blocks).
Do this writting into a char buffer (a char[]). The most efficient size is a matter for experimentation but I'd start with 2048 characters. When you've filled the buffer, call XmlWriter.WriteRaw on it, and then start writing back at index 0 again.
This way, you're doing less allocations, and you're started on the output from the moment you've got your image loaded into the memory stream. Generally, this should result in better throughput.

C# performance methods of receiving data from a socket?

Let's assume we have a simple internet socket, and it's going to send 10 megabytes (because I want to ignore memory issues) of random data through.
Is there any performance difference or a best practice method that one should use for receiving data? The final output data should be represented by a byte[]. Yes I know writing an arbitrary amount of data to memory is bad, and if I was downloading a large file I wouldn't be doing it like this. But for argument's sake let's ignore that and assume it's a smallish amount of data. I also realise that the bottleneck here is probably not the memory management but rather the socket receiving. I just want to know what would be the most efficient method of receiving data.
A few dodgy ways can think of is:
Have a List and a buffer, after the buffer is full, add it to the list and at the end list.ToArray() to get the byte[]
Write the buffer to a memory stream, after its complete construct a byte[] of the stream.Length and read it all into it in order to get the byte[] output.
Is there a more efficient/better way of doing this?

Just write to a MemoryStream and then call ToArray - that does the business of constructing an appropriately-sized byte array for you. That's effectively what a List<byte> would be like anyway, but using a MemoryStream will be a lot simpler.

Well, Jon Skeet's answer is great (as usual), but there's no code, so here's my interpretation. (Worked fine for me.)
using (var mem = new MemoryStream())
{
using (var tcp = new TcpClient())
{
tcp.Connect(new IPEndPoint(IPAddress.Parse("192.0.0.192"), 8880));
tcp.GetStream().CopyTo(mem);
}
var bytes = mem.ToArray();
}
(Why not combine the two usings? Well, if you want to debug, you might want to release the tcp connection before taking your time looking at the bytes received.)
This code will receive multiple packets and aggregate their data, FYI. So it's a great way to simply receive all tcp data sent during a connection.

What is the encoding of your data? is it plain ASCII, or is it something else, like UTF-8/Unicode?
if it is plain ASCII, you could just allocate a StringBuilder() of the required size (get the size from the ContentLength header of the response) and keep on appending your data to the builder, after converting it into a string using Encoding.ASCII.
If it is Unicode/UTF8 then you have an issue - you cannot just call Encoding..GetString(buffer, 0, bytesRead) on the bytes read, because the bytesRead might not constitute a logical string fragment in that encoding. For this case you will need to buffer the entire entity body into memory(or file), then read that file and decode it using the encoding.

You could write to a memory stream, then use a streamreader or something like that to get the data. What are you doing with the data? I ask because would be more efficient from a memory standpoint to write the incoming data to a file or database table as the data is being received rather than storing the entire contents in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read large base64 strings into chunks to avoid allocation to LOH C# - c#

Related

Generate memory stream from String object using efficient fashion (ReadOnlyMemory/Span)

Why does whitespace appear at the end of my C# TextWriter file?

Hash from different files is always the same

how to convert Image to string the most efficient way?

C# performance methods of receiving data from a socket?

Categories

Resources