very large string in memory

very large string in memory - c#

I am writing a program for formatting 100s of MB String data (nearing a gig) into xml == And I am required to return it as a response to an HTTP (GET) request .
I am using a StringWriter/XmlWriter to build an XML of the records in a loop and returning the
using (StringWriter writer = new StringWriter())
using (writer = XmlWriter.Create(writer, settings)) //where settings are the xml props
writer.ToString()
during testing I saw a few --out of memory exceptions-- and quite clueless on how to find a solution? do you guys have any suggestions for a memory optimized delivery of the response?
is there a memory efficient way of encoding the data? or maybe chunking the data --
I just can not think of how to return it without building the whole thing into one HUGE string object
thanks
--
a few clarifications --
this is an asp .net webservices app over a gigabit ethernet link as josh noted. I am not very familiar with it so still a bit of a learning curve.
I am using XMLWriter to create the XML and create a string out of it using String
some stats --
response xml size = about 385 megs (my data size will grow very quickly to way more than this)
string object size as calculated by a memory profiler = peaked at 605MB
and thanks to everyone who responded...

Use XmlTextWriter wrapped around Reponse.OutputStream to send the XML to the client and periodically flush the response. This way you never have to have more than a few mb in memory at any one time (at least for sending to the client).

Can't you just stream the response to the client? XmlWriter doesn't require its underlying stream to be buffered in memory. If it's ASP.NET you can use the Response.OutputStream or if it's WCF, you can use response streaming.

HTTP get for 1 gig? that's a lot! Perhaps you should reconsider.
At least gziping the output could help.

You should not create XML using string manipulation.
Instead, you should use the XmlTextWriter, XmlDocument, or (in .Net 3.5) XElement classes to build an XML tree in memory, then write it directly to Response.OutputStream using an XmlTextWriter.
Writing directly to an XmlTextWriter that wraps Response.OutputStream wil be most efficient (you'll never have an entire element tree in memory at once), but will be somewhat more complicated.
By doing it this way, you will never have a single string (or array) containing the entire object, and should thus avoid OutOfMemoryExceptions.

Had a similar problem, hope this will help someone. My initial code was:
var serializer = new XmlSerializer(type);
string xmlString;
using (var writer = new StringWriter())
{
serializer.Serialize(writer, objectData, sn); // OutOfMemoryException here
xmlString = writer.ToString();
}
I ended up replaceing StringWriter with MemoryStream and this solved my problem
using (var mem = new MemoryStream())
{
serializer.Serialize(mem, objectData, sn);
xmlString = Encoding.UTF8.GetString(mem.ToArray());
}

You'll have to return each record (or a small group of records) on their own individual GETs.

Related

On-the-fly formatting a stream of JSON using System.Text.Json

I've a Json string that is not indended, e.g.,:
{"hash":"123","id":456}
I want to indent the string and serialize it to a JSON file. Naively, I can indent the string using Newtonsoft as follows.
using Newtonsoft.Json.Linq;
JToken token = JToken.Parse(json);
var formattedJson = JObject.Parse(token.ToString()).ToString();
However, since I am using a decent number of large JSON objects, I am mainly interested in solutions that can operate on a stream of data. For performance reasons, I have decided to use System.Text.Json, and I am wondering if it comes with any out-of-box functionality for processing data streams.
Before rolling my own solution, I am wondering if there is any approach with mostly out-of-box functionality ideally intercepting a stream of the input while it is written to storage (i.e., on-the-fly conversion). Alternatively, I can process a serialized stream, but that needs to read through the file, make the necessary changes, and write to the output file without requiring to deserialize the entire JSON into memory first. I am mainly interested in the first approach because (a) I would be going through the JSON once, and (b) that does not require storing an intermediate file (stream -> unformatted-JSON -> formatted-JSON).
Motivation
An upstream service is streaming a large collection of information in JSON format. A downstream service reads through the JSON in a line-by-line fashion and extracts the required fields; presumably motivated by the large size of the JSON files that makes them impractical/infeasible to deserialize in memory. However, there are a few conventions to happen on the streamed JSON to make it compatible with the downstream service. One of the required conventions is indentation and one-key-value-pair-per-line. It seems the upstream service is dropping all the formatting to stream fewer bits, but the downstream relies on the formatting to extract information. Both upstream and downstream services are beyond my control. The goal of the service I'm writing is to sit in the middle and make the necessary conventions (formattings such as indentation is one of them) on the streamed JSON to make it compatible with the downstream service.
As explained above, deserializing the streamed JSON into an object, making the necessary changes, and serializing the updated JSON to disk, seems an obvious solution, however, given the size and the volume of data, this approach is impractical/infeasible for my application.
I can think of a middle layer that processes the streamed JSON on the fly and makes the changes before writing the bits to a persistence media. However, before going that path, I wanted to double-check if there is any out-of-box functionality in System.Text.Json to process streams of information.
Update
The question is largely updated for clarity and emphasis on the main point: is there any out-of-box functionality in System.Text.Json for processing stream of JSON?

It seems that there's no easy way to do it with System.Text.Json because you cannot use a Stream object with the System.Text.Json.Utf8JsonReader directly. To bypass this limitation, you need to put the file content into memory with the System.Text.Json.JsonDocument object so obviously, it will take up a lot of memory.
For now, with what a read on the web, the only solution to be memory efficient is to use Newtonsoft.Json library.
using (var streamReader = new StreamReader(sourceFilePath))
using (var jsonTextReader = new JsonTextReader(streamReader))
using (var streamWriter = File.CreateText(destinationFilePath))
using (var jsonTextWriter = new JsonTextWriter(streamWriter))
{
jsonTextWriter.Formatting = Formatting.Indented;
while (jsonTextReader.Read())
{
jsonTextWriter.WriteToken(jsonTextReader);
}
}
Much faster with 1MB buffer instead of the default 4kB.
Indented a 6.33 GB file in 5m05s to 13.7 GB, a total of around 200.000.000 lines. It was reading and writing on a HDD, running in Visual Studio (Debug release) and only uses 17MB RAM. Couldn't test on SSD because of space limitation.
string filename = #"VeryBig.json";
using FileStream inputFileStream = new(filename, FileMode.Open, FileAccess.Read, FileShare.Read, 1 * 1024 * 1024);
using StreamReader streamReader = new(inputFileStream);
using JsonTextReader jsonTextReader = new(streamReader);
string filenameOutput = Path.ChangeExtension(filename, ".indented.json");
using FileStream outputFileStream = new(filenameOutput, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read, 1 * 1024 * 1024);
using StreamWriter streamWriter = new(outputFileStream);
using JsonTextWriter jsonTextWriter = new(streamWriter);
jsonTextWriter.Formatting = Formatting.Indented;
while (jsonTextReader.Read())
{
jsonTextWriter.WriteToken(jsonTextReader);
}

I just tested and didn' t find any problem
var json="{\"hash\":\"123\",\"id\":456}";
var jsonObject=JsonDocument.Parse(json);
json = System.Text.Json.JsonSerializer.Serialize(jsonObject,
new JsonSerializerOptions() { WriteIndented = true });
test result
{
"hash": "123",
"id": 456
}

The easy way using System.Text.Json and friends:
using System;
using System.Text;
using System.Text.Json;
using System.IO;
public class Program
{
const string template = #"
Original:
---------
{0}
---------
Pretty:
---------
{1}
---------
";
public static void Main()
{
var src = "{\"hash\":\"123\",\"id\":456}";
using ( var doc = JsonDocument.Parse(src, new JsonDocumentOptions{ AllowTrailingCommas = true }) )
using ( var ms = new MemoryStream() )
using ( var jsonWriter = new Utf8JsonWriter( ms, new JsonWriterOptions{ Indented = true } ) )
{
doc.RootElement.WriteTo(jsonWriter);
jsonWriter.Flush();
ms.Flush();
string pretty = Encoding.UTF8.GetString(ms.ToArray());
Console.WriteLine( template , src, pretty );
}
}
}
Which produces the expected
Original:
---------
{"hash":"123","id":456}
---------
Pretty:
---------
{
"hash": "123",
"id": 456
}
---------

is there any out-of-box functionality in System.Text.Json for processing stream of JSON?
Yes... sort of.
You can use Utf8JsonReader to read from a stream. But it doesn't support streams natively: it relies on ReadOnlySequence<>s. So the code for streaming with it is pretty gnarly, and involves creating your own buffers and reading from the stream yourself.

Parsing XmlTextWriter object to String

I'm building a class library in C# which uses the XmlTextWriter class to build an XML which is then exported as a HTML document.
However when I save the file with a .html extension using the XmlTextWriter object as content, the resulting file only contains the text "System.Xml.XmlTextWriter"
This comes up in the method defined below, specifically in the final line:-
public void SaveAsHTML(string filepath)
{
XmlTextWriter html;
html = new XmlTextWriter(#"D:/HTMLWriter/XML/HTMLSaveAsConfig.xml", System.Text.Encoding.UTF8);
html.WriteStartDocument();
html.WriteStartElement("html");
html.WriteRaw(Convert.ToString(Head));
html.WriteRaw(Convert.ToString(Body));
html.WriteEndElement();
html.WriteEndDocument();
html.Flush();
html.Close();
System.IO.File.WriteAllText(filepath, html.ToString());
}
For context, the variables Head and Body are also XmlTextWriter objects containing what will become the and elements of the html file respectively.
I've tried using Convert.ToString(), which causes the same issue.
I'm tempted to try overriding the ToString() method for my class as a fix, potentially using the XmlSerializer class. However I was wondering if there's a less noisy way of returning the Xml object as a string?

The following function will extract the string from the System.Xml.XmlTextWriter objects you've created as Head and Body.
private string XmlTextWriterToString(XmlTextWriter writer)
{
// Ensure underlying stream is flushed.
writer.Flush();
// Reset position to beginning of stream.
writer.BaseStream.Position = 0;
using (var reader = new StreamReader(writer.BaseStream))
{
// Read and return content of stream as a single string
var result = reader.ReadToEnd();
return result;
}
}
Some caveats here are that the underlying System.IO.Stream object associated with the System.Xml.XmlTextWriter must support both 'read' and 'seek' operations (i.e., both Stream.CanRead and System.CanSeek properties must return true, respectively).
I've made the following edits to your original code:
Replaced the Convert.ToString() calls with calls to this new function.
Made an assumption that you're intending to write to the file specified by the filepath parameter to your SaveAsHTML() function, and not the hard-coded path.
Wrapped the creation (and use, and disposal) of the System.Xml.XmlTextWriter in a using block (if you're not familiar, see What are the uses of “using” in C#?).
Following is your code with those changes.
public void SaveAsHTML(string filepath)
{
using (var html = new XmlTextWriter(filepath, System.Text.Encoding.UTF8))
{
html.WriteStartDocument();
html.WriteStartElement("html");
html.WriteRaw(XmlTextWriterToString(Head));
html.WriteRaw(XmlTextWriterToString(Body));
html.WriteEndElement();
html.WriteEndDocument();
html.Flush();
html.Close();
}
}
Another thing of which to be mindful is that, not knowing for sure from the code provided how they're being managed, the lifetimes of Head and Body are subject to the same exception-based resource leak potential that html was before wrapping it in the using block.
A final thought: the page for System.Xml.XmlTextWriter notes the following: Starting with the .NET Framework 2.0, we recommend that you create XmlWriter instances by using the XmlWriter.Create method and the XmlWriterSettings class to take advantage of new functionality.

The last line writes the value of XmlTextWriter.ToString(), which does not return the text representation of the XML you wrote. Try leaving off the last line, it looks like your XmlTextWriter is already writing to a file.

#PhilBrubaker's solution seems to be on the right track. There are still a few bugs in my code that I'm working towards getting a fix for, but the good news is that the casting seems to be working now.
protected string XmlToString(XmlWriter xmlBody)
{
XmlTextWriter textXmlBody = (XmlTextWriter)xmlBody;
textxmlBody.BaseStream.Position = 0;
using (var reader = new StreamReader(textXmlBody.BaseStream))
{
var result = reader.ReadToEnd();
reader.Dispose();
return result;
}
}
I've changed the input parameter type from XmlWriter and cast it explicitly to XmlTextWriter in the method, this is so that the method also works when the Create() method is used instead of an initialisation as recommended for .NET 2.0. It's not 100% reliable at the moment as XmlWriter doesn't always cast correctly to XmlTextWriter (depending on the features), but that's out of the scope for this thread and I'm investigating that separately.
Thanks for your help!
On a side note, the using block is something I haven't come across before, but it's provided so many solutions across the board for me. So thanks for that too!

C#- Renci.Ssh.Net- Which one gives optimized performance- WriteAllText Vs. UploadFile

I need to generate multiple XML files at SFTP location from C# code. for SFTP connectivity, I am using Renci.Ssh.net. I found there are different methods to generate files including WriteAllText() and UploadFile(). I am producing XML string runtime, currently I've used WriteAllText() method (just to avoid creating the XML file on local and thus to avoid IO operation).
using (SftpClient client = new SftpClient(host,port, sftpUser, sftpPassword))
{
client.Connect();
if (client.IsConnected)
{
client.BufferSize = 1024;
var filePath = sftpDir + fileName;
client.WriteAllText(filePath, contents);
client.Disconnect();
}
client.Dispose();
}
Will using UploadFile(), either from FileStream or MemoryStream give me better performance in long run?
The result document size will be in KB, around 60KB.
Thanks!

SftpClient.UploadFile is optimized for uploads of large amount of data.
But for 60KB, I'm pretty sure that it makes no difference whatsoever. So you can continue using the more convenient SftpClient.WriteAllText.
Though, I believe that most XML generators (like .NET XmlWriter are able to write XML to Stream (it's usually the preferred output API, rather than a string). So the use of SftpClient.UploadFile can be more convenient in the end.
See also What is the difference between SftpClient.UploadFile and SftpClient.WriteAllBytes?

Use PdfReport.Core in a WEB-API .NET CORE 2

I am looking into PdfReport.Core and have been asked to let our .NET CORE 2.0 WEB-API return a PDF to the calling client. The client would be any https caller like a ajax or mvc client.
Below is a bit of the code I am using. I am using swashbuckle to test the api, which looks like it is returning the report but when I try to open in a PDF viewer it says it is curropted. I am thinking I am not actually outputting the actual PDF to the stream, suggestions?
[HttpGet]
[Route("api/v1/pdf")]
public FileResult GetPDF()
{
var outputStream = new MemoryStream();
InMemoryPdfReport.CreateStreamingPdfReport(_hostingEnvironment.WebRootPath, outputStream);
outputStream.Position = 0;
return new FileStreamResult(outputStream, "application/pdf")
{
FileDownloadName = "report.pdf"
};
}

I'm not familiar with that particular library, but generally speaking with streams, file corruption is a result of either 1) the write not being flushed or 2) incorrect positioning within the stream.
Since, you've set the position back to zero, I'm guessing the problem is that your write isn't being flushed correctly. Essentially, when you write to a stream, the data is not necessarily "complete" in the stream. Sometimes writes are queued to more efficiently write in batches. Sometimes, there's cleanup tasks a particular stream writer needs to complete to "finalize" everything. For example, with a format like PDF, end matter may need to be appended to the bytes, particular to the format. A stream writer that is writing PDF would take care of this in a flush operation, since it cannot be completed until all writing is done.
Long and short, review the documentation of the library. In particular, look for any method/process that deals with "flushing". That's most likely what your missing.

Memory stream is empty

I need to generate a huge xml file from different sources (functions). I decide to use XmlTextWriter since it uses less memory than XmlDocument.
First, initiate an XmlWriter with underlying MemoryStream
MemoryStream ms = new MemoryStream();
XmlTextWriter xmlWriter = new XmlTextWriter(ms, new UTF8Encoding(false, false));
xmlWriter.Formatting = Formatting.Indented;
Then I pass the XmlWriter (note xml writer is kept open until the very end) to a function to generate the beginning of the XML file:
xmlWriter.WriteStartDocument();
xmlWriter.WriteStartElement();
// xmlWriter.WriteEndElement(); // Do not write the end of root element in first function, to add more xml elements in following functions
xmlWriter.WriteEndDocument();
xmlWriter.Flush();
But I found that underlying memory stream is empty (by converting byte array to string and output string). Any ideas why?
Also, I have a general question about how to generate a huge xml file from different sources (functions). What I do now is keeping the XmlWriter open (I assume the underlying memory stream should open as well) to each function and write. In the first function, I do not write the end of root element. After the last function, I manually add the end of root element by:
string endRoot = "</Root>";
byte[] byteEndRoot = Encoding.ASCII.GetBytes(endRoot);
ms.Write(byteEndRoot, 0, byteEndRoot.Length);
Not sure if this works or not.
Thanks a lot!

Technically you should only ask one question per question, so I'm only going to answer the first one because this is just a quick visit to SO for me at the moment.
You need to call Flush before attempting to read from the Stream I think.
Edit
Just bubbling up my second hunch from the comments below to justify the accepted answer here.
In addition to the call to Flush, if reading from the Stream is done using the Read method and its brethren, then the position in the stream must first be reset back to the start. Otherwise no bytes will be read.
ms.Position = 0; /*reset Position to start*/
StreamReader reader = new StreamReader(ms);
string text = reader.ReadToEnd();
Console.WriteLine(text);

Perhaps you need to call Flush() on the xml stream before checking the memory streazm.

Make sure you call Flush on the XmlTextWriter before checking the memory stream.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.