I have a problem when I try to parse a large json file, which has around 200mb.
I'm doing it with Newtonsoft.Json. It gives OutOfMemory exception.
This is my code:
using (StreamReader sr=File.OpenText("path"))
{
JObject file= (JObject)JToken.ReadFrom(new JsonTextReader(sr));
}
How can I do this ? ( preferable using JObject )
You can use JsonTextReader to read text in a DataReader fashion as stated in this question:
Incremental JSON Parsing in C#
You will have to code your own logic to process JSON data, but it will for sure solve your memory issues:
using (var reader = new JsonTextReader(File.OpenText("path")))
{
while (reader.Read())
{
// Your logic here (anything you need is in [reader] object), for instance:
if (reader.TokenType == JsonToken.StartArray)
{
// Process array
MyMethodToProcessArray(reader);
}
else if (reader.TokenType == JsonToken.StartObject)
{
// Process object
MyMethodToProcessObject(reader);
}
}
}
You would actually build a recursive JSON parser.
Related
I'm trying to deserialize a list of heavy objects from a json file. I do not want to deserialize it the classic way, like directly to a list, because it will expose me to an OutOfMemory exception. So I'm looking for a way to handle object one by one to store them one by one in the database and be memory safe.
I already handle the serialization and it's working well, but I'm facing some difficulties for deserialization.
Any idea ?
Thanks in advance
// Serialization
using (var FileStream = new FileStream(DirPath + "/TPV.Json", FileMode.Create))
{
using (var sw = new StreamWriter(FileStream))
{
using (var jw = new JsonTextWriter(sw))
{
jw.WriteStartArray();
using (var _Database = new InspectionBatimentsDataContext(TheBrain.DBClient.ConnectionString))
{
foreach (var TPVId in TPVIds)
{
var pic = (from p in _Database.TPV
where Operators.ConditionalCompareObjectEqual(p.Release, TPVId.Release, false) & Operators.ConditionalCompareObjectEqual(p.InterventionId, TPVId.InterventionId, false)
select p).FirstOrDefault;
var ser = new JsonSerializer();
ser.Serialize(jw, pic);
jw.Flush();
}
}
jw.WriteEndArray();
}
}
}
I finnaly found a way to do it by using custom separator beetween each object during serialization. Then for deserialization, I simply read the json file as string until I find my custom separator and I deserialise readed string, all in a loop. It's not the perfect answer because I'm breaking json format in my files, but it's not a constraint in my case.
I'm trying to convert from a huge JSON file(2GB) to xml file. I have some troubles reading the huge JSON file.
I've been researching about how i can read huge JSON files.
I found this:
Out of memory exception while loading large json file from disk
How to parse huge JSON file as stream in Json.NET?
Parsing large json file in .NET
It seems that i'm duplicating my question but i have some troubles which aren't solved in these posts.
So, i need to load the huge JSON File and the community propose something like this:
MyObject o;
using (StreamReader sr = new StreamReader("foo.json"))
using (JsonTextReader reader = new JsonTextReader(sr))
{
var serializer = new JsonSerializer();
reader.SupportMultipleContent = true;
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
// Deserialize each object from the stream individually and process it
var o = serializer.Deserialize<MyObject>(reader);
//Do something with the object
}
}
}
So, We can read by parts and deserialize objects one by one.
I'll show you my code
JsonSerializer serializer = new JsonSerializer();
string hugeJson = "hugJSON.json";
using (FileStream s = File.Open(hugeJson , FileMode.Open))
{
using (StreamReader sr = new StreamReader(s))
{
using (JsonReader reader = new JsonTextReader(sr))
{
reader.SupportMultipleContent = true;
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
var jsonObject = serializer.Deserialize(reader);
string xmlString = "";
XmlDocument doc = JsonConvert.DeserializeXmlNode(jsonObject.ToString(), "json");
using (var stringWriter = new StringWriter())
{
using (var xmlTextWriter = XmlWriter.Create(stringWriter))
{
doc.WriteTo(xmlTextWriter);
xmlTextWriter.Flush();
xmlString = stringWriter.GetStringBuilder().ToString();
}
}
}
}
}
}
}
But when i try doc.WriteTo(xmlTextWriter), i get Exception of type System.OutOfMemoryException was thrown.
I've been trying with BufferedStream. This class allows me manage big files but i have another problem.
I'm reading in byte[] format. When i convert to string, the json is splitted and i can't parse to xml file because there are missing characters
for example:
{ foo:[{
foo:something,
foo1:something,
foo2:something
},
{
foo:something,
foo:som
it is cutted.
Is any way to read a huge JSON and convert to XML without load the JSON by parts? or i could load a convert by parts but i don't know how to do this.
Any ideas?
UPDATE:
I have been trying with this code:
static void Main(string[] args)
{
string json = "";
string pathJson = "foo.json";
//Read file
string temp = "";
using (FileStream fs = new FileStream(pathJson, FileMode.Open))
{
using (BufferedStream bf = new BufferedStream(fs))
{
byte[] array = new byte[70000];
while (bf.Read(array, 0, 70000) != 0)
{
json = Encoding.UTF8.GetString(array);
temp = String.Concat(temp, json);
}
}
}
XmlDocument doc = new XmlDocument();
doc = JsonConvert.DeserializeXmlNode(temp, "json");
using (var stringWriter = new StringWriter())
using (var xmlTextWriter = XmlWriter.Create(stringWriter))
{
doc.WriteTo(xmlTextWriter);
xmlTextWriter.Flush();
xmlString = stringWriter.GetStringBuilder().ToString();
}
File.WriteAllText("outputPath", xmlString);
}
This code convert from json file to xml file. but when i try to convert a big json file (2GB), i can't. The process cost a lot of time and the string doesn't have capacity to store all the json. How i can store it? Is any way to do this conversion without use the datatype string?
UPDATE:
The json format is:
[{
'key':[some things],
'data': [some things],
'data1':[A LOT OF ENTRIES],
'data2':[A LOT OF ENTRIES],
'data3':[some things],
'data4':[some things]
}]
Out-of-memory exceptions in .Net can be caused by several problems including:
Allocating too much total memory.
If this might be happening, check whether you are running in 64-bit mode as described here. If not, rebuild in 64-bit mode as described here and re-test.
Allocating too many objects on the large object heap causing memory fragmentation.
Allocating a single object that is larger than the .Net object size limit.
Failing to dispose of unmanaged memory (not applicable here).
In your case, you may be trying to allocate too much total memory but are definitely allocating three very large objects: the in-memory temp JSON string, the in-memory xmlString XML string and the in-memory stringWriter.
You can substantially reduce your memory footprint and completely eliminate these objects by constructing an XDocument or XmlDocument directly via a streaming translation from the JSON file. Then afterward, write the document directly to the XML file using XDocument.Save() or XmlDocument.Save().
To do this, you will need to allocate your own XmlNodeConverter, then construct a JsonSerializer using it and deserialize as shown in Deserialize JSON from a file. The following method(s) do the trick:
public static partial class JsonExtensions
{
public static XDocument LoadXNode(string pathJson, string deserializeRootElementName)
{
using (var stream = File.OpenRead(pathJson))
return LoadXNode(stream, deserializeRootElementName);
}
public static XDocument LoadXNode(Stream stream, string deserializeRootElementName)
{
// Let caller dispose the underlying streams.
using (var textReader = new StreamReader(stream, Encoding.UTF8, true, 1024, true))
return LoadXNode(textReader, deserializeRootElementName);
}
public static XDocument LoadXNode(TextReader textReader, string deserializeRootElementName)
{
var settings = new JsonSerializerSettings
{
Converters = { new XmlNodeConverter { DeserializeRootElementName = deserializeRootElementName } },
};
using (var jsonReader = new JsonTextReader(textReader) { CloseInput = false })
return JsonSerializer.CreateDefault(settings).Deserialize<XDocument>(jsonReader);
}
public static void StreamJsonToXml(string pathJson, string pathXml, string deserializeRootElementName, SaveOptions saveOptions = SaveOptions.None)
{
var doc = LoadXNode(pathJson, deserializeRootElementName);
doc.Save(pathXml, saveOptions);
}
}
Then use them as follows:
JsonExtensions.StreamJsonToXml(pathJson, outputPath, "json");
Here I am using XDocument instead of XmlDocument because I believe (but have not checked personally) that it uses less memory, e.g. as reported in Some hard numbers about XmlDocument, XDocument and XmlReader (x86 versus x64) by Ken Lassesen.
This approach eliminates the three large objects mentioned previously and substantially reduces the chance of running out of memory due to problems #2 or #3.
Demo fiddle here.
If you are still running out of memory even after ensuring you are running in 64-bit mode and streaming directly from and to your file(s) using the methods above, then it may simply be that your XML is too large to fit in your computer's virtual memory space using XDocument or XmlDocument. If that is so, you will need to adopt a pure streaming solution that transforms from JSON to XML on the fly as it streams. Unfortunately, Json.NET does not provide this functionality out of the box, so you will need a more complex solution.
So, what are your options?
You could fork your own version of XmlNodeConverter.cs and rewrite ReadElement(JsonReader reader, IXmlDocument document, IXmlNode currentNode, string propertyName, XmlNamespaceManager manager) to write directly to an XmlWriter instead of an IXmlDocument.
While probably doable with a couple days effort, the difficulty would seem to exceed that of a single stackoverflow answer.
You could use the reader returned by JsonReaderWriterFactory to translate JSON to XML on the fly, and pass that reader directly to XmlWriter.WriteNode(XmlReader). The readers and writers returned by this factory are used internally by DataContractJsonSerializer but can be used directly as well.
If your JSON has a fixed schema (which is unclear from your question) you have many more straightforward options. Incrementally deserializing to some c# data model as shown in Parsing large json file in .NET and re-serializing that model to XML is likely to use much less memory than loading into some generic DOM such as XDocument.
Option #2 can be implemented very simply, as follows:
using (var stream = File.OpenRead(pathJson))
using (var jsonReader = JsonReaderWriterFactory.CreateJsonReader(stream, XmlDictionaryReaderQuotas.Max))
{
using (var xmlWriter = XmlWriter.Create(outputPath))
{
xmlWriter.WriteNode(jsonReader, true);
}
}
However, the XML thereby produced is much less pretty than the XML generated by XmlNodeConverter. For instance, given the simple input JSON
{"Root":[{
"key":["a"],
"data": [1, 2]
}]}
XmlNodeConverter will create the following XML:
<json>
<Root>
<key>a</key>
<data>1</data>
<data>2</data>
</Root>
</json>
While JsonReaderWriterFactory will create the following (indented for clarity):
<root type="object">
<Root type="array">
<item type="object">
<key type="array">
<item type="string">a</item>
</key>
<data type="array">
<item type="number">1</item>
<item type="number">2</item>
</data>
</item>
</Root>
</root>
The exact format of the XML generated can be found in
Mapping Between JSON and XML.
Still, once you have valid XML, there are streaming XML-to-XML transformation solutions that will allow you to transform the generated XML to your final, desired format, including:
C# XSLT Transforming Large XML Files Quickly.
How to: Perform Streaming Transform of Large XML Documents (C#).
Combining the XmlReader and XmlWriter classes for simple streaming transformations.
Is it possible to do the other way?
Unfortunately
JsonReaderWriterFactory.CreateJsonWriter().WriteNode(xmlReader, true);
isn't really suited for conversion of arbitrary XML to JSON as it only allows for conversion of XML with the precise schema specified by Mapping Between JSON and XML.
Furthermore, when converting from arbitrary XML to JSON the problem of array recognition exists: JSON has arrays, XML doesn't, it only has repeating elements. To recognize repeating elements (or tuples of elements where identically named elements may not be adjacent) and convert them to JSON array(s) requires buffering either the XML input or the JSON output (or a complex two-pass algorithm). Mapping Between JSON and XML avoids the problem by requiring type="object" or type="array" attributes.
I have an OData response as JSON (Which is in few MBs) and the requirement is to stream "certain parts of JSON" without even loading them to memory.
For Example: When I'm reading the property "value[0].Body.Content" in the below JSON (which will be in MBs), I want to Stream this value part without de-serializing it into an Object of type string. So basically read the value part into a fixed size byte array and write that byte array to destination stream (repeating the step until that data is finished processing).
JSON:
{
"#odata.context": "https://localhost:5555/api/v2.0/$metadata#Me/Messages",
"value": [
{
"#odata.id": "https://localhost:5555/api/v2.0/",
"#odata.etag": "W/\"Something\"",
"Id": "vccvJHDSFds43hwy98fh",
"CreatedDateTime": "2018-12-01T01:47:53Z",
"LastModifiedDateTime": "2018-12-01T01:47:53Z",
"ChangeKey": "SDgf43tsdf",
"WebLink": "https://localhost:5555/?ItemID=dfsgsdfg9876ijhrf",
"Body": {
"ContentType": "HTML",
"Content": "<html>\r\n<body>Huge Data Here\r\n</body>\r\n</html>\r\n"
},
"ToRecipients": [{
"EmailAddress": {
"Name": "ME",
"Address": "me#me.com"
}
}
],
"CcRecipients": [],
"BccRecipients": [],
"ReplyTo": [],
"Flag": {
"FlagStatus": "NotFlagged"
}
}
],
"#odata.nextLink": "http://localhost:5555/rest/jersey/sleep?%24filter=LastDeliveredDateTime+ge+2018-12-01+and+LastDeliveredDateTime+lt+2018-12-02&%24top=50&%24skip=50"
}
Approaches Tried:
1. Newtonsoft
I initially tried using Newtonsoft streaming, but it internally converts the data into string and loads into memory. (This is resulting in LOH shooting up and memory not getting released until compaction happens - We've a memory limit for our worker process and cannot keep this in memory)
**code:**
using (var jsonTextReader = new JsonTextReader(sr))
{
var pool = new CustomArrayPool();
// Checking if pooling will help with memory
jsonTextReader.ArrayPool = pool;
while (jsonTextReader.Read())
{
if (jsonTextReader.TokenType == JsonToken.PropertyName
&& ((string)jsonTextReader.Value).Equals("value"))
{
jsonTextReader.Read();
if (jsonTextReader.TokenType == JsonToken.StartArray)
{
while (jsonTextReader.Read())
{
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
var Current = JToken.Load(jsonTextReader);
// By Now, the LOH Shoots up.
// Avoid below code of converting this JToken back to byte array.
destinationStream.write(Encoding.ASCII.GetBytes(Current.ToString()));
}
else if (jsonTextReader.TokenType == JsonToken.EndArray)
{
break;
}
}
}
}
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
var Current = JToken.Load(jsonTextReader);
// Do some processing with Current
destinationStream.write(Encoding.ASCII.GetBytes(Current.ToString()));
}
}
}
OData.Net:
I was thinking if this is doable using OData.Net Library as it looks like it supports streaming of string fields. But couldn't get far, as I end up with creating a Model for the data, which would mean the value would get converted into one string object of MB's.
Code
ODataMessageReaderSettings settings = new ODataMessageReaderSettings();
IODataResponseMessage responseMessage = new InMemoryMessage { Stream = stream };
responseMessage.SetHeader("Content-Type", "application/json;odata.metadata=minimal;");
// ODataMessageReader reader = new ODataMessageReader((IODataResponseMessage)message, settings, GetEdmModel());
ODataMessageReader reader = new ODataMessageReader(responseMessage, settings, new EdmModel());
var oDataResourceReader = reader.CreateODataResourceReader();
var property = reader.ReadProperty();
Any idea how to parse this JSON in parts using OData.Net/Newtonsoft and stream value of certain fields? Is the only way to do this, is to manually parse the stream?
If you are copying portions of JSON from one stream to another, you can do this more efficiently with JsonWriter.WriteToken(JsonReader) thus avoiding the intermediate Current = JToken.Load(jsonTextReader) and Encoding.ASCII.GetBytes(Current.ToString()) representations and their associated memory overhead:
using (var textWriter = new StreamWriter(destinationStream, new UTF8Encoding(false, true), 1024, true))
using (var jsonWriter = new JsonTextWriter(textWriter) { Formatting = Formatting.Indented, CloseOutput = false })
{
// Use Formatting.Indented or Formatting.None as required.
jsonWriter.WriteToken(jsonTextReader);
}
However, Json.NET's JsonTextReader does not have the ability to read a single string value in "chunks" in the same way as XmlReader.ReadValueChunk(). It will always fully materialize each atomic string value. If your strings values are so large that they are going on the large object heap, even using JsonWriter.WriteToken() will not prevent these strings from being completely loaded into memory.
As an alternative, you might consider the readers and writers returned by JsonReaderWriterFactory. These readers and writers are used by DataContractJsonSerializer and translate JSON to XML on-the-fly as it is being read and written. Since the base classes for these readers and writers are XmlReader and XmlWriter, they do support reading and writing string values in chunks. Using them appropriately will avoid allocation of strings in the large object heap.
To do this, first define the following extension methods, that copy a selected subset of JSON value(s) from an input stream to an output stream, as specified by a path to the data to be streamed:
public static class JsonExtensions
{
public static void StreamNested(Stream from, Stream to, string [] path)
{
var reversed = path.Reverse().ToArray();
using (var xr = JsonReaderWriterFactory.CreateJsonReader(from, XmlDictionaryReaderQuotas.Max))
{
foreach (var subReader in xr.ReadSubtrees(s => s.Select(n => n.LocalName).SequenceEqual(reversed)))
{
using (var xw = JsonReaderWriterFactory.CreateJsonWriter(to, Encoding.UTF8, false))
{
subReader.MoveToContent();
xw.WriteStartElement("root");
xw.WriteAttributes(subReader, true);
subReader.Read();
while (!subReader.EOF)
{
if (subReader.NodeType == XmlNodeType.Element && subReader.Depth == 1)
xw.WriteNode(subReader, true);
else
subReader.Read();
}
xw.WriteEndElement();
}
}
}
}
}
public static class XmlReaderExtensions
{
public static IEnumerable<XmlReader> ReadSubtrees(this XmlReader xmlReader, Predicate<Stack<XName>> filter)
{
Stack<XName> names = new Stack<XName>();
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
names.Push(XName.Get(xmlReader.LocalName, xmlReader.NamespaceURI));
if (filter(names))
{
using (var subReader = xmlReader.ReadSubtree())
{
yield return subReader;
}
}
}
if ((xmlReader.NodeType == XmlNodeType.Element && xmlReader.IsEmptyElement)
|| xmlReader.NodeType == XmlNodeType.EndElement)
{
names.Pop();
}
}
}
}
Now, the string [] path argument to StreamNested() is not any sort of jsonpath path. Instead, it is a path corresponding to the hierarchy of XML elements corresponding to the JSON you want to select as translated by the XmlReader returned by JsonReaderWriterFactory.CreateJsonReader(). The mapping used for this translation is, in turn, documented by Microsoft in Mapping Between JSON and XML. To select and stream only those JSON values matching value[*], the XML path required is //root/value/item. Thus, you can select and stream your desired nested objects by doing:
JsonExtensions.StreamNested(inputStream, destinationStream, new[] { "root", "value", "item" });
Notes:
Mapping Between JSON and XML is somewhat complex. It's often easier just to load some sample JSON into an XDocument using the following extension method:
static XDocument ParseJsonAsXDocument(string json)
{
using (var xr = JsonReaderWriterFactory.CreateJsonReader(new MemoryStream(Encoding.UTF8.GetBytes(json)), Encoding.UTF8, XmlDictionaryReaderQuotas.Max, null))
{
return XDocument.Load(xr);
}
}
And then determine the correct XML path observationally.
For a related question, see JObject.SelectToken Equivalent in .NET.
I need to parse a big XML file (≈100MB, 400000rows) then put the data in a List.
First I try to parse the XML file in a C# console application, it taked about 2s to finish the job.
Then I copy the code into Unity, and it taked about 17s to finish the job.
Anybody know why it become so slowly? And how to make it faster?
Thanks!
Code:
// Used for storing data
stateList = new List<BallisticState>();
XmlTextReader reader = new XmlTextReader(filepath);
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name == "row")
{
string value = reader.GetAttribute("value").TrimStart();
BallisticState state = new BallisticState();
// this method converts string to float
SetBallisticState(state, value);
stateList.Add(state);
}
}
}
I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?
public static List<T> DeserializeJsonFiles<T>(string path)
{
if (string.IsNullOrWhiteSpace(path))
return null;
var jsonObjects = new List<T>();
//var sw = new Stopwatch();
try
{
//sw.Start();
foreach (var filename in Directory.GetFiles(path))
{
using (var streamReader = new StreamReader(filename))
using (var jsonReader = new JsonTextReader(streamReader))
{
jsonReader.SupportMultipleContent = true;
var serializer = new JsonSerializer();
while (jsonReader.Read())
{
if (jsonReader.TokenType != JsonToken.StartObject)
continue;
var jsonObject = serializer.Deserialize<dynamic>(jsonReader);
var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met
if (reducedObject == null)
continue;
jsonObject = reducedObject;
jsonObjects.Add(jsonObject);
}
}
}
//sw.Stop();
//Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex}")
return null;
}
return jsonObjects;
}
Thanks.
It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer to create another object and it fails.
You need to return IEnumerable<T> from your method, yield return each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>, processing each item, and writing to disk or wherever they need to end up.