I have a very, very large JSON file (1000+ MB) of identical JSON objects. For example:
[
{
"id": 1,
"value": "hello",
"another_value": "world",
"value_obj": {
"name": "obj1"
},
"value_list": [
1,
2,
3
]
},
{
"id": 2,
"value": "foo",
"another_value": "bar",
"value_obj": {
"name": "obj2"
},
"value_list": [
4,
5,
6
]
},
{
"id": 3,
"value": "a",
"another_value": "b",
"value_obj": {
"name": "obj3"
},
"value_list": [
7,
8,
9
]
},
...
]
Every single item in the root JSON list follows the same structure and thus would be individually deserializable. I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.
At first, I tried to just directly deserialize my objects in a loop:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (!sr.EndOfStream)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
This didn't work, threw an exception clearly stating that an object is expected, not a list. My understanding is that this command would just read a single object contained at the root level of the JSON file, but since we have a list of objects, this is an invalid request.
My next idea was to deserialize as a C# List of objects:
JsonSerializer serializer = new JsonSerializer();
List<MyObject> o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (!sr.EndOfStream)
{
o = serializer.Deserialize<List<MyObject>>(reader);
}
}
This does succeed. However, it only somewhat reduces the issue of high RAM usage. In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM. This has only displaced the problem.
I then decided to simply try taking a single character off the beginning of the stream (to eliminate the [) by doing sr.Read() before going into the loop. The first object then does read successfully, but subsequent ones do not, with an exception of "unexpected token". My guess is this is the comma and space between the objects throwing the reader off.
Simply removing square brackets won't work since the objects do contain a primitive list of their own, as you can see in the sample. Even trying to use }, as a separator won't work since, as you can see, there are sub-objects within the objects.
What my goal is, is to be able to read the objects from the stream one at a time. Read an object, do something with it, then discard it from RAM, and read the next object, and so on. This would eliminate the need to load either the entire JSON string or the entire contents of the data into RAM as C# objects.
What am I missing?
This should resolve your problem. Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (reader.Read())
{
// deserialize only when there's "{" character in the stream
if (reader.TokenType == JsonToken.StartObject)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
}
I think we can do better than the accepted answer, using more features of JsonReader to make a more generalized solution.
As a JsonReader consumes tokens from a JSON, the path is recorded in the JsonReader.Path property.
We can use this to precisely select deeply nested data from a JSON file, using regex to ensure that we're on the right path.
So, using the following extension method:
public static class JsonReaderExtensions
{
public static IEnumerable<T> SelectTokensWithRegex<T>(
this JsonReader jsonReader, Regex regex)
{
JsonSerializer serializer = new JsonSerializer();
while (jsonReader.Read())
{
if (regex.IsMatch(jsonReader.Path)
&& jsonReader.TokenType != JsonToken.PropertyName)
{
yield return serializer.Deserialize<T>(jsonReader);
}
}
}
}
The data you are concerned with lies on paths:
[0]
[1]
[2]
... etc
We can construct the following regex to precisely match this path:
var regex = new Regex(#"^\[\d+\]$");
it now becomes possible to stream objects out of your data (without fully loading or parsing the entire JSON) as follows
IEnumerable<MyObject> objects = jsonReader.SelectTokensWithRegex<MyObject>(regex);
Or if we want to dig even deeper into the structure, we can be even more precise with our regex
var regex = new Regex(#"^\[\d+\]\.value$");
IEnumerable<string> objects = jsonReader.SelectTokensWithRegex<string>(regex);
to only extract value properties from the items in the array.
I've found this technique extremely useful for extracting specific data from huge (100 GiB) JSON dumps, directly from HTTP using a network stream (with low memory requirements and no intermediate storage required).
.NET 6
This is easily done with the System.Text.Json.JsonSerializer in .NET 6:
using (FileStream? fileStream = new FileStream("hugefile.json", FileMode.Open))
{
IAsyncEnumerable<Person?> people = JsonSerializer.DeserializeAsyncEnumerable<Person?>(fileStream);
await foreach (Person? person in people)
{
Console.WriteLine($"Hello, my name is {person.Name}!");
}
}
Here is another easy way to parse large JSON file using Cinchoo ETL, an open source library (Uses JSON.NET under the hood to parse the json in stream manner)
using (var r = ChoJSONReader<MyObject>.LoadText(json)
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
Sample fiddle: https://dotnetfiddle.net/i5qJ5R
Is this what you're looking for? Found on a previous question
The current version of Json.net does not allow you to use the accepted answer code. A current alternative is:
public static object DeserializeFromStream(Stream stream)
{
var serializer = new JsonSerializer();
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
return serializer.Deserialize(jsonTextReader);
}
}
Documentation: Deserialize JSON from a file stream
Related
I'm trying to deserialize a pretty ugly JSON provided by an external REST API and am wondering about the "proper" way to do that (I'm using System.Text.Json in .net 6). Details follow:
I have a model for the data:
class DeviceData{
//lots of properties
}
which works fine (i.e I can just JsonSerializer.Deserialize<DeviceData> the response) when making an API query for a single instance, since it returns a nice JSON one would expect:
{
"property1_name": value,
"property2_name": value,
...
}
The problem begins when I use the batch query provided by the API, since the response to api_url/batch?=device1,device2,... looks as if someone failed to make an array (the device1s are alphanumeric strings pulled form a database) is:
{
"names":[
"device1",
"device2",
...
],
"device1":{
"stuff_i_dont_need": value,
"device1": {
"property1_name": value,
"property2_name": value,
...
}
}
"device2":{
...
}
...
}
The double nesting of dynamic property names means I can't just deserialize the second response as a dictionary of <string, myclass> pairs. I managed to hack something together using JsonDocument but it's extremly ugly and it feels like there should be a nice short way to do that with just JsonSerializer and maybe some reader overrides.
Using Deserialize subsections of a JSON payload from How to use a JSON document, Utf8JsonReader, and Utf8JsonWriter in System.Text.Json as template you could do something like this:
JsonNode root = JsonNode.Parse(json)!;
Dictionary<string, X> devices = new();
foreach(string name in root["names"]!.AsArray()) {
var o = root[name][name].AsObject();
using var stream = new MemoryStream();
using var writer = new Utf8JsonWriter(stream);
o.WriteTo(writer);
writer.Flush();
X? x = JsonSerializer.Deserialize<X>(stream.ToArray());
var innerJson = root[name][name].ToJsonString();
devices[name] = x;
}
foreach(var d in devices) Console.WriteLine($"{d.Key}: {d.Value}");
This prints
device1: X { property1_name = 12, property2_name = 13 }
device2: X { property1_name = 22, property2_name = 23 }
I'm not sure if this is faster/better than calling ToJsonString():
JsonNode root = JsonNode.Parse(json)!;
Dictionary<string, X> devices = new();
foreach(string name in root["names"]!.AsArray()) {
var innerJson = root[name][name].ToJsonString();
devices[name] = JsonSerializer.Deserialize<X>(innerJson);
}
foreach(var d in devices) Console.WriteLine($"{d.Key}: {d.Value}")
If you're after fancy you could go full LINQ:
JsonNode root = JsonNode.Parse(json)!;
Dictionary<string, X> devices = root["names"]!.AsArray()
.Select(name => (string)name)
.ToDictionary(
keySelector: name => name,
elementSelector: name => System.Text.Json.JsonSerializer.Deserialize<X>(root[name][name].ToJsonString()));
foreach(var d in devices) Console.WriteLine($"{d.Key}: {d.Value}");
Both print
I am simply trying to serialize and deserialize a string array in Bson format using Json.NET, but the following code fails:
var jsonSerializer = new JsonSerializer();
var array = new string [] { "A", "B" };
// Serialization
byte[] bytes;
using (var ms = new MemoryStream())
using (var bson = new BsonWriter(ms))
{
jsonSerializer.Serialize(bson, array, typeof(string[]));
bytes = ms.ToArray();
}
// Deserialization
using (var ms = new MemoryStream(bytes))
using (var bson = new BsonReader(ms))
{
// Exception here
array = jsonSerializer.Deserialize<string[]>(bson);
}
Exception message:
Cannot deserialize the current JSON object (e.g. {"name":"value"}) into type 'System.String[]' because the type requires a JSON array (e.g. [1,2,3]) to deserialize correctly.
To fix this error either change the JSON to a JSON array (e.g. [1,2,3]) or change the deserialized type so that it is a normal .NET type (e.g. not a primitive type like integer, not a collection type like an array or List) that can be deserialized from a JSON object. JsonObjectAttribute can also be added to the type to force it to deserialize from a JSON object.
How can I get this to work?
Set ReadRootValueAsArray to true on BsonReader
http://james.newtonking.com/projects/json/help/index.html?topic=html/P_Newtonsoft_Json_Bson_BsonReader_ReadRootValueAsArray.htm
This setting is required because the BSON data spec doesn't save metadata about whether the root value is an object or an array.
Hmmm, from where I sit, your code should work, but Json.Net seems to think that your serialized array of strings is a dictionary. This could be because, according to the BSON specification, arrays actually do get serialized as a list of key-value pairs just like objects do. The keys in this case are simply the string representations of the array index values.
In any case, I was able to work around the issue in a couple of different ways:
Deserialize to a Dictionary and then manually convert it back to an array.
var jsonSerializer = new JsonSerializer();
var array = new string[] { "A", "B" };
// Serialization
byte[] bytes;
using (var ms = new MemoryStream())
using (var bson = new BsonWriter(ms))
{
jsonSerializer.Serialize(bson, array);
bytes = ms.ToArray();
}
// Deserialization
using (var ms = new MemoryStream(bytes))
using (var bson = new BsonReader(ms))
{
var dict = jsonSerializer.Deserialize<Dictionary<string, string>>(bson);
array = dict.OrderBy(kvp => kvp.Key).Select(kvp => kvp.Value).ToArray();
}
Wrap the array in an outer object.
class Wrapper
{
public string[] Array { get; set; }
}
Then serialize and deserialize using the wrapper object.
var jsonSerializer = new JsonSerializer();
var obj = new Wrapper { Array = new string[] { "A", "B" } };
// Serialization
byte[] bytes;
using (var ms = new MemoryStream())
using (var bson = new BsonWriter(ms))
{
jsonSerializer.Serialize(bson, obj);
bytes = ms.ToArray();
}
// Deserialization
using (var ms = new MemoryStream(bytes))
using (var bson = new BsonReader(ms))
{
obj = jsonSerializer.Deserialize<Wrapper>(bson);
}
Hope this helps.
As explained in this answer by James Newton-King, the BSON format doesn't save metadata about whether the root value is a collection, making it necessary to set BsonDataReader.ReadRootValueAsArray appropriately before beginning to deserialize.
One easy way to do this, when deserializing to some known POCO type (rather than dynamic or JToken), is to initialize the reader based on whether the root type will be serialized using an array contract. The following extension methods do this:
public static partial class BsonExtensions
{
public static T DeserializeFromFile<T>(string path, JsonSerializerSettings settings = null)
{
using (var stream = new FileStream(path, FileMode.Open))
return Deserialize<T>(stream, settings);
}
public static T Deserialize<T>(byte [] data, JsonSerializerSettings settings = null)
{
using (var stream = new MemoryStream(data))
return Deserialize<T>(stream, settings);
}
public static T Deserialize<T>(byte [] data, int index, int count, JsonSerializerSettings settings = null)
{
using (var stream = new MemoryStream(data, index, count))
return Deserialize<T>(stream, settings);
}
public static T Deserialize<T>(Stream stream, JsonSerializerSettings settings = null)
{
// Use BsonReader in Json.NET 9 and earlier.
using (var reader = new BsonDataReader(stream) { CloseInput = false }) // Let caller dispose the stream
{
var serializer = JsonSerializer.CreateDefault(settings);
//https://www.newtonsoft.com/json/help/html/DeserializeFromBsonCollection.htm
if (serializer.ContractResolver.ResolveContract(typeof(T)) is JsonArrayContract)
reader.ReadRootValueAsArray = true;
return serializer.Deserialize<T>(reader);
}
}
}
Now you can simply do:
var newArray = BsonExtensions.Deserialize<string []>(bytes);
Notes:
BSON support was moved to its own package, Newtonsoft.Json.Bson, in Json.NET 10.0.1. In this version and later versions BsonDataReader replaces the now-obsolete BsonReader.
The same extension methods can be used to deserialize a dictionary, e.g.:
var newDictionary = BsonExtensions.Deserialize<SortedDictionary<int, string>>(bytes);
By checking the contract type ReadRootValueAsArray is set appropriately.
Demo fiddle here.
In general, you could check data type first before set ReadRootValueAsArray to true, like this:
if (typeof(IEnumerable).IsAssignableFrom(type))
bSonReader.ReadRootValueAsArray = true;
I know this is an old thread but I discovered a easy deserialization while using the power of MongoDB.Driver
You can use BsonDocument.parse(JSONString) to deserialize a JSON object so to deserialize a string array use this:
string Jsonarray = "[\"value1\", \"value2\", \"value3\"]";
BsonArray deserializedArray = BsonDocument.parse("{\"arr\":" + Jsonarray + "}")["arr"].asBsonArray;
deserializedArray can then be used as any array such as a foreach loop.
With the Algolia online cloud search engine their examples work fine.
// Load JSON file ( from file system )
StreamReader re = File.OpenText("contacts.json");
JsonTextReader reader = new JsonTextReader(re);
JArray batch = JArray.Load(reader);
// Add objects
Index index = client.InitIndex("contacts");
index.AddObjects(batch);
So what I am wanting to do it take C# class of properties that I serialized to JSON and be able to somehow use it as a JArray to load and add to send to Algolia.
// works fine
var json = new JavaScriptSerializer().Serialize(boom);
JArray batch = JArray.Parse(json); // breaks
Index index = client.InitIndex("myindex");
index.AddObjects(batch);
This breaks
JArray batch = JArray.Parse(json);
It's most likely failing because boom is not an array. What you can do is put boom in an anonymous array, and serialize that instead:
var json = new JavaScriptSerializer().Serialize(new[] { boom });
var batch = JArray.Parse(json);
Even better, you can skip over the serialization and create the JArray immediately from your object:
var batch = JArray.FromObject(new[] { boom });
I am trying to store a collection of lists (each containing over 20.000 int's) and was hoping to use a nested lest for this since each day a new list will be added.
Eventually I need to access the data in the following way:
"Take the first value of each list and compile a new list".
Iddeally I'd like to serialise a List<List<int>> however this does not seem to work (I can serialise a List<int>). Is there a trick to doing this (preferably without getting any addons)?
If not, how would you advice me to store such data efficiently and quick?
The way I try it now:
static void saveFunction(List<int> data, string name)
{
using (Stream stream = File.Open(name + ".bin", FileMode.OpenOrCreate))
{
BinaryFormatter bin = new BinaryFormatter();
if (stream.Length == 0)
{
List<List<int>> List = new List<List<int>>();
List.Add(data);
bin.Serialize(stream, List);
}
else
{
List<List<int>> List = (List<List<int>>)bin.Deserialize(stream);
List.Add(data);
bin.Serialize(stream, List);
}
}
}
Strangely the list.Count remains 1, and the number of int in the list remain the same as well while the file size increases.
You need to rewind the stream and clear the previous data between reading and writing:
static void saveFunction(List<int> data, string name)
{
using (Stream stream = File.Open(name + ".bin", FileMode.OpenOrCreate))
{
BinaryFormatter bin = new BinaryFormatter();
if (stream.Length == 0)
{
var List = new List<List<int>>();
List.Add(data);
bin.Serialize(stream, List);
}
else
{
var List = (List<List<int>>)bin.Deserialize(stream);
List.Add(data);
stream.SetLength(0); // Clear the old data from the file
bin.Serialize(stream, List);
}
}
}
What you are doing now is appending the new list to the end of the file while leaving the old list as-is -- which BinaryFormatter will happily read as the (first) object in the file when it is re-opened.
As for your second question, "how would you advice me to store such data efficiently and quick?", since your plan is to "take the first value of each list and compile a new list", it appears you're going to need to re-read the preceding lists when writing a new list. If that were not true, however, and each new list was independent of the preceding lists, BinaryFormatter does support writing multiple root objects to the same file. See here for details: Serializing lots of different objects into a single file
I know there were already many discussions on that topic, like this one:
BinaryFormatter and Deserialization Complex objects
but this looks awfully complicated. What I'm looking for is an easier way to serialize and deserialize a generic List of objects into/from one file. This is what I've tried:
public void SaveFile(string fileName)
{
List<object> objects = new List<object>();
// Add all tree nodes
objects.Add(treeView.Nodes.Cast<TreeNode>().ToList());
// Add dictionary (Type: Dictionary<int, Tuple<List<string>, List<string>>>)
objects.Add(dictionary);
using(Stream file = File.Open(fileName, FileMode.Create))
{
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(file, objects);
}
}
public void LoadFile(string fileName)
{
ClearAll();
using(Stream file = File.Open(fileName, FileMode.Open))
{
BinaryFormatter bf = new BinaryFormatter();
object obj = bf.Deserialize(file);
// Error: ArgumentNullException in System.Core.dll
TreeNode[] nodeList = (obj as IEnumerable<TreeNode>).ToArray();
treeView.Nodes.AddRange(nodeList);
dictionary = obj as Dictionary<int, Tuple<List<string>, List<string>>>;
}
}
The serialization works, but the deserialization fails with an ArgumentNullException. Does anyone know how to pull the dictionary and the tree nodes out and cast them back, may be with a different approach, but also nice and simple? Thanks!
You have serialized a list of objects where the first item is a list of nodes and the second a dictionary. So when deserializing, you will get the same objects back.
The result from deserializing will be a List<object>, where the first element is a List<TreeNode> and the second element a Dictionary<int, Tuple<List<string>, List<string>>>
Something like this:
public static void LoadFile(string fileName)
{
ClearAll();
using(Stream file = File.Open(fileName, FileMode.Open))
{
BinaryFormatter bf = new BinaryFormatter();
object obj = bf.Deserialize(file);
var objects = obj as List<object>;
//you may want to run some checks (objects is not null and contains 2 elements for example)
var nodes = objects[0] as List<TreeNode>;
var dictionary = objects[1] as Dictionary<int, Tuple<List<string>,List<string>>>;
//use nodes and dictionary
}
}
You can give it a try on this fiddle.