Deserializing large files using Json.NET - c#

I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?
public static List<T> DeserializeJsonFiles<T>(string path)
{
if (string.IsNullOrWhiteSpace(path))
return null;
var jsonObjects = new List<T>();
//var sw = new Stopwatch();
try
{
//sw.Start();
foreach (var filename in Directory.GetFiles(path))
{
using (var streamReader = new StreamReader(filename))
using (var jsonReader = new JsonTextReader(streamReader))
{
jsonReader.SupportMultipleContent = true;
var serializer = new JsonSerializer();
while (jsonReader.Read())
{
if (jsonReader.TokenType != JsonToken.StartObject)
continue;
var jsonObject = serializer.Deserialize<dynamic>(jsonReader);
var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met
if (reducedObject == null)
continue;
jsonObject = reducedObject;
jsonObjects.Add(jsonObject);
}
}
}
//sw.Stop();
//Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex}")
return null;
}
return jsonObjects;
}
Thanks.

It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer to create another object and it fails.
You need to return IEnumerable<T> from your method, yield return each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>, processing each item, and writing to disk or wherever they need to end up.

Related

Deserialize object one by one from file .Net

I'm trying to deserialize a list of heavy objects from a json file. I do not want to deserialize it the classic way, like directly to a list, because it will expose me to an OutOfMemory exception. So I'm looking for a way to handle object one by one to store them one by one in the database and be memory safe.
I already handle the serialization and it's working well, but I'm facing some difficulties for deserialization.
Any idea ?
Thanks in advance
// Serialization
using (var FileStream = new FileStream(DirPath + "/TPV.Json", FileMode.Create))
{
using (var sw = new StreamWriter(FileStream))
{
using (var jw = new JsonTextWriter(sw))
{
jw.WriteStartArray();
using (var _Database = new InspectionBatimentsDataContext(TheBrain.DBClient.ConnectionString))
{
foreach (var TPVId in TPVIds)
{
var pic = (from p in _Database.TPV
where Operators.ConditionalCompareObjectEqual(p.Release, TPVId.Release, false) & Operators.ConditionalCompareObjectEqual(p.InterventionId, TPVId.InterventionId, false)
select p).FirstOrDefault;
var ser = new JsonSerializer();
ser.Serialize(jw, pic);
jw.Flush();
}
}
jw.WriteEndArray();
}
}
}
I finnaly found a way to do it by using custom separator beetween each object during serialization. Then for deserialization, I simply read the json file as string until I find my custom separator and I deserialise readed string, all in a loop. It's not the perfect answer because I'm breaking json format in my files, but it's not a constraint in my case.

Processing large text (JSON) file

I have a requirement to allow my Intranet .NET web portal users to send free-text SQL query to the backend (Read-only DB on SQL Server 2014) and get the results in Excel, while this works fine for most cases but the code fails when the results are too big (around 350mb, 250k records) to be processed.
My first attempt was to get the results directly as JSON serialized into a data table on the frontend.
That failed since iterating through the results set would throw System.OutOfMemoryException:
private JavaScriptSerializer _serializer;
return _serializer.Serialize(results));
So decided its not a good thing anyway to display this amount of results on the interface directly since IE will struggle. So went to the option of prompting users to download an Excel copy of the output by saving the results into a JSON file then read the file and convert it to Excel:
using (StreamReader sr = new StreamReader(filePath))
String json;
// Read and display lines from the file until the end of
// the file is reached.
while ((json = sr.ReadLine()) != null)
{
Console.WriteLine(json);}
}
However the ReadLine() method is throwing the same exception, it is good to note that the ReadLine is failing due to the fact the file consists of only one line otherwise I would try and iterate over the lines line by line.
Lastly I tried access the IEnumerable directly to write it into Excel
var results = new ReportProcess().ExecuteSql(sqlQuery, out string eventMessage);
List<object> queryObjects = new List<object>();
foreach (var result in results)
{
queryObjects.Add(result);
}
var row = queryObjects.FirstOrDefault();
if (row != null)
{
var recordType = row.GetType();
using (var excelFile = new ExcelPackage())
{
MemberInfo[] membersToInclude = recordType
.GetProperties(BindingFlags.Instance | BindingFlags.Public)
.ToArray();
var worksheet = excelFile.Workbook.Worksheets.Add("Sheet1");
worksheet.Cells.LoadFromCollection(queryObjects, true,
OfficeOpenXml.Table.TableStyles.None,
BindingFlags.Instance | BindingFlags.Public,
membersToInclude);
fileName = Guid.NewGuid() + ".xlsx";
excelFile.SaveAs(new FileInfo(HttpContext.Current.Server.MapPath("~/temp/") + fileName));
}
}
And again the code fails around
foreach (var result in results)
{
queryObjects.Add(result);
}
With the same exception
So now I'm stuck around the fact that no matter what I try to do to iterate through the IEnumerable resuls, I would always get the OutOfMemory Exception.
I have also tried to increase the memory allocated to objects by setting the gcAllowVeryLargeObjects to true in the web.config but to no avail:
<runtime>
<gcAllowVeryLargeObjects enabled="true"/>
</runtime>
Other attempt:
Googling around did not bring anything to resolve the issue, any suggestions/ideas?
Eventually I had to rewrite my code to implement external libraries to serialize CSV using CsvHelper library
using (StreamReader sr = new StreamReader(filePath))
{
var csvReader = new CsvReader(sr);
var records = csvReader.GetRecords<object>();
var result = string.Empty;
try
{
return JsonConvert.SerializeObject(new ServerData(records, _eventMessage));
}
catch (Exception ex)
{
_eventMessage.Level = EventMessage.EventLevel.Error;
}
return _serializer.Serialize(new ServerData(result, _eventMessage));
}
That seems to work fine with large data set, OutOfMemory Exception is no longer appearing

How to save a file in LocalStorage?

I have an ObservableCollection <T>. I want to insert the various elements in it and then save the newly created file in LocalStorage. How can I do that?
SQLiteAsyncConnection conn = new SQLiteAsyncConnection(Path.Combine(ApplicationData.Current.LocalFolder.Path, "Database.db"), true);
await conn.CreateTableAsync<Musei>();
var Dbase = Path.Combine(ApplicationData.Current.LocalFolder.Path, "Database.db");
var con = new SQLiteAsyncConnection(Dbase, true);
var query = await con.Table<Musei>().ToListAsync();
ObservableCollection<Musei> favMusei = new ObservableCollection<Musei>();
if (query.Count > 0)
{
favMusei.Clear();
foreach (Musei museifav in query)
{
favMusei.Add(museifav);
}
}
I'm using a json file to store in the memory. JSON is a light weight message exchange format and is widely used. You have to do some slight modifications in the code if you want some different file format.
Your collection would be serialized to the memory at the time of saving and has to be deserialized when reading it back from memory.
Add your own generic implementation of the collection. To create your situation i'm using a simple ObservableCollection<int>. And don't forget to initialize the collection to some meaningful values, here i'm using the default constructor initialization.
using System.Collections.ObjectModel;
using System.Runtime.Serialization.Json;
using Windows.Storage;
//Add your own generic implementation of the collection
//and make changes accordingly
private ObservableCollection<int> temp;
private string file = "temp.json";
private async void saveToFile()
{
//add your items to the collection
temp = new ObservableCollection<int>();
var jsonSerializer = new DataContractJsonSerializer(typeof(ObservableCollection<int>));
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForWriteAsync(file, CreationCollisionOption.ReplaceExisting))
{
jsonSerializer.WriteObject(stream, temp);
}
}
private async Task getFormFile()
{
var jsonSerializer = new DataContractJsonSerializer(typeof(ObservableCollection<int>));
try
{
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForReadAsync(file))
{
temp = (ObservableCollection<int>)jsonSerializer.ReadObject(stream);
}
}
catch
{
//if some error is caught while reading data from the file then initializing
//the collection to default constructor instance is a good choice
//again it's your choice and may differ in your scenario
temp = new ObservableCollection<int>();
}
}
To add some functionality to the code you can also have an ensureDataLoaded() function which would ensure that the data has been read from the JSON file.
public async Task ensureDataLoaded()
{
if (temp.Count == 0)
await getFormFile();
return;
}
Before using the global variable temp (having the ObservableCollection) call the ensureDataLoaded function. It would avoid some unnecessary NullPointerExceptions.

How to save a Dictionary variable in roaming - C#

After studying the manual of Visual C#, I'm starting to program a simple app for Windows 8. I'm using Visual Studio 2013 Express and .NET Framework 4.5.1. In the code I've written up to now, in the code-behind of a page I create a list of people:
private Dictionary<string, People> listPeople = new Dictionary<string, People>();
After this, I wish this list would fill a ComboBox control of another page. The solution I thought is to save the Dictionary<string, People> variable in roaming, and then use it where I need. In this way also would solve the problem of maintaining the list of people saved even when the app is terminated.
How can I do?
did you already try it like that?
var applicationData = Windows.Storage.ApplicationData.Current;
applicationData.RoamingSettings.Values["PeopleList"] = listPeople;
var applicationData = Windows.Storage.ApplicationData.Current;
var listPeople = (Dictionary<string, People>)applicationData.RoamingSettings.Values["PeopleList"];
The overwhelming advice is to serialize your setting as a string then set the value in Values (or save the values in the complex type individually).
But, I think that really ignores the fact that Values is a series of key-value pairs... If you want to support complex types, I think creating your own file in the roaming folder is a better idea. I've written a small helper class to accomplish this via JSON serialization:
public class ApplicationStorageHelper
{
private static readonly JsonSerializer jsonSerializer = JsonSerializer.Create();
public static async Task<bool> SaveData<T>(T data)
{
var file =
await ApplicationData.Current.RoamingFolder.CreateFileAsync("settings.dat", CreationCollisionOption.ReplaceExisting);
using (var stream = await file.OpenAsync(FileAccessMode.ReadWrite))
{
using (var outputStream = stream.GetOutputStreamAt(0))
{
using (var writer = new StreamWriter(outputStream.AsStreamForWrite()))
{
var jsonWriter = new JsonTextWriter(writer);
jsonSerializer.Serialize(jsonWriter, data);
return true;
}
}
}
}
public static async Task<T> LoadData<T>()
{
try
{
var file =
await ApplicationData.Current.RoamingFolder.GetFileAsync("settings.dat");
using (var inputStream = await file.OpenSequentialReadAsync())
{
using (var reader = new StreamReader(inputStream.AsStreamForRead()))
{
var jsonReader = new JsonTextReader(reader);
return jsonSerializer.Deserialize<T>(jsonReader);
}
}
}
catch (FileNotFoundException)
{
return default(T);
}
}
}
You'll have to reference JSON.Net from Nuget or some other way.

Out of memory for JObject

I have a problem when I try to parse a large json file, which has around 200mb.
I'm doing it with Newtonsoft.Json. It gives OutOfMemory exception.
This is my code:
using (StreamReader sr=File.OpenText("path"))
{
JObject file= (JObject)JToken.ReadFrom(new JsonTextReader(sr));
}
How can I do this ? ( preferable using JObject )
You can use JsonTextReader to read text in a DataReader fashion as stated in this question:
Incremental JSON Parsing in C#
You will have to code your own logic to process JSON data, but it will for sure solve your memory issues:
using (var reader = new JsonTextReader(File.OpenText("path")))
{
while (reader.Read())
{
// Your logic here (anything you need is in [reader] object), for instance:
if (reader.TokenType == JsonToken.StartArray)
{
// Process array
MyMethodToProcessArray(reader);
}
else if (reader.TokenType == JsonToken.StartObject)
{
// Process object
MyMethodToProcessObject(reader);
}
}
}
You would actually build a recursive JSON parser.

Categories

Resources