Processing large text (JSON) file

Processing large text (JSON) file - c#

I have a requirement to allow my Intranet .NET web portal users to send free-text SQL query to the backend (Read-only DB on SQL Server 2014) and get the results in Excel, while this works fine for most cases but the code fails when the results are too big (around 350mb, 250k records) to be processed.
My first attempt was to get the results directly as JSON serialized into a data table on the frontend.
That failed since iterating through the results set would throw System.OutOfMemoryException:
private JavaScriptSerializer _serializer;
return _serializer.Serialize(results));
So decided its not a good thing anyway to display this amount of results on the interface directly since IE will struggle. So went to the option of prompting users to download an Excel copy of the output by saving the results into a JSON file then read the file and convert it to Excel:
using (StreamReader sr = new StreamReader(filePath))
String json;
// Read and display lines from the file until the end of
// the file is reached.
while ((json = sr.ReadLine()) != null)
{
Console.WriteLine(json);}
}
However the ReadLine() method is throwing the same exception, it is good to note that the ReadLine is failing due to the fact the file consists of only one line otherwise I would try and iterate over the lines line by line.
Lastly I tried access the IEnumerable directly to write it into Excel
var results = new ReportProcess().ExecuteSql(sqlQuery, out string eventMessage);
List<object> queryObjects = new List<object>();
foreach (var result in results)
{
queryObjects.Add(result);
}
var row = queryObjects.FirstOrDefault();
if (row != null)
{
var recordType = row.GetType();
using (var excelFile = new ExcelPackage())
{
MemberInfo[] membersToInclude = recordType
.GetProperties(BindingFlags.Instance | BindingFlags.Public)
.ToArray();
var worksheet = excelFile.Workbook.Worksheets.Add("Sheet1");
worksheet.Cells.LoadFromCollection(queryObjects, true,
OfficeOpenXml.Table.TableStyles.None,
BindingFlags.Instance | BindingFlags.Public,
membersToInclude);
fileName = Guid.NewGuid() + ".xlsx";
excelFile.SaveAs(new FileInfo(HttpContext.Current.Server.MapPath("~/temp/") + fileName));
}
}
And again the code fails around
foreach (var result in results)
{
queryObjects.Add(result);
}
With the same exception
So now I'm stuck around the fact that no matter what I try to do to iterate through the IEnumerable resuls, I would always get the OutOfMemory Exception.
I have also tried to increase the memory allocated to objects by setting the gcAllowVeryLargeObjects to true in the web.config but to no avail:
<runtime>
<gcAllowVeryLargeObjects enabled="true"/>
</runtime>
Other attempt:
Googling around did not bring anything to resolve the issue, any suggestions/ideas?

Eventually I had to rewrite my code to implement external libraries to serialize CSV using CsvHelper library
using (StreamReader sr = new StreamReader(filePath))
{
var csvReader = new CsvReader(sr);
var records = csvReader.GetRecords<object>();
var result = string.Empty;
try
{
return JsonConvert.SerializeObject(new ServerData(records, _eventMessage));
}
catch (Exception ex)
{
_eventMessage.Level = EventMessage.EventLevel.Error;
}
return _serializer.Serialize(new ServerData(result, _eventMessage));
}
That seems to work fine with large data set, OutOfMemory Exception is no longer appearing

Related

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

A little background on problem:
We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. However, in some cases export to excel is failing.
Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here
My approach so far:
EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. I could find that the file is dumped with some invalid character in it. I looked into the possible approaches .
Firstly, I identified the problematic character in the excel file. I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file.
Now using the approaches given in other SO thread here and here, I replaced all possible occurrences of invalid chars. I am using at the moment
XmlConvert.IsXmlChar
method to find out the problematic XML character and replacing with blank space.
I created a sample program where I am trying to work on the problematic excel sheet.
//in main method
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);
//removal of invalid characters
static string RemoveInvalidXmlChars(string inputText)
{
StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
int firstOccurenceOfRealData = inputText.IndexOf("<t>");
int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");
if (firstOccurenceOfRealData < 0 ||
lastOccurenceOfRealData < 0 ||
firstOccurenceOfRealData > lastOccurenceOfRealData)
return inputText;
withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData));
int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining);
foreach (char c in textToCheckFor)
{
withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
}
withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));
return withoutInvalidXmlCharsBuilder.ToString();
}
If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. The above mentioned code successfully replaces the same invalid character and writes the content back to the file. However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below). Moreover, Following code
var excelPackage = new ExcelPackage(new FileInfo(filePath));
on the file that I updated via Notepad++, throws following exception
"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}
My Questions:
Is my approach to modify content this way correct?
If yes, How can I write updated string to an Excel file?
If my approach is wrong then, How can I proceed to get rid of invalid XML chars?
Errors shown on opening file (without invalid XML char):
First Pop up
When I click on yes
Thanks in advance !

It does sounds like a binary (presumable XLSX) file based on your last comment. To confirm, open the file created by the FlexMonster with 7zip. If it opens properly and you see a bunch of XML files in folders, its a XLSX.
In that case, a search/replace on a binary file sounds like a very bad idea. It might work on the XML parts but might also replace legit chars in other parts. I think the better approach would be to do as #PanagiotisKanavos suggests and use ZipArchive. But you have to do rebuild it in the right order otherwise Excel complains. Similar to how it was done here https://stackoverflow.com/a/33312038/1324284, you could do something like this:
public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
{
//Go though each file in the zip one by one and copy over to the new file - entries need to be in order
foreach (var entry in xlsxZip.Entries)
{
var newentry = copiedzip.CreateEntry(entry.FullName);
var newstream = newentry.Open();
var orgstream = entry.Open();
//Copy non-xml files over
if (!entry.Name.EndsWith(".xml"))
{
orgstream.CopyTo(newstream);
}
else
{
//Load the xml document to manipulate
var xdoc = new XmlDocument();
xdoc.Load(orgstream);
var xml = xdoc.OuterXml.Replace(oldString, newstring);
xdoc = new XmlDocument();
xdoc.LoadXml(xml);
xdoc.Save(newstream);
}
orgstream.Close();
newstream.Flush();
newstream.Close();
}
}
}
When it is used like this:
[TestMethod]
public void ReplaceXmlTest()
{
var datatable = new DataTable("tblData");
datatable.Columns.AddRange(new[]
{
new DataColumn("Col1", typeof (int)),
new DataColumn("Col2", typeof (int)),
new DataColumn("Col3", typeof (string))
});
for (var i = 0; i < 10; i++)
{
var row = datatable.NewRow();
row[0] = i;
row[1] = i * 10;
row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
datatable.Rows.Add(row);
}
using (var pck = new ExcelPackage())
{
var workbook = pck.Workbook;
var worksheet = workbook.Worksheets.Add("source");
worksheet.Cells.LoadFromDataTable(datatable, true);
worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");
//Now similulate the copy/open of the excel file into a zip archive
using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
{
var fi = new FileInfo(#"c:\temp\ReplaceXmlTest.xlsx");
if (fi.Exists)
fi.Delete();
orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
}
}
}
Gives this:
Just keep in mind that this is completely brute force. Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. Hard to say without knowing more about the data.

How to efficiently download, read and process CSV in C#

I'm working on a service that will collect a large CSV-file from an online resource, then as it's downloading, read the lines, (preferably in batches), and send them to a database. This should not use more than 256MB of RAM at any time, and not save a file to disk.
This is for a service that will run once every 7 days, and collect all the companies in the Norwegian Company Register, (a nifty, 250MB, 1.1 million line CSV is found here: http://hotell.difi.no/download/brreg/enhetsregisteret )
My application can easily download the file and add it to a List<>, and process it, but it uses 3.3 GB of RAM
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
var request = await _httpClient.GetAsync(_options.Value.Urls["BrregCsv"]);
request.EnsureSuccessStatusCode();
using (var stream = await request.Content.ReadAsStreamAsync())
using (var streamReader = new StreamReader(stream))
{
while (!streamReader.EndOfStream)
{
using (var csv = new CsvReader(streamReader)) // CsvReader is from the CsvHelper -nuget
{
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>().ToList());
}
}
}
return true;
}
Small note on the SqlRepository: I've replaced it with a simple "destroyer"-method that just clears the data, so as to not use any extra resources while debugging
What I'd expect is that the Garbage Collector would "destroy" the resources used as the lines of the file is processed, but it doesn't.
Put simply, I want the following to happen:
As the CSV downloads, it reads a few lines, these are then sent to a method, and the lines in memory are then flushed
I'm definitely inexperienced at working with large datasets, so I'm working off other people's work, and not getting the results I expect
Thank you for your time and assistance

So getting some pointers from Sami Kuhmonen (#sami-kuhmonen) helped, and here's what I came up with this:
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
using (var streamReader = new StreamReader(stream))
using (var csv = new CsvReader(streamReader))
{
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>());
}
return true;
}
It downloads the entire file and sends it to the SqlRepository in 20 seconds, never surpassing 15% CPU, or 30MB RAM
Now, my next challenge is the SqlRepository, but this issue is solved

Another solution I'm now implementing, that is more predictable in it's resource use is this:
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistryAlternate()
{
using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
using (var reader = new StreamReader(stream))
using (var csv = new CsvReader(reader))
{
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
var tempList = new List<NorwegianCompany>();
while (csv.Read())
{
tempList.Add(csv.GetRecord<NorwegianCompany>());
if (tempList.Count() > 50000)
{
await Task.Factory.StartNew(() => _sqlRepository.UpdateNorwegianCompaniesTable(tempList));
tempList.Clear();
}
}
}
return true;
}
Now it uses 3 minutes, but never peak 200MB and uses 7-12% CPU, even when doing a SQL "bulk updates", (SqlBulkTool -NuGet is excellent for my needs here), every X lines

C# Error: OutOfMemoryException - Reading a large text file and replacing from dictionary

I'm new to C# and object-oriented programming in general. I have an application which parses text file.
The objective of the application is to read the contents of the provided text file and replace the matching values.
When a file about 800 MB to 1.2GB is provided as the input, the application crashes with error System.OutofMemoryException.
On researching, I came across couple of answers which recommend changing the Target Platform: to x64.
Same issue exists after changing the target platform.
Following is the code:
// Reading the text file
var _data = string.Empty;
using (StreamReader sr = new StreamReader(logF))
{
_data = sr.ReadToEnd();
sr.Dispose();
sr.Close();
}
foreach (var replacement in replacements)
{
_data = _data.Replace(replacement.Key, replacement.Value);
}
//Writing The text File
using (StreamWriter sw = new StreamWriter(logF))
{
sw.WriteLine(_data);
sw.Dispose();
sw.Close();
}
The error points to
_data = sr.ReadToEnd();
replacements is a dictionary. The Key contains the original word and the Value contains the word to be replaced.
The Key elements are replaced with the Value elements of the KeyValuePair.
The approached being followed is Reading the file, replacing and writing.
I tried using a StringBuilder instead of string yet the application crashed.
Can this be overcome by reading the file one line at a time, replacing and writing? What would be the efficient and faster way of doing the same.
Update: The system memory is 8 GB and on monitoring the performance it spikes upto 100% memory usage.
#Tim Schmelter answer works well.
However, the memory utilization spikes over 90%. It could be due to the following code:
String[] arrayofLine = File.ReadAllLines(logF);
// Generating Replacement Information
Dictionary<int, string> _replacementInfo = new Dictionary<int, string>();
for (int i = 0; i < arrayofLine.Length; i++)
{
foreach (var replacement in replacements.Keys)
{
if (arrayofLine[i].Contains(replacement))
{
arrayofLine[i] = arrayofLine[i].Replace(replacement, masking[replacement]);
if (_replacementInfo.ContainsKey(i + 1))
{
_replacementInfo[i + 1] = _replacementInfo[i + 1] + "|" + replacement;
}
else
{
_replacementInfo.Add(i + 1, replacement);
}
}
}
}
//Creating Replacement Information
StringBuilder sb = new StringBuilder();
foreach (var Replacement in _replacementInfo)
{
foreach (var replacement in Replacement.Value.Split('|'))
{
sb.AppendLine(string.Format("Line {0}: {1} ---> \t\t{2}", Replacement.Key, replacement, masking[replacement]));
}
}
// Writing the replacement information
if (sb.Length!=0)
{
using (StreamWriter swh = new StreamWriter(logF_Rep.txt))
{
swh.WriteLine(sb.ToString());
swh.Dispose();
swh.Close();
}
}
sb.Clear();
It finds the line number in which the replacement was made. Can this be captured using Tim's code in order to avoid loading the data into memory multiple times.

If you have very large files you should try MemoryMappedFile which is designed for this purpose(files > 1GB) and enables to read "windows" of a file into memory. But it's not easy to use.
A simple optimization would be to read and replace line by line
int lineNumber = 0;
var _replacementInfo = new Dictionary<int, List<string>>();
using (StreamReader sr = new StreamReader(logF))
{
using (StreamWriter sw = new StreamWriter(logF_Temp))
{
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
lineNumber++;
foreach (var kv in replacements)
{
bool contains = line.Contains(kv.Key);
if (contains)
{
List<string> lineReplaceList;
if (!_replacementInfo.TryGetValue(lineNumber, out lineReplaceList))
lineReplaceList = new List<string>();
lineReplaceList.Add(kv.Key);
_replacementInfo[lineNumber] = lineReplaceList;
line = line.Replace(kv.Key, kv.Value);
}
}
sw.WriteLine(line);
}
}
}
At the end you can use File.Copy(logF_Temp, logF, true); if you want to overwite the old.

Read file line by line and append changed line to other file. At the end replace source file with new one (create backup or not).
var tmpFile = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(logF))
{
using (StreamWriter sw = new StreamWriter(tmpFile))
{
string line;
while ((line = sr.ReadLine()) != null)
{
foreach (var replacement in replacements)
line = line.Replace(replacement.Key, replacement.Value);
sw.WriteLine(line);
}
}
}
File.Replace(tmpFile, logF, null);// you can pass backup file name instead on null if you want a backup of logF file

An OutOfMemoryException is thrown whenever the application tries and fails to allocate memory to perform an operation. According to Microsoft's documentation, the following operations can potentially throw an OutOfMemoryException:
Boxing (i.e., wrapping a value type in an Object)
Creating an array
Creating an object
If you try to create an infinite number of objects, then it's pretty reasonable to assume that you're going to run out of memory sooner or later.
(Note: don't forget about the garbage collector. Depending on the lifetimes of the objects being created, it will delete some of them if it determines they're no longer in use.)
For What I suspect is this line :
foreach (var replacement in replacements)
{
_data = _data.Replace(replacement.Key, replacement.Value);
}
That sooner or later u will run out of memory. Do u ever count how many it loop?
Try
Increase the available memory.
Reduce the amount of data you are retrieving.

Deserializing large files using Json.NET

I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?
public static List<T> DeserializeJsonFiles<T>(string path)
{
if (string.IsNullOrWhiteSpace(path))
return null;
var jsonObjects = new List<T>();
//var sw = new Stopwatch();
try
{
//sw.Start();
foreach (var filename in Directory.GetFiles(path))
{
using (var streamReader = new StreamReader(filename))
using (var jsonReader = new JsonTextReader(streamReader))
{
jsonReader.SupportMultipleContent = true;
var serializer = new JsonSerializer();
while (jsonReader.Read())
{
if (jsonReader.TokenType != JsonToken.StartObject)
continue;
var jsonObject = serializer.Deserialize<dynamic>(jsonReader);
var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met
if (reducedObject == null)
continue;
jsonObject = reducedObject;
jsonObjects.Add(jsonObject);
}
}
}
//sw.Stop();
//Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex}")
return null;
}
return jsonObjects;
}
Thanks.

It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer to create another object and it fails.
You need to return IEnumerable<T> from your method, yield return each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>, processing each item, and writing to disk or wherever they need to end up.

Clearing the memory associated to a large List

I followed the advice in this SO Question. It did not work for me. Here is my situation and my code associated to it
I have a very large list, it has 2.04M items in it. I read it into memory to sort it, and then write it to a .csv file. I have 11 .csv files that I need to read, and subsequently sort. The first iteration gives me a memory usage of just over 1GB. I tried setting the list to null. I tried calling the List.Clear() I also tried the List.TrimExcess(). I have also waited for GC to do its thing. By hoping that it would know that there are no reads or writes going to that list.
Here is my code that I am using. Any advice is always greatly appreciated.
foreach (var zone in zones)
{
var filePath = string.Format("outputs/zone-{0}.csv", zone);
var lines = new List<string>();
using (StreamReader reader = new StreamReader(filePath))
{
var headers = reader.ReadLine();
while(! reader.EndOfStream)
{
var line = reader.ReadLine();
lines.Add(line);
}
//sort the file, then rewrite the file into inputs
lines = lines.OrderByDescending(l => l.ElementAt(0)).ThenByDescending(l => l.ElementAt(1)).ToList();
using (StreamWriter writer = new StreamWriter(string.Format("inputs/zone-{0}-sorted.csv", zone)))
{
writer.WriteLine(headers);
writer.Flush();
foreach (var line in lines)
{
writer.WriteLine(line);
writer.Flush();
}
}
lines.Clear();
lines.TrimExcess();
}
}

Try putting the whole thing in a using:
using (var lines = new List<string>())
{ ... }
Although I'm not sure about the nested usings.
Instead, where you have lines.Clear;, add lines = null;. That should encourage the garbage collector.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Processing large text (JSON) file - c#

Related

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

How to efficiently download, read and process CSV in C#

C# Error: OutOfMemoryException - Reading a large text file and replacing from dictionary

Deserializing large files using Json.NET

Clearing the memory associated to a large List

Categories

Resources