How to efficiently download, read and process CSV in C# - c#

I'm working on a service that will collect a large CSV-file from an online resource, then as it's downloading, read the lines, (preferably in batches), and send them to a database. This should not use more than 256MB of RAM at any time, and not save a file to disk.
This is for a service that will run once every 7 days, and collect all the companies in the Norwegian Company Register, (a nifty, 250MB, 1.1 million line CSV is found here: http://hotell.difi.no/download/brreg/enhetsregisteret )
My application can easily download the file and add it to a List<>, and process it, but it uses 3.3 GB of RAM
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
var request = await _httpClient.GetAsync(_options.Value.Urls["BrregCsv"]);
request.EnsureSuccessStatusCode();
using (var stream = await request.Content.ReadAsStreamAsync())
using (var streamReader = new StreamReader(stream))
{
while (!streamReader.EndOfStream)
{
using (var csv = new CsvReader(streamReader)) // CsvReader is from the CsvHelper -nuget
{
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>().ToList());
}
}
}
return true;
}
Small note on the SqlRepository: I've replaced it with a simple "destroyer"-method that just clears the data, so as to not use any extra resources while debugging
What I'd expect is that the Garbage Collector would "destroy" the resources used as the lines of the file is processed, but it doesn't.
Put simply, I want the following to happen:
As the CSV downloads, it reads a few lines, these are then sent to a method, and the lines in memory are then flushed
I'm definitely inexperienced at working with large datasets, so I'm working off other people's work, and not getting the results I expect
Thank you for your time and assistance

So getting some pointers from Sami Kuhmonen (#sami-kuhmonen) helped, and here's what I came up with this:
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
using (var streamReader = new StreamReader(stream))
using (var csv = new CsvReader(streamReader))
{
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>());
}
return true;
}
It downloads the entire file and sends it to the SqlRepository in 20 seconds, never surpassing 15% CPU, or 30MB RAM
Now, my next challenge is the SqlRepository, but this issue is solved

Another solution I'm now implementing, that is more predictable in it's resource use is this:
public async Task<bool> CollectAndUpdateNorwegianCompanyRegistryAlternate()
{
using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
using (var reader = new StreamReader(stream))
using (var csv = new CsvReader(reader))
{
csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
csv.Configuration.Delimiter = ";";
csv.Configuration.BadDataFound = null;
var tempList = new List<NorwegianCompany>();
while (csv.Read())
{
tempList.Add(csv.GetRecord<NorwegianCompany>());
if (tempList.Count() > 50000)
{
await Task.Factory.StartNew(() => _sqlRepository.UpdateNorwegianCompaniesTable(tempList));
tempList.Clear();
}
}
}
return true;
}
Now it uses 3 minutes, but never peak 200MB and uses 7-12% CPU, even when doing a SQL "bulk updates", (SqlBulkTool -NuGet is excellent for my needs here), every X lines

Related

Processing large text (JSON) file

I have a requirement to allow my Intranet .NET web portal users to send free-text SQL query to the backend (Read-only DB on SQL Server 2014) and get the results in Excel, while this works fine for most cases but the code fails when the results are too big (around 350mb, 250k records) to be processed.
My first attempt was to get the results directly as JSON serialized into a data table on the frontend.
That failed since iterating through the results set would throw System.OutOfMemoryException:
private JavaScriptSerializer _serializer;
return _serializer.Serialize(results));
So decided its not a good thing anyway to display this amount of results on the interface directly since IE will struggle. So went to the option of prompting users to download an Excel copy of the output by saving the results into a JSON file then read the file and convert it to Excel:
using (StreamReader sr = new StreamReader(filePath))
String json;
// Read and display lines from the file until the end of
// the file is reached.
while ((json = sr.ReadLine()) != null)
{
Console.WriteLine(json);}
}
However the ReadLine() method is throwing the same exception, it is good to note that the ReadLine is failing due to the fact the file consists of only one line otherwise I would try and iterate over the lines line by line.
Lastly I tried access the IEnumerable directly to write it into Excel
var results = new ReportProcess().ExecuteSql(sqlQuery, out string eventMessage);
List<object> queryObjects = new List<object>();
foreach (var result in results)
{
queryObjects.Add(result);
}
var row = queryObjects.FirstOrDefault();
if (row != null)
{
var recordType = row.GetType();
using (var excelFile = new ExcelPackage())
{
MemberInfo[] membersToInclude = recordType
.GetProperties(BindingFlags.Instance | BindingFlags.Public)
.ToArray();
var worksheet = excelFile.Workbook.Worksheets.Add("Sheet1");
worksheet.Cells.LoadFromCollection(queryObjects, true,
OfficeOpenXml.Table.TableStyles.None,
BindingFlags.Instance | BindingFlags.Public,
membersToInclude);
fileName = Guid.NewGuid() + ".xlsx";
excelFile.SaveAs(new FileInfo(HttpContext.Current.Server.MapPath("~/temp/") + fileName));
}
}
And again the code fails around
foreach (var result in results)
{
queryObjects.Add(result);
}
With the same exception
So now I'm stuck around the fact that no matter what I try to do to iterate through the IEnumerable resuls, I would always get the OutOfMemory Exception.
I have also tried to increase the memory allocated to objects by setting the gcAllowVeryLargeObjects to true in the web.config but to no avail:
<runtime>
<gcAllowVeryLargeObjects enabled="true"/>
</runtime>
Other attempt:
Googling around did not bring anything to resolve the issue, any suggestions/ideas?
Eventually I had to rewrite my code to implement external libraries to serialize CSV using CsvHelper library
using (StreamReader sr = new StreamReader(filePath))
{
var csvReader = new CsvReader(sr);
var records = csvReader.GetRecords<object>();
var result = string.Empty;
try
{
return JsonConvert.SerializeObject(new ServerData(records, _eventMessage));
}
catch (Exception ex)
{
_eventMessage.Level = EventMessage.EventLevel.Error;
}
return _serializer.Serialize(new ServerData(result, _eventMessage));
}
That seems to work fine with large data set, OutOfMemory Exception is no longer appearing

C# Error: OutOfMemoryException - Reading a large text file and replacing from dictionary

I'm new to C# and object-oriented programming in general. I have an application which parses text file.
The objective of the application is to read the contents of the provided text file and replace the matching values.
When a file about 800 MB to 1.2GB is provided as the input, the application crashes with error System.OutofMemoryException.
On researching, I came across couple of answers which recommend changing the Target Platform: to x64.
Same issue exists after changing the target platform.
Following is the code:
// Reading the text file
var _data = string.Empty;
using (StreamReader sr = new StreamReader(logF))
{
_data = sr.ReadToEnd();
sr.Dispose();
sr.Close();
}
foreach (var replacement in replacements)
{
_data = _data.Replace(replacement.Key, replacement.Value);
}
//Writing The text File
using (StreamWriter sw = new StreamWriter(logF))
{
sw.WriteLine(_data);
sw.Dispose();
sw.Close();
}
The error points to
_data = sr.ReadToEnd();
replacements is a dictionary. The Key contains the original word and the Value contains the word to be replaced.
The Key elements are replaced with the Value elements of the KeyValuePair.
The approached being followed is Reading the file, replacing and writing.
I tried using a StringBuilder instead of string yet the application crashed.
Can this be overcome by reading the file one line at a time, replacing and writing? What would be the efficient and faster way of doing the same.
Update: The system memory is 8 GB and on monitoring the performance it spikes upto 100% memory usage.
#Tim Schmelter answer works well.
However, the memory utilization spikes over 90%. It could be due to the following code:
String[] arrayofLine = File.ReadAllLines(logF);
// Generating Replacement Information
Dictionary<int, string> _replacementInfo = new Dictionary<int, string>();
for (int i = 0; i < arrayofLine.Length; i++)
{
foreach (var replacement in replacements.Keys)
{
if (arrayofLine[i].Contains(replacement))
{
arrayofLine[i] = arrayofLine[i].Replace(replacement, masking[replacement]);
if (_replacementInfo.ContainsKey(i + 1))
{
_replacementInfo[i + 1] = _replacementInfo[i + 1] + "|" + replacement;
}
else
{
_replacementInfo.Add(i + 1, replacement);
}
}
}
}
//Creating Replacement Information
StringBuilder sb = new StringBuilder();
foreach (var Replacement in _replacementInfo)
{
foreach (var replacement in Replacement.Value.Split('|'))
{
sb.AppendLine(string.Format("Line {0}: {1} ---> \t\t{2}", Replacement.Key, replacement, masking[replacement]));
}
}
// Writing the replacement information
if (sb.Length!=0)
{
using (StreamWriter swh = new StreamWriter(logF_Rep.txt))
{
swh.WriteLine(sb.ToString());
swh.Dispose();
swh.Close();
}
}
sb.Clear();
It finds the line number in which the replacement was made. Can this be captured using Tim's code in order to avoid loading the data into memory multiple times.
If you have very large files you should try MemoryMappedFile which is designed for this purpose(files > 1GB) and enables to read "windows" of a file into memory. But it's not easy to use.
A simple optimization would be to read and replace line by line
int lineNumber = 0;
var _replacementInfo = new Dictionary<int, List<string>>();
using (StreamReader sr = new StreamReader(logF))
{
using (StreamWriter sw = new StreamWriter(logF_Temp))
{
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
lineNumber++;
foreach (var kv in replacements)
{
bool contains = line.Contains(kv.Key);
if (contains)
{
List<string> lineReplaceList;
if (!_replacementInfo.TryGetValue(lineNumber, out lineReplaceList))
lineReplaceList = new List<string>();
lineReplaceList.Add(kv.Key);
_replacementInfo[lineNumber] = lineReplaceList;
line = line.Replace(kv.Key, kv.Value);
}
}
sw.WriteLine(line);
}
}
}
At the end you can use File.Copy(logF_Temp, logF, true); if you want to overwite the old.
Read file line by line and append changed line to other file. At the end replace source file with new one (create backup or not).
var tmpFile = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(logF))
{
using (StreamWriter sw = new StreamWriter(tmpFile))
{
string line;
while ((line = sr.ReadLine()) != null)
{
foreach (var replacement in replacements)
line = line.Replace(replacement.Key, replacement.Value);
sw.WriteLine(line);
}
}
}
File.Replace(tmpFile, logF, null);// you can pass backup file name instead on null if you want a backup of logF file
An OutOfMemoryException is thrown whenever the application tries and fails to allocate memory to perform an operation. According to Microsoft's documentation, the following operations can potentially throw an OutOfMemoryException:
Boxing (i.e., wrapping a value type in an Object)
Creating an array
Creating an object
If you try to create an infinite number of objects, then it's pretty reasonable to assume that you're going to run out of memory sooner or later.
(Note: don't forget about the garbage collector. Depending on the lifetimes of the objects being created, it will delete some of them if it determines they're no longer in use.)
For What I suspect is this line :
foreach (var replacement in replacements)
{
_data = _data.Replace(replacement.Key, replacement.Value);
}
That sooner or later u will run out of memory. Do u ever count how many it loop?
Try
Increase the available memory.
Reduce the amount of data you are retrieving.

async reading and writing lines of text

I've found plenty of examples of how to read/write text to a file asynchronously, but I'm having a hard time finding how to do it with a List.
For the writing I've got this, which seems to work:
public async Task<List<string>> GetTextFromFile(string file)
{
using (var reader = File.OpenText(file))
{
var fileText = await reader.ReadToEndAsync();
return fileText.Split(new[] { Environment.NewLine }, StringSplitOptions.None).ToList();
}
}
The writing is a bit trickier though ...
public async Task WriteTextToFile(string file, List<string> lines, bool append)
{
if (!append && File.Exists(file)) File.Delete(file);
using (var writer = File.OpenWrite(file))
{
StringBuilder builder = new StringBuilder();
foreach (string value in lines)
{
builder.Append(value);
builder.Append(Environment.NewLine);
}
Byte[] info = new UTF8Encoding(true).GetBytes(builder.ToString());
await writer.WriteAsync(info, 0, info.Length);
}
}
My problem with this is that for a moment it appears my data is triple in memory.
The original List of my lines, then the StringBuilder makes it a single string with the newlines, then in info I have the byte representation of the string.
That seems excessive that I have to have three copies of essentially the same data in memory.
I am concerned with this because at times I'll be reading and writing large text files.
Following up on that, let me be clear - I know that for extremely large text files I can do this all line by line. What I am looking for are two methods of reading/writing data. The first is to read in the whole thing and process it, and the second is to do it line by line. Right now I am working on the first approach for my small and moderate sized text files. But I am still concerned with the data replication issue.
The following might suit your needs as it does not store the data again as well as writing it line by line:
public async Task WriteTextToFile(string file, List<string> lines, bool append)
{
if (!append && File.Exists(file))
File.Delete(file);
using (var writer = File.OpenWrite(file))
{
using (var streamWriter = new StreamWriter(writer))
foreach (var line in lines)
await streamWriter.WriteLineAsync(line);
}
}

Clearing the memory associated to a large List

I followed the advice in this SO Question. It did not work for me. Here is my situation and my code associated to it
I have a very large list, it has 2.04M items in it. I read it into memory to sort it, and then write it to a .csv file. I have 11 .csv files that I need to read, and subsequently sort. The first iteration gives me a memory usage of just over 1GB. I tried setting the list to null. I tried calling the List.Clear() I also tried the List.TrimExcess(). I have also waited for GC to do its thing. By hoping that it would know that there are no reads or writes going to that list.
Here is my code that I am using. Any advice is always greatly appreciated.
foreach (var zone in zones)
{
var filePath = string.Format("outputs/zone-{0}.csv", zone);
var lines = new List<string>();
using (StreamReader reader = new StreamReader(filePath))
{
var headers = reader.ReadLine();
while(! reader.EndOfStream)
{
var line = reader.ReadLine();
lines.Add(line);
}
//sort the file, then rewrite the file into inputs
lines = lines.OrderByDescending(l => l.ElementAt(0)).ThenByDescending(l => l.ElementAt(1)).ToList();
using (StreamWriter writer = new StreamWriter(string.Format("inputs/zone-{0}-sorted.csv", zone)))
{
writer.WriteLine(headers);
writer.Flush();
foreach (var line in lines)
{
writer.WriteLine(line);
writer.Flush();
}
}
lines.Clear();
lines.TrimExcess();
}
}
Try putting the whole thing in a using:
using (var lines = new List<string>())
{ ... }
Although I'm not sure about the nested usings.
Instead, where you have lines.Clear;, add lines = null;. That should encourage the garbage collector.

How to read from StorageFile until a character in found in Windows Store app?

I have an app for Windows Store and what i am trying to do is read text from a file. I have two textFields. The descriptionTextField accepts new lines.
// Read from file
public async Task ReadFile()
{
try
{
// get the file
StorageFile notesStorageFile = await localFolder.GetFileAsync("NotesData.txt");
var readThis = await FileIO.ReadLinesAsync(notesStorageFile);
foreach (var line in readThis)
{
notesRepository.Add(new Note(line.Split(';')[0], line.Split(';')[1]));
}
Debug.WriteLine("File read successfully.");
}
catch (FileNotFoundException ex)
{
Debug.WriteLine("Error1: " + ex);
}
}
Now if NotesData.txt has:
Eggs;description eggs;
it works file.
But if NotesData.txt has:
Groceries;buy 10 eggs
buy 1 kg meat;
I get the index out of bound error. I just cant figure out how to fix the ReadFile() code.
The exception appears when i am calling the method. The problem i believe is with the descriptionTextBox that can accept new lines.
NotesData.txt
Apples;description apples; // works ok
Pears; description line 1
description line 2
description line 3; // problem
Pears; description line 1; // works ok
It seems to me you're trying to read back contents of a file you have previously saved and the problems you're having are just a consequence of the format you have selected for saving the data in the first place. Looking at it, new lines are not the only difficulty you're going to be having. What if the user decides to enter a semicolon in one of the textboxes? Are you preventing that?
I suggest you abandon your own serialization format and rather use one of the existing ones. If your notesRespository is a List<Note> this could be your (de)serialization code for XML:
private async Task Save(List<Note> notesRepository)
{
var xmlSerializer = new XmlSerializer(typeof (List<Note>));
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForWriteAsync("notes.xml", CreationCollisionOption.ReplaceExisting))
{
xmlSerializer.Serialize(stream, notesRepository);
}
}
private async Task<List<Note>> Load()
{
var xmlSerializer = new XmlSerializer(typeof(List<Note>));
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForReadAsync("notes.xml"))
{
return (List<Note>) xmlSerializer.Deserialize(stream);
}
}
And this for JSON:
private async Task Save(List<Note> notesRepository)
{
var jsonSerializer = new DataContractJsonSerializer(typeof (List<Note>));
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForWriteAsync("notes.json", CreationCollisionOption.ReplaceExisting))
{
jsonSerializer.WriteObject(stream, notesRepository);
}
}
private async Task<List<Note>> Load()
{
var jsonSerializer = new DataContractJsonSerializer(typeof(List<Note>));
using (var stream = await ApplicationData.Current.LocalFolder.OpenStreamForReadAsync("notes.json"))
{
return (List<Note>)jsonSerializer.ReadObject(stream);
}
}
When the repository gets too large to always load and save it as a whole you could even consider a structured storage like SQLite.
This line:
notesRepository.Add(new Note(line.Split(';')[0], line.Split(';')[1]));
assumes that you'll always have at least one semi-colon in a line. If you've got a line in your file which doesn't have that (e.g. a blank line) then it will fail.
It's not clear where that's where your problem is, because you haven't said where the exception's coming from, but that would be my first guess.
I'd also only do the split once:
string[] bits = line.Split(';');
if (bits.Length >= 2)
{
// What do you want to do with lines with more than one semi-colon?
notesRepository.Add(bits[0], bits[1]);
}
else
{
// Handle lines without a semi-colon at all.
}

Categories

Resources