Parsing a big CSV file C# .net 4

Parsing a big CSV file C# .net 4 - c#

I know this question has been asked before, but I can't seem to get it working with the answers I've read. I've got a CSV file ~ 1.2GB , If I'm running the process like a 32bit i get outOfMemoryException, it works if i run it as a 64bit process, but it still takes 3,4gb in memory, i do know that I'm storing a lot of data in my customData class, but still 3,4gb of ram?, Am I doing something wrong when reading the file?
dict is a dictionary in which i just have a mapping to which property to save something in, depending on the column it's in. Am i doing the reading the right way?
StreamReader reader = new StreamReader(File.OpenRead(path));
while(!reader.EndOfStream) {
String line = reader.ReadLine();
String[] values = line.Split(';');
CustomData data = new CustomData();
string value;
for (int i = 0; i < values.Length; i++) {
dict.TryGetValue(i, out value);
Type targetType = data.GetType();
PropertyInfo prop = targetType.GetProperty(value);
if(values[i]==null)
{
prop.SetValue(data, "NULL",null);
}
else
{
prop.SetValue(data, values[i], null);
}
}
dataList.Add(data);
}

There doesn't seem to be anything wrong in your usage of the stream reader, you read a line in memory, then forget it.
However, in C# a string is encoded in memory as UTF-16 so on the average a character consumes 2 bytes in memory.
If your CSV contains also a lot of empty fields that you convert to "NULL" you add up to 7 bytes for each empty field.
So on the whole, since you basically store all the data from your file in memory, it's not really surprising that you require almost 3 times the size of the file in memory.
The actual solution is to parse your data by chucks of N lines, treat them, and free them from memory.
Note: Consider using a CSV parser, there is more to CSV than just comas or semi-colons, what if one of your field conatins a semi-colon, a newline, a quote... ?
Edit
Actually each string take up to 20+(N/2)*4 bytes in memory see C# in Depth

Ok a couple of points here.
As pointed out in the comments, .NET under x86 can only consume 1.5GBytes per process, so consider that your maximum memory in 32 bit
The StreamReader itself will have an overhead. I don't know if it caches the entire file in memory, or not (maybe someone can clarify?). If so, reading and processing the file in chunks might be a better solution
The CustomData class, how many fields does it have, and how many instances are created? Note you will need 32bits for each reference in x86 and 64 bits for each reference in x64. So if you have CustomData class, which has 10 fields of type System.Object, each CustomData class before storing any data requires 88 bytes.
The dataList.Add at the end. I assume you are adding to a generic List? If so, note that List employes a doubling algorithm to resize. If you have 1GByte in a List and it requires 1 more byte in size, it will create a 2GByte array and copy the 1GByte to the 2GByte array on resize. So all of a sudden the 1GByte + 1 byte actually requires 3GBytes to manipulate. Another alternative is to use a pre-sized array

Related

How to load file fully and process record csvreader?

I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}

Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance

To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.

Removing redundant data from large file

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an "OutOfMemory" exception when attempting to do this (on the line that adds the string to the hashset).
There are around 32,000,000 lines in the files. It's not practical to re-read the entire file for each comparison.
Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I'm not sure that'd work either with that many values.
Thanks for any input!

First thing you need to think about - is high memory consumption is a problem?
If your application will always run on server with a lot of RAM available, or in any other case you know you'll have enough memory, you can do a lot of things you can't do if your application will run in a low-memory environment, or in an unknown environment. If memory isn't the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you'll be limited to 2GB memory (4GB, if you'll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you've got to do is change it - and it'll work great (assuming you have enough memory).
If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i'm more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.
Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you'll save less data in the memory (only hash for not duplicated items).
Best of luck.

Have you tried to use an array to intialize the HashSet. I assume that the doubling algorithm of HashSet is the reason for the OutOfMemoryException.
var uniqueLines = new HashSet<string>(File.ReadAllLines(#"C:\Temp\BigFile.log"));
Edit:
I am testing the result of the .Add() method to see if it
returns false to count the number of items that are redundant. I'd
like to keep this feature if possible.
Then you should try to initilize the HashSet with the correct(maximum) size of the file's lines:
int lineCount = File.ReadLines(path).Count();
List<string> fooList = new List<String>(lineCount);
var uniqueLines = new HashSet<string>(fooList);
fooList.Clear();
foreach (var line in File.ReadLines(path))
uniqueLines.Add(line);

I took a similar approach to Tim using HashSet. I did add manual line counting and comparison.
I read the setup log from my windows 8 install which was 58MB in size at 312248 lines and ran it in LinqPad in .993 seconds.
var temp=new List<string>(10000);
var uniqueHash=new HashSet<int>();
int lineCount=0;
int uniqueLineCount=0;
using(var fs=new FileStream(#"C:\windows\panther\setupact.log",FileMode.Open,FileAccess.Read))
using(var sr=new StreamReader(fs,true)){
while(!sr.EndOfStream){
lineCount++;
var line=sr.ReadLine();
var key=line.GetHashCode();
if(!uniqueHash.Contains(key) ){
uniqueHash.Add(key);
temp.Add(line);
uniqueLineCount++;
if(temp.Count()>10000){
File.AppendAllLines(#"c:\temp\output.txt",temp);
temp.Clear();
}
}
}
}
Console.WriteLine("Total Lines:"+lineCount.ToString());
Console.WriteLine("Lines Removed:"+ (lineCount-uniqueLineCount).ToString());

Read Extremely large file efficiently in C#. Currently using StreamReader

I have a Json file that is sized 50GB and beyond.
Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.
internal static IEnumerable<T> ReadJson<T>(string filePath)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
using (StreamReader sr = new StreamReader(filePath))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
var myPerson = ser.ReadObject(jsonReader);
jsonReader.Close();
yield return (T)myPerson;
}
}
}
Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.

It is going to read a line at a time (of the input file), which could be 10 bytes, and could be all 50GB. So it comes down to : how is the input file structured? And if the input JSON has newlines other than cleanly at the breaks between objects, this could get really ill.
The buffer size might impact how much it reads while looking for the end of each line, but ultimately: it needs to find a new-line each time (at least, how it is written currently).

I think you should first compare different parsers before worrying about details as the buffer size.
The differences between DataContractJsonSerializer, Raven JSON or Newtonsoft JSON will be quite significant.

So your main issue with this is where are your boundaries, and given that your doc is a JSON doc it seems to me likely that your boundaries are likely to be classes, I assume (or hope) that you don't have 1 honking great class that is 50gb large. I also assume that you don't really need all those classes in memory but you may need to search the whole thing for your subset...does that sound roughly right? if so I think that your pseudo code is something like
using a Json parser that accepts a streamreader (newtonsoft?)
read and parse until eof
yield return your parsed class that matches criteria
read and parse next class
end

Custom archive format File Reading

C#.NET 4.0
I'm having an interesting problem here with reading a custom file archive format. In C#, I wrote a program that creates an archive header (some overhead info about the archive as a whole, number of files, those kinds of things). It then takes an input file to be stored, reads and bytes, and then writes some overhead about the file (filename, type, size and such) and then the actual file data. I can also extract files from the archive through this program. To test it, I stored a png image and extracted it by reading the filesize from the overhead and then allocating an array of bytes of that size, pulled the filedata into that array, and then wrote it with a streamwriter. No big deal, worked fine. Now, we go to the C++ side...
C++
My C++ program needs to read the filedata in, determine the filetype, and then pass it off to the appropriate processing class. The processing classes were giving errors, which they shouldn't have. So I decided to write the filedata out fro the C++ program after reading it using fwrite(), and the resulting file appears to be damaged? In a nutshell, this is the code being used to read the file...
unsigned char * data = 0;
char temp = 0;
__int64 fileSize = 0;
fread(&fileSize, sizeof(__int64), 1, _fileHandle);
data = new unsigned char[fileSize];
for (__int64 i = 0; i < fileSize; i++)
{
fread(&temp, 1, 1, _fileHandle);
data[i] = temp;
}
(I'm at work right now, so I just wrote this from memory. However, I'm 99% positive it's accurate to my code at home. I'm also not concerned with non MS Standards at the moment, so please bear with the __int64.)
I haven't gone through all 300 something thousand bytes to determine if everything is consistent, but the first 20 or so bytes that I looked at appear to be correct. I don't exactly see why there is a problem. Is there something funny about fread()? I also to double check the file in the archive, removed all the archive overhead and saved just the image data to a new png image with notepad, which worked fine.
Should I be reading this differently? Is there something wrong with using fread() to read in this data?

Given that the first n bytes appear to be correct, did you by chance forget to open the file in binary mode ("rb")? If you didn't then it's helpfully converting any sequences of \r\n into \n for you which would obviously not be what you want.
Since this question is tagged C++ did you consider using the canonical C++ approach of iostreams rather than the somewhat antiquated FILE* streams from C?

parsing binary file in C#

I have a binary file. i stored it in byte array. file size can be 20MB or more. then i want to parse or find particular value in the file. i am doing it by 2 ways ->
1. By converting full file in char array.
2. By converting full file in hex string.(i also have hex values)
what is best way to parse full file..or should i do in binary form. i am using vs-2005.

From the aspect of memory consumption, it would be best it you could parse it directly, on-the-fly.
Converting it to a char array in C# means effectively doubling it's size in memory (presuming you are converting each byte to a char), while hex string will take at least 4 times the size (C# chars are 16-bit unicode characters).
On the other hand, it you need to make many searches and parsing over an existing set of data repeatedly, you may benefit from having it stored in any form which suits your needs better.

What's stopping you from seaching in the byte[]?
IMHO, If you're simply searching for a byte of specified value, or several continous bytes, this is the easiest way and most efficient way to do it.

If I understood your question correctly you need to find strings which can contain any characters in a large binary file. Does the binary file contain text? If so do you know the encoding? If so you can use StreamReader class like so:
using (StreamReader sr = new StreamReader("C:\test.dat", System.Text.Encoding.UTF8))
{
string s = sr.ReadLine();
}
In any case I think it's much more efficient using some kind of stream access to the file, instead of loading it all to memory.
You could load it by chunks into the memory, and then use some pattern matching algorithm (like Knuth-Moris-Pratt or Karp-Rabin)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.