How to load file fully and process record csvreader?

How to load file fully and process record csvreader? - c#

I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}

Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance

To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.

Related

Approach to Implementing a CSV Generator from Linear Information

I have a pseudo-code question for a problem I've encountered. I have a binary file of recorded variable data at certain record rates (20Hz,40Hz, etc..). This information is linear in the file. For example if I have var1 and var2, I'd read from the file var1's data, then var2's data, then var1's next sample, etc...I'm pretty sure the best way to construct a CSV is by row. My original thought was to just read in the binary file and parse the information into a contemporary buffer/structure. Once all the binary data is read in then begin writing the CSV file by row. My only concern with this approach is memory consumption. There can be anywhere from 300-400 parameters recorded as high as 160HZ. That's a lot of data to have stored. I was wondering if there's any other approaches that are more efficient. Language I'm using is C#

As I understand it, you have:
{ some large number of var1 samples }
{ some large number of var2 samples }
{ some large number of var3 samples }
And you want to create:
var1, var2, var3, etc.
var1, var2, var3, etc.
If you have enough memory to hold all of that data, then your first approach is the way to go.
Only you can say whether you have enough memory. If the file is all binary data (i.e. integers, floats, doubles, etc.), then you can get a pretty good idea of how much memory you'll need just by looking at the size of the file.
Assuming that you don't have enough memory to hold all of the data at once, you could easily process the data in two passes.
On the first pass, you read all of the var1 data and immediately write it to a temporary file called var1Data. Then do the same with var2, var3, etc. When the first pass is done, you have N binary files, each one containing the data for that variable.
The second pass is a simple matter of opening all of those files, and then looping:
while not end of data
read from var1Data
read from var2Data
read from var3Data
etc.
create structure
write to CSV
Or, you could do this:
while not end of data
read from var1Data
write to CSV
read from var2Data
write to CSV
etc.
Granted, it's two passes over the data, but if you can't fit all of the data into memory that's the way you'll have to go.
One drawback is that you'll have 300 or 400 files open concurrently. That shouldn't be a problem. But there is another way to do it.
On the first pass, read, say, the first 100,000 values for each parameter into memory, create your structures, and write those to the CSV. Then make another pass over the file, reading items 100,000 to 199,999 for each parameter into memory and append to the CSV. Do that until you've processed the entire file.
That might be easier, depending on how your binary file is structured. If you know where each parameter's data starts in the file, and all the values for that parameter are the same size, then you can seek directly to the start for that parameter (or to the 100,000th entry for that parameter), and start reading. And once you've read however many values for var1, you can seek directly to the start of the var2 data and start reading from there. You skip over data you're not ready to process in this pass.
Which method to use will depend on how much memory you have and how your data is structured. As I said, if it all fits into memory then your job is very easy. If it won't fit into memory, then if the binary file is structured correctly you can do it with multiple passes over the input file, on each pass skipping over the data you don't want for that pass. Otherwise, you can use the multiple files method, or you can do multiple passes over the input, reading sequentially (i.e. not skipping over data).

Efficient Methods of Comparing Text Files Simultaneously

I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.

Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).

How to append to large XML files in C# using memory efficiently

Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.

If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line

Reading EDI files and writing into new file

I have a big text file (which has about 20k lines) using which I need to replace some lines of text in other text files (about 60-70 of them).
The other files can be called as templates. The lines in these templates needs to replaced based on some conditions
Sample content of the file:
ISA*00* *00* *01*000123456 *ZZ*PARTNERID~ *090827*0936*U*00401*000000055*0*T*>~
GS*PO*000123456*PARTNERID*20090827*1041*2*X*004010~
ST*850*0003~
BEG*00*SA*1000012**20090827~
REF*SR*N~
CSH*Y~
TD5*****UPSG~
N1*ST*John Doe~
N3*126 Any St*~
N4*Hauppauge*NY*11788-1234*US~
PO1*1*1*EA*19.95**VN*0054321~
CTT*1*1~
SE*11*0003~
GE*1*2~
IEA*1*000000001~
I am loading the filestream from the content file as below and reading it using stream reader.
FileStream baseFileStream= new FileStream("C:\\Content.txt", FileMode.Open);
Then I need to loop through the template files in a folder one by one. Once I pick a template file I will load into another FileStream(templates at max will have 300 lines ).
While reading the file I will have to go back to the previous lines numerous times. But if I read the files using ReadToEnd() or ReadLine() methods going back to the previous lines will not be possible.
To overcome this I am reading the template into collection of Lines. But will it be a good idea to read the Content file into a collection as it’s very huge. There will be a lot of searching involved in this file.Will buffered stream be of any use here?
Or is there any better approach for this?
Thanks

In my opinion, you're almost in a catch-22 situation. It's either you load the large file into memory (via your collection) which depending on the average size and memory available on the server, might be the best approach, or another alternative would be to iterate through the template files, and for each iteration, load a new file stream opening the large file every time (slower due to file I/O, but low memory consumption), so that you can perform your "search", as we all know the file stream is forward only.

Fast/low-memory method to parse first two columns in a large csv file using c#

I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive lock.
What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?
Perhaps this is a way to allow simultaneous access?
using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
using ( var reader = new StreamReader( filestream ) )
{
...
}
}
Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx
which seems to give me the ability to read just two columns and then skip to the next line.
They also have some benchmarks showing fast performance and low memory profile.

If you want low memory, you'll probably use a StreamReader and ReadLine by line.
In a similar case the other day, I was able to skip the first 20,000,000 lines in a 500 MB file and build a string (using StringBuilder) for the next 1,000,000 lines in about 7 seconds.

Assuming that the file contains ASCII encoded text (would be typical for csv), your best bet may be to use Stream directly and the Stream.Read method, which allows you to read into a pre-allocated buffer. This has a few advantages:
You only allocate a buffer once, whereas ReadLine() will create a new String for every line.
You don't have to perform the Unicode conversion for the entire line; you can either do this only for the portion up to the second comma or (if you're severely time-constrained), you can write your own numeric parser that operates on the ASCII string data in the buffer (I'm sure there are well-documented algorithms for doing this.) This is assuming you need numeric data, of course.
Additional methods you'll likely need include the ASCII Encoding methods, particularly Encoding.ASCII.GetString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.