I am writing a framework for writing out collections into different formats for a project at my employer. One of the output formats is delimited text files (commonly known as the CSV -- even though CSVs aren't always delimited by a comma).
I am using the Microsoft.Jet.OLEDB.4.0 provider via OleDbConnection in ADO.net. For reading this files, its very quick. However, for writing, its extremely slow.
In one case, I have a file with 160 records, with each record having about 250 fields. It takes approximately 30 seconds to create this file, seemingly CPU bound.
I have done the following, which provided significant performance boosts, but I can't think of anything else:
Preparing the statement once
Using unnamed parameters
Any other suggestions to speed this up some?
How about "don't use OleDbConnection"... writing delimited files with TextWriter is pretty simple (escaping aside). For reading, CsvReader.
I have written a small and simple set of classes at my employer to do just that (write and read CSV files or other flat files with a fixed field length).
I have just used the StreamWriter & StreamReader classes, and it is quite fast actually.
Try using the System.Configuration.CommaDelimitedStringCollection, like this code here to print a list of objects to a TextWriter.
public void CommaSeperatedWriteLine(TextWriter sw, params Object[] list)
{
if (list.Length > 0)
{
System.Configuration.CommaDelimitedStringCollection commaStr = new System.Configuration.CommaDelimitedStringCollection();
foreach (Object obj in list)
{
commaStr.Add(obj.ToString());
}
sw.WriteLine(commaStr.ToString());
}
}
Take a look at this LINQ to CSV library from code project:
http://www.codeproject.com/KB/linq/LINQtoCSV.aspx
I have not used this yet but I have had it in my reference file for about a year now.
"This library makes it easy to use CSV files with LINQ queries."
Related
I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}
Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance
To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.
Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.
If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line
I need to read in a text file that can range from 8k to 5MB. This file is made up of a single line of text. No Carriage returns or End of Lines. I then need to break it down by to its individual pieces. Those pieces are delimited by size. For example, the first chuck of information is made up of 240 characters. In that 240 characters the first 30 are the Name field. The next 35 are the Address, and so on. Parsing aside, is the StreamReader class the best choice for reading it into memory?
Look a the TextFieldParser class, though in the Microsoft.VisualBasic.FileIO namespace, it can easily be used with C#.
The class description on MSDN is:
Provides methods and properties for parsing structured text files.
An example usage would be:
using(var tfp = new TextFieldParser("path to text file"))
{
tfp.TextFieldType = FieldType.FixedWidth;
tfp.FieldWidths = new int[] {5, 10, 11, -1};
}
I'd very much recommend to use a StreamReader as opposed to reading all the text into a string for reasons of heap efficiency. I have had lots of trouble with strings over 2Mb without to much effort (on 32bit .NET).
Do you need further guidance? It seems to me you might be looking for help in treating the stream. It is common for programmers to have more experience in handling strings, and therefore preferring stringy solutions.
If you paste some more specifcs about the structure of the data, I could help you out a bit. For now, just a single general pointer:
All general-purpose parsers and lexers employ a streaming input model. e.g. Look at Coco/C# for a simple to use parser generator.
I have a Text File (Sorry, I'm not allowed to work on XML files :(), and it includes customer records. Each text file looks like:
Account_ID: 98734BLAH9873
User Name: something_85
First Name: ILove
Last Name: XML
Age: 209
etc... And I need to be able to use LINQ to get the data from these text files and just store them in memory.
I have seen many Linq to SQL, Linq to BLAH but nothing for Linq to Text. Can someone please help me out abit?
Thank you
You can use the code like that
var pairs = File.ReadAllLines("filename.txt")
.Select(line => line.Split(':'))
.ToDictionary(cells => cells[0].Trim(), cells => cells[1].Trim())
Or use the .NET 4.0 File.ReadLines() method to return an IEnumerable, which is useful for processing big text files.
The concept of a text file data source is extremely broad (consider that XML is stored in text files). For that reason, I think it is unlikely that such a beast exists.
It should be simple enough to read the text file into a collection of Account objects and then use LINQ-to-Objects.
Filehelpers is a really great open source solution to this:
http://filehelpers.sourceforge.net/
You just declare a class with attributes, and FileHelpers reads the flat file for you:
[FixedLengthRecord]
public class PriceRecord
{
[FieldFixedLength(6)]
public int ProductId;
[FieldFixedLength(8)]
[FieldConverter(typeof(MoneyConverter))]
public decimal PriceList;
[FieldFixedLength(8)]
[FieldConverter(typeof(MoneyConverter))]
public decimal PriceOnePay;
}
Once FileHelpers gives you back an array of rows, you can use Linq to Objects to query the data
We've had great success with it. I actually think Kaerber's solution is a nice simple solution, maybe stave of migrating to FileHelpers till you really need the extra power.
I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?
If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();
System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.
This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.
Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.
You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.
If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.
Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.
One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.
I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.
If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.
To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+