I am grabbing a value at a time and dynamically loading it into a grid.
Is there a way to index a csv file to only look up a value at a certain row and column?
I can't read all the rows as that would defeat the purpose of loading dynamically.
The CSV parser, Fast CSV Parser in my case, can grab a value like so csv[row][column]. When looking at the source, I noticed that it loops over everything in the file until it reaches the correct index column pair. To grab a value at row 100,000 column 80, it can take quite a long time.
Any help much appreciated.
Well, you could do a fast first pass and store the offsets of each row. That would make subsequently locating a row much faster. If you have 80 columns but 100K rows, I'd focus on fast row-finding rather than fast column-finding.
ETA: OK, I'm assuming your CSV file is on disk and that you can get exclusive access to it. Some of this code was based on this.
List<int> offsets = new List<int>();
using (StreamReader reader = new StreamReader("myfile.csv"))
{
int offset = 0;
string line;
while ((line = reader.ReadLine()) != null)
{
offsets.Add(offset);
offset += (line.Length + 2); // The 2 is for NewLine(\r\n)
}
offsets.Add(offset); // pick up the last one
}
At the end of this, you will have the List variable offsets, which is indexed by line number and contains the offset to each line. You can then, when reading the file (when doing your grid-building) use offsets[n] to get the offset to Seek to (I'm assuming you're using a FileStream or a StreamReader) and offsets[n+1] - offsets[n] to get the length.
As far as parsing the returned line of text is concerned, I assume the CSV library you're adapting has good logic for that.
CSV files do not have any support for indexing where a particular row might be, no.
The best I think you can do is read each row until you find the one you want. So you'll average reading half the file when scanning for a row, which is better than reading the entire file.
If you use the CSV parser I present in the article Reading and Writing CSV Files in C#, you can just read one row at a time.
The other option is if you are going to be accessing multiple rows from the same file. In this case, you could run through the file and build a list of indexes. But this only pays off if your going to lookup multiple lines in a single session.
If you are permitted to use 3rd party libraries I'd look at some of those. MySQL has CSV engine support and therefor it seems likely that you'd be able to do this using a library from them.
C# however doesn't provide a great way to handle CSV files.
http://dev.mysql.com/doc/refman/5.0/en/csv-storage-engine.html
Related
Presently working on an application that allows the user to input a list of names/search terms and an folder path. The application then searches for each phrase and copies any responsive documents to an output path. Important to note that the use is often with directories containing from 100GB up to a few TB and can sometimes be required to run thousands of search terms.
Initially I simply used the System.IO.GetFiles() function for this, but I've found I have better results creating a data table of all documents in the input path and running my searches over that data table (see below).
//Constructing a data table of all files in the input path
foreach (var file in fileArray)
{
System.Data.DataRow row = searchTable.NewRow();
row[1] = file;
row[0] = System.IO.Path.GetFileName(file);
searchTable.Rows.Add(row);
}
//For each line inputted by the user, search the data table to find any responsive file names
foreach (var line in searchArray)
{
for (int i = 0; i < searchTable.Rows.Count; i++)
{
if (searchTable.Rows[i][0].ToString().Contains(line))
{
string file = searchTable.Rows[i][1].ToString();
string output = SwiftBank.CalculateOutputFilePath(outputPath,inputPath,file);
System.IO.File.Copy(file, output);
}
}
}
I've found that while this works, it isn't optimised and functions very slowly for large data sets. Obviously doing a lot of repeat work, searching the data table in full every search term. Wondering if someone on here might have a better idea?
In my experience, doing a handful of contains queries for a few thousands fairly short strings should take less than a second. If you are searching inside much larger data sets, like searching thru 100Gb of content, you should look at some more advanced library, like lucene.
There are a few things I would suggest changing
Use a regular list instead of a datatable. Something like List<(string filePath, string fileName)> would be much simpler, and contain the same information
Perform all checks for a specific file at once, i.e. reorder your loops the the file loop is the outer one. This should help cache usage a little bit.
However, the vast majority of the time will likely be spent on copying files. This is many orders of magnitude slower than doing some simple searching in a few kilobytes of memory. You might gain a little bit by doing more than one copy in parallel, since SSDs may be able to improve throughput at higher loads, but that is likely only true if the files are small. You might consider alternatives to the copying, like adding shortcuts, instead.
I have many large csv files (1-10 gb each) which I'm importing into databases. For each file, I need to replace the 1st line so I can format the headers to be the column names. My current solution is:
using (var reader = new StreamReader(file))
{
using (var writer = new StreamWriter(fixed))
{
var line = reader.ReadLine();
var fixedLine = parseHeaders(line);
writer.WriteLine(fixedLine);
while ((line = reader.ReadLine()) != null)
writer.WriteLine(line);
}
}
What is a quicker way to only replace line 1 without iterating through every other line of these huge files?
If you can guarantee that fixedLine is the same length (or less) as line, you can update the files in-place instead of copying them.
If not, you can possibly get a little performance improvement by accessing the .BaseStream of your StreamReader and StreamWriter and doing big block copies (using, say, a 32K byte buffer) to do the copying, which will at least eliminate the time spent checking every character to see if it's an end-of-line character as happens now with reader.ReadLine().
The only thing that can significantly speed it up is if you can really replace first line. If new first line is no longer than old one - replace (with space padding if needed) the first line carefully.
Otherwise - you have to create new file and copy the rest after first line. You may be able to optimize copying a bit by adjusting buffer sizes/explicit copy as binary/per-allocating size, but it will not change the fact that you need to copy whole file.
One more cheat if you planning to drop CSV data into DB anyway: if order does not matter you can read some lines from the beginning, replace them with new header and add the removed lines to the end of the file.
Side note: if this is one-time operation I'd simply copy files and be done with it... Debugging code that inserts data into middle of text file with potentially different encoding may not worth an effort.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.
Need a bit of help, I have two sources of information and the information is exported to two different CSV file's by different programs. They are supposed to include the same information, however this is what needs to be checked.
Therefore what I would like to do is as follows:
Take the information from the two files.
Compare
Output any differences and which file the difference was in. (e.g File A Contained this, but File B did not and vice versa).
The files are 200,000 odd rows so will need to be as effective as possible.
Tried doing this with Excel however has proved to be too complicated and I'm really struggling to find a way programatically.
Assuming that the files are really supposed to be identical, right down to text qualifiers, ordering of rows, and number of rows contained in each file, the simplest approach may be to simply iterate through both files together and compare each line.
using (StreamReader f1 = new StreamReader(path1))
using (StreamReader f2 = new StreamReader(path2)) {
var differences = new List<string>();
int lineNumber = 0;
while (!f1.EndOfStream) {
if (f2.EndOfStream) {
differences.Add("Differing number of lines - f2 has less.");
break;
}
lineNumber++;
var line1 = f1.ReadLine();
var line2 = f2.ReadLine();
if (line1 != line2) {
differences.Add(string.Format("Line {0} differs. File 1: {1}, File 2: {2}", lineNumber, line1, line2);
}
}
if (!f2.EndOfStream) {
differences.Add("Differing number of lines - f1 has less.");
}
}
Depending on your answers to the comments on your question, if it doesn't really need to be done with code, you could do worse than download a compare tool, which is likely to more sophisticated.
(Winmerge for example)
OK, for anyone else that googles this and finds this. Here is what my answer was.
I exported the details to a CSV and ordered them numerically when they were exported for ease of use. Once they were exported as two CSV files, I then used a program called Beyond Compare which can be found here. This allows the files to be compared.
At first I used Beyond Compare manually to test what I was exporting was correct etc, however Beyond Compare does have the ability to be able to use command lines to compare. This then results in everything done programatically, all that has to be done is a user views the results in Beyond Compare. You may be able to export them to another CSV, I havn't looked as the GUI of Beyond Compare is very nice and useful, so it is easier to use this.
I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?
If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();
System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.
This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.
Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.
You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.
If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.
Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.
One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.
I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.
If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.
To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+