I have many large csv files (1-10 gb each) which I'm importing into databases. For each file, I need to replace the 1st line so I can format the headers to be the column names. My current solution is:
using (var reader = new StreamReader(file))
{
using (var writer = new StreamWriter(fixed))
{
var line = reader.ReadLine();
var fixedLine = parseHeaders(line);
writer.WriteLine(fixedLine);
while ((line = reader.ReadLine()) != null)
writer.WriteLine(line);
}
}
What is a quicker way to only replace line 1 without iterating through every other line of these huge files?
If you can guarantee that fixedLine is the same length (or less) as line, you can update the files in-place instead of copying them.
If not, you can possibly get a little performance improvement by accessing the .BaseStream of your StreamReader and StreamWriter and doing big block copies (using, say, a 32K byte buffer) to do the copying, which will at least eliminate the time spent checking every character to see if it's an end-of-line character as happens now with reader.ReadLine().
The only thing that can significantly speed it up is if you can really replace first line. If new first line is no longer than old one - replace (with space padding if needed) the first line carefully.
Otherwise - you have to create new file and copy the rest after first line. You may be able to optimize copying a bit by adjusting buffer sizes/explicit copy as binary/per-allocating size, but it will not change the fact that you need to copy whole file.
One more cheat if you planning to drop CSV data into DB anyway: if order does not matter you can read some lines from the beginning, replace them with new header and add the removed lines to the end of the file.
Side note: if this is one-time operation I'd simply copy files and be done with it... Debugging code that inserts data into middle of text file with potentially different encoding may not worth an effort.
Related
I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}
Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance
To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.
I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).
I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?
Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.
Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}
Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.
Use the Filestream.Position method to get the position of that file and then set the position.
I have text file which contain 200000 rows. I want to read first 50000 rows and then process it and then read second part say 50001 to 100000 etc. When I read second block I don't write to loop on first 1 to 50000. I want that reader pointer directly goes to row number 50001 and start reading.
How it can be possible? Which reader is used for that?
You need the StreamReader class.
With this you can do line by line reading with the ReadLine() method. You will need to keep track of the line count yourself and call a method to process your data every 50000 lines, but so long as you keep the reader open you should not need to restart the reading.
No unfortunately there is no way you can skip counting the Lines. At the raw level files do not work on a line number basis. Instead they work at a position / offset basis. The root file system has no concept of lines. It's a concept added by higher level components.
So there is no way to tell the operating system, please open file at line specified. Instead you have to open the file and skip around counting new lines until you've passed the specified number. Then store the next set of bytes into an array until you hit the next new line.
Though If each line has equal number of bytes present then you can try the following.
using( Stream stream = File.Open(fileName, FileMode.Open) )
{
stream.Seek(bytesPerLine * (myLine - 1), SeekOrigin.Begin);
using( StreamReader reader = new StreamReader(stream) )
{
string line = reader.ReadLine();
}
}
I believe the best way would be to use stream reader,
Here are two related questions to yours, in which you can get answers from there. But ultimately if you want to get blocks of text it is very hard to do unless it is a set amount.
However I believe these would be a good read for you to use:
Reading Block of text file
This one shows you how to separate blocks of code to read. The answer for this one would be best suited, you can just set the conditions to read how many lines you have read, and set the conditions to check if the line count == 50000 or so on then do something.
As you can see
This answer makes use of the keyword continue which I believe will be useful for what you are intending to do.
Reading text file block by block
This one shows you a more readable answer but doesn't really answer what you are looking for in reading blocks.
For your question I believe that what you want to do has confused you a little, it seems like you want to highlight 50000 lines and then read it as one, that is not the way streamreader works, and yes reading line by line makes the process longer but unfortunately that's the case.
Unless the rows are exactly the same length, you can't start directly at row 50001.
What you can do, however, is when reading the first 50000 rows, remember where the last row ends. You can then seek directly to that offset and continue reading from there.
Where the row length is fixed, you do something like this:
myfile.Seek(50000 * (rowCharacters + 2), SeekOrigin.Begin);
Seek goes to a specific offset in bytes, so you just need to tell it how many bytes 50000 rows occupy. Given an ASCII encoding, that's the number of characters in the line, plus 2 for the newline sequence.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.