I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?
Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.
Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}
Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.
Use the Filestream.Position method to get the position of that file and then set the position.
Related
So I've been using the same code about a year now and normally I find new ways to do old tasks and slowly improve but I just seemed to of stagnated with this. I was curious if anyone could provide any insight on how I would do this task differently. I'm loading in a text file, reading all its lines into a string array and then looping those entries to perform a operation on each line.
string[] config = File.ReadAllLines("Config.txt");
foreach (string line in config)
{
DoOperations(line);
}
Eventually I'll just be moving to openfiledialog, but that's for a time in the future and using OFG on a console application that's multi threaded seems like bad practice.
Since you don't act on the whole file at any point you could read it one line at a time. Given that your file looks like a config it's probably not a massive file, but if you were trying to read a large file in using File.ReadAllLines() you can get into memory issues. Reading one line at a time helps avoid that.
using (StreamReader file = new StreamReader("config.txt")){
while((line = file.ReadLine()) != null)
{
DoOperations(line);
}
}
You could rename config to lines for readability ;)
You could use var
Select? (if DoSomething returns something)
var parsed = File.ReadAllLines("Config.txt").Select( l => Parsed(line));
ForEeach?
lines.ToList().ForEach( l => DoSomething(line));
Read line by line with ReadLines?
foreach (var line in File.ReadLines("Config.txt"))
{
(...)
}
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
Let's say I have the following file format (Key value pairs):
Objec1tKey: Object1Value
Object2Key: Object2Value
Object3Key: Object3Value
Object4Key: Object4Value1
Object4Value2
Object4Value3
Object5Key: Object5Value
Object6Key: Object6Value
I'm reading this line by line with StreamReader. for the objects 1, 2, 3, 5 and 6 it wouldn't be a problem because the whole object is on one line, so it's possible to process the object.
But for object 4 I need to process multiple lines. can I use Peek for this? (MSDN for Peek: Returns the next available character but does not consume it.). Is there a method like Peek which returns the next line and not the character?
If I can use Peek, then my question is, can I use Peek two times so I can read the two next lines (or 3) until I know there is a new object (obect 5) to be processed?
I would strongly recommend that you separate the IO from the line handling entirely.
Instead of making your processing code use a StreamReader, pass it either an IList<string> or an IEnumerable<string>... if you use IList<string> that will make it really easy to just access the lines by index (so you can easily keep track of "the key I'm processing started at line 5" or whatever), but it would mean either doing something clever or reading the whole file in one go.
If it's not a big file, then just using File.ReadAllLines is going to be the very simplest way of reading a file as a list of lines.
If it is a big file, use File.ReadLines to obtain an IEnumerable<string>, and then your processing code needs to be a bit smarter... for example, it might want to create a List<string> for each key that it processes, containing all the lines for that key - and let that list be garbage collected when you read the next key.
There is now way to use Peek multiple time as you thing, because it will always return only "top" character in stream. It just read it but "not send" to stream information that it was read.
To sum up pointer to stream after Peek stays in same place.
If you use for example FileStream you can use Seek for going back, but you didn't precise what type of stream are you using.
You could do something like this:
List<MyObject> objects = new List<MyObject>();
using (StreamReader sr = new StreamReader(aPath))
{
MyObject curObj;
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
if (line.IndexOf(':') >= 0) // or whatever identify the beginning of a new object
{
curObj = new MyObject(line);
objects.Add(curObj);
}
else
curObj.AddAttribute(line);
}
}
I have text file which contain 200000 rows. I want to read first 50000 rows and then process it and then read second part say 50001 to 100000 etc. When I read second block I don't write to loop on first 1 to 50000. I want that reader pointer directly goes to row number 50001 and start reading.
How it can be possible? Which reader is used for that?
You need the StreamReader class.
With this you can do line by line reading with the ReadLine() method. You will need to keep track of the line count yourself and call a method to process your data every 50000 lines, but so long as you keep the reader open you should not need to restart the reading.
No unfortunately there is no way you can skip counting the Lines. At the raw level files do not work on a line number basis. Instead they work at a position / offset basis. The root file system has no concept of lines. It's a concept added by higher level components.
So there is no way to tell the operating system, please open file at line specified. Instead you have to open the file and skip around counting new lines until you've passed the specified number. Then store the next set of bytes into an array until you hit the next new line.
Though If each line has equal number of bytes present then you can try the following.
using( Stream stream = File.Open(fileName, FileMode.Open) )
{
stream.Seek(bytesPerLine * (myLine - 1), SeekOrigin.Begin);
using( StreamReader reader = new StreamReader(stream) )
{
string line = reader.ReadLine();
}
}
I believe the best way would be to use stream reader,
Here are two related questions to yours, in which you can get answers from there. But ultimately if you want to get blocks of text it is very hard to do unless it is a set amount.
However I believe these would be a good read for you to use:
Reading Block of text file
This one shows you how to separate blocks of code to read. The answer for this one would be best suited, you can just set the conditions to read how many lines you have read, and set the conditions to check if the line count == 50000 or so on then do something.
As you can see
This answer makes use of the keyword continue which I believe will be useful for what you are intending to do.
Reading text file block by block
This one shows you a more readable answer but doesn't really answer what you are looking for in reading blocks.
For your question I believe that what you want to do has confused you a little, it seems like you want to highlight 50000 lines and then read it as one, that is not the way streamreader works, and yes reading line by line makes the process longer but unfortunately that's the case.
Unless the rows are exactly the same length, you can't start directly at row 50001.
What you can do, however, is when reading the first 50000 rows, remember where the last row ends. You can then seek directly to that offset and continue reading from there.
Where the row length is fixed, you do something like this:
myfile.Seek(50000 * (rowCharacters + 2), SeekOrigin.Begin);
Seek goes to a specific offset in bytes, so you just need to tell it how many bytes 50000 rows occupy. Given an ASCII encoding, that's the number of characters in the line, plus 2 for the newline sequence.
I have many large csv files (1-10 gb each) which I'm importing into databases. For each file, I need to replace the 1st line so I can format the headers to be the column names. My current solution is:
using (var reader = new StreamReader(file))
{
using (var writer = new StreamWriter(fixed))
{
var line = reader.ReadLine();
var fixedLine = parseHeaders(line);
writer.WriteLine(fixedLine);
while ((line = reader.ReadLine()) != null)
writer.WriteLine(line);
}
}
What is a quicker way to only replace line 1 without iterating through every other line of these huge files?
If you can guarantee that fixedLine is the same length (or less) as line, you can update the files in-place instead of copying them.
If not, you can possibly get a little performance improvement by accessing the .BaseStream of your StreamReader and StreamWriter and doing big block copies (using, say, a 32K byte buffer) to do the copying, which will at least eliminate the time spent checking every character to see if it's an end-of-line character as happens now with reader.ReadLine().
The only thing that can significantly speed it up is if you can really replace first line. If new first line is no longer than old one - replace (with space padding if needed) the first line carefully.
Otherwise - you have to create new file and copy the rest after first line. You may be able to optimize copying a bit by adjusting buffer sizes/explicit copy as binary/per-allocating size, but it will not change the fact that you need to copy whole file.
One more cheat if you planning to drop CSV data into DB anyway: if order does not matter you can read some lines from the beginning, replace them with new header and add the removed lines to the end of the file.
Side note: if this is one-time operation I'd simply copy files and be done with it... Debugging code that inserts data into middle of text file with potentially different encoding may not worth an effort.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.