I got the problem of reading single line form large file encoded in UTF-8. Lines in the file has the constant length.
The file in average has 300k lines. The time is the main constraint, so I want to do it the fastest way possible.
I've tried LinQ
File.ReadLines("file.txt").Skip(noOfLines).Take(1).First();
But the time is not satisfactory enough.
My biggest hope was using the stream, and setting its possition to the desired line start, but the problem is that lines sizes in bytes differ.
Any ideas, how to do it?
Now this is where you don't want to use linq (-:
You actually want to find a nth occurrence of a new line in the file and read something till the next new line.
You probably want to check out this documentation on memory mapped files as well:
https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(v=vs.110).aspx
There is also a post comparing different access methods
http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files
Related
Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#
In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.
I'm trying to write a simple .txt via StreamWriter. I want it to look like this:
12
26
100
So simple numbers. But how do I tell the Reader/writer in which line to write or read.
http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx
Here it says that ReadLine() reads a line of current the Stream. But how do I know which line it is. Or is it always the first one?
I want to read the numbers, modify them and then write them back.
Any suggestions?
Thanks in advance!
A reader is conceptually a unidirectional thing, from the start (or more accurately, the current position in the stream) to the end.
Each time you read a line, it is simply buffering data until it finds a new line; it doesn't really have the concept of a current line (or moving between lines).
As long as the file isn't massive, you should be OK reading the entire file, working on a string (or string array), then saving the entire file; inserting/removing text content is otherwise non-trivial (especially when you consider the mysteries of encodings).
File.ReadAllLines and File.WriteAllLines may be easier in your scenario.
I think the only way is to read all lines in array (or any other data structude, i.e. list), modify and write it back to file.
Maybe xml will be better for your purposes?
I have a file data.txt. data.txt contains text line by line as:
one
two
three
six
Here I need to write data in file as:
one
two
three
four
five
six
I dont know how to write file like this!!
Generally, you have to re-write the file when inserting - because text files have variable length rows.
There are optimizations you could employ: like extending a file and buffering and writing, but you may have to buffer an arbitrary amount - i.e. inserting a row at the top.
If we knew more about your complete scenario, we would be more able to help usefully.
Loop through your text file and put lines as array. Modify the array and save it back to file. But it's not a good idea if you have some other text file, for this particular example it can work no problem.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.
I have a text file that has the following format:
1234
ABC123 1000 2000
The first integer value is a weight and the next line has three values, a product code, weight and cost, and this line can be repeated any number of times. There is a space in between each value.
I have been able to read in the text file, store the first value on the first line into a variable, and then the subsequent lines into an array and then into a list, using first readline.split('').
To me this seems an inefficient way of doing it, and I have been trying to find a way where I can read from the second line where the product codes, weights and costs are listed down into a list without the need of using an array. My list control contains an object where I am only storing the weight and cost, not the product code.
Does anyone know how to read in a text file, take in some values from the file straight into a list control?
Thanks
What you do is correct. There is no generalized way of doing it, since what you did is that you descirbed the algorithm for it, that has to be coded or parametrized somehow.
Since your text file isn't as structured as a CSV file, this kind of manual parsing is probably your best bet.
C# doesn't have a Scanner class like Java, so what you wan't doesn't exist in the BCL, though you could write your own.
The other answers are correct - there's no generalized solution for this.
If you've got a relatively small file, you can use File.ReadAllLines(), which will at least get rid of a lot cruft code, since it'll immediately convert it to a string array for you.
If you don't want to parse strings from the file and to reserve an additional memory for holding split strings you can use a binary format to store your information in the file. Then you can use the class BinaryReader with methods like ReadInt32(), ReadDouble() and others. It is more efficient than read by characters.
But one thing: binary format is bad readable by humans. It will be difficult to edit the file in the editor. But programmatically - without any problems.