How can I efficiently index a file?

How can I efficiently index a file? - c#

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).
The lines can be of a different length.
In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
go through the entire file
when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping
We then use an API similar to:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like:
Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.
Any ideas are very much appreciated.

Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

Related

Call Length Property on Returned Array in Chained String/LINQ Methods of C#

I found this post on selecting a range from an array, and have to use the LINQ option:
Selecting a range of items inside an array in C#
Ultimately, I'm trying to get the last four lines from some text file. After, I've read in and cleaned the lines for unwanted characters and empty lines, I have an array with all of the lines. I'm using the following to do so:
string[] allLines = GetEachLine(results);
string[] lastFourLines = allLines.Skip(allLines.Length - 4).Take(4).ToArray();
This works fine, but I'm wondering if I could somehow skip assinging to the allLines variable all together. Such as:
string[] lastFourLines = GetEachLine(results).Skip(returnedArrayLength - 4).Take(4).ToArray();

It would be better to change GetEachLine and code preceding it (however results is computed) to use IEnumerable<T> and avoid using an array to read the entire file in memory for the last four lines (unless you use all of results for something else) - consider using File.ReadLines.
However, if you are using .Net Core 2.0 or greater, you can use Enumerable.TakeLast to efficiently return the last four lines:
var lastFourLines = GetEachLine(results).TakeLast(4);

if GetEachLine() returns string[] then that should work fine, though null checking may be needed.
As you chain more you may want to use line breaks to increase readability:
string[] lastFourLines = GetEachLine(results)
.Skip(allLines.Length - 4)
.Take(4)
.ToArray();
allLines.Length won't exist unless you still have line 1 from your question, you can avoid calling GetEachLine() twice by using TakeLast().
string[] lastFourLines = GetEachLine(results)
.TakeLast(4)
.ToArray();

If you are looking to efficiently retrieve the last N (filtered) line of a large file, you really need to start at the point where you are reading the file contents.
Consider a 1GB log file containing 10M records, where you only want the last few lines. Ideally, you would want to start by reading the last couple KB and then start extracting lines by searching for line breaks from the end, extracting each line and returning them in an iterator yield. If you run out of data, read the preceding block. Continue only as long as the consumer requests more values from the iterator.
Offhand, I don't know a built-in way to do this, and coding this from scratch could get pretty involved. Luckily, a search turned up this similar question having a highly rated answer.

Reading Range of Lines from a File

Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#

In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.

How to read text file after specified line in c#

I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?

Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.

Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}

Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.

Use the Filestream.Position method to get the position of that file and then set the position.

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

Big strings: System.OutOfMemoryException

var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS

You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability

Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.

Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I efficiently index a file? - c#

Related

Call Length Property on Returned Array in Chained String/LINQ Methods of C#

Reading Range of Lines from a File

How to read text file after specified line in c#

data structure for indexing big file

Big strings: System.OutOfMemoryException

Categories

Resources