data structure for indexing big file

data structure for indexing big file - c#

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

Related

Working with large array on disk

I need to work with a large 2-dimensional array of doubles, with more than 100 million cells. The matrix first needs to be filled and then manipulated by taking either one row or one column. The matrix can be bigger than 1 terabyte in size, and will not find in memory.
How can the array be stored efficiently? The main operation is fast saving it from memory row by row (double[100k] each) and fast reading to the memory of one row or one column.

You could use Memory Mapped Files. You are essentially still working with an array, but allowing the kernel to choose what parts to load into memory. You could also possibly use Fixed size buffers to read whole sections of the memory mapped files.

How can I efficiently index a file?

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).
The lines can be of a different length.
In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
go through the entire file
when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping
We then use an API similar to:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like:
Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.
Any ideas are very much appreciated.

Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

Reading Range of Lines from a File

Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#

In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.

How to read a huge log file while remaining performant?

I'm reading a 500MB+ log file, I was wondering which would be faster in terms of speed.
I want to be able to scroll through the ~1mil entries, but shouldn't be slow because of the virtual mode(hopefully). It loads successfully, however scrolling is a bit lagged.
Currently I'm using a listview in virtual mode.
QUESTION - virtual-retrieval-item function -
list storing each entry of log info, and the retrieve item call to display the specific indexes displayed of the list such as list[e.getindex]
store the beginning of each log info (position in the log file) to an list, then calling a read function to read from the position to the escape character (gets one log entry). For example first entry would read 0 - escape character, say the second entry would read 16-43, third, 43-60 (the size of the log entries all vary)
There are pros and cons of both, but I am curious to see what others think in terms of speed.
On one hand, (1) reads all the data of a 1mil entries into a list, then reads them from memory because virtual mode helps display only the items that is viewable at the time (approx 10). However the overhead is that all the data is in memory
With (2) there is no storage of actual log entries into memory, however it has to make a call to file to scan through the file, and start reading at a specific line. It has to make this call for each item.
Is there an alternate? These are the fastest ways in which were researched.

Best Way to Load a File, Manipulate the Data, and Write a New File

I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade

Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

data structure for indexing big file - c#

Related

Working with large array on disk

How can I efficiently index a file?

Reading Range of Lines from a File

How to read a huge log file while remaining performant?

Best Way to Load a File, Manipulate the Data, and Write a New File

Categories

Resources