Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#
In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.
Related
I got the problem of reading single line form large file encoded in UTF-8. Lines in the file has the constant length.
The file in average has 300k lines. The time is the main constraint, so I want to do it the fastest way possible.
I've tried LinQ
File.ReadLines("file.txt").Skip(noOfLines).Take(1).First();
But the time is not satisfactory enough.
My biggest hope was using the stream, and setting its possition to the desired line start, but the problem is that lines sizes in bytes differ.
Any ideas, how to do it?
Now this is where you don't want to use linq (-:
You actually want to find a nth occurrence of a new line in the file and read something till the next new line.
You probably want to check out this documentation on memory mapped files as well:
https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(v=vs.110).aspx
There is also a post comparing different access methods
http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files
So we have a number of files with a custom file-format. These files are then processed to generate a new file based on this content. Think for example processing a .zip file.
The file is being read sequentially and creating some content based on the content.
For instance, reading sequentially could yield the following:
1st Byte: 'S' at index #0 = 'S'
2nd Byte: 'U' at index #0 = 'US'
3rd Byte: 'C' at index #0 = 'CUS'
4th Byte: 'B' at index #0 = 'BCUS'
5th Byte: 'A' at index #0 and index #2 = 'ABACUS'
A few points worth noting about this:
The file contents tend to start off from the end of the resultant file leading to the start - however this is not always the case.
Reading the file backwards - I think - is not an option since this would then mess up the indexes
The resultant file length cannot be determined beforehand - unless the entire content of the file is read and parsed
Indexes can potentially span through the entire range of the file.
It cannot be known beforehand that there are empty spaces in-between bytes for instances between B and C in BCUS which is later filled up as ABACUS.
Currently I'm writing the resultant content in an in-memory List<byte>; and then writing the results to file. This is not ideal since it means that the whole resultant file is loaded in-memory.
I have did a bit of checking and found MemoryMapping in C# which seemed like a great idea at first glance however from what I've seen 1) it requires knowing the file-length beforehand and 2) it has no support for inserting bytes at specified indexes - whilst pushing any existent content to adjacent bytes.
I was also thinking of storing bits of the data (as chunks) for instance every 1MB of file content as a separate file while processing. However due to the random access nature of the writes and possibly spanning the entire length of the file, I think there would be a lot of file I/O in terms of opening / closing files and re-reading their data.
Do you have any ideas of how this can be performed efficiently?
I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
I have 500 csv files ,
each of them's size is about 10~20M.
for a sample , the content in file like below ↓
file1 :
column1 column2 column3 column4 .... column50
file2:
column51 column52 ... ... column100
So , What I want to do is merge all the files in to one large file like below ↓
fileAll
column1 , column2 ...... column2500
In my solusion now is
1, Merge per 100 files into 5 large files
2, Merge 5 large files into one large file
But the performance is very bad.
So , Can anyone give me some advice to improve the performance ?
Thanks !
What language are you working in, Off the top of my head, I would think you would get the best performance by doing a line by line stream.
So for instance, read the first line of all the files in, write the first line of the merge out. Continue until your done.
The reason why this is better than your solution is your solution reads and writes the same data to and from disk several times, which is slow. I assume you can't fit all the files in memory, (and you wouldn't want to anyway, the caching would be horrible), but you want to minimize disk reads and writes(the slowest operation) and try to do it in a fashion where each each segment to be written can fit in your cache.
All so, depending on what language your using, you may be taking a Huge hit on concatenating strings. And language that is using null terminated arrays as its string implementation is going to take a huge hit for concatenating large strings because it has to search for the null terminator. python is an examples off the top of my head. So you may want to limit the size of the strings you work with. In the above example, read in x many chars, write out x many chars ect ect. But you should still only be reading the data in once, and writing the data out once if at all possible.
You could try doing it as a streamed operation; don't do 1. Load File 1, 2. Load File 2, 3. Merge, 4. Write Result. Instead do 1. Load line 1 of File 1 & 2, 2. Merge Line, 3. Write line. This way you speed things up by doing smaller chunks of read, process, write and thereby allow the disk to empty its read/write buffers while you do the merge of each line (row). There could be other things slowing down your process. Pls post code. For example, string operations could easily be slowing things down if not done carefully. Finally, Release mode (as opposed to Debug) is more optimized and will typically run significantly faster.
I have a file data.txt. data.txt contains text line by line as:
one
two
three
six
Here I need to write data in file as:
one
two
three
four
five
six
I dont know how to write file like this!!
Generally, you have to re-write the file when inserting - because text files have variable length rows.
There are optimizations you could employ: like extending a file and buffering and writing, but you may have to buffer an arbitrary amount - i.e. inserting a row at the top.
If we knew more about your complete scenario, we would be more able to help usefully.
Loop through your text file and put lines as array. Modify the array and save it back to file. But it's not a good idea if you have some other text file, for this particular example it can work no problem.