So we have a number of files with a custom file-format. These files are then processed to generate a new file based on this content. Think for example processing a .zip file.
The file is being read sequentially and creating some content based on the content.
For instance, reading sequentially could yield the following:
1st Byte: 'S' at index #0 = 'S'
2nd Byte: 'U' at index #0 = 'US'
3rd Byte: 'C' at index #0 = 'CUS'
4th Byte: 'B' at index #0 = 'BCUS'
5th Byte: 'A' at index #0 and index #2 = 'ABACUS'
A few points worth noting about this:
The file contents tend to start off from the end of the resultant file leading to the start - however this is not always the case.
Reading the file backwards - I think - is not an option since this would then mess up the indexes
The resultant file length cannot be determined beforehand - unless the entire content of the file is read and parsed
Indexes can potentially span through the entire range of the file.
It cannot be known beforehand that there are empty spaces in-between bytes for instances between B and C in BCUS which is later filled up as ABACUS.
Currently I'm writing the resultant content in an in-memory List<byte>; and then writing the results to file. This is not ideal since it means that the whole resultant file is loaded in-memory.
I have did a bit of checking and found MemoryMapping in C# which seemed like a great idea at first glance however from what I've seen 1) it requires knowing the file-length beforehand and 2) it has no support for inserting bytes at specified indexes - whilst pushing any existent content to adjacent bytes.
I was also thinking of storing bits of the data (as chunks) for instance every 1MB of file content as a separate file while processing. However due to the random access nature of the writes and possibly spanning the entire length of the file, I think there would be a lot of file I/O in terms of opening / closing files and re-reading their data.
Do you have any ideas of how this can be performed efficiently?
Related
I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}
Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance
To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.
Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#
In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.
I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
I am looking to create a file by structuring it in size blocks. Essentially I am looking to create a rudimentary file system.
I need to write a header, and then an "infinite" possible number of entries of the same size/structure. The important parts are:
Each block of data needs to be read/writable individually
Header needs to be readable/writable as its own entity
Need a way to store this data and be able to determine its location in the file quickly
The would imagine the file would resemble something like:
[HEADER][DATA1][DATA2][DATA3][...]
What is the proper way to handle something like this? Lets say I want to read DATA3 from the file, how do I know where that data chunk starts?
If I understand you correctly and you need a way to assign a kind of names/IDs to your DATA chunks, you can try to introduce yet another type of chunk.
Let's call it TOC (table of contents).
So, the file structure will look like [HEADER][TOC1][DATA1][DATA2][DATA3][TOC2][...].
TOC chunk will contain names/IDs and references to multiple DATA chunks. Also, it will contain some internal data such as pointer to the next TOC chunk (so, you might consider each TOC chunk as a linked-list node).
At runtime all TOC chunks could be represented as a kind of HashMap, where key is a name/ID of the DATA chunk and value is its location in the file.
We can store in the header the size of chunk. If the size of chunks are variable, you can store pointers which points to actual chunk. An interesting design for variable size is in postgres heap file page. http://doxygen.postgresql.org/bufpage_8h_source.html
I am working in reverse but this may help.
I write decompilers for binary files. Generally there is a fixed header of a known number of bytes. This contains specific file identification so we can recognize the file type we are dealing with.
Following that will be a fixed number of bytes containing the number of sections (groups of data) This number then tells us how many data pointers there will be. Each data pointer may be four bytes (or whatever you need) representing the start of the data block. From this we can work out the size of each block. The decompiler then reads the blocks one at a time to get the size and location in the file of each data block. The job then is to extract that block of bytes and do whatever is needed.
We step through the file one block at a time. The size of the last block is the start pointer to the end of the file.
I am writing a program to diff, and copy entire files or segments based on changes on either end (Rsync-esque... but more like Unison). The main idea is to keep my music folder (all mp3s) up to date over multiple locations.
I'd like to send segmented updates if only small portions of the file have changed, as opposed to copying the entire file. For this, I need a way to diff segments of the file.
I initially tried generating hashes for blocks of every file (Every n bytes I'd hash the segment). I noticed that when I changed one attribute (id3v2 tag on an mp3) all the hashed blocks would change. This makes sense, as I would guess the header is growing as it acquired new information.
This leads me to my actual question. I would like to know how to determine the length of an mp3's header, so I could create 2 comparable hashes.
1) The meta info of the file (header)
2) The actual mpeg stream with audio (This hash should remain unchanged if all I do is alter tag info)
Am I missing anything else?
Thanks!
Ty
If all you want to check the length of is id3v2 tags, then you can find out information about its structure at http://www.id3.org/id3v2.4.0-structure.
If you read the first 3 bytes, and they are equal to "ID3", then skip to the 7th byte, then read the header size. Be careful though, because the size is stored as a "synchsafe integer".
If you want to determine the header information, you'll either:
a) need to use a mp3 library that can do the parsing for you, or
b) go to the mp3 specification and parse it out as needed.
I wound up using TagLibSharp. developer.novell.com/wiki/index.php/TagLib_Sharp