Efficient way to read index file in .NET - c#

Dictionaries usually has an index and a data file. I'm writing a dictionary application as a hobby project. I'm confused about how to read the offset file in .NET. The index file is of 4-5 MB size. What is the most efficient way to fetch the offset/length value of a word.
EDIT:
I need to know only how to read offset file if I have a word to search. ie how to search the index file for a word so that I can get the subsequent 8 bytes

Stream.Seek(long offset, SeekOrigin origin) will be usefull to get to the offset.

4-5 megabytes for the index? That's nothing. Read the entire thing into a byte array and with it as a MemoryStream or more appropriately, parse the entire contents into appropriate data structures for quick searching (has, b-tree, etc).

System.IO.BinaryReader has a ReadUInt32 method that reads an unsigned int. It also has different methods for reading binary files.

Related

How to optimize sequential reading and backtracking the position of the file c#?

I have an indefinitely big file. I am to find largest matches of segments of the file with some byte arrays of different lengths.
What I do now is this.
1-Created a FileStream fs
ForEach byte b in fs.
save currentPosition.
//these byte arrays are different depending on b
ForEach byte array
while matching bytes
read from fs.
print matched sequence.
seek to position
Now the program is slow. How can I improve my reading from the file?
From what I read, the fs has an internal buffer, so when I read a byte, it looks ahead by default 4kb.
My questions:
Am I correct in assuming that the sequential reads of the bytes in fs inside the while loop are satisfied from that buffer?
If so, what happens when I seek back? Does the buffer get destroyed and I read fill it again with the same content for each byte array? Because I need the same buffer, but I just want to reiterate the buffer.
Also, after I have iterated all the byte arrays, and I want to move on to next bwhat happens to that buffer, because what I really want is that same buffer but without the first byte.
How does this work? Do I need to create a wrapper for the FileStream, to read a byte array (that buffer) myself, and satisfy my reads from that buffer?
Edit: From task manager I see that the average of the processor my program uses is 2%. So the fact that the program is slow must be because of the file reads.

How to read/write a specific number of bytes to file

I am looking to create a file by structuring it in size blocks. Essentially I am looking to create a rudimentary file system.
I need to write a header, and then an "infinite" possible number of entries of the same size/structure. The important parts are:
Each block of data needs to be read/writable individually
Header needs to be readable/writable as its own entity
Need a way to store this data and be able to determine its location in the file quickly
The would imagine the file would resemble something like:
[HEADER][DATA1][DATA2][DATA3][...]
What is the proper way to handle something like this? Lets say I want to read DATA3 from the file, how do I know where that data chunk starts?
If I understand you correctly and you need a way to assign a kind of names/IDs to your DATA chunks, you can try to introduce yet another type of chunk.
Let's call it TOC (table of contents).
So, the file structure will look like [HEADER][TOC1][DATA1][DATA2][DATA3][TOC2][...].
TOC chunk will contain names/IDs and references to multiple DATA chunks. Also, it will contain some internal data such as pointer to the next TOC chunk (so, you might consider each TOC chunk as a linked-list node).
At runtime all TOC chunks could be represented as a kind of HashMap, where key is a name/ID of the DATA chunk and value is its location in the file.
We can store in the header the size of chunk. If the size of chunks are variable, you can store pointers which points to actual chunk. An interesting design for variable size is in postgres heap file page. http://doxygen.postgresql.org/bufpage_8h_source.html
I am working in reverse but this may help.
I write decompilers for binary files. Generally there is a fixed header of a known number of bytes. This contains specific file identification so we can recognize the file type we are dealing with.
Following that will be a fixed number of bytes containing the number of sections (groups of data) This number then tells us how many data pointers there will be. Each data pointer may be four bytes (or whatever you need) representing the start of the data block. From this we can work out the size of each block. The decompiler then reads the blocks one at a time to get the size and location in the file of each data block. The job then is to extract that block of bytes and do whatever is needed.
We step through the file one block at a time. The size of the last block is the start pointer to the end of the file.

parsing binary file in C#

I have a binary file. i stored it in byte array. file size can be 20MB or more. then i want to parse or find particular value in the file. i am doing it by 2 ways ->
1. By converting full file in char array.
2. By converting full file in hex string.(i also have hex values)
what is best way to parse full file..or should i do in binary form. i am using vs-2005.
From the aspect of memory consumption, it would be best it you could parse it directly, on-the-fly.
Converting it to a char array in C# means effectively doubling it's size in memory (presuming you are converting each byte to a char), while hex string will take at least 4 times the size (C# chars are 16-bit unicode characters).
On the other hand, it you need to make many searches and parsing over an existing set of data repeatedly, you may benefit from having it stored in any form which suits your needs better.
What's stopping you from seaching in the byte[]?
IMHO, If you're simply searching for a byte of specified value, or several continous bytes, this is the easiest way and most efficient way to do it.
If I understood your question correctly you need to find strings which can contain any characters in a large binary file. Does the binary file contain text? If so do you know the encoding? If so you can use StreamReader class like so:
using (StreamReader sr = new StreamReader("C:\test.dat", System.Text.Encoding.UTF8))
{
string s = sr.ReadLine();
}
In any case I think it's much more efficient using some kind of stream access to the file, instead of loading it all to memory.
You could load it by chunks into the memory, and then use some pattern matching algorithm (like Knuth-Moris-Pratt or Karp-Rabin)

How to write a file format handler

Today i'm cutting video at work (yea me!), and I came across a strange video format, an MOD file format with an companion MOI file.
I found this article online from the wiki, and I wanted to write a file format handler, but I'm not sure how to begin.
I want to write a file format handler to read the information files, has anyone ever done this and how would I begin?
Edit:
Thanks for all the suggestions, I'm going to attempt this tonight, and I'll let you know. The MOI files are not very large, maybe 5KB in size at most (I don't have them in front of me).
You're in luck in that the MOI format at least spells out the file definition. All you need to do is read in the file and interpret the results based on the file definition.
Following the definition, you should be able to create a class that could read and interpret a file which returns all of the file format definitions as properties in their respective types.
Reading the file requires opening the file and generally reading it on a byte-by-byte progression, such as:
using(FileStream fs = File.OpenRead(path-to-your-file)) {
while(true) {
int b = fs.ReadByte();
if(b == -1) {
break;
}
//Interpret byte or bytes here....
}
}
Per the wiki article's referenced PDF, it looks like someone already reverse engineered the format. From the PDF, here's the first entry in the format:
Hex-Address: 0x00
Data Type: 2 Byte ASCII
Value (Hex): "V6"
Meaning: Version
So, a simplistic implementation could pull the first 2 bytes of data from the file stream and convert to ASCII, which would provide a property value for the Version.
Next entry in the format definition:
Hex-Address: 0x02
Data Type: 4 Byte Unsigned Integer
Value (Hex):
Meaning: Total size of MOI-file
Interpreting the next 4 bytes and converting to an unsigned int would provide a property value for the MOI file size.
Hope this helps.
If the files are very large and just need to be streamed in, I would create a new reader object that uses an unmanagedmemorystream to read the information in.
I've done a lot of different file format processing like this. More recently, I've taken to making a lot of my readers more functional where reading tends to use 'yield return' to return read only objects from the file.
However, it all depends on what you want to do. If you are trying to create a general purpose format for use in other applications or create an API, you probably want to conform to an existing standard. If however you just want to get data into your own application, you are free to do it however you want. You could use a binaryreader on the stream and construct the information you need within your app, or get the reader to return objects representing the contents of the file.
The one thing I would recommend. Make sure it implements IDisposable and you wrap it in a using!

How can I determine the length of an mp3 file's header?

I am writing a program to diff, and copy entire files or segments based on changes on either end (Rsync-esque... but more like Unison). The main idea is to keep my music folder (all mp3s) up to date over multiple locations.
I'd like to send segmented updates if only small portions of the file have changed, as opposed to copying the entire file. For this, I need a way to diff segments of the file.
I initially tried generating hashes for blocks of every file (Every n bytes I'd hash the segment). I noticed that when I changed one attribute (id3v2 tag on an mp3) all the hashed blocks would change. This makes sense, as I would guess the header is growing as it acquired new information.
This leads me to my actual question. I would like to know how to determine the length of an mp3's header, so I could create 2 comparable hashes.
1) The meta info of the file (header)
2) The actual mpeg stream with audio (This hash should remain unchanged if all I do is alter tag info)
Am I missing anything else?
Thanks!
Ty
If all you want to check the length of is id3v2 tags, then you can find out information about its structure at http://www.id3.org/id3v2.4.0-structure.
If you read the first 3 bytes, and they are equal to "ID3", then skip to the 7th byte, then read the header size. Be careful though, because the size is stored as a "synchsafe integer".
If you want to determine the header information, you'll either:
a) need to use a mp3 library that can do the parsing for you, or
b) go to the mp3 specification and parse it out as needed.
I wound up using TagLibSharp. developer.novell.com/wiki/index.php/TagLib_Sharp

Categories

Resources