I am looking to create a file by structuring it in size blocks. Essentially I am looking to create a rudimentary file system.
I need to write a header, and then an "infinite" possible number of entries of the same size/structure. The important parts are:
Each block of data needs to be read/writable individually
Header needs to be readable/writable as its own entity
Need a way to store this data and be able to determine its location in the file quickly
The would imagine the file would resemble something like:
[HEADER][DATA1][DATA2][DATA3][...]
What is the proper way to handle something like this? Lets say I want to read DATA3 from the file, how do I know where that data chunk starts?
If I understand you correctly and you need a way to assign a kind of names/IDs to your DATA chunks, you can try to introduce yet another type of chunk.
Let's call it TOC (table of contents).
So, the file structure will look like [HEADER][TOC1][DATA1][DATA2][DATA3][TOC2][...].
TOC chunk will contain names/IDs and references to multiple DATA chunks. Also, it will contain some internal data such as pointer to the next TOC chunk (so, you might consider each TOC chunk as a linked-list node).
At runtime all TOC chunks could be represented as a kind of HashMap, where key is a name/ID of the DATA chunk and value is its location in the file.
We can store in the header the size of chunk. If the size of chunks are variable, you can store pointers which points to actual chunk. An interesting design for variable size is in postgres heap file page. http://doxygen.postgresql.org/bufpage_8h_source.html
I am working in reverse but this may help.
I write decompilers for binary files. Generally there is a fixed header of a known number of bytes. This contains specific file identification so we can recognize the file type we are dealing with.
Following that will be a fixed number of bytes containing the number of sections (groups of data) This number then tells us how many data pointers there will be. Each data pointer may be four bytes (or whatever you need) representing the start of the data block. From this we can work out the size of each block. The decompiler then reads the blocks one at a time to get the size and location in the file of each data block. The job then is to extract that block of bytes and do whatever is needed.
We step through the file one block at a time. The size of the last block is the start pointer to the end of the file.
Related
I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?
Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965
I have a pseudo-code question for a problem I've encountered. I have a binary file of recorded variable data at certain record rates (20Hz,40Hz, etc..). This information is linear in the file. For example if I have var1 and var2, I'd read from the file var1's data, then var2's data, then var1's next sample, etc...I'm pretty sure the best way to construct a CSV is by row. My original thought was to just read in the binary file and parse the information into a contemporary buffer/structure. Once all the binary data is read in then begin writing the CSV file by row. My only concern with this approach is memory consumption. There can be anywhere from 300-400 parameters recorded as high as 160HZ. That's a lot of data to have stored. I was wondering if there's any other approaches that are more efficient. Language I'm using is C#
As I understand it, you have:
{ some large number of var1 samples }
{ some large number of var2 samples }
{ some large number of var3 samples }
And you want to create:
var1, var2, var3, etc.
var1, var2, var3, etc.
If you have enough memory to hold all of that data, then your first approach is the way to go.
Only you can say whether you have enough memory. If the file is all binary data (i.e. integers, floats, doubles, etc.), then you can get a pretty good idea of how much memory you'll need just by looking at the size of the file.
Assuming that you don't have enough memory to hold all of the data at once, you could easily process the data in two passes.
On the first pass, you read all of the var1 data and immediately write it to a temporary file called var1Data. Then do the same with var2, var3, etc. When the first pass is done, you have N binary files, each one containing the data for that variable.
The second pass is a simple matter of opening all of those files, and then looping:
while not end of data
read from var1Data
read from var2Data
read from var3Data
etc.
create structure
write to CSV
Or, you could do this:
while not end of data
read from var1Data
write to CSV
read from var2Data
write to CSV
etc.
Granted, it's two passes over the data, but if you can't fit all of the data into memory that's the way you'll have to go.
One drawback is that you'll have 300 or 400 files open concurrently. That shouldn't be a problem. But there is another way to do it.
On the first pass, read, say, the first 100,000 values for each parameter into memory, create your structures, and write those to the CSV. Then make another pass over the file, reading items 100,000 to 199,999 for each parameter into memory and append to the CSV. Do that until you've processed the entire file.
That might be easier, depending on how your binary file is structured. If you know where each parameter's data starts in the file, and all the values for that parameter are the same size, then you can seek directly to the start for that parameter (or to the 100,000th entry for that parameter), and start reading. And once you've read however many values for var1, you can seek directly to the start of the var2 data and start reading from there. You skip over data you're not ready to process in this pass.
Which method to use will depend on how much memory you have and how your data is structured. As I said, if it all fits into memory then your job is very easy. If it won't fit into memory, then if the binary file is structured correctly you can do it with multiple passes over the input file, on each pass skipping over the data you don't want for that pass. Otherwise, you can use the multiple files method, or you can do multiple passes over the input, reading sequentially (i.e. not skipping over data).
I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade
Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.
I am writing a program to diff, and copy entire files or segments based on changes on either end (Rsync-esque... but more like Unison). The main idea is to keep my music folder (all mp3s) up to date over multiple locations.
I'd like to send segmented updates if only small portions of the file have changed, as opposed to copying the entire file. For this, I need a way to diff segments of the file.
I initially tried generating hashes for blocks of every file (Every n bytes I'd hash the segment). I noticed that when I changed one attribute (id3v2 tag on an mp3) all the hashed blocks would change. This makes sense, as I would guess the header is growing as it acquired new information.
This leads me to my actual question. I would like to know how to determine the length of an mp3's header, so I could create 2 comparable hashes.
1) The meta info of the file (header)
2) The actual mpeg stream with audio (This hash should remain unchanged if all I do is alter tag info)
Am I missing anything else?
Thanks!
Ty
If all you want to check the length of is id3v2 tags, then you can find out information about its structure at http://www.id3.org/id3v2.4.0-structure.
If you read the first 3 bytes, and they are equal to "ID3", then skip to the 7th byte, then read the header size. Be careful though, because the size is stored as a "synchsafe integer".
If you want to determine the header information, you'll either:
a) need to use a mp3 library that can do the parsing for you, or
b) go to the mp3 specification and parse it out as needed.
I wound up using TagLibSharp. developer.novell.com/wiki/index.php/TagLib_Sharp