I have a pseudo-code question for a problem I've encountered. I have a binary file of recorded variable data at certain record rates (20Hz,40Hz, etc..). This information is linear in the file. For example if I have var1 and var2, I'd read from the file var1's data, then var2's data, then var1's next sample, etc...I'm pretty sure the best way to construct a CSV is by row. My original thought was to just read in the binary file and parse the information into a contemporary buffer/structure. Once all the binary data is read in then begin writing the CSV file by row. My only concern with this approach is memory consumption. There can be anywhere from 300-400 parameters recorded as high as 160HZ. That's a lot of data to have stored. I was wondering if there's any other approaches that are more efficient. Language I'm using is C#
As I understand it, you have:
{ some large number of var1 samples }
{ some large number of var2 samples }
{ some large number of var3 samples }
And you want to create:
var1, var2, var3, etc.
var1, var2, var3, etc.
If you have enough memory to hold all of that data, then your first approach is the way to go.
Only you can say whether you have enough memory. If the file is all binary data (i.e. integers, floats, doubles, etc.), then you can get a pretty good idea of how much memory you'll need just by looking at the size of the file.
Assuming that you don't have enough memory to hold all of the data at once, you could easily process the data in two passes.
On the first pass, you read all of the var1 data and immediately write it to a temporary file called var1Data. Then do the same with var2, var3, etc. When the first pass is done, you have N binary files, each one containing the data for that variable.
The second pass is a simple matter of opening all of those files, and then looping:
while not end of data
read from var1Data
read from var2Data
read from var3Data
etc.
create structure
write to CSV
Or, you could do this:
while not end of data
read from var1Data
write to CSV
read from var2Data
write to CSV
etc.
Granted, it's two passes over the data, but if you can't fit all of the data into memory that's the way you'll have to go.
One drawback is that you'll have 300 or 400 files open concurrently. That shouldn't be a problem. But there is another way to do it.
On the first pass, read, say, the first 100,000 values for each parameter into memory, create your structures, and write those to the CSV. Then make another pass over the file, reading items 100,000 to 199,999 for each parameter into memory and append to the CSV. Do that until you've processed the entire file.
That might be easier, depending on how your binary file is structured. If you know where each parameter's data starts in the file, and all the values for that parameter are the same size, then you can seek directly to the start for that parameter (or to the 100,000th entry for that parameter), and start reading. And once you've read however many values for var1, you can seek directly to the start of the var2 data and start reading from there. You skip over data you're not ready to process in this pass.
Which method to use will depend on how much memory you have and how your data is structured. As I said, if it all fits into memory then your job is very easy. If it won't fit into memory, then if the binary file is structured correctly you can do it with multiple passes over the input file, on each pass skipping over the data you don't want for that pass. Otherwise, you can use the multiple files method, or you can do multiple passes over the input, reading sequentially (i.e. not skipping over data).
Related
I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?
Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965
I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}
Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance
To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.
I am looking to create a file by structuring it in size blocks. Essentially I am looking to create a rudimentary file system.
I need to write a header, and then an "infinite" possible number of entries of the same size/structure. The important parts are:
Each block of data needs to be read/writable individually
Header needs to be readable/writable as its own entity
Need a way to store this data and be able to determine its location in the file quickly
The would imagine the file would resemble something like:
[HEADER][DATA1][DATA2][DATA3][...]
What is the proper way to handle something like this? Lets say I want to read DATA3 from the file, how do I know where that data chunk starts?
If I understand you correctly and you need a way to assign a kind of names/IDs to your DATA chunks, you can try to introduce yet another type of chunk.
Let's call it TOC (table of contents).
So, the file structure will look like [HEADER][TOC1][DATA1][DATA2][DATA3][TOC2][...].
TOC chunk will contain names/IDs and references to multiple DATA chunks. Also, it will contain some internal data such as pointer to the next TOC chunk (so, you might consider each TOC chunk as a linked-list node).
At runtime all TOC chunks could be represented as a kind of HashMap, where key is a name/ID of the DATA chunk and value is its location in the file.
We can store in the header the size of chunk. If the size of chunks are variable, you can store pointers which points to actual chunk. An interesting design for variable size is in postgres heap file page. http://doxygen.postgresql.org/bufpage_8h_source.html
I am working in reverse but this may help.
I write decompilers for binary files. Generally there is a fixed header of a known number of bytes. This contains specific file identification so we can recognize the file type we are dealing with.
Following that will be a fixed number of bytes containing the number of sections (groups of data) This number then tells us how many data pointers there will be. Each data pointer may be four bytes (or whatever you need) representing the start of the data block. From this we can work out the size of each block. The decompiler then reads the blocks one at a time to get the size and location in the file of each data block. The job then is to extract that block of bytes and do whatever is needed.
We step through the file one block at a time. The size of the last block is the start pointer to the end of the file.
I have a binary file which can be seen as a concatenation of different sub-file:
INPUT FILE:
Hex Offset ID SortIndex
0000000 SubFile#1 3
0000AAA SubFile#2 1
0000BBB SubFile#3 2
...
FFFFFFF SubFile#N N
These are the information i have about each SubFile:
Starting Offset
Lenght in bytes
Final sequence Order
What's the fastest way to produce a Sorted Output File in your opinion ?
For instance OUTPUT FILE will contain the SubFile in the following order:
SubFile#2
SubFile#3
SubFile#1
...
I have thought about:
Split the Input File extracting each Subfile to disk, then
concatenate them in the correct order
Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.
Consider the following information also:
Input file can be really huge (200MB~1GB)
For those who knows, i am speaking about IBM AFP Files.
Both my solution are easy to implement, but looks really not performing in my opinion.
Thanks in advance
Also if file is big the number of IDs is not so huge.
You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array.
I expect this to be faster than other methods.
So... let's make some pseudocode.
public struct FileItem : IComparable<FileItem>
{
public String Id;
public int SortIndex;
public uint Offset;
public uint Length;
public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}
public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
FileItem[] result = // fill the array
Array.Sort(result);
}
public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
foreach (FileItem item in items)
{
Copy from inputFile[item.Offset .. item.Length] to outputFile.
}
}
The number of read operations is linear, O(n), but seeking is required.
The only performance problem about seeking is cache miss by hard drive cache.
Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.
If you are using a solid state disk instead seeking time is 0 :)
Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly.
You can ensure better time if you preallocate the size of the file before starting to write it.
FileStream myFileStream = ...
myFileStream.SetLength(predictedTotalSizeOfFile);
Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.
The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.
It is not useful to try a multithreaded approach, it will just slow down the copy.
It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).
Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code.
Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.
The first solution I thought of was to read the input file sequentially and build a Subfile-object for every subfile. These objects will be put into b+tree as soon as they are created. The tree will order the subfiles by their SortIndex. A good b-tree implementation will have linked child nodes which enables you to iterate over the subfiles in the correct order and write them into the output file
another way could be to use random access files. you can load all SortIndexes and offsets. then sort them and write the output file in the sorted way. in this case all depends on how random access files work. in this case all depends on the random access file reader implementation. if it just reads the file until a specified position it would not be very performant.. honestly, I have no idea how they work... :(
I have 500 csv files ,
each of them's size is about 10~20M.
for a sample , the content in file like below ↓
file1 :
column1 column2 column3 column4 .... column50
file2:
column51 column52 ... ... column100
So , What I want to do is merge all the files in to one large file like below ↓
fileAll
column1 , column2 ...... column2500
In my solusion now is
1, Merge per 100 files into 5 large files
2, Merge 5 large files into one large file
But the performance is very bad.
So , Can anyone give me some advice to improve the performance ?
Thanks !
What language are you working in, Off the top of my head, I would think you would get the best performance by doing a line by line stream.
So for instance, read the first line of all the files in, write the first line of the merge out. Continue until your done.
The reason why this is better than your solution is your solution reads and writes the same data to and from disk several times, which is slow. I assume you can't fit all the files in memory, (and you wouldn't want to anyway, the caching would be horrible), but you want to minimize disk reads and writes(the slowest operation) and try to do it in a fashion where each each segment to be written can fit in your cache.
All so, depending on what language your using, you may be taking a Huge hit on concatenating strings. And language that is using null terminated arrays as its string implementation is going to take a huge hit for concatenating large strings because it has to search for the null terminator. python is an examples off the top of my head. So you may want to limit the size of the strings you work with. In the above example, read in x many chars, write out x many chars ect ect. But you should still only be reading the data in once, and writing the data out once if at all possible.
You could try doing it as a streamed operation; don't do 1. Load File 1, 2. Load File 2, 3. Merge, 4. Write Result. Instead do 1. Load line 1 of File 1 & 2, 2. Merge Line, 3. Write line. This way you speed things up by doing smaller chunks of read, process, write and thereby allow the disk to empty its read/write buffers while you do the merge of each line (row). There could be other things slowing down your process. Pls post code. For example, string operations could easily be slowing things down if not done carefully. Finally, Release mode (as opposed to Debug) is more optimized and will typically run significantly faster.