How to get more performance for merging files

How to get more performance for merging files - c#

I have 500 csv files ,
each of them's size is about 10~20M.
for a sample , the content in file like below ↓
file1 :
column1 column2 column3 column4 .... column50
file2:
column51 column52 ... ... column100
So , What I want to do is merge all the files in to one large file like below ↓
fileAll
column1 , column2 ...... column2500
In my solusion now is
1, Merge per 100 files into 5 large files
2, Merge 5 large files into one large file
But the performance is very bad.
So , Can anyone give me some advice to improve the performance ?
Thanks !

What language are you working in, Off the top of my head, I would think you would get the best performance by doing a line by line stream.
So for instance, read the first line of all the files in, write the first line of the merge out. Continue until your done.
The reason why this is better than your solution is your solution reads and writes the same data to and from disk several times, which is slow. I assume you can't fit all the files in memory, (and you wouldn't want to anyway, the caching would be horrible), but you want to minimize disk reads and writes(the slowest operation) and try to do it in a fashion where each each segment to be written can fit in your cache.
All so, depending on what language your using, you may be taking a Huge hit on concatenating strings. And language that is using null terminated arrays as its string implementation is going to take a huge hit for concatenating large strings because it has to search for the null terminator. python is an examples off the top of my head. So you may want to limit the size of the strings you work with. In the above example, read in x many chars, write out x many chars ect ect. But you should still only be reading the data in once, and writing the data out once if at all possible.

You could try doing it as a streamed operation; don't do 1. Load File 1, 2. Load File 2, 3. Merge, 4. Write Result. Instead do 1. Load line 1 of File 1 & 2, 2. Merge Line, 3. Write line. This way you speed things up by doing smaller chunks of read, process, write and thereby allow the disk to empty its read/write buffers while you do the merge of each line (row). There could be other things slowing down your process. Pls post code. For example, string operations could easily be slowing things down if not done carefully. Finally, Release mode (as opposed to Debug) is more optimized and will typically run significantly faster.

Related

Approach to Implementing a CSV Generator from Linear Information

I have a pseudo-code question for a problem I've encountered. I have a binary file of recorded variable data at certain record rates (20Hz,40Hz, etc..). This information is linear in the file. For example if I have var1 and var2, I'd read from the file var1's data, then var2's data, then var1's next sample, etc...I'm pretty sure the best way to construct a CSV is by row. My original thought was to just read in the binary file and parse the information into a contemporary buffer/structure. Once all the binary data is read in then begin writing the CSV file by row. My only concern with this approach is memory consumption. There can be anywhere from 300-400 parameters recorded as high as 160HZ. That's a lot of data to have stored. I was wondering if there's any other approaches that are more efficient. Language I'm using is C#

As I understand it, you have:
{ some large number of var1 samples }
{ some large number of var2 samples }
{ some large number of var3 samples }
And you want to create:
var1, var2, var3, etc.
var1, var2, var3, etc.
If you have enough memory to hold all of that data, then your first approach is the way to go.
Only you can say whether you have enough memory. If the file is all binary data (i.e. integers, floats, doubles, etc.), then you can get a pretty good idea of how much memory you'll need just by looking at the size of the file.
Assuming that you don't have enough memory to hold all of the data at once, you could easily process the data in two passes.
On the first pass, you read all of the var1 data and immediately write it to a temporary file called var1Data. Then do the same with var2, var3, etc. When the first pass is done, you have N binary files, each one containing the data for that variable.
The second pass is a simple matter of opening all of those files, and then looping:
while not end of data
read from var1Data
read from var2Data
read from var3Data
etc.
create structure
write to CSV
Or, you could do this:
while not end of data
read from var1Data
write to CSV
read from var2Data
write to CSV
etc.
Granted, it's two passes over the data, but if you can't fit all of the data into memory that's the way you'll have to go.
One drawback is that you'll have 300 or 400 files open concurrently. That shouldn't be a problem. But there is another way to do it.
On the first pass, read, say, the first 100,000 values for each parameter into memory, create your structures, and write those to the CSV. Then make another pass over the file, reading items 100,000 to 199,999 for each parameter into memory and append to the CSV. Do that until you've processed the entire file.
That might be easier, depending on how your binary file is structured. If you know where each parameter's data starts in the file, and all the values for that parameter are the same size, then you can seek directly to the start for that parameter (or to the 100,000th entry for that parameter), and start reading. And once you've read however many values for var1, you can seek directly to the start of the var2 data and start reading from there. You skip over data you're not ready to process in this pass.
Which method to use will depend on how much memory you have and how your data is structured. As I said, if it all fits into memory then your job is very easy. If it won't fit into memory, then if the binary file is structured correctly you can do it with multiple passes over the input file, on each pass skipping over the data you don't want for that pass. Otherwise, you can use the multiple files method, or you can do multiple passes over the input, reading sequentially (i.e. not skipping over data).

Reading Range of Lines from a File

Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#

In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.

Fastet / most efficient way to work with very large Lists (in c#)

This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.

Best Way to Load a File, Manipulate the Data, and Write a New File

I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade

Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.

Sorting Binary File Index Based

I have a binary file which can be seen as a concatenation of different sub-file:
INPUT FILE:
Hex Offset ID SortIndex
0000000 SubFile#1 3
0000AAA SubFile#2 1
0000BBB SubFile#3 2
...
FFFFFFF SubFile#N N
These are the information i have about each SubFile:
Starting Offset
Lenght in bytes
Final sequence Order
What's the fastest way to produce a Sorted Output File in your opinion ?
For instance OUTPUT FILE will contain the SubFile in the following order:
SubFile#2
SubFile#3
SubFile#1
...
I have thought about:
Split the Input File extracting each Subfile to disk, then
concatenate them in the correct order
Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.
Consider the following information also:
Input file can be really huge (200MB~1GB)
For those who knows, i am speaking about IBM AFP Files.
Both my solution are easy to implement, but looks really not performing in my opinion.
Thanks in advance

Also if file is big the number of IDs is not so huge.
You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array.
I expect this to be faster than other methods.
So... let's make some pseudocode.
public struct FileItem : IComparable<FileItem>
{
public String Id;
public int SortIndex;
public uint Offset;
public uint Length;
public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}
public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
FileItem[] result = // fill the array
Array.Sort(result);
}
public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
foreach (FileItem item in items)
{
Copy from inputFile[item.Offset .. item.Length] to outputFile.
}
}
The number of read operations is linear, O(n), but seeking is required.
The only performance problem about seeking is cache miss by hard drive cache.
Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.
If you are using a solid state disk instead seeking time is 0 :)
Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly.
You can ensure better time if you preallocate the size of the file before starting to write it.
FileStream myFileStream = ...
myFileStream.SetLength(predictedTotalSizeOfFile);
Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.
The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.
It is not useful to try a multithreaded approach, it will just slow down the copy.
It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).
Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code.
Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.

The first solution I thought of was to read the input file sequentially and build a Subfile-object for every subfile. These objects will be put into b+tree as soon as they are created. The tree will order the subfiles by their SortIndex. A good b-tree implementation will have linked child nodes which enables you to iterate over the subfiles in the correct order and write them into the output file
another way could be to use random access files. you can load all SortIndexes and offsets. then sort them and write the output file in the sorted way. in this case all depends on how random access files work. in this case all depends on the random access file reader implementation. if it just reads the file until a specified position it would not be very performant.. honestly, I have no idea how they work... :(

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.