What is the best way to store data (no serialization - just using a Stream + BinaryWriter/BinaryReader) in this scenario for quick and easy access to files.
DataContainer contains 10 files each files is 1 mb.
If I need to write to/read from file 5 it should only read that part of the 10 mb container and return 1mb by using a unique name/ID identifier, possibly stored in a header. Problems occur when you wish to update a file in the middle of the container, because the indexs will change in the stream (if the updated object is larger or smaller than the existing one)
How do I handle this without having to rewrite the entire datacontainer when updating?
I wish to write this for myself instead of using pre-existing libraries.
Any ideas?
I think you can store sizes of files instead of indexes, so you can calculate indexes. For example, consider container:
file size
1 10
2 15
3 20
4 25
Index of file 4 is calculated simply as 10+15+20.
Related
Can any one let me know fastest way of showing Range of Lines in a files of 5 GB size. For Example: If the File is having a Size of 5GB and it has line numbers has one of the column in the file. Say if the number of lines in a file are 1 million, I have Start Index Line # and End Index Line #. Say i want to read 25th Line to 89 th line of a large file, rather than reading each and every line, is there any fastest way of reading specific lines from 25th to 89th without reading whole file from begining in C#
In short, no. How can you possibly know where the carriage returns/line numbers are before you actually read them?
To avoid memory issues you could:
File.ReadLines(path)
.SkipWhile(line=>someCondition)
.TakeWhile(line=>someOtherCondition)
5GB is a huge amount of data to sift through without building some sort of index. I think you've stumbled upon a case where loading your data into a database and adding the appropriate indexes might serve you best.
This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.
I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade
Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.
I have 500 csv files ,
each of them's size is about 10~20M.
for a sample , the content in file like below ↓
file1 :
column1 column2 column3 column4 .... column50
file2:
column51 column52 ... ... column100
So , What I want to do is merge all the files in to one large file like below ↓
fileAll
column1 , column2 ...... column2500
In my solusion now is
1, Merge per 100 files into 5 large files
2, Merge 5 large files into one large file
But the performance is very bad.
So , Can anyone give me some advice to improve the performance ?
Thanks !
What language are you working in, Off the top of my head, I would think you would get the best performance by doing a line by line stream.
So for instance, read the first line of all the files in, write the first line of the merge out. Continue until your done.
The reason why this is better than your solution is your solution reads and writes the same data to and from disk several times, which is slow. I assume you can't fit all the files in memory, (and you wouldn't want to anyway, the caching would be horrible), but you want to minimize disk reads and writes(the slowest operation) and try to do it in a fashion where each each segment to be written can fit in your cache.
All so, depending on what language your using, you may be taking a Huge hit on concatenating strings. And language that is using null terminated arrays as its string implementation is going to take a huge hit for concatenating large strings because it has to search for the null terminator. python is an examples off the top of my head. So you may want to limit the size of the strings you work with. In the above example, read in x many chars, write out x many chars ect ect. But you should still only be reading the data in once, and writing the data out once if at all possible.
You could try doing it as a streamed operation; don't do 1. Load File 1, 2. Load File 2, 3. Merge, 4. Write Result. Instead do 1. Load line 1 of File 1 & 2, 2. Merge Line, 3. Write line. This way you speed things up by doing smaller chunks of read, process, write and thereby allow the disk to empty its read/write buffers while you do the merge of each line (row). There could be other things slowing down your process. Pls post code. For example, string operations could easily be slowing things down if not done carefully. Finally, Release mode (as opposed to Debug) is more optimized and will typically run significantly faster.
I am writing a program to diff, and copy entire files or segments based on changes on either end (Rsync-esque... but more like Unison). The main idea is to keep my music folder (all mp3s) up to date over multiple locations.
I'd like to send segmented updates if only small portions of the file have changed, as opposed to copying the entire file. For this, I need a way to diff segments of the file.
I initially tried generating hashes for blocks of every file (Every n bytes I'd hash the segment). I noticed that when I changed one attribute (id3v2 tag on an mp3) all the hashed blocks would change. This makes sense, as I would guess the header is growing as it acquired new information.
This leads me to my actual question. I would like to know how to determine the length of an mp3's header, so I could create 2 comparable hashes.
1) The meta info of the file (header)
2) The actual mpeg stream with audio (This hash should remain unchanged if all I do is alter tag info)
Am I missing anything else?
Thanks!
Ty
If all you want to check the length of is id3v2 tags, then you can find out information about its structure at http://www.id3.org/id3v2.4.0-structure.
If you read the first 3 bytes, and they are equal to "ID3", then skip to the 7th byte, then read the header size. Be careful though, because the size is stored as a "synchsafe integer".
If you want to determine the header information, you'll either:
a) need to use a mp3 library that can do the parsing for you, or
b) go to the mp3 specification and parse it out as needed.
I wound up using TagLibSharp. developer.novell.com/wiki/index.php/TagLib_Sharp