Most efficient way to save large double arrays as files in C# - c#

I'm writing a program that is reading data from .dat files into double[,,] arrays, calculates some stuff and needs to write the arrays into a file to save them for a later usage.
These arrays can have up to [64x64x150000] elements which forces me to already load those files in small parts into the program to make use of them (otherwise the MemoryException is called). Until now I used textfiles to save smaller arrays on my harddisk but saving a [64x64x150000] array step by step fills up above >6GB per file at the end which is quiet a lot when you have to work with a lot of those .dat-files and have pretty much to keep all the .txt-files.
So I would like to know if any other filetype saves some harddisk space or if there is another possibility to save those arrays outside of my program for a later usage with less harddisk space requirement.
(I need to be able to exchange the files between different computers).

(8 B/double * (64 * 64 * 150000) double) / (109 B/GB) = 5.6 GB
So unless you either reduce to a lower precision (floats) or perform some kind of compression, you'll need 5.6 GB to store all those doubles. Reducing to floats would take 2.8 GB per file.
For each of the 64 * 64 vectors of length 150000, you may be able to perform a signal compression (depending on what the data looks like). That's a broad topic, so without knowing more all I can give you is a starting point: Signal compression.

Either compression, or try Binary Serialization. A double can take up dozens of bytes in text, particularly depending on your encoding (1-2 per digit). In binary, each one is exactly 8 bytes (+ however much overhead for bookkeeping, probably minimal).

Related

C# Compressing a lot of data blocks fast/efficiently

I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.
When I save them to one file I of course get a very large file.
But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.
But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.
A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block.
But it seems they aren't designed to be reused.
And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.
Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?
The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only).
The first and third block type has an occurrence of 37,5% and the second 25%.
Block type percentages
Long file short story:
Type1 consists mostly of ones.
Type2 consists mostly of zeros and ones
Type3 consists mostly of zeros
Values greater than 128 do not occur (yet).
The 16KiB block consists almost always of zeros
If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.
Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.
Gzip is known to be "fine", which means compression ratio is okay, and speed is good.
If you want more compression, other alternatives exist, such as 7z.
If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.
As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license :
https://github.com/stangelandcl/LZ4Sharp
There are some comparisons metrics with gzip on the project homepage, such as :
i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%
Hope this helps.
You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).
But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).

Comparing large text files - Is comparing hashes faster than using subsets of the file?

Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...
If the full text's string representation is too large to be held in memory at once, is it going to be faster to:
a) Hash both files in their entirety and then test to see if the hashes are identical
or
b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch
In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?
I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:
Language is C# in .NET
Text files are 3GB each
Hash function is MD5
Maximum 'spare' RAM is 1GB
The MD5 Checksum will be slower since you need to process the two files to get the outcome. You say you have 3GB files and only 1GB of memory spare you do the math.
Checking them in byte chunks will actually determine any difference earlier, also by checking the file size, file length etc...
I would go with option 2.
Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...
Otherwise Option B is what i would go for...
To get the maximum speed I would use MemoryMappedFile instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor) which would allow for literally processing files of TB-size...
It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...
Assuming you have no future use for the hash information (to compare against other texts, or to check after potential future changes), then there's two cases:
A) documents are same
B) documents are different
If A, then there's almost no difference between the two scenarios. Both involve reading the entire files one chunk at a time and doing a calculation/compare on every byte. The computational overhead of the hash is minimal compared to the work of reading the files.
If B, then it's possible you'd find a difference in the first page of the files, at which point you'd be able to quit the process.
So depending on the relative probability of A v B, it seems comparing would be faster on average. Note also that you could then report where the change occurs, which you could not in the has scenario.

Best way to store long binary (up to 512 bit) in C#

I'm trying to figure out the best way to store large binary (more than 96 bit) numbers in C#
I'm building application that will automatically allocate workers for shifts. Shifts can be as short as 15 minutes (but this might be even smaller in the future). To avoid double-booking of workers, I plan to have binary map of their daily time: 24 hours separated in equal chunks (15 minutes) and every chunk has a flag (0 for free, 1 for busy)
So when we try to give another shift to a worker, we can do binary comparison of workers daily availability with shift's time. Simple and easy to decide.
But C# long only allows to have up to 64 bit, and with the current set up I need at least 96 bits (24 hours * 60 minutes / 15 minutes per period).
This representation must be memory friendly, as there will be about a million objects operated at a time.
Few other options i considered:
String. Memory-hungry, not simple to implement bit-wise operations
Array of bits. But as far as I know C# does not have bit type
Array of unsigned integers. Each array represents only part of a day. The best I can think of
Any other suggestions??
Thanks in advance!
Have you looked at the BitArray class? It should be pretty much exactly what you're looking for.
Try following,
.Net 4 has inbuilt BigInteger type
http://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx
.Net 2 project on code project
http://www.codeproject.com/KB/cs/biginteger.
Another alternative,
http://www.codeplex.com/IntX/
Unless you have millions of employees that all need to be scheduled at the same time, I'd be tempted to store your 96 booleans as a char array with 0 meaning "free" and 1 meaning "busy". Simple to index/access/update. The rest of the employees schedules can sit in their database rows on disk where you simply don't care about "96 megabytes".
If you can find a class which implements a bit array, you could use that. (You could code one easily, too). But does it really matter spacewise?
Frankly, if your organization really has a million employees to schedule, surely you can afford a machine which has space for a 96 mB array as well as the rest of your code?
The one good excuse I can see for using bit vectors has to do with execution time cost. If you scheduling algorithm essentially ANDs one employee bit vector against another looking for conflicts, and does that on large scale, bit vectors might reduce the computation time to do this by a factor of roughly 10 (use a two *long*s per employee to get your 96 bits). I'd wait till my algorithm worked before I worried about this.
You could use and array of bytes. I don't think any language supports an array of bits, as a byte is the smallest addressable piece of memory. Other options are an array of booleans, but each boolean I believe is stored as a byte anyway, so there would be wasted memory, but it might be easier to work with. It really depends on how many days you are going to work with. You could also just store the start and end of the shift and use other means to figure out if there are overlapping schedules. This would probably make the most sense, and be the easiest to debug.
BitArray has already been mentioned, it uses an array of ints much like you planned to do anyway.
That also means that it adds an extra layer of indirection (and some extra bytes); it also does a lot of checking everywhere to make sure that eg the lengths of two bitarrays is the same when operating on them. So I would be careful with them. They're easy, but slower than necessary - the difference is especially big (compared to handling the array yourself) for smallish bitarrays.

What type to use in code and database for huge file size?

My project (C#) deals with many files about 1 MB ~ 2 GB in size. SQL server 2008 database is used. In long-term I need to do some operations over them like total sum of their size and ...
At first glance I was planned to store their size in byte (in C# code long, in database BigInt) because of file size nature and its precise. I was thinking maybe its a better idea to use a double and consider file size in MB because most of them are 1~2000 MB and make more sense while talking about files of this project. Is there any advantage/disadvantage over these two kind of designs both in code (performance issues when there are many mathematics operations) and database (batch operations over many files) ?
You should use long/BigInt, for a few reasons:
File sizes are in bytes, which are a precise discrete measure -- so you might as well use a precise discrete value.
If you use a decimal, it's hard to know what scale you're dealing with -- kb? bytes? mb? But if you're dealing with integers, you'll probably know that it's in bytes.
Not a super big deal, but performance is slightly better on most processors with longs than with decimals.
As far as I know, it's the conventional thing to do.
Store the exact value in a long, formatting/interpreting as MB, etc should be done by the client.
Either use uint or ulong, since:
you can't have negative file sizes, and
you can't have files that take fractions of bytes (e.g. 3,231.5 bytes is not a valid answer)
With a max value of 4,294,967,295 a uint seems the best solution for files of the size you're describing, as long as you can guarantee they'll never be any larger than this many bytes. It takes less space than a long or a double, and with only 32 bits it might even be quicker to evaluate depending on the processor and such.
Why not Decimal? Decimal.MaxValue has this property:
The value of this constant is positive
79,228,162,514,264,337,593,543,950,335
SQL supports this for most of the .Net valid range. Limitations in conversion are noted here.
If i remember correctly, file size is already returned as a long in .Net. Also, floats are 64bit and longs are 64bit, but floats are slower.
Also, if you do calculations are doubles, you will have to convert a long to a double on every file, while if you wait until the end, then there's only one conversion.
Do all of your calculations as bytes, then at the last step you can convert to a float to report something like 2.051GB or whatever.

Reading huge amounts of small files in sequence

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?

Categories

Resources