I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.
When I save them to one file I of course get a very large file.
But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.
But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.
A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block.
But it seems they aren't designed to be reused.
And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.
Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?
The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only).
The first and third block type has an occurrence of 37,5% and the second 25%.
Block type percentages
Long file short story:
Type1 consists mostly of ones.
Type2 consists mostly of zeros and ones
Type3 consists mostly of zeros
Values greater than 128 do not occur (yet).
The 16KiB block consists almost always of zeros
If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.
Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.
Gzip is known to be "fine", which means compression ratio is okay, and speed is good.
If you want more compression, other alternatives exist, such as 7z.
If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.
As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license :
https://github.com/stangelandcl/LZ4Sharp
There are some comparisons metrics with gzip on the project homepage, such as :
i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%
Hope this helps.
You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).
But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).
Related
I was going through some source code written by another developer and I came across the following line of code when it comes to streams (file, memory, etc) and file/content uploads. Is there a particular reason that this person is using 1024 as a buffer? Why is this 1024 multiplied by 16?
byte[] buffer = new byte[16*1024];
Could someone please clarify this further? Also, it would be awesome if anyone can direct me towards articles, etc to further read and understand this.
The practice of allocating memory in powers of 2 is holdover from days of yore. Word sizes are powers of 2 (e.g. fullword = 32 bits, doubleword = 8 bits), and virtual memory page sizes were powers of 2. You wanted your allocated memory to align on convenient boundaries as it made execution more efficient. For instance, once upon a time, loading a word or double word into a register was more expensive in terms of CPU cycles if it wasn't on an appropriate boundary (e.g, memory address divisible by 4 or 8 respectively). And if you were allocating a big chunk of memory, you might as well consume a whole page of virtual memory, because you'd likely lock an entire page anyway.
These day it doesn't really matter, but old practices die hard.
[And unless you knew something about how the memory allocator worked and how many words of overhead were involved in each malloc()'d block... it probably didn't work anyway.
1024 is the exact amount of bytes in a kilobyte. All that line means is that they are creating a buffer of 16 KB. That's really all there is to it. If you want to go down the route of why there are 1024 bytes in a kilobyte and why it's a good idea to use that in programming, this would be a good place to start. Here would also be a good place to look. Although it's talking about disk space, it's the same general idea.
I'm writing a program that is reading data from .dat files into double[,,] arrays, calculates some stuff and needs to write the arrays into a file to save them for a later usage.
These arrays can have up to [64x64x150000] elements which forces me to already load those files in small parts into the program to make use of them (otherwise the MemoryException is called). Until now I used textfiles to save smaller arrays on my harddisk but saving a [64x64x150000] array step by step fills up above >6GB per file at the end which is quiet a lot when you have to work with a lot of those .dat-files and have pretty much to keep all the .txt-files.
So I would like to know if any other filetype saves some harddisk space or if there is another possibility to save those arrays outside of my program for a later usage with less harddisk space requirement.
(I need to be able to exchange the files between different computers).
(8 B/double * (64 * 64 * 150000) double) / (109 B/GB) = 5.6 GB
So unless you either reduce to a lower precision (floats) or perform some kind of compression, you'll need 5.6 GB to store all those doubles. Reducing to floats would take 2.8 GB per file.
For each of the 64 * 64 vectors of length 150000, you may be able to perform a signal compression (depending on what the data looks like). That's a broad topic, so without knowing more all I can give you is a starting point: Signal compression.
Either compression, or try Binary Serialization. A double can take up dozens of bytes in text, particularly depending on your encoding (1-2 per digit). In binary, each one is exactly 8 bytes (+ however much overhead for bookkeeping, probably minimal).
I am writing a server which will ready and write huge files / database.
I have used Stream read and write functions many places where I am using 8192 as buffer size.
I am also reading large input from TCP sockets.
I don't know what would be the configuration of the VMs where the service will be deployed.
Is there any built in function using which I can determine the best suitable buffer size for my server?
I often wondered this myself. But in the end I do not hink that there is a general rule to apply. It always comes down to your specific needs.
As a rule of thumb, if your buffer is bigger you need less roundtrips to the file system or database, which in general, is best for most cases.
However, how much data your system can read into memory at once, without making other applications, is very depending on your individual environment. Some mobile device might have different specifics than your over-the-top server hardware, and so on.
Other things to consider would be network bandwith and other shared resources, as well as the sheer performance impact on your actions.
For example, at a project with thousands of image files, we found after several tries that - for us - the idela buffer size was at around 1 MB. For images with a size lower than that we used a buffer size equal to the file size. For your scenario this would of course not fit.
Rico Mariani, performance expert at Microsoft, names the 10 most important aspects of programming for performance: Measure, measure, measure, measure, ... (You get the point. :-) )
It depends on throughput, communication channel utilization and connection stability in production environment.
From my point of view, the best here is to make an adaptive algorithm, which will change buffer size, depending on factors mentioned above.
UPDATE.
Be careful when using buffer, that is equals or larger than 85000 bytes. Such buffers should be reused as much, as possible (because of LOH behavior).
The critical factor is not the size of the application's buffer but the size of the socket send and receive buffers, which must be >= the bandwidth-delay product of the link. Any increase above that should yield zero benefit; any decrease below it will become visible in suboptimal bandwidth. Application buffers have a role to play in reducing system calls but 8192 is normally quite enough for most purposes, especially networking ones.
Our web server needs to process many compositions of large images together before sending the results to web clients. This process is performance critical because the server can receive several thousands of requests per hour.
Right now our solution loads PNG files (around 1MB each) from the HD and sends them to the video card so the composition is done on the GPU. We first tried loading our images using the PNG decoder exposed by the XNA API. We saw the performance was not too good.
To understand if the problem was loading from the HD or the decoding of the PNG, we modified that by loading the file in a memory stream, and then sending that memory stream to the .NET PNG decoder. The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant. We roughly get the same levels of performance.
Our benchmarks show the following performance results:
Load images from disk: 37.76ms 1%
Decode PNGs: 2816.97ms 77%
Load images on Video Hardware: 196.67ms 5%
Composition: 87.80ms 2%
Get composition result from Video Hardware: 166.21ms 5%
Encode to PNG: 318.13ms 9%
Store to disk: 3.96ms 0%
Clean up: 53.00ms 1%
Total: 3680.50ms 100%
From these results we see that the slowest parts are when decoding the PNG.
So we are wondering if there wouldn't be a PNG decoder we could use that would allow us to reduce the PNG decoding time. We also considered keeping the images uncompressed on the hard disk, but then each image would be 10MB in size instead of 1MB and since there are several tens of thousands of these images stored on the hard disk, it is not possible to store them all without compression.
EDIT: More useful information:
The benchmark simulates loading 20 PNG images and compositing them together. This will roughly correspond to the kind of requests we will get in the production environment.
Each image used in the composition is 1600x1600 in size.
The solution will involve as many as 10 load balanced servers like the one we are discussing here. So extra software development effort could be worth the savings on the hardware costs.
Caching the decoded source images is something we are considering, but each composition will most likely be done with completely different source images, so cache misses will be high and performance gain, low.
The benchmarks were done with a crappy video card, so we can expect the PNG decoding to be even more of a performance bottleneck using a decent video card.
There is another option. And that is, you write your own GPU-based PNG decoder. You could use OpenCL to perform this operation fairly efficiently (and perform your composition using OpenGL which can share resources with OpenCL). It is also possible to interleave transfer and decoding for maximum throughput. If this is a route you can/want to pursue I can provide more information.
Here are some resources related to GPU-based DEFLATE (and INFLATE).
Accelerating Lossless compression with GPUs
gpu-block-compression using CUDA on Google code.
Floating point data-compression at 75 Gb/s on a GPU - note that this doesn't use INFLATE/DEFLATE but a novel parallel compression/decompression scheme that is more GPU-friendly.
Hope this helps!
Have you tried the following 2 things.
1)
Multi thread it, there is several ways of doing this but one would be a "all in" method. Basicly fully spawn X amount of threads, for the full proccess.
2)
Perhaps consider having XX thread do all the CPU work, and then feed it to the GPU thread.
Your question is very well formulated for being a new user, but some information about the senario might be usefull?
Are we talking about a batch job or service pictures in real time?
Do the 10k pictures change?
Hardware resources
You should also take into account what hardware resources you have at your dispoal.
Normaly the 2 cheapest things are CPU power and diskspace, so if you only have 10k pictures that rarly change, then converting them all into a format that quicker to handle might be the way to go.
Multi thread trivia
Another thing to consider when doing multithreading, is that its normaly smart to make the threads in BellowNormal priority.So you dont make the entire system "lag". You have to experiment a bit with the amount of threads to use, if your luck you can get close to 100% gain in speed pr CORE but this depends alot on the hardware and the code your running.
I normaly use Environment.ProcessorCount to get the current CPU count and work from there :)
I've written a pure C# PNG coder/decoder ( PngCs ) , you might want to give it a look.
But I higly doubt it will have better speed permance [*], it's not highly optimized, it rather tries to minimize the memory usage for dealing with huge images (it encodes/decodes sequentially, line by line). But perhaps it serves you as boilerplate to plug in some better compression/decompression implementantion. As I see it, the speed bottleneck is zlib (inflater/deflater), which (contrarily to Java) is not implemented natively in C# -I used a SharpZipLib library, with pure C# managed code; this cannnot be very efficient.
I'm a little surprised, however, that in your tests decoding was so much slower than encoding. That seems strange to me, because, in most compression algorithms (perhaps in all; and surely in zlib) encoding is much more computer intensive than decoding.
Are you sure about that?
(For example, this speedtest which read and writes 5000x5000 RGB8 images (not very compressible, about 20MB on disk) gives me about 4.5 secs for writing and 1.5 secs for reading). Perhaps there are other factor apart from pure PNG decoding?
[*] Update: new versions (since 1.1.14) that have several optimizations; if you can use .Net 4.5, specially, it should provide better decoding speed.
You have mutliple options
Improve the performance of the decoding process
You could implement another faster png decoder
(libpng is a standard library which might be faster)
You could switch to another picture format that uses simpler/faster decodeable compression
Parallelize
Use the .NET parallel processing capabilities for decoding concurrently. Decoding is likely singlethreaded so this could help if you run on multicore machines
Store the files uncompressed but on a device that compresses
For instance a compressed folder or even a sandforce ssd.
This will still compress but differently and burden other software with the decompression. I am not sure this will really help and would only try this as a last resort.
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?