Create zip style file without compression - c#

I know that there are many free and not so free compression libraries out there, but for the project i am working on, i need to be able to take file data from a stream and put it into some kind zip or pack file, but without compression, because i will need to access these files quickly without having to wait for them to decompress.
Anyone know how this could be approached, or if there are some libraries out there that do this that i am not aware of?

You can use Zip for this. You would use a compression level of something like "none" or "store", which just combines the files without compression. This site enumerates some of them:
Maximum - The slowest of the
compression options, but the most
useful for creating small archives.
Normal - The default value.
Low - Faster than the default, but
less effective.
Minimum - Extremely fast
compression, but not as efficient as
other methods.
None - Creates a ZIP file but does
not compress it. File size may be
slightly larger if archive is
encrypted or made self-extracting.
Here are some C# examples:
CodeProject
EggheadCafe
For the unix unaware, this is exactly what tar does. When you see .tar.gz files, it's just a bunch of files combined into a tar file, and then run through gzip.

Have a look at System.IO.Packaging namespace.
Quote from MSDN:
System.IO.Packaging
Provides classes that support storage
of multiple data objects in a single
container.
Package is an abstract class that can
be used to organize objects into a
single entity of a defined physical
format for portability and efficient
access.
A ZIP file is the primary physical
format for the Package. Other Package
implementations might use other
physical formats such as an XML
document, a database, or Web service.
You can select different compression options for your package:
NotCompressed - Compression is turned off.
Normal - Compression is optimized for a balance between size and
performance.
Maximum - Compression is optimized for
size.
Fast - Compression is optimized for
performance.
SuperFast - Compression is optimized for high performance.

Perhaps just use a zip with compression set to "none"; SharpZipLib would suffice.
Be careful about assuming that compression is slower, though - it might actually (depending on the scenario) be quicker with compression, since you reduce the amount of physical IO and IPC (often a bottleneck), and simply do a bit more CPU work; but you generally have plenty of CPU.

Traditionally simple storage files under windows are cabinet files, which do support compression, as well as signing, what zip does not support.
Look out if theres a way to create cabinet files within .net.

Remember to profile first. Your harddrive is much slower than your cpu or ram. If the file is sitting on disk reading a smaller, compressed file will take less time than if you read an uncompressed blob. The difference may well be more than the time it takes to decompress it.
Also the OS may cache the file in memory. When that happens the harddrive is completely removed from the loop (transparent to you). That could make the decompression time too costly.
I learned this "technique" when dealing with slow internet connections. The client needed the data fast and we had cycles to spare. Sending compressed packets increase the throughput/latency of the application.

I had an additional requirement that the resulting pack file be browsable with standard tools (at least FAR Manager).
So far I've tried:
OPC (Open Packaging Conventions, System.Packaging namespace, ZIP-based, backend for MSO .docx files). Built-in and standard, but quite slow, probably because it actually first copies all data to a temporary location in case it has to be compressed (even if it's not so), and only then writes to the final destination. Unbearably slow. Note that there's also a Windows built-in implementation which is not .NET-based, might be faster but does not span all of the OS versions I have to support.
ITSS (InfoTech Storage System, the backend of CHM files). Built-in in Windows, somewhat standard. Surprisingly, the implementation is incomplete, and it's deadly slow, even slower than OPC.
DOC (COM Compound File Structured Storage, backend for MSO .doc files, .msi files, etc). Built-in in Windows, quite standard. Does not support file names longer than 32 characters, which is a significant drawback in my case. Fast enough on small to middle sizes (totally outruns .NET OPC impl), but has some scalability issues when it goes up to gigabytes.
Various ZIP implementations are still to be tested.

Related

Fastest PNG decoder for .NET

Our web server needs to process many compositions of large images together before sending the results to web clients. This process is performance critical because the server can receive several thousands of requests per hour.
Right now our solution loads PNG files (around 1MB each) from the HD and sends them to the video card so the composition is done on the GPU. We first tried loading our images using the PNG decoder exposed by the XNA API. We saw the performance was not too good.
To understand if the problem was loading from the HD or the decoding of the PNG, we modified that by loading the file in a memory stream, and then sending that memory stream to the .NET PNG decoder. The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant. We roughly get the same levels of performance.
Our benchmarks show the following performance results:
Load images from disk: 37.76ms 1%
Decode PNGs: 2816.97ms 77%
Load images on Video Hardware: 196.67ms 5%
Composition: 87.80ms 2%
Get composition result from Video Hardware: 166.21ms 5%
Encode to PNG: 318.13ms 9%
Store to disk: 3.96ms 0%
Clean up: 53.00ms 1%
Total: 3680.50ms 100%
From these results we see that the slowest parts are when decoding the PNG.
So we are wondering if there wouldn't be a PNG decoder we could use that would allow us to reduce the PNG decoding time. We also considered keeping the images uncompressed on the hard disk, but then each image would be 10MB in size instead of 1MB and since there are several tens of thousands of these images stored on the hard disk, it is not possible to store them all without compression.
EDIT: More useful information:
The benchmark simulates loading 20 PNG images and compositing them together. This will roughly correspond to the kind of requests we will get in the production environment.
Each image used in the composition is 1600x1600 in size.
The solution will involve as many as 10 load balanced servers like the one we are discussing here. So extra software development effort could be worth the savings on the hardware costs.
Caching the decoded source images is something we are considering, but each composition will most likely be done with completely different source images, so cache misses will be high and performance gain, low.
The benchmarks were done with a crappy video card, so we can expect the PNG decoding to be even more of a performance bottleneck using a decent video card.
There is another option. And that is, you write your own GPU-based PNG decoder. You could use OpenCL to perform this operation fairly efficiently (and perform your composition using OpenGL which can share resources with OpenCL). It is also possible to interleave transfer and decoding for maximum throughput. If this is a route you can/want to pursue I can provide more information.
Here are some resources related to GPU-based DEFLATE (and INFLATE).
Accelerating Lossless compression with GPUs
gpu-block-compression using CUDA on Google code.
Floating point data-compression at 75 Gb/s on a GPU - note that this doesn't use INFLATE/DEFLATE but a novel parallel compression/decompression scheme that is more GPU-friendly.
Hope this helps!
Have you tried the following 2 things.
1)
Multi thread it, there is several ways of doing this but one would be a "all in" method. Basicly fully spawn X amount of threads, for the full proccess.
2)
Perhaps consider having XX thread do all the CPU work, and then feed it to the GPU thread.
Your question is very well formulated for being a new user, but some information about the senario might be usefull?
Are we talking about a batch job or service pictures in real time?
Do the 10k pictures change?
Hardware resources
You should also take into account what hardware resources you have at your dispoal.
Normaly the 2 cheapest things are CPU power and diskspace, so if you only have 10k pictures that rarly change, then converting them all into a format that quicker to handle might be the way to go.
Multi thread trivia
Another thing to consider when doing multithreading, is that its normaly smart to make the threads in BellowNormal priority.So you dont make the entire system "lag". You have to experiment a bit with the amount of threads to use, if your luck you can get close to 100% gain in speed pr CORE but this depends alot on the hardware and the code your running.
I normaly use Environment.ProcessorCount to get the current CPU count and work from there :)
I've written a pure C# PNG coder/decoder ( PngCs ) , you might want to give it a look.
But I higly doubt it will have better speed permance [*], it's not highly optimized, it rather tries to minimize the memory usage for dealing with huge images (it encodes/decodes sequentially, line by line). But perhaps it serves you as boilerplate to plug in some better compression/decompression implementantion. As I see it, the speed bottleneck is zlib (inflater/deflater), which (contrarily to Java) is not implemented natively in C# -I used a SharpZipLib library, with pure C# managed code; this cannnot be very efficient.
I'm a little surprised, however, that in your tests decoding was so much slower than encoding. That seems strange to me, because, in most compression algorithms (perhaps in all; and surely in zlib) encoding is much more computer intensive than decoding.
Are you sure about that?
(For example, this speedtest which read and writes 5000x5000 RGB8 images (not very compressible, about 20MB on disk) gives me about 4.5 secs for writing and 1.5 secs for reading). Perhaps there are other factor apart from pure PNG decoding?
[*] Update: new versions (since 1.1.14) that have several optimizations; if you can use .Net 4.5, specially, it should provide better decoding speed.
You have mutliple options
Improve the performance of the decoding process
You could implement another faster png decoder
(libpng is a standard library which might be faster)
You could switch to another picture format that uses simpler/faster decodeable compression
Parallelize
Use the .NET parallel processing capabilities for decoding concurrently. Decoding is likely singlethreaded so this could help if you run on multicore machines
Store the files uncompressed but on a device that compresses
For instance a compressed folder or even a sandforce ssd.
This will still compress but differently and burden other software with the decompression. I am not sure this will really help and would only try this as a last resort.

Memory-Mapped Files vs. RAM Disk

For the game Minecraft, the general approach when running the server application is to run it in a RAMDisk, as it is uses hundreds of tiny files for world generation, and the I/O speeds are the major bottleneck.
In a recent attempt, I tried to use Dokan/ImDisk to create a RAMDisk programatically for the Server Application. Dokan was considerably slower than the average Hard-Drive, and I was unable to get ImDisk to function properly. Since these are the only 2 Filesystem Drivers I know of that have a .NET API, i'm looking into alternatives now.
It was mentioned to me previously to try Memory-Mapped Files. My approach currently is to Create RAMDisk, Create Symbolic Link between Data Folder for Game Server and the RAMDisk, then launch the Game Server process.
Can Memory-Mapped Files function the same way, I.E. creating a virtual drive which I can create a symbolic link to? Such as G:\Data_Files\?
Are there any other alternatives to Dokan/ImDisk with a .NET API/Bindings floating around?
After looking at a bunch of solutions and doing a few benchmarks, we couldn't pass up RAMDisk from DataRam. We kicked around a bunch of the Windows driver stuff and some other freebie solutions and ultimately couldn't justify the expense compared to the tiny price tag of a commercial solution.
There are several approaches that depend on specifics of your task.
If you need to work with file system (i.e. via filesystem API functions and classes), and you want it fast, then (as I suggested in reply to your previous question) you'd need to create a RAMDisk driver. Windows Driver Kit includes a sample driver, which (coincidence?) has the name "RamDisk". Driver development, though, is tricky, and if something goes wrong with the sample or you need to extend it, you would need to dig deep into kernel-mode development (or hire someone to do the job). Why kernel mode? Cause as you could see with Dokan, switching back to user mode to store the data causes major slowdown.
If all you need is a handy management of bunch of files in memory using Stream class (with possibility to flush all of this to the disk), then you can make use of one of virtual file systems. Our SolFS (Application Edition) is one of such products that you can use (I can also remember CodeBase File System, but they don't seem to provide an evaluation version). SolFS seems to fit your task nicely so if you think so too, you can contact me privately (see my profile) for assistance.
To answer your questions:
No, memory-mapped files (MMF) are literally files on the disk (including a virtual disk if you have one), which can be accessed not via filesystem API but directly using in-memory operations. MMFs tend to be faster for most file operations, that's why they are frequently mentioned.
Our Callback File System or CallbackDisk products (see virtual storage line) are an alternative, however, as I mentioned in the first paragraph, they won't solve your problem due to user-mode context switch.
Update:
I see no obstacles for the driver to have a copy in memory and perform writes to disk asynchronously when needed. But this will require modifying sample RAMDisk driver (and this involves quite a lot of kernel-mode programming).
With SolFS or other virtual file system you can have a copy of the storage on the disk as well. In case of virtual file system it might appear that working with container file on the disk will give you satisfactory results (as virtual file system usually has a memory cache) and you won't need to keep in-memory copy at all.

What is the difference between zlib's gzip compression and the compression used by .NET's GZipStream?

Having an odd problem - one of my app suites has to read/write gzip-compressed files that are used on both Windows and Linux, and I am finding that the files I generate using zlib on Linux are 2-3 times larger than those I generate using GZipStream on Windows. They read perfectly on either platform, so I know that the compression is correct regardless of which platform created the file. The thing is, the files are transferred across the network at various times, and obviously file size is a concern.
My question is:
Has anyone else encountered this
Is there some documented difference between the two? I do know that GZipStream does not provide a way to specify the compression level like you can with zlib, but I am using maximum compression on the zlib side. Shouldn't I see relatively the same file size, assuming that GZipStream is written to use maximum compression as well?
And the answer is .... the Linux version was never compressing the data to begin with. Took a lot of debugging to find the bug that caused it, but after correcting it, the sizes are now comparable on both platforms.
I think the reason you are experiencing this is not because of the compression algorithm used, but because of how the files are compressed. From the zLib manual:
"The zlib format was designed to be compact and fast for use in memory and on communications channels. The gzip format was designed for single- file compression on file systems, has a larger header than zlib to maintain directory information, and uses a different, slower check method than zlib."
I think what is happening is that the files on your linux machine are being Tar'red together into 1 file, then that one file is being compressed. In WIndows, I think it compresses each individual file, then stores them compressed into 1 file.
This is my theory, but have nothing to really support it. Thought I might try some trial tests at home later, just to fill my curiosity.

Create contiguous file using C#?

Is it possible to create a large (physically) contiguous file on a disk using C#? Preferably using only managed code, but if it's not possible then an unmanaged solution would be appreciated.
Any file you create will be logically contiguous.
If you want physically contiguous you are on OS and FS territory. Really (far) beyond the control of normal I/O API's.
But what will probably come close is to claim the space up-front: create an Empty stream and set the Length or Position property to what you need.
Writing a defragger?
It sounds like you're after the defragmentation API anyway:-
http://msdn.microsoft.com/en-us/library/aa363911%28v=vs.85%29.aspx
The link from the bottom, cos it seems you've missed the C# wrapper that someone has kindly produced.
http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx
With modern file systems it is hard to ensure a contiguous file on the hard disk. Logically the file is always contiguous, but the physical blocks that keep the data vary from file system to file system.
The best bet for this would be to use an old file system (ext2, FAT32, etc.) and just ask for a large file using seek to the file size you want and then flushing this file. More up-to-date file systems will probably mark a large file size, but won't actually write anything to the hard disk, instead returning zeros on a future read without actually reading.
int fileSize = 1024 * 1024 * 512;
FileStream file = new FileStream("C:\\MyFile", FileMode.Create, FileAccess.Write);
file.Seek(fileSize, SeekOrigin.Begin);
file.Close();
To build a database, you will need to use the scatter-gather I/O functions provided by the Windows API. This is a special type of file I/O that allows you to either "scatter" data from a file into memory or "gather" data from memory and write it to a contiguous region of a file. While the buffers into which the data is scattered or from which it is gathered need not be contiguous, the source or destination file region is always contiguous.
This functionality consists of two primary functions, both of which work asynchronously. The ReadFileScatter function reads contiguous data from a file on disk and writes it into an array of non-contiguous memory buffers. The WriteFileGather function reads non-contiguous data from memory buffers and writes it to a contiguous file on disk. Of course, you'll also need the OVERLAPPED structure that is used by both of these functions.
This is exactly what SQL Server uses when it reads and writes to the database and/or its log files, and in fact this functionality was added to an early service pack for NT 4.0 specifically for SQL Server's use.
Of course, this is pretty advanced level stuff, and hardly for the faint of heart. Surprisingly, you can actually find the P/Invoke definitions on pinvoke.net, but I have an intensely skeptical mistrust of the site. Since you'll need to spend a lot of quality time with the documentation just to understand how these functions work, you might as well write the declarations yourself. And doing it from C# will create a whole host of additional problems for you, such that I don't even recommend it. If this kind of I/O performance is important to you, I think you're using the wrong tool for the job.
The poor man's solution is contig.exe, a single-file defragmenter available for free download here.
In short no then
The OS will do this in the background, what i would do is make the file as big as you expect it to be, that way the OS will place it contigously. And if you need to grow the file, you again grow it by like 10% each time.
This is simular to how a SQL server keeps it database files.
Open opening the FileStream you open it with append.
Example:
FileStream fwriter = new FileStream("C:\\test.txt", FileMode.Append, FileAccess.Write, FileShare.Read);

Multithreaded compression in C#

Is there a library in .net that does multithreaded compression of a stream? I'm thinking of something like the built in System.IO.GZipStream, but using multiple threads to perform the work (and thereby utilizing all the cpu cores).
I know that, for example 7-zip compresses using multiple threads, but the C# SDK that they've released doesn't seem to do that.
I think your best bet is to split the data stream at equal intervals yourself, and launch threads to compress each part separately in parallel, if using non-parallelized algorithms. (After which a single thread concatenates them into a single stream (you can make a stream class that continues reading from the next stream when the current one ends)).
You may wish to take a look at SharpZipLib which is somewhat better than the intrinsic compression streams in .NET.
EDIT: You will need a header to tell where each new stream begins, of course. :)
Found this library: http://www.codeplex.com/sevenzipsharp
Looks like it wraps the unmanaged 7z.dll which does support multithreading. Obviously not ideal having to wrap unmanaged code, but it looks like this is currently the only option that's out there.
I recently found a compression library that supports multithreaded bzip compression:DotNetZip. The nice thing about this library is that the ParallelBZip2OutputStream class is derived from System.IO.Stream and takes a System.IO.Stream as output. This means that you can create a chain of classes derived from System.IO.Stream like:
ICSharpCode.SharpZipLib.Tar.TarOutputStream
Ionic.BZip2.ParallelBZip2OutputStream (from the DotNetZip library)
System.Security.Cryptography.CryptoStream (for encryption)
System.IO.FileStream
In this case we create a .tar.bz file, encrypt it (maybe with AES) and directly write it to a file.
A compression format (but not necessarily the algorithm) needs to be aware of the fact that you can use multiple threads. Or rather, not necessarily that you use multiple threads, but that you're compressing the original data in multiple steps, parallel or otherwise.
Let me explain.
Most compression algorithms compress data in a sequential manner. Any data can be compressed by using information learned from already compressed data. So for instance, if you're compressing a book by a bad author, which uses a lot of the same words, clichés and sentences multiple times, by the time the compression algorithm comes to the second+ occurrence of those things, it will usually be able to compress the current occurrence better than the first occurrence.
However, a side-effect of this is that you can't really splice together two compressed files without decompressing both and recompressing them as one stream. The knowledge from one file would not match the other file.
The solution of course is to tell the decompression routine that "Hey, I just switched to an altogether new data stream, please start fresh building up knowledge about the data".
If the compression format has support for such a code, you can easily compress multiple parts at the same time.
For instance, a 1GB file could be split into 4 256MB files, compress each part on a separate core, and then splice them together at the end.
If you're building your own compression format, you can of course build support for this yourself.
Whether .ZIP or .RAR or any of the known compression formats can support this is unknown to me, but I know the .7Z format can.
Normally I would say try Intel Parallel studio, which lets you develop code specifically targetted at multi-core systems, but for now it does C/C++ only. Maybe create just lib in C/C++ and call that from your C# code?

Categories

Resources