Multithreaded compression in C#

Multithreaded compression in C# - c#

Is there a library in .net that does multithreaded compression of a stream? I'm thinking of something like the built in System.IO.GZipStream, but using multiple threads to perform the work (and thereby utilizing all the cpu cores).
I know that, for example 7-zip compresses using multiple threads, but the C# SDK that they've released doesn't seem to do that.

I think your best bet is to split the data stream at equal intervals yourself, and launch threads to compress each part separately in parallel, if using non-parallelized algorithms. (After which a single thread concatenates them into a single stream (you can make a stream class that continues reading from the next stream when the current one ends)).
You may wish to take a look at SharpZipLib which is somewhat better than the intrinsic compression streams in .NET.
EDIT: You will need a header to tell where each new stream begins, of course. :)

Found this library: http://www.codeplex.com/sevenzipsharp
Looks like it wraps the unmanaged 7z.dll which does support multithreading. Obviously not ideal having to wrap unmanaged code, but it looks like this is currently the only option that's out there.

I recently found a compression library that supports multithreaded bzip compression:DotNetZip. The nice thing about this library is that the ParallelBZip2OutputStream class is derived from System.IO.Stream and takes a System.IO.Stream as output. This means that you can create a chain of classes derived from System.IO.Stream like:
ICSharpCode.SharpZipLib.Tar.TarOutputStream
Ionic.BZip2.ParallelBZip2OutputStream (from the DotNetZip library)
System.Security.Cryptography.CryptoStream (for encryption)
System.IO.FileStream
In this case we create a .tar.bz file, encrypt it (maybe with AES) and directly write it to a file.

A compression format (but not necessarily the algorithm) needs to be aware of the fact that you can use multiple threads. Or rather, not necessarily that you use multiple threads, but that you're compressing the original data in multiple steps, parallel or otherwise.
Let me explain.
Most compression algorithms compress data in a sequential manner. Any data can be compressed by using information learned from already compressed data. So for instance, if you're compressing a book by a bad author, which uses a lot of the same words, clichés and sentences multiple times, by the time the compression algorithm comes to the second+ occurrence of those things, it will usually be able to compress the current occurrence better than the first occurrence.
However, a side-effect of this is that you can't really splice together two compressed files without decompressing both and recompressing them as one stream. The knowledge from one file would not match the other file.
The solution of course is to tell the decompression routine that "Hey, I just switched to an altogether new data stream, please start fresh building up knowledge about the data".
If the compression format has support for such a code, you can easily compress multiple parts at the same time.
For instance, a 1GB file could be split into 4 256MB files, compress each part on a separate core, and then splice them together at the end.
If you're building your own compression format, you can of course build support for this yourself.
Whether .ZIP or .RAR or any of the known compression formats can support this is unknown to me, but I know the .7Z format can.

Normally I would say try Intel Parallel studio, which lets you develop code specifically targetted at multi-core systems, but for now it does C/C++ only. Maybe create just lib in C/C++ and call that from your C# code?

Related

efficient continuous data writes on HDD

In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)

Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.

Do ghostscript convertion in-memory using dot net or any other language

Can I use ghostscript API to convert PDF to some other format without reading data from disk or writing results to disk?
It has a big overhead!
I need something like this:
public static byte[][] ConvertPDF(byte[] pdfData)
{
//// Returns an array of byte-array of pages data
}

Using the Ghostscript API you can send input from anywhere you like. Depending on the output device you choose you may be able to send the output to stdout, or to retrieve a bitmap in memory.
If you want TIFF output then you have to have an output file (Tagged Image File Format, the clue is in the name...)
Similarly, you can't do this with PDF files as input, those have to be available as a file, because PDF is a random access format.
What leads you to think that this is a performance problem ?

Since there still isn't a correct answer here all these years later, I'll provide one.
Ghostscipt performs its operations on disk. It doesn't use an input & output path merely to load the file into memory, perform operations, and write it back. It actually reads and writes parts of the file to disk as it goes (using multiple threads). While this IS slower, it also uses much less memory(bearing in mind that these files could potentially be quite large).
Because the operations are performed on disk, there was not (at the time of this question) any way to pass in or retrieve a byte array/memory stream because to do so would be "dishonest"--it might imply that it was a "shortcut" to prevent disk IO when in fact it would not. Later, support was added to accept & return memory streams, but it's important to note that this support merely accepted the memory stream, wrote it to a temporary file, performed the operations, and then read it back to a new memory stream.
If that still meets your needs (for example, if you want the inevitable IO to be handled by the library rather than your business logic), here are a couple links demonstrating how to go about it (your exact needs do change the mechanics).
Image to pdf (memory stream to memory stream via rasterizer)
Image to pdf (file to memory stream via processor)
Pdf to image (memory stream to memory stream via rasterizer)
Hopefully these will, collectively, provide enough information to solve this issue for others who, like me & OP, mostly found people saying it was impossible and that I shouldn't even be trying.

C# array to implement support for page-file architecture

Let me explain what I need to accomplish. I need to load a file into RAM and analyze its structure. What I was doing is this:
//Stream streamFile;
byte[] bytesFileBuff = new byte[streamFile.Length];
if(streamFile.Read(bytesFileBuff, 0, streamFile.Length) == streamFile.Length)
{
//Loaded OK, can now analyze 'bytesFileBuff'
//Go through bytes in 'bytesFileBuff' array from 0 to `streamFile.Length`
}
But in my previous experience with Windows and 32-bit processes, it seems like even smaller amounts of RAM can be hard to allocate. (In that particular example I failed to allocate 512MB on a Windows 7 machine with 16GB of installed RAM.)
So I was curious, is there a special class that would allow me to work with the contents on a file of hypothetically any length (by implementing an internal analog of a page-file architecture)?

If linear stream access (even with multiple passes) is not a viable option, the solution in Win32 would be to use Memory Mapped Files with relatively small Views.
I didn't think you could do that in C# easily, but I was wrong. It turns out that .NET 4.0 and above provide classes wrapping the Memory Mapped Files API.
See http://msdn.microsoft.com/en-us/library/dd997372.aspx
If you have used memory mapped files in C/C++, you will know what to do.
The basic idea would be to use MemoryMappedFileCreateFromFile to obtain a MemoryMappedFile object. With the object, you can call the CreateViewAccessor method to get different MemoryMappedViewAccessor objects that represent chunks of the file; you can use these objects to read from the file in chunks of your choice. Make sure you dispose MemoryamappedViewAccessors diligently to release the memory buffer.
You have to work out the right strategy for using memory mapped files. You don't want to create too many small views or you will suffer a lot of overhead. Too few larger views and you will consume a lot of memory.
(As I said, I didn't know about these class wrappers in .NET. Do read the MSDN docs carefully: I might have easily missed something important in the few minutes I spent reviewing them)

File I/O best practice - byte[] or FileStream?

I'm currently working with a lot of different file types (txt, binary, office, etc). I typically use a byte[] or string to hold the file data in memory (while it is being written/parsed) and in order to read/write it into files I write the entire data using a FileStream after the data has been completely processed.
Should I be using a TextStream instead of a string while generating data for a text file?
Should I be using a FileStream instead of a byte[] while generating data for a binary file?
Would using streams give me better performance instead of calculating the entire data and outputting it in one go at the end?
Is it a general rule that File I/O should always use streams or is my approach fine in some cases?

The advantage of a byte[]/string vs a stream may be that the byte[]/string is in memory, and accessing it may be faster. If the file is very large, however, you may end up paging thus reducing performance. Another advantage of the byte[]/string approach is that the parsing may be a little easier (simply use File.ReadAllText, say).
If your parsing allows (particularly if you don't need to seek randomly), using a FileStream can be more efficient especially if the file is rather large. Also, you can make use of C#'s (4.5) async/await features to very easily read/write the file asynchronously and process chunks that you read in.
Personally, I'd probably just read the file into memory if I'm not too worried about performance, or the file is very small. Otherwise I'd consider using streams.
Ultimately I would say write some simple test programs and time the performance of each if you're worried about the performance differences, that should give you your best answer.

Apart from talking about the size of the data, another important question is the purpose of the data. Manipulation is easier to perform when working with strings and arrays. If both strings and arrays are equally convenient then an array of bytes would be preferred. Strings have to be interpreted which brings in complexity (Encoding, BOM etc) and therefore increases the likelihood of a bug. Use strings only for text. Binary data should always be handled by byte arrays or streams.
Streams should be considered each time you either don't have to perform any manipulation or the subjected data is very large or the subjected data is coming in very slowly. Streams are a natural way of processing data part by part whereas strings and arrays in general expect the data to be there in its entirety before processing it.
Working in streams will generally yield performance since it opens up the possibility for having different channels both reading and writing asynchronously.

while generating data for a text file
If the file data flushing is immediate, your choice is StreamWriter over the FileStream. If not, then the StringBuilder.
while generating data for a binary file?
MemoryStream is a choice. Additionally, BinaryWriter over the memstream is preferred.

Create zip style file without compression

I know that there are many free and not so free compression libraries out there, but for the project i am working on, i need to be able to take file data from a stream and put it into some kind zip or pack file, but without compression, because i will need to access these files quickly without having to wait for them to decompress.
Anyone know how this could be approached, or if there are some libraries out there that do this that i am not aware of?

You can use Zip for this. You would use a compression level of something like "none" or "store", which just combines the files without compression. This site enumerates some of them:
Maximum - The slowest of the
compression options, but the most
useful for creating small archives.
Normal - The default value.
Low - Faster than the default, but
less effective.
Minimum - Extremely fast
compression, but not as efficient as
other methods.
None - Creates a ZIP file but does
not compress it. File size may be
slightly larger if archive is
encrypted or made self-extracting.
Here are some C# examples:
CodeProject
EggheadCafe
For the unix unaware, this is exactly what tar does. When you see .tar.gz files, it's just a bunch of files combined into a tar file, and then run through gzip.

Have a look at System.IO.Packaging namespace.
Quote from MSDN:
System.IO.Packaging
Provides classes that support storage
of multiple data objects in a single
container.
Package is an abstract class that can
be used to organize objects into a
single entity of a defined physical
format for portability and efficient
access.
A ZIP file is the primary physical
format for the Package. Other Package
implementations might use other
physical formats such as an XML
document, a database, or Web service.
You can select different compression options for your package:
NotCompressed - Compression is turned off.
Normal - Compression is optimized for a balance between size and
performance.
Maximum - Compression is optimized for
size.
Fast - Compression is optimized for
performance.
SuperFast - Compression is optimized for high performance.

Perhaps just use a zip with compression set to "none"; SharpZipLib would suffice.
Be careful about assuming that compression is slower, though - it might actually (depending on the scenario) be quicker with compression, since you reduce the amount of physical IO and IPC (often a bottleneck), and simply do a bit more CPU work; but you generally have plenty of CPU.

Traditionally simple storage files under windows are cabinet files, which do support compression, as well as signing, what zip does not support.
Look out if theres a way to create cabinet files within .net.

Remember to profile first. Your harddrive is much slower than your cpu or ram. If the file is sitting on disk reading a smaller, compressed file will take less time than if you read an uncompressed blob. The difference may well be more than the time it takes to decompress it.
Also the OS may cache the file in memory. When that happens the harddrive is completely removed from the loop (transparent to you). That could make the decompression time too costly.
I learned this "technique" when dealing with slow internet connections. The client needed the data fast and we had cycles to spare. Sending compressed packets increase the throughput/latency of the application.

I had an additional requirement that the resulting pack file be browsable with standard tools (at least FAR Manager).
So far I've tried:
OPC (Open Packaging Conventions, System.Packaging namespace, ZIP-based, backend for MSO .docx files). Built-in and standard, but quite slow, probably because it actually first copies all data to a temporary location in case it has to be compressed (even if it's not so), and only then writes to the final destination. Unbearably slow. Note that there's also a Windows built-in implementation which is not .NET-based, might be faster but does not span all of the OS versions I have to support.
ITSS (InfoTech Storage System, the backend of CHM files). Built-in in Windows, somewhat standard. Surprisingly, the implementation is incomplete, and it's deadly slow, even slower than OPC.
DOC (COM Compound File Structured Storage, backend for MSO .doc files, .msi files, etc). Built-in in Windows, quite standard. Does not support file names longer than 32 characters, which is a significant drawback in my case. Fast enough on small to middle sizes (totally outruns .NET OPC impl), but has some scalability issues when it goes up to gigabytes.
Various ZIP implementations are still to be tested.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.