Efficient compression of folder with same file copied multiple times - c#

I am creating a *.zip using Ionic.Zip. However, my *.zip contains same files multiple times, sometimes even 20x, and the ZIP format does not take advantage of it at all.
Whats worse, Ionic.Zip sometimes crashes with an OutOfMemoryException, since I am compressing the files into a MemoryStream.
Is there a .NET library for compressing that takes advantage of redundancy between files?
Users decompress the files on their own, so it cannot be an exotic format.

I ended up creating a tar.gz using the SharpZipLib library. Using this solution on 1 file, the archive is 3kB. Using it on 20 identical files, the archive is only 6kB, whereas in .zip it was 64kB.
Nuget:
Install-Package SharpZipLib
Usings:
using ICSharpCode.SharpZipLib.GZip;
using ICSharpCode.SharpZipLib.Tar;
Code:
var output = new MemoryStream();
using (var gzip = new GZipOutputStream(output))
using (var tar = TarArchive.CreateOutputTarArchive(gzip))
{
for (int i = 0; i < files.Count; i++)
{
var tarEntry = TarEntry.CreateEntryFromFile(file);
tar.WriteEntry(tarEntry,false);
}
tar.IsStreamOwner = false;
gzip.IsStreamOwner = false;
}

No, there is no such API exposed by well-known ones (such as GZip, PPMd, Zip, LZMA). They all operate per file (or stream of bytes to be more specific).
You could catenate all the files, ie using a tar-ball format and then use compression algorithm.
Or, it's trivial to implement your own check: compute hash for a file and store it in the a hash-filename dictionary. If hash matches for next file you can decide what you want to do, such as ignore this file completely, or perhaps note its name and save it in another file to mark duplicates.

Yes, 7-zip. There is a SevenZipSharp library you could use, but from my experience, launching a compressing process directly using command line is much faster.
My personal experience:
We used a SevenZipSharp in a company to decompress archives up to 1GB and it was terribly slow until I reworked it so that it will use the 7-zip library directly by running its command line interface. Then it was as fast as it was when decompressing manually in Windows Explorer.

I haven't tested this, but according to one answerer in How many times can a file be compressed?
If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information.

Related

split zip files to volumes [duplicate]

I need to create spanned (multi-volume) zip files using .Net, but I have been unable to find a library that enables me to do it.
Spanned zip is a zip compressed file that is split among a number of files, which usually have extensions like .z00, .z01, and so on.
The library would have to be open-source or free, because I'm gonna use it for a open source project.
(it's a duplicate to this question, but there are no answers there and I'm not going for ASP specific anyway)
DotNetZip example:
int segmentsCreated ;
using (ZipFile zip = new ZipFile())
{
zip.UseUnicode= true; // utf-8
zip.AddDirectory(#"MyDocuments\ProjectX");
zip.Comment = "This zip was created at " + System.DateTime.Now.ToString("G") ;
zip.MaxOutputSegmentSize = 100*1024 ; // 100k segments
zip.Save("MyFiles.zip");
segmentsCreated = zip.NumberOfSegmentsForMostRecentSave ;
}
if segmentsCreated comes back as 5, then you have the following files, each not more than 100kb in size.
MyFiles.zip
MyFiles.z01
MyFiles.z02
MyFiles.z03
MyFiles.z04
Edited To Note: DotNetZip used to live at Codeplex. Codeplex has been shut down. The old archive is still [available at Codeplex][1]. It looks like the code has migrated to Github:
https://github.com/DinoChiesa/DotNetZip. Looks to be the original author's repo.
https://github.com/haf/DotNetZip.Semverd. This looks to be the currently maintained version. It's also packaged up an available via Nuget at https://www.nuget.org/packages/DotNetZip/
DotNetZip allows you to do this. From their documentation:
The library supports zip passwords, Unicode, ZIP64, stream input and output,
AES encryption, multiple compression levels, self-extracting archives,
spanned archives, and more.
Take a look at the SevenZipSharp library. It supports multivolumes archives.

Creating large multi-part zip with Ionic fails with OutOfMemoryException

I try to create a multi-part zip file with files of a total size of 17 GB, using IonicZip.
Each zip part is set to be not larger than around 500 MB.
zip.MaxOutputSegmentSize = 500000000;
The files that go into the zip are of various sizes each, but never larger than 350MB, usually much smaller, just a couple of KB or MB.
My machine where I create the zip file has 4GB RAM.
When I start the zip creation in my program, I get an OutOfMemoryException at some point due to the used RAM.
(The same works fine when the total size of all files is about 2 GB instead of 17GB).
The code:
ZipFile zip = new ZipFile(zipFileName, Encoding.UTF8);
zip.CompressionLevel = CompressionLevel.BestCompression;
zip.ParallelDeflateThreshold = -1; // https://stackoverflow.com/questions/11981143/compression-fails-when-using-ionic-zip
zip.MaxOutputSegmentSize = 500000000; // around 500MB
[...]
while (...)
{
zip.AddEntry(adjustedFilePath, File.ReadAllBytes(filepath));
}
zip.Save();
I am wondering how IonicZip handles zip.save in combination with multi-part creation. It should not be necessary to hold all multi-parts in memory but only the current one, not?
And since I set zip.MaxOutputSegmentSize to only around 500MB and the maximum size of a single file that goes into the zip is never more than 350MB, I don't see why it should eat up so much memory.
On the other hand, when the OutOfMemoryException occurs, there is not even any single part of the multi-part written to disk yet. Usually, with smaller amount of files where the zip creation succeeds, the multiple parts are on the filesystem with different creation timestamps, approx. 5 seconds apart each. So I am really not sure what IonicZip is doing internally exactly until it spits out the first zip part.
Sorry, I'm new to C# and .NET. Is the IonicZip the best library for this? Could I use the System.IO.Compression or System.IO.Packaging package instead (I did not see that they support multi-part zips) or the commercial Xceed?
Posts that I already checked but did not help:
Compression fails when using ionic zip
Ionic zip throws out of memory exception
OutOfMemoryException when creating large ZIP file using System.IO.Packaging (not IonicZip related)
I suggest using the built in .net System.IO.Compression namespace classes to zip/unzip files; they are stream based, whereas your example here does not use streams - so must be using RAM to store the compressed data. With the .net native libraries you can write out compressed data in chunks to the output stream.
The solution I came up with which seems to be the best in terms of effort (I don't want to having to reinvent the wheel and write multi-part logic for the standard zip/compression package, seems such a standard thing that should be part of any zipping package) is to use Xceed and pretty much use the same code logic like I did for IonicZip.
It seems that IonicZip is not really handling it in an efficient way in terms of memory. Xceed works fine without any major increase in memory usage.
using Xceed.Zip;
using Xceed.FileSystem;
[...]
ZipArchive zip = new ZipArchive(new DiskFile(zipFileName));
zip.SplitNameFormat = SplitNameFormat.PkZip;
zip.SplitSize = 500000;
zip.DefaultCompressionLevel = Xceed.Compression.CompressionLevel.Highest;
zip.TempFolder = new DiskFolder(#"C:\temp");
new DiskFolder(localSourceFolder).CopyFilesTo(zip, true, true);

Create a ZIP file without entries touching the disk?

I'm trying to create a program that has the capability of creating a zipped package containing files based on user input.
I don't need any of those files to be written to the hard drive before they're zipped, as that would be unnecessary, so how do I create these files without actually writing them to the hard drive, and then have them zipped?
I'm using DotNetZip.
See the documentation here, specifically the example called "Create a zip using content obtained from a stream":
using (ZipFile zip = new ZipFile())
{
ZipEntry e= zip.AddEntry("Content-From-Stream.bin", "basedirectory", StreamToRead);
e.Comment = "The content for entry in the zip file was obtained from a stream";
zip.AddFile("Readme.txt");
zip.Save(zipFileToCreate);
}
If your files are not already in a stream format, you'll need to convert them to one. You'll probably want to use a MemoryStream for that.
I use SharpZipLib, but if DotNetZip can do everything against a basic System.IO.Stream, then yes, just feed it a MemoryStream to write to.
Writing to the hard disk shouldn't be something avoid because it's unnecessary. That's backwards. If it's not a requirement that the entire zipping process is done in memory then avoid it by writing to the hard disk.
The hard disk is better suited for storing large amounts of data than memory is. If by some chance your zip file ends up being around a gigabyte in size your application could croak or at least cause a system slowdown. If you write directly to the hard drive the zip could be several gigabytes in size without causing an issue.

Decompression of uploaded file with: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream

Im uploading zip file (compressed with winrar) to my server by FileUpload control. On the server I use this code to decompress file:
HttpPostedFile myFile = FileUploader.PostedFile;
using (Stream inFile = myFile.InputStream)
{
using (GZipStream decompress = new GZipStream(inFile, CompressionMode.Decompress))
{
StreamReader reader = new StreamReader(decompress);
string text = reader.ReadToEnd(); // Here is an error
}
}
But I get error:
The magic number in GZip header is not correct. Make sure you are passing in a GZip stream
Is there any way to repair this ? Im using .net 2.0
Thank You very much for help
ZIP and GZIP are not quite the same. You can use a third-party library like #ziplib to decompress ZIP files.
GZip is a format that compresses a given stream into another stream. When used with files it is conventionally given the .gz extension and the content-type application/x-gzip (though often we use the content type of the contained stream and another means of indicating that it's g-zipped). On the web it's often used as a content-encoding or (alas less well-supported given its closer to what we generally really want) transfer-encoding to reduce download and upload time "invisibly" (the user thinks they're downloading a large HTML page but really their downloading a smaller GZip of it).
Zip is a format that compresses an archive of one or more files, along with information about relative paths. The file produced is conventionally given the .zip extension, and the content-type application/zip (registered with IANA).
There are definite similarities aside from the name, as in they both (generally) use the DEFLATE algorithm, and we can combine the use of GZip with the use of Tar to create an archive similar to what Zip gives us, but they have different uses.
You've got two options:
The simplest (from the programming side of things anyway) is to get a windows tool that produces GZip files (Winrar will open but not create them, but there are dozens of tools that will create them including quite a few that are free). Then your code will work.
The other is to use the Package Class. It's a bit more complicated to use, because a package of potentially several files is inherently more complicated than a single file, but not dreadful by any means. This will let you examine a Zip file, extract the file(s) contained, make changes to them, etc.

How to Decompress nested GZip (TGZ) files in C#

I am receiving a TGZ file that will contain one plain text file along with possibly one or more nested TGZ files. I have figured out how to decompress the main TGZ file and read the plain text file contained in it, but I have not been able to figure out how to recognize and decompress the nested TGZ files. Has anyone come across this problem before?
Also, I do not have control over the file I am receiving, so I cannot change the format of a TGZ file containing nested TGZ files. One other caveat (even though I don't think it matters) is that these files are being compressed and tarred in a Unix or Linux environment.
Thanks in advance for any help.
Try the SharpZipLib (http://www.icsharpcode.net/OpenSource/SharpZipLib/Download.aspx) free library.
It lets you work with TGZ and has methods to test files before trying to inflate them; so you can either rely on the file extensions being correct, or test them individually to see if you can read them as compressed files - then inflate them once the main file has been decompressed.
To read and write .tar and .tgz (or .tar.gz ) files from .NET, you can use this one-file tar class:
http://cheesoexamples.codeplex.com/SourceControl/changeset/view/97756#1868643
Very simple usage. To create an archive:
string[] filenames = { ... };
Ionic.Tar.CreateArchive("archive.tar", filenames);
Create a compressed (gzip'd) tar archive:
string[] filenames = { ... };
Ionic.Tar.CreateArchive("archive.tgz", filenames, TarOptions.Compress);
Read a tar archive:
var entries = Ionic.Tar.List("archive.tar"); // also handles .tgz files
Extract all entries in a tar archive:
var entries = Ionic.Tar.Extract("archive.tar"); // also handles .tgz files
Take a look at DotNetZip on CodePlex.
"If all you want is a better
DeflateStream or GZipStream class to
replace the one that is built-into the
.NET BCL, that is here, too.
DotNetZip's DeflateStream and
GZipStream are available in a
standalone assembly, based on a .NET
port of Zlib. These streams support
compression levels and deliver much
better performance that the built-in
classes. There is also a ZlibStream to
complete the set (RFC 1950, 1951,
1952)."
It appears that you can iterate through the compressed file and pull the individual files out of the archive. You can then test the files you uncompressed and see if any of them are themselves GZip files.
Here is a snippit from their Examples Page
using (ZipFile zip = ZipFile.Read(ExistingZipFile))
{
foreach (ZipEntry e in zip)
{
e.Extract(OutputStream);
}
}
Keith

Categories

Resources