How to Decompress nested GZip (TGZ) files in C# - c#

I am receiving a TGZ file that will contain one plain text file along with possibly one or more nested TGZ files. I have figured out how to decompress the main TGZ file and read the plain text file contained in it, but I have not been able to figure out how to recognize and decompress the nested TGZ files. Has anyone come across this problem before?
Also, I do not have control over the file I am receiving, so I cannot change the format of a TGZ file containing nested TGZ files. One other caveat (even though I don't think it matters) is that these files are being compressed and tarred in a Unix or Linux environment.
Thanks in advance for any help.

Try the SharpZipLib (http://www.icsharpcode.net/OpenSource/SharpZipLib/Download.aspx) free library.
It lets you work with TGZ and has methods to test files before trying to inflate them; so you can either rely on the file extensions being correct, or test them individually to see if you can read them as compressed files - then inflate them once the main file has been decompressed.

To read and write .tar and .tgz (or .tar.gz ) files from .NET, you can use this one-file tar class:
http://cheesoexamples.codeplex.com/SourceControl/changeset/view/97756#1868643
Very simple usage. To create an archive:
string[] filenames = { ... };
Ionic.Tar.CreateArchive("archive.tar", filenames);
Create a compressed (gzip'd) tar archive:
string[] filenames = { ... };
Ionic.Tar.CreateArchive("archive.tgz", filenames, TarOptions.Compress);
Read a tar archive:
var entries = Ionic.Tar.List("archive.tar"); // also handles .tgz files
Extract all entries in a tar archive:
var entries = Ionic.Tar.Extract("archive.tar"); // also handles .tgz files

Take a look at DotNetZip on CodePlex.
"If all you want is a better
DeflateStream or GZipStream class to
replace the one that is built-into the
.NET BCL, that is here, too.
DotNetZip's DeflateStream and
GZipStream are available in a
standalone assembly, based on a .NET
port of Zlib. These streams support
compression levels and deliver much
better performance that the built-in
classes. There is also a ZlibStream to
complete the set (RFC 1950, 1951,
1952)."
It appears that you can iterate through the compressed file and pull the individual files out of the archive. You can then test the files you uncompressed and see if any of them are themselves GZip files.
Here is a snippit from their Examples Page
using (ZipFile zip = ZipFile.Read(ExistingZipFile))
{
foreach (ZipEntry e in zip)
{
e.Extract(OutputStream);
}
}
Keith

Related

How to locate compressed bits of certain files in a zip archive for further processing?

Context:
To save more space, I want to further compress some files in a zip archive with an algorithm, and other files in same archive with another algorithm. Later I need to revert the process to get the original zip archive, because the zip files are owned by users.
How to locate compressed bits of certain files in a zip archive for further processing?
Language: Guess this kind of code is usually C/C++ for performance, but C# is good too.
OS: Windows Server 2012 R2 or later.
Edit:
I learned that in zip(zlib) format, compressed files are organized in blocks. We should be able to locate the files by searching headers. Still checking on how to code it.
The zip file format is fully documented. e.g. http://www.fileformat.info/format/zip/corion.htm
You can find other sources and descriptions easily.
What you have to do is to read bytes according to the format and you have exact knowledge about where a certain file's compressed bits are. You can find open source libraries for this or you can roll your own in your preferred language.
As a side note, compressing an otherwise compressed file in a zip archive may not worth the efforts.

Programmatically merging zip segments made by DotNetZip

I have a problem with merging zip segments which I made using DotNetZip library.
I'm zipping big file, which produces files like: abc.zip, abc.z01 and abc.z02.
using (ZipFile zip = new ZipFile())
{
zip.AddDirectory(fullDir);
zip.MaxOutputSegmentSize = 65536;
zip.Save(packageFullName);
return zip.NumberOfSegmentsForMostRecentSave;
}
In other service I want to download these files and merge them to only one zip file. I made this by simply merging theirs byte arrays. Sadly I'm getting error, that archive created by me isn't valid.
I'm not sure why my approach isn't right. I found this stack question: https://superuser.com/questions/15935/how-do-i-reassemble-a-zip-file-that-has-been-emailed-in-multiple-parts - accepted answer also produces invalid archive.
Do anybody knows how can I merge a few DotNetZip files? I don't really want to extract them in memory and pack once again, but maybe it's the only way.
dotnetzip can read segment zip files without problem, you can refer it's source code to take a look how it handle the segement files as one zip file, its an internal class you cannot directly use, but it may have a clue tell you how to do it:
http://dotnetzip.codeplex.com/SourceControl/latest#Zip/ZipSegmentedStream.cs

Uncompress docx files, compare their contents and create a new merged docx file

Why docx re-compressed with ZipFile.CreateFromDirectory is not identical to original one?
I'm building a module for "docx" (and other word documents) comparison. First of all I uncompress two "docx" files. Then I compare and merge the xml files in directory structure that have been created after documents' decompression. In the end I compress the merged directory and create the new "docx" file. The two "docx" files (the original one and the merged one) are same according to Microsoft word comparison. Also the xml contents are same according to CRC32 comparison, but either size or CRC32 value of the merged "docx" file are different from the original one. For the decompression I use the
System.IO.Compression library.
Is this a compression problem? What is the compression algorithm that the microsoft word (and other viewers use) for creating open xml format documents such as "docx" files?
I run some unit tets for several docx comparisons. So i think the only way to check if a test passed correctly is to compare the crc32 numbers.
public static void CreateCompressFile(string dirinfo, string originalFile)
{
FileInfo fi = new FileInfo(originalFile);
ZipFile.CreateFromDirectory(dirinfo,
originalFile.Replace(fi.Extension, "_tmp" + fi.Extension),
CompressionLevel.Fastest, false);
}
Docx is ZIP file. As long as de-compressed content is the same files can be considered identical from Word's point of view (unless you need to sign ZIP file itself for some reason).
ZIP file format does not require some particular format for compressed data - it explicitly allows variations on compression quality. Each compression library/tool is free to pick compression level based on its internal criteria. It is unlikely that 2 different implementations will produce identical ZIP files from the same content even if options passed to compression as similar.
I.e. even sample you have shows ability to pick CompressionLevel: ZipFile.CreateFromDirectory(...,CompressionLevel.Fastest, ...);.
Similar questions discussed before on SE: ZIP files created with GUI have more bytes than ZIP files created in a shell.

Efficient compression of folder with same file copied multiple times

I am creating a *.zip using Ionic.Zip. However, my *.zip contains same files multiple times, sometimes even 20x, and the ZIP format does not take advantage of it at all.
Whats worse, Ionic.Zip sometimes crashes with an OutOfMemoryException, since I am compressing the files into a MemoryStream.
Is there a .NET library for compressing that takes advantage of redundancy between files?
Users decompress the files on their own, so it cannot be an exotic format.
I ended up creating a tar.gz using the SharpZipLib library. Using this solution on 1 file, the archive is 3kB. Using it on 20 identical files, the archive is only 6kB, whereas in .zip it was 64kB.
Nuget:
Install-Package SharpZipLib
Usings:
using ICSharpCode.SharpZipLib.GZip;
using ICSharpCode.SharpZipLib.Tar;
Code:
var output = new MemoryStream();
using (var gzip = new GZipOutputStream(output))
using (var tar = TarArchive.CreateOutputTarArchive(gzip))
{
for (int i = 0; i < files.Count; i++)
{
var tarEntry = TarEntry.CreateEntryFromFile(file);
tar.WriteEntry(tarEntry,false);
}
tar.IsStreamOwner = false;
gzip.IsStreamOwner = false;
}
No, there is no such API exposed by well-known ones (such as GZip, PPMd, Zip, LZMA). They all operate per file (or stream of bytes to be more specific).
You could catenate all the files, ie using a tar-ball format and then use compression algorithm.
Or, it's trivial to implement your own check: compute hash for a file and store it in the a hash-filename dictionary. If hash matches for next file you can decide what you want to do, such as ignore this file completely, or perhaps note its name and save it in another file to mark duplicates.
Yes, 7-zip. There is a SevenZipSharp library you could use, but from my experience, launching a compressing process directly using command line is much faster.
My personal experience:
We used a SevenZipSharp in a company to decompress archives up to 1GB and it was terribly slow until I reworked it so that it will use the 7-zip library directly by running its command line interface. Then it was as fast as it was when decompressing manually in Windows Explorer.
I haven't tested this, but according to one answerer in How many times can a file be compressed?
If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information.

Decompression of uploaded file with: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream

Im uploading zip file (compressed with winrar) to my server by FileUpload control. On the server I use this code to decompress file:
HttpPostedFile myFile = FileUploader.PostedFile;
using (Stream inFile = myFile.InputStream)
{
using (GZipStream decompress = new GZipStream(inFile, CompressionMode.Decompress))
{
StreamReader reader = new StreamReader(decompress);
string text = reader.ReadToEnd(); // Here is an error
}
}
But I get error:
The magic number in GZip header is not correct. Make sure you are passing in a GZip stream
Is there any way to repair this ? Im using .net 2.0
Thank You very much for help
ZIP and GZIP are not quite the same. You can use a third-party library like #ziplib to decompress ZIP files.
GZip is a format that compresses a given stream into another stream. When used with files it is conventionally given the .gz extension and the content-type application/x-gzip (though often we use the content type of the contained stream and another means of indicating that it's g-zipped). On the web it's often used as a content-encoding or (alas less well-supported given its closer to what we generally really want) transfer-encoding to reduce download and upload time "invisibly" (the user thinks they're downloading a large HTML page but really their downloading a smaller GZip of it).
Zip is a format that compresses an archive of one or more files, along with information about relative paths. The file produced is conventionally given the .zip extension, and the content-type application/zip (registered with IANA).
There are definite similarities aside from the name, as in they both (generally) use the DEFLATE algorithm, and we can combine the use of GZip with the use of Tar to create an archive similar to what Zip gives us, but they have different uses.
You've got two options:
The simplest (from the programming side of things anyway) is to get a windows tool that produces GZip files (Winrar will open but not create them, but there are dozens of tools that will create them including quite a few that are free). Then your code will work.
The other is to use the Package Class. It's a bit more complicated to use, because a package of potentially several files is inherently more complicated than a single file, but not dreadful by any means. This will let you examine a Zip file, extract the file(s) contained, make changes to them, etc.

Categories

Resources