Decompress tar files using C#

Decompress tar files using C# - c#

I'm searching a way to add embedded resource to my solution. This resources will be folders with a lot of files in them. On user demand they need to be decompressed.
I'm searching for a way do store such folders in executable without involving third-party libraries (Looks rather stupid, but this is the task).
I have found, that I can GZip and UnGZip them using standard libraries. But GZip handles single file only. In such cases TAR should come to the scene. But I haven't found TAR implementation among standard classes.
Maybe it possible decompress TAR with bare C#?

While looking for a quick answer to the same question, I came across this thread, and was not entirely satisfied with the current answers, as they all point to using third-party dependencies to much larger libraries, all just to achieve simple extraction of a tar.gz file to disk.
While the gz format could be considered rather complicated, tar on the other hand is quite simple. At its core, it just takes a bunch of files, prepends a 500 byte header (but takes 512 bytes) to each describing the file, and writes them all to single archive on a 512 byte alignment. There is no compression, that is typically handled by compressing the created file to a gz archive, which .NET conveniently has built-in, which takes care of all the hard part.
Having looked at the spec for the tar format, there are only really 2 values (especially on Windows) we need to pick out from the header in order to extract the file from a stream. The first is the name, and the second is size. Using those two values, we need only seek to the appropriate position in the stream and copy the bytes to a file.
I made a very rudimentary, down-and-dirty method to extract a tar archive to a directory, and added some helper functions for opening from a stream or filename, and decompressing the gz file first using built-in functions.
The primary method is this:
public static void ExtractTar(Stream stream, string outputDir)
{
var buffer = new byte[100];
while (true)
{
stream.Read(buffer, 0, 100);
var name = Encoding.ASCII.GetString(buffer).Trim('\0');
if (String.IsNullOrWhiteSpace(name))
break;
stream.Seek(24, SeekOrigin.Current);
stream.Read(buffer, 0, 12);
var size = Convert.ToInt64(Encoding.ASCII.GetString(buffer, 0, 12).Trim(), 8);
stream.Seek(376L, SeekOrigin.Current);
var output = Path.Combine(outputDir, name);
if (!Directory.Exists(Path.GetDirectoryName(output)))
Directory.CreateDirectory(Path.GetDirectoryName(output));
using (var str = File.Open(output, FileMode.OpenOrCreate, FileAccess.Write))
{
var buf = new byte[size];
stream.Read(buf, 0, buf.Length);
str.Write(buf, 0, buf.Length);
}
var pos = stream.Position;
var offset = 512 - (pos % 512);
if (offset == 512)
offset = 0;
stream.Seek(offset, SeekOrigin.Current);
}
}
And here is a few helper functions for opening from a file, and automating first decompressing a tar.gz file/stream before extracting.
public static void ExtractTarGz(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTarGz(stream, outputDir);
}
public static void ExtractTarGz(Stream stream, string outputDir)
{
// A GZipStream is not seekable, so copy it first to a MemoryStream
using (var gzip = new GZipStream(stream, CompressionMode.Decompress))
{
const int chunk = 4096;
using (var memStr = new MemoryStream())
{
int read;
var buffer = new byte[chunk];
do
{
read = gzip.Read(buffer, 0, chunk);
memStr.Write(buffer, 0, read);
} while (read == chunk);
memStr.Seek(0, SeekOrigin.Begin);
ExtractTar(memStr, outputDir);
}
}
}
public static void ExtractTar(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTar(stream, outputDir);
}
Here is a gist of the full file with some comments.

Tar-cs will do the job, but it is quite slow. I would recommend using SharpCompress which is significantly quicker. It also supports other compression types and it has been updated recently.
using System;
using System.IO;
using SharpCompress.Common;
using SharpCompress.Reader;
private static String directoryPath = #"C:\Temp";
public static void unTAR(String tarFilePath)
{
using (Stream stream = File.OpenRead(tarFilePath))
{
var reader = ReaderFactory.Open(stream);
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
{
ExtractionOptions opt = new ExtractionOptions {
ExtractFullPath = true,
Overwrite = true
};
reader.WriteEntryToDirectory(directoryPath, opt);
}
}
}
}

See tar-cs
using (FileStream unarchFile = File.OpenRead(tarfile))
{
TarReader reader = new TarReader(unarchFile);
reader.ReadToEnd("out_dir");
}

Since you are not allowed to use outside libraries, you are not restricted to a specific format of the tar file either. In fact, they don't even need it to be all in the same file.
You can write your own tar-like utility in C# that walks a directory tree, and produces two files: a "header" file that consists of a serialized dictionary mapping System.IO.Path instances to an offset/length pairs, and a big file containing the content of individual files concatenated into one giant blob. This is not a trivial task, but it's not overly complicated either.

there are 2 ways to compress/decompress in .NET first you can use Gzipstream class and DeflatStream both can actually do compress your files in .gz format so if you compressed any file in Gzipstream it can be opened with any popular compression applications such as winzip/ winrar, 7zip but you can't open compressed file with DeflatStream. these two classes are from .NET 2.
and there is another way which is Package class it's actually same as Gzipstream and DeflatStream the only different is you can compress multiple files which then can be opened with winzip/ winrar, 7zip.so that's all .NET has. but it's not even generic .zip file,
it something Microsoft uses to compress their *x extension office files. if you decompress any docx file with package class you can see everything stored in it. so don't use .NET libraries for compressing or even decompressing cause you can't even make a generic compress file or even decompress a generic zip file. you have to consider for a third party library such as
http://www.icsharpcode.net/OpenSource/SharpZipLib/
or implement everything from the ground floor.

Related

Zipping Two File with Same Content and Encoding them to base64 giving different response

I need to encode the zip file in base64 formats.
I followed the following approach
string text = File.ReadAllText("../../../SampleDat.dat");
byte[] compress0 = Compress(stringbyte);
string short_com0 = base64_encode(compress0);
public static byte[] Compress(byte[] data)
{
using (var compressedStream = new MemoryStream())
using (var zipStream = new GZipStream(compressedStream, CompressionMode.Compress))
{
zipStream.Write(data, 0, data.Length);
zipStream.Close();
return compressedStream.ToArray();
}
}
public string base64_encode(byte[] data)
{
if (data == null)
throw new ArgumentNullException("data");
return Convert.ToBase64String(data);
}
After using this I got this encoded string.
H4sIAAAAAAAEAJVQTU/CQBS8m/gfejHRgxQpoJJ4qGXBKlBsq6KXph8P2NjdrbuLleT9eBe/QvSgHt7hTWYmMzMmsdt3Yxe9lBe0SDVcisytqpLmqaaCkxctU5/PBQ5GZNabkjAxFwWThPhxQgYDNJd4bkyGQXifeEGfYKoUKMWA60nKYP+n5mwCTKyksjxJNUiaHmxpolzIf4tuZPk3iWcaLoRce6IAJPP5iHLwC5wC3ZSU7K30JwmjVcaoUgYynOGN38fI+OUQrZUGZrDtN6g5SAzhaUUV3dhMViwzyNey7//uzpiEQ/L74N/D46agaYZuwSinyvA0fQbLNQGVTrm2Di3CtVxbI3iGEjttXGpdqZ5t13XdyD9szLxVIxfMXlIJCkrItS2hElIrm/ICXuzH6V7rfL4oTx+CIMtY/+7aiaNZq7ZFnLfDinavZsFtBvfNpZ9HZIH4MyriUctpd7rHJ6dNvPDGDX88HaFz3MGO02w6r7wgTAN2AgAA
When I created zip manually and read file in the code and compress that file
//file zipped manually
string filePath1 = "../../../git_only/oraclehcm1/dbscripts/SampleDat.zip";
byte[] physicalfile1 = File.ReadAllBytes(filePath1);
string long_com1 = base64_encode(physicalfile1);
The response I get is
UEsDBBQAAAAIAECDYlK8IEwDbAEAAHYCAAANAAAAU2FtcGxlRGF0LmRhdJVQTU/CQBS8m/gfejHRgxQpqJB4qGXBKlBsq6KXph8P2NjdrbuLleT9eBc/4tdBPbzDvMxMZmZMYrfvxi56KS9okWo4F5lbVSXNU00FJ09apj6fCxyMyKw3JWFiLgomCfHjhAwGaC7x3JgMg/A28YI+wVQpUIoB15OUwe5PzckEmFhJZXmSapA03fukiXIh/y26kuXfJJ5puBBy7YkCkMznI8rBL3AKdFNSspfS7ySMVhmjSpmX4Qyv/D5Gxi+HaK00ML/4AoOag8QQHlZU0Y3NZMUykB/LvuLtrTEJh+T3wb+Hx01B0wzdglFOleFp+giWawIqnXJt7VuEa7m2RvAIJXbauNS6Uj3bruu6kb/ZmHmrRi6YvaQSFJSQa1tCJaRWNuUFPNn3053W6XxRdu+CIMtY/+bSiaNZq7ZFnLfDih5ezILrDG6bSz+PyALxZ1TEg5bT7hweHXebeOaNG/54OkLnqIMdp9l0ngFQSwECHwAUAAAACABAg2JSvCBMA2wBAAB2AgAADQAkAAAAAAAAACAAAAAAAAAAU2FtcGxlRGF0LmRhdAoAIAAAAAAAAQAYAEMpLaJSD9cBq6mosXsP1wFNJS5xSw7XAVBLBQYAAAAAAQABAF8AAACXAQAAAAA=
This is the actual response . I also noticed the two zip are of the different size and the zip I which I created programmatically , The files in this zip have no extensions.
Please help me to create the second encoding through program and > .NET version I am using is 4.5
and I cannot use Zip.createDirectory() method due to project dependencies.
Any help is appreaciated .
Thanks in Advnance!

The first one is a gzip file, the second one is a zip file. If you want to make a zip file, try the ZipFile class as opposed to the GZipStream class.

I wouldn't expect two different Zip algorithms/libraries to yield the same output. For one, in the programmatic way, the file metadata (name, modification date, attributes) are not set, while the command line version will include all that information for unzipping purposes.
Plus libraries update at different cadence than standalones, and you might not have the fixes synchronized to reliably match the outputs.

How to get the compressed size (ZipArchive ) of a FileStream block?

I'm writing a little C# appx package editor (appx is basically a zip file containing a bunch of XML metadata files).
In order to make a valid appx file, I need to create a block map file (XML) that contains for each file two attributes : hash and size as explained here (https://learn.microsoft.com/en-us/uwp/schemas/blockmapschema/element-block)
Hash represent a 64kb uncompressed chunk of a file. Size represent the size of that chunk after being compressed (deflate algorithm). Here is what I wrote so far as proof of concept :
using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
namespace StreamTest
{
class Program
{
static void Main(string[] args)
{
using (var srcFile = File.OpenRead(#"C:\Test\sample.txt"))
{
ZipAndHash(srcFile);
}
Console.ReadLine();
}
static void ZipAndHash(Stream inStream)
{
const int blockSize = 65536; //64KB
var uncompressedBuffer = new byte[blockSize];
int bytesRead;
int totalBytesRead = 0;
//Create a ZIP file
using (FileStream zipStream = new FileStream(#"C:\Test\test.zip", FileMode.Create))
{
using (ZipArchive zipArchive = new ZipArchive(zipStream, ZipArchiveMode.Create))
{
using (BinaryWriter zipWriter = new BinaryWriter(zipArchive.CreateEntry("test.txt").Open()))
{
//Read stream with a 64kb buffer
while ((bytesRead = inStream.Read(uncompressedBuffer, 0, uncompressedBuffer.Length)) > 0)
{
totalBytesRead = totalBytesRead + bytesRead;
//Compress current block to the Zip entry
if (uncompressedBuffer.Length == bytesRead)
{
//Hash each 64kb block before compression
hashBlock(uncompressedBuffer);
//Compress
zipWriter.Write(uncompressedBuffer);
}
else
{
//Hash remaining bytes and compress
hashBlock(uncompressedBuffer.Take(bytesRead).ToArray());
zipWriter.Write(uncompressedBuffer.Take(bytesRead).ToArray());
}
}
//How to obtain the size of the compressed block after invoking zipWriter.Write() ?
Console.WriteLine($"total bytes : {totalBytesRead}");
}
}
}
}
private static void hashBlock(byte[] uncompressedBuffer)
{
// hash the block
}
}
}
I can easily get the hash attribute by using a 64kb buffer while reading a stream, my question is :
How do I obtain the compressed size of each 64kb block after using zipWrite.Write(), is it even possible with System.IO.Compression or should I use something else ?

If your problem still actual, for creating container you can use 2 approach:
Managed OPC Packaging APIs, which provide support for applications that produce or consume Open Packaging Conventions compliant files, called packages, and develop all specific things by yourself (general description is here: https://msdn.microsoft.com/en-us/library/windows/desktop/dd371623(v=vs.85).aspx)
For getting blocks compression size on-the-fly you can use DeflateStream and MemoryStream like below:
private static long getCompressSize(byte[] input)
{
long length = 0;
using (MemoryStream compressedStream = new MemoryStream())
{
compressedStream.Position = 0;
using (DeflateStream compressionStream = new DeflateStream(compressedStream, CompressionLevel.Optimal, true))
{
compressionStream.Write(input, 0, input.Length);
}
length = compressedStream.Length;
}
Logger.WriteLine("input length:" + input.Length + " compressed stream: " + length);
return length;
}
C++ API for Appx containers (but in this case Project should be rewritten using C++ or additional library should be written and imported to C# project).
The main advantage is that it already has methods for create all needed package parts. General description is here (https://msdn.microsoft.com/en-us/library/windows/desktop/hh446766(v=vs.85).aspx)
Solution for getting block size and Hash:
The IAppxBlockMapBlock interface provides a read-only object that represents an individual block within a file contained in the block map file (AppxBlockMap.xml) for the App package. The IAppxBlockMapFile::GetBlocks method is used to return an enumerator for traversing and retrieving the individual blocks of a file listed in the package block map.
The IAppxBlockMapBlock interface inherits from the IUnknown interface and has these methods:
GetCompressedSize - Retrieves compressed size of the block.
GetHash - Retrieves the hash value of the block.

Zip within a zip opens to undocumented System.IO.Compression.SubReadStream

I have a function I use for aggregating streams from a zip archive.
private void ExtractMiscellaneousFiles()
{
foreach (var miscellaneousFileName in _fileData.MiscellaneousFileNames)
{
var fileEntry = _archive.GetEntry(miscellaneousFileName);
if (fileEntry == null)
{
throw new ZipArchiveMissingFileException("Couldn't find " + miscellaneousFileName);
}
var stream = fileEntry.Open();
OtherFileStreams.Add(miscellaneousFileName, (DeflateStream) stream);
}
}
This works well in most cases. However, if I have a zip within a zip, I get an excpetion on casting the stream to a DeflateStream:
System.InvalidCastException: Unable to cast object of type 'System.IO.Compression.SubReadStream' to type 'System.IO.Compression.DeflateStream'.
I am unable to find Microsoft documentation for a SubReadStream. I would like my zip within a zip as a DeflateStream. Is this possible? If so how?
UPDATE
Still no success. I attempted #Sunshine's suggestion of copying the stream using the following code:
private void ExtractMiscellaneousFiles()
{
_logger.Log("Extracting misc files...");
foreach (var miscellaneousFileName in _fileData.MiscellaneousFileNames)
{
_logger.Log($"Opening misc file stream for {miscellaneousFileName}");
var fileEntry = _archive.GetEntry(miscellaneousFileName);
if (fileEntry == null)
{
throw new ZipArchiveMissingFileException("Couldn't find " + miscellaneousFileName);
}
var openStream = fileEntry.Open();
var deflateStream = openStream;
if (!(deflateStream is DeflateStream))
{
var memoryStream = new MemoryStream();
deflateStream.CopyTo(memoryStream);
memoryStream.Position = 0;
deflateStream = new DeflateStream(memoryStream, CompressionLevel.NoCompression, true);
}
OtherFileStreams.Add(miscellaneousFileName, (DeflateStream)deflateStream);
}
}
But I get a
System.NotSupportedException: Stream does not support reading.
I inspected deflateStream.CanRead and it is true.
I've discovered this happens not just on zips, but on files that are in the zip but are not compressed (because too small, for example). Surely there's a way to deal with this; surely someone has encountered this before. I'm opening a bounty on this question.
Here's the .NET source for SubReadStream, thanks to #Quantic.

The return type of ZipArchiveEntry.Open() is Stream. An abstract type, in practice it can be a DeflateStream (you'd be happy), a SubReadStream (boo) or a WrappedStream (boo). Woe be you if they decide to improve the class some day and use a ZopfliStream (boo). The workaround is not good, you are trying to deflate data that is not compressed (boo).
Too many boos.
Only good solution is to change the type of your OtherFileStreams member. We can't see it, smells like a List<DeflateStream>. It needs to be a List<Stream>.

So it looks like the when storing a zip file inside another zip it doesn't deflate the zip but rather just inlines the content of the zip with the rest of the files with some information that these entries are part of a sub zip file. Which makes sense because applying compression to something that is already compressed is a waste of time.
This zip file is marked as CompressionMethodValues.Stored in the archive, which causes .NET to just return the original stream it read instead to wrapping it in a DeflateStream.
Source here: https://github.com/dotnet/corefx/blob/master/src/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs#L670
You could pass the stream into a ZipArchive, if it's not a DeflateStream (if you are interested in the file inside)
var stream = entry.Open();
if (!(stream is DeflateStream))
{
var subArchive = new ZipArchive(stream);
}
Or you can copy the stream to a FileStream (if you want to save it to disk)
var stream = entry.Open();
if (!(stream is DeflateStream))
{
var fs = File.Create(Path.GetTempFileName());
stream.CopyTo(fs);
fs.Close();
}
Or copy to any stream you are interested in using.
Note: This is also how .NET 4.6 behaves

Whats the most efficient way in C# to cut off the first 4 bytes of a file?

I have a compressed (LZMA) .txt file and need to decompress it, but i have to exclude the first 4 bytes as they are not part of the file content.
I load my file like this:
byte[] curFile = File.ReadAllBytes(files[i]);
Performance is critical as i have to loop trough over 14k+ files, average file size is around 4KB.

for (int i = 0; i < filePath.Length; i++)
{
var positionToSkipTo = 4;
using (var fileStream = File.OpenRead(filePath))
{
fileStream.Seek(positionToSkipTo, SeekOrigin.Begin);
var curFile = new byte[fileStream.Length - positionToSkipTo];
fileStream.Read(curFile, 0, curFile.Length);
//Do your thing
}
}
Everything is self-explanatory. Important functions are listed at MSDN FileStream class documentation.

If you're just using a byte array, you can utilize the ConstrainedCopy method in the Array class.
Array.ConstrainedCopy(unclippedArray, 4, clippedArray, 0, unclippedArray.Length - 4);
If you're not going to just be dealing with the raw bytes, utilize a memory stream and a binary reader or a filestream like other people suggested.

Compressing / Decompressing Folders & Files

Does anyone know of a good way to compress or decompress files and folders in C# quickly? Handling large files might be necessary.

The .Net 2.0 framework namespace System.IO.Compression supports GZip and Deflate algorithms. Here are two methods that compress and decompress a byte stream which you can get from your file object. You can substitute GZipStream for DefaultStream in the methods below to use that algorithm. This still leaves the problem of handling files compressed with different algorithms though.
public static byte[] Compress(byte[] data)
{
MemoryStream output = new MemoryStream();
GZipStream gzip = new GZipStream(output, CompressionMode.Compress, true);
gzip.Write(data, 0, data.Length);
gzip.Close();
return output.ToArray();
}
public static byte[] Decompress(byte[] data)
{
MemoryStream input = new MemoryStream();
input.Write(data, 0, data.Length);
input.Position = 0;
GZipStream gzip = new GZipStream(input, CompressionMode.Decompress, true);
MemoryStream output = new MemoryStream();
byte[] buff = new byte[64];
int read = -1;
read = gzip.Read(buff, 0, buff.Length);
while (read > 0)
{
output.Write(buff, 0, read);
read = gzip.Read(buff, 0, buff.Length);
}
gzip.Close();
return output.ToArray();
}

I've always used the SharpZip Library.
Here's a link

You can use a 3rd-party library such as SharpZip as Tom pointed out.
Another way (without going 3rd-party) is to use the Windows Shell API. You'll need to set a reference to the Microsoft Shell Controls and Automation COM library in your C# project. Gerald Gibson has an example at:
Internet Archive's copy of the dead page

As of .Net 1.1 the only available method is reaching into the java libraries.
Using the Zip Classes in the J# Class Libraries to Compress Files and Data with C#
Not sure if this has changed in recent versions.

My answer would be close your eyes and opt for DotNetZip. It's been tested by a large community.

GZipStream is a really good utility to use.

This is very easy to do in java, and as stated above you can reach into the java.util.zip libraries from C#. For references see:
java.util.zip javadocs
sample code
I used this a while ago to do a deep (recursive) zip of a folder structure, but I don't think I ever used the unzipping. If I'm so motivated I may pull that code out and edit it into here later.

Another good alternative is also DotNetZip.

You can create zip file with this method:
public async Task<string> CreateZipFile(string sourceDirectoryPath, string name)
{
var path = HostingEnvironment.MapPath(TempPath) + name;
await Task.Run(() =>
{
if (File.Exists(path)) File.Delete(path);
ZipFile.CreateFromDirectory(sourceDirectoryPath, path);
});
return path;
}
and then you can unzip zip file with this methods:
1- This method work with zip file path
public async Task ExtractZipFile(string filePath, string destinationDirectoryName)
{
await Task.Run(() =>
{
var archive = ZipFile.Open(filePath, ZipArchiveMode.Read);
foreach (var entry in archive.Entries)
{
entry.ExtractToFile(Path.Combine(destinationDirectoryName, entry.FullName), true);
}
archive.Dispose();
});
}
2- This method work with zip file stream
public async Task ExtractZipFile(Stream zipFile, string destinationDirectoryName)
{
string filePath = HostingEnvironment.MapPath(TempPath) + Utility.GetRandomNumber(1, int.MaxValue);
using (FileStream output = new FileStream(filePath, FileMode.Create))
{
await zipFile.CopyToAsync(output);
}
await Task.Run(() => ZipFile.ExtractToDirectory(filePath, destinationDirectoryName));
await Task.Run(() => File.Delete(filePath));
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Decompress tar files using C# - c#

See tar-cs using (FileStream unarchFile = File.OpenRead(tarfile)) { TarReader reader = new TarReader(unarchFile); reader.ReadToEnd("out_dir"); }

Related

Zipping Two File with Same Content and Encoding them to base64 giving different response

How to get the compressed size (ZipArchive ) of a FileStream block?

Zip within a zip opens to undocumented System.IO.Compression.SubReadStream

Whats the most efficient way in C# to cut off the first 4 bytes of a file?

Compressing / Decompressing Folders & Files

Categories

Resources