Compress large log file before reading - c#

We have a large amount of logs (117 logs with total of about 17gb of data). It's straight text so I know it will compress well. I'm not looking for great compression, or speed (but that would be a good bonus). What I currently do is get a list of log files to read (they have a date stamp in the file name, so I filter on that first). After I get the list I then read each file using File.ReadAllLines() but we also filter on that...
private void GetBulkUpdateItems(List<string> allLines, Regex updatedRowsRegEx)
{
foreach (var file in this)
allLines.AddRange(File.ReadAllLines(file).Where(x => updatedRowsRegEx.IsMatch(x)));
allLines.Sort();
}
reading 5 files from the network takes about 22 seconds. What I'd like to do is compress the list of files into a single zip file. copy the zip file locally, then unzip them and do the rest. Problem is I can't figure out how to start. Since I'm using .net 4.5 I first tried System.IO.Compression.ZipFile but it wants a Directory and I don't want all 117 files. I saw someone use a network stream and 7zip which sounded promising, and I'm fairly certain that 7zip is installed on the server I need the logs from (Probably not important because we use the UNC path). So I'm stuck. Any suggestions?

ZipArchive is the underlying class for ZipFile and allows more granular manipulation.
Sample from the article adding hardcoded text:
using (FileStream zipToOpen = new FileStream(
#"c:\users\exampleuser\release.zip", FileMode.Open))
{
using (ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Update))
{
ZipArchiveEntry readmeEntry = archive.CreateEntry("Readme.txt");
using (StreamWriter writer = new StreamWriter(readmeEntry.Open()))
{
writer.WriteLine("Information about this package.");
writer.WriteLine("========================");
}
}
}
As Praveen Paulose suggested you can use ZipFileExtensions.CreateEntryFromFile to create entry from file to add to archive.

Related

How to read data from inner archives without extracting zip file?

I have a zip file which contains inner zip file (Ex:ZipFile1.zip->ZipFile2.zip->file.txt). I want to read the data of inner archive file content (file.txt) using ICSharpCode.SharpZipLib library without extracting to disk. Is it possible? If it is possible, Let me know how to get this.
Based on this answer, you can open a file within the zip as a Stream. You can also open a ZipFile from a Stream. I'm sure you can see where this is heading.
using (var zip = new ZipFile("ZipFile1.zip"))
{
var nestedZipEntry = zip.GetEntry("ZipFile2.zip");
using (var nestedZipStream = zip.GetInputStream(nestedZipEntry))
using (var nestedZip = new ZipFile(nestedZipStream))
{
var fileEntry = nestedZip.GetEntry("file.txt");
using (var fileStream = nestedZip.GetInputStream(fileEntry))
using (var reader = new StreamReader(fileStream))
{
Console.WriteLine(reader.ReadToEnd());
}
}
}
What we're doing here:
Open ZipFile1.zip
Find the entry for ZipFile2.zip
Open ZipFile2.zip as a Stream
Create a new ZipFile object around nestedZipStream.
Find the entry for file.txt
Create a StreamReader around fileStream to read the text file.
Read the contents of file.txt and output it to the console.
Try it online - in this sample, the base64 data is the binary data of a zip file which contains "test.zip", which in turn contains "file.txt". The contents of that text file is "hello".
P.S. If an entry isn't found then GetEntry will return null. You'll want to check for that in any code you write. It works here because I'm sure that these entries exist in their respective archives.

decompress files on google storage on the fly using c#

I have a very interesting problem that I hope I can solve using .Net, simply I have a zip file in google storage which I want to decompress and move to a different bucket, but I don't have enough memory nor storage to save the whole file and decompress. To solve this issue I have to read the central directory part of the zip file at the end of the file and then do streaming decompress. Did anyone work on a similar issue?
So far I figured to get the last 1024 bytes from the file using the following code:
var fileInfo = _storage.GetObject(BucketName, fileName, new GetObjectOptions { Projection = Projection.Full });
var stream = new MemoryStream();
_storage.DownloadObject(BucketName, fileName, stream, new DownloadObjectOptions { Range = new System.Net.Http.Headers.RangeHeaderValue((long)(fileInfo.Size - 1024), (long)(fileInfo.Size)) });
The problem is I can't read the central directory from this stream:
ZipArchive z = new ZipArchive(stream);
You can try to adapt sunzip to your needs. It reads a zip file as a stream and decompresses it.

How do I extract any archives with SharpZipLib?

I'm using SharpZipLib to extract archives. I managed to extract .zip archives:
FastZip fastZip = new FastZip();
fastZip.ExtractZip(file, directory, null);
and to extract .tar.gz:
// Use a 4K buffer. Any larger is a waste.
byte[] dataBuffer = new byte[4096];
using (Stream fileStream = new FileStream(file, FileMode.Open, FileAccess.Read))
{
using (GZipInputStream gzipStream = new GZipInputStream(fileStream))
{
// Change this to your needs
string fnOut = Path.Combine(directory, Path.GetFileNameWithoutExtension(file));
using (FileStream fsOut = File.Create(fnOut))
{
StreamUtils.Copy(gzipStream, fsOut, dataBuffer);
}
}
}
Is there also a way to extract any kind of archive where I don't need to know the type of archive upfront? (e.g. SharpZipLib.ExtractAnyArchive(file, directory))
SharpZipLib unfortunately is currently not able to auto-detect the format of an archive file/stream.
You either have to implement the functionality by yourself in some form, or seek an alternative library that is able to auto-detect the format of an archive. An example of such a library would be SharpCompress, however, as you already noted in the comments, different libraries can come with different kind of limitations and bugs that might affect the functionality of your software.
If you decide to roll your own auto-detection functionality for SharpZipLib, you can choose different approaches, like
Try opening an (unknown) archive using the archive (reader/stream) classes for every archive format supported by SharpZipLib, until you find one which can open and process the archive file successfully.
Implement some format detection routine that scans an archive file/stream for 'magic' signature bytes identifying a particular archive format. If the format of an archive file/stream has been thus identified, select and use the appropriate SharpZipLib classes for handling the detected archive format.

Compress large files using .NET Framework ZipArchive class on ASP.NET

I have a code that get all files on a directory, compresses each one and creates a .zip file. I'm using the .NET Framework ZipArchive class on the System.IO.Compression namespace and the extension method CreateEntryFromFile. This is working well except when processing large files (aproximately 1GB and up), there it throws a System.IO.Stream Exception "Stream too large".
On the extension method reference on MSDN it states that:
When ZipArchiveMode.Update is present, the size limit of an entry is limited to Int32.MaxValue. This limit is because update mode uses a MemoryStream internally to allow the seeking required when updating an archive, and MemoryStream has a maximum equal to the size of an int.
So this explains the exception I get, but provides no further way of how to overcome this limitation. How can I allow large file proccesing?
Here is my code, its part of a class, just in case, the GetDatabaseBackupFiles() and GetDatabaseCompressedBackupFiles() functions returns a list of FileInfo objects that I iterate:
public void CompressBackupFiles()
{
var originalFiles = GetDatabaseBackupFiles();
var compressedFiles = GetDatabaseCompressedBackupFiles();
var pendingFiles = originalFiles.Where(c => compressedFiles.All(d => Path.GetFileName(d.Name) != Path.GetFileName(c.Name)));
foreach (var file in pendingFiles)
{
var zipPath = Path.Combine(_options.ZippedBackupFilesBasePath, Path.GetFileNameWithoutExtension(file.Name) + ".zip");
using (ZipArchive archive = ZipFile.Open(zipPath, ZipArchiveMode.Update))
{
archive.CreateEntryFromFile(file.FullName, Path.GetFileName(file.Name));
}
}
DeleteFiles(originalFiles);
}
When you are only creating a zip file, replace the ZipArchiveMode.Update with ZipArchiveMode.Create.
The update mode is meant for cases, when you need delete files from an existing archive, or add new files to existing archive.
In the update mode the whole zip file is loaded into memory and it consumes a lot of memory for big files. Therefore this mode should be avoided when possible.

ICSharpCode.SharpZipLib.Zip example with crc variable details

I am using icsharpziplib dll for zipping sharepoint files using c# in asp.net
When i open the output.zip file, it is showing "zip file is either corrupted or damaged".
And the crc value for files in the output.zip is showing as 000000.
How do we calculate or configure crc value using icsharpziplib dll?
Can any one have the good example how to do zipping using memorystreams?
it seems you're not creating each ZipEntry.
Here's is a code that I adapted to my needs:
http://wiki.sharpdevelop.net/SharpZipLib-Zip-Samples.ashx#Create_a_Zip_fromto_a_memory_stream_or_byte_array_1
Anyway with SharpZipLib there are many ways you can work with zip file: the ZipFile class, the ZipOutputStream and the FastZip.
I'm using the ZipOutputStream to create an in-memory ZIP file, adding in-memory streams to it and finally flushing to disk, and it's working quite good. Why ZipOutputStream? Because it's the only choice available if you want to specify a compression level and use Streams.
Good luck :)
1:
You could do it manually but the ICSharpCode library will take care of it for you. Also something I've discovered: 'zip file is either corrupted or damaged' can also be a result of not adding your zip entry name correctly (such as an entry that sits in a chain of subfolders).
2:
I solved this problem by creating a compressionHelper utility. I had to dynamically compose and return zip files. Temp files were not an option as the process was to be run by a webservice.
The trick with this was a BeginZip(), AddEntry() and EndZip() methods (because I made it into a utility to be invoked. You could just use the code directly if need be).
Something I've excluded from the example are checks for initialization (like calling EndZip() first by mistake) and proper disposal code (best to implement IDisposable and close your zipfileStream and your memoryStream if applicable).
using System.IO;
using ICSharpCode.SharpZipLib.Zip;
public void BeginZipUpdate()
{
_memoryStream = new MemoryStream(200);
_zipOutputStream = new ZipOutputStream(_memoryStream);
}
public void EndZipUpdate()
{
_zipOutputStream.Finish();
_zipOutputStream.Close();
_zipOutputStream = null;
}
//Entry name could be 'somefile.txt' or 'Assemblies\MyAssembly.dll' to indicate a folder.
//Unsure where you'd be getting your file, I'm reading the data from the database.
public void AddEntry(string entryName, byte[] bytes)
{
ZipEntry entry = new ZipEntry(entryName);
entry.DateTime = DateTime.Now;
entry.Size = bytes.Length;
_zipOutputStream.PutNextEntry(entry);
_zipOutputStream.Write(bytes, 0, bytes.Length);
_zipOutputStreamEntries.Add(entryName);
}
So you're actually having the zipOutputStream write to a memoryStream. Then once _zipOutputStream is closed, you can return the contents of the memoryStream.
public byte[] GetResultingZipFile()
{
_zipOutputStream.Finish();
_zipOutputStream.Close();
_zipOutputStream = null;
return _memoryStream.ToArray();
}
Just be aware of how much you want to add to a zipfile (delay in process/IO/timeouts etc).

Categories

Resources