Is it possible to download and unzip in parallel?

Is it possible to download and unzip in parallel? - c#

I have some large zip files that I'm downloading and then unzipping in my program. Performance is important, and one direction I started thinking about was whether it was possible to start the download and then begin unzipping the data as it arrives, instead of waiting for the download to complete and then start unzipping. Is this possible? From what I understand of DEFLATE, it should be theoretically possible right?
I'm currently using DotNetZip as my zip library, but it refuses to act on a non-seekable stream.
Code would be something like this:
// HTTP Get the application from the server
var request = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
request.Method = "GET";
Directory.CreateDirectory(localPath);
using (var response = (HttpWebResponse)request.GetResponse())
using (Stream input = response.GetResponseStream())
{
// Unzip being some function which will start unzipping and
// return when unzipping is done
return Unzip(input, localPath);
}

I started thinking about was whether it was possible to start the download and then begin unzipping the data as it arrives, instead of waiting for the download to complete and then start unzipping. Is this possible?
If you want to start unzipping whilst the response body is still downloading, you can't really do this.
In a ZIP file, the Central Directory Record, which contains the list of files in the ZIP file, is located at the very end of the ZIP file. It will be the last thing you download. Without it, you can't reliably determine where the individual file records are located in your ZIP file.
This would also explain why DotNetZip needs a seekable stream. It needs to be able to read the Central Directory Record at the end of the file first, then jump back to earlier sections to read information about individual ZIP entries to extract them.
If you have very specific ZIP files you could make certain assumptions about the layout of those individual file records and extract them by hand, without seeking backwards, but it would not be broadly compatible with ZIP files in general.

You could use a async Task to unzip
await Task.Run(() => ZipFile.ExtractToDirectory(localPath + #"\" + fileName, destinationPath));

If you want to unpack the vast majority of zipfiles, they contain only file records followed by compressed data, repeated until you hit the central directory. So it is very much possible to do streaming decompression like asked in this question. The fflate JavaScript library does it for example.
It is possible to create a) a self executing zipfile, or b) some other weird ass zipfile that isn't formatted like this, but you'd be hard pressed to find one in the wild.

Related

What archive file format is good for random access during distributed processing?

I'm looking for an archive file type that I can use for processing large archive files in AWS lambda. The entries in the archive are not so large by themselves, the largest maybe 100mb, but there could be a lot of them. My strategy is to create a lambda for processing each entry, where the parameters to my the lambda are a path to the file in s3, as well as a byte range for the entry inside the archive. This would allow for processing each entry without needing to load the entire file. I can write a format to handle this, but I figure something like this probably already exists.
Not required, but hoping to work with these files in C#.

As long as your files are not that big, I can suggest the following approach.
Function invoked
If there is a file in /tmp GoTo Step 4.
If there is no file in /tmp download a new file from S3.
Pop data from the file in chunks making sure that the remaining file shrinks while you process it.
Process the popped chunks of data.
If the function is about to timeout, stop processing file and invoke yourself again (call sibling). It may spawn in the same container or in a different one and will either start processing another file (remaining from some other run) or continue the same one.
When file is completely processed - mark it in some way (Tag) in S3.
There are some limitations here:
- You should not care about the order of processing the files and the rows inside files.
- Occasional multiple processing of same chunks of data should not cause any problem.
- You probably want to keep track of processed files also somewhere externally
A pretty similar approach is used in the Scheduler class of the sosw package. This is a Python package not C#, but idea could help you.

How to download mp3 files in sequence?

I am using _webClient.OpenReadAsync(myURI) to download files, it works fine to download files. I want to download the files in sequence starting from 0-20. the 1st file should be downloaded, then the 2nd and so on.
I am using below to download, but it's not what I am expecting.
foreach (string s in files)
_webClient.OpenReadAsync(new Uri(string.Format("{0}{1}", selectedReciter.DownloadURL, s)));
The for loop should only continue to 2nd, 3rd and so on, if 1st file is downloaded, then 2nd, then 3rd and so on.

You are opening URL for reading asynchronously, that word has heavy meaning. What will happen is that the function won't complete when the file has began reading, but rather it will return much sooner.
What you need to do there is to await the result, something like this:
async Task DownloadAll(List<string> addresses)
{
var wc = new WebClient();
foreach(var address in addresses)
await wc.OpenReadTaskAsync(address);
}
Don't forget to add the NuGet package: Microsoft.Bcl.Async first.

Use Background file transfer to download files. Background file transfer allows you to download files also in background means when your application are deactivate or in background.
Here is more about Background file transfer. And Here is an example how to use background file transfer.

You could use the non-async version of the same method that blocks execution until the OpenRead stream is complete.

End of Central Directory record could not be found

I am downloading a zip file using c# program and I get the error
at System.IO.Compression.ZipArchive.ReadEndOfCentralDirectory()
at System.IO.Compression.ZipArchive.Init(Stream stream, ZipArchiveMode mode,
Boolean leaveOpen)
at System.IO.Compression.ZipArchive..ctor(Stream stream, ZipArchiveMode mode,
Boolean leaveOpen, Encoding entryNameEncoding)
at System.IO.Compression.ZipFile.Open(String archiveFileName, ZipArchiveMode
mode, Encoding entryNameEncoding)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileN
ame, String destinationDirectoryName, Encoding entryNameEncoding)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileN
ame, String destinationDirectoryName)
Here's the program
response = (HttpWebResponse)request.GetResponse();
Stream ReceiveStream = response.GetResponseStream();
byte[] buffer = new byte[1024];
FileStream outFile = new FileStream(zipFilePath, FileMode.Create);
int bytesRead;
while ((bytesRead = ReceiveStream.Read(buffer, 0, buffer.Length)) != 0)
outFile.Write(buffer, 0, bytesRead);
outFile.Close();
response.Close();
try
{
ZipFile.ExtractToDirectory(zipFilePath, destnDirectoryName);
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
Console.ReadLine();
}
I do not understand the error. Can anybody explain this
Thanks
MR

The problem is ZipFile can't find the line of code that signals the end of the archive, so either:
It is not a .zip archive.
It may be a .rar or other compressed type. Or as I suspect here, you are downloading an html file that auto-redirects to the zip file.
Solution - Gotta find a correct archive to use this code.
The archive is corrupt.
Solution - The archive will need repairing.
There is more than 1 part to the archive.
A multi part zip file.
Solution - Read in all the files before decompression.
As #ElliotSchmelliot notes in comments, the file may be hidden or have extended characters in the name.
Solution - Check your file attributes/permissions and verify the file name.
Opening the file with your favorite zip/unzip utility (7-zip, winzip, etc) will tell which of these it could be.

From your old question you deleted.
I get System.IO.InvalidDataException: End of Central Directory record could not be found.
This most likely means whatever file you are passing in is malformed and the Zip is failing. Since you already have the file outfile on the hard drive I would recommend trying to open that file with with windows built in zip extractor and see if it works. If it fails the problem is not with your unzipping code but with the data the server is sending to you.

I have the same problem, but in my case the problem is with the compression part and not with the decompression.
During the compression I need use the "Using" statament with the Stream and the ZipArchive objects too. The "Using" statament will Close the archive properly and I can decompress it without any problem.
The working code in my case in VB.Net:
Using zipSteramToCreate As New MemoryStream()
Using archive As New ZipArchive(zipSteramToCreate, ZipArchiveMode.Create)
' Add entry...
End Using
' Return the zip byte array for example:
Return zipSteramToCreate.ToArray
End Using

I encountered this same problem. There are many types of compression, .zip being only one of the types. Look and make sure that you aren't trying to 'unzip' a .rar or similar file.

In my case i absolutely KNEW that my zip was not corrupted, and I was able to figure out through trial and error that I was extracting the files to a directory with the filename and extension in the FOLDER Name.
So Unzipping /tmp/data.zip to:
/tmp/staging/data.zip/files_go_here
failed with the error [End of Central Directory record could not be found]
but extracting data.zip to this worked just fine:
/tmp/staging/data/files_go_here
While it might seem unusual to some folks to name a folder a filename with extension, I can't think of a single reason why you should not be able to do this, and more importantly -- the error returned is not obviously related to the cause.
I was getting the same error with both the System.IO.Compression library and 3rd party packages such as SharpZipLib -- which is what eventually clued me in that it was a more general issue.
I hope this helps someone and saves them some time/frustration.

I used SharpCompress C#.net Library available via Nuget Package manager, it solved my purpose of unzipping.

I just came across this thread when I had the same error from a PowerShell script calling the Net.WebClient DownloadFile method.
In my case, the problem was that the web server was unable to provide the requested zip file, and instead provided an HTML page with an error message in it, which obviously could not be unzipped.
So instead, I created an exception handler to extract and present the "real" error message.

Might be useful to someone else. I dealt with this by adding an exception to my code, which then:
Creates a temporary directory
Extracts the zip archive (normally works)
Renames the original ziparchive to *.bak
Zips and replaces the original archive file with one that works

For me, the problem had to do with git settings.
To solve it, I added:
*.zip binary
to my .gitattributes file.
Then I downloaded an uncorrupted version of the file (without using git) and added a new commit updating the .zip file to the uncorrupted version and also updating the .gitattributes file.
I wish I could avoid adding that extra commit to update the .zip file, but the only way I can think of avoiding that would be to insert a commit updating the .gitattributes file into or before the commit that added the .zip file (using a rebase) and using git push -f to update the remote repo, but I can't do that.

I also had this error because I was trying to open a .json file as a .zip archive:
using(ZipArchive archive = ZipFile.Open(fileToSend.FilePath, ZipArchiveMode.Read))
{
ZipArchiveEntry entry = archive.GetEntry(fileToSend.FileName);
using (StreamReader reader = new StreamReader(entry.Open(), Encoding.UTF8))
{
fileContent = reader.ReadToEnd();
}
}
I was expecting that fileToSend.FilePath = "C:\MyProject\mydata.zip"
but it was actually fileToSend.FilePath = "C:\MyProject\mydata.json" and that was causing the error.

Write down the stream to a file then inspect it with a (hex) editor.
I got the same message in Visual Studio when downloading nupkg from nuget.org. It was because nuget.org was blacklisted by the firewall. So instead of the pkg I got a html error page which (of course) cannot be unzipped.

In my case: I was mistakenly saving an input stream to *.zip.
While Archive Utility had no issues opening the file, all the rest failed (unzip cmd or java libs) with the same "end of central" error.
The plot-twist was: the file I'm downloading is in gzip format, i.e. *.gz, and not zip.

Make sure it is a zip file you trying to decompress.
The web-service I querying zips results when there are two files, but in this instance it was just returning one. My code was saving the embedded base64 as a stream and therefore my code was assigning the zip extension.
Whereas it was already actually just a plain PDF...

In my case, I was receiving this error in a combination with FileSystemWatcher, which triggered a processing method upon the zip archive before the archive was fully copied/created in its target folder.
I solved it with a check of whether the zip archive was truly eligible for reading in a try/catch block within a while loop.

My solution compress with powershell
Compress-Archive * -DestinationPath a.zip

I found resolution.
Move "Tools->Nuget PackageManager ->Package Manager Settings" and in "Nuget Package Manager" -General Tab , click Clear All Nuget Caches button and OK. You can install package from online

Decompression of uploaded file with: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream

Im uploading zip file (compressed with winrar) to my server by FileUpload control. On the server I use this code to decompress file:
HttpPostedFile myFile = FileUploader.PostedFile;
using (Stream inFile = myFile.InputStream)
{
using (GZipStream decompress = new GZipStream(inFile, CompressionMode.Decompress))
{
StreamReader reader = new StreamReader(decompress);
string text = reader.ReadToEnd(); // Here is an error
}
}
But I get error:
The magic number in GZip header is not correct. Make sure you are passing in a GZip stream
Is there any way to repair this ? Im using .net 2.0
Thank You very much for help

ZIP and GZIP are not quite the same. You can use a third-party library like #ziplib to decompress ZIP files.

GZip is a format that compresses a given stream into another stream. When used with files it is conventionally given the .gz extension and the content-type application/x-gzip (though often we use the content type of the contained stream and another means of indicating that it's g-zipped). On the web it's often used as a content-encoding or (alas less well-supported given its closer to what we generally really want) transfer-encoding to reduce download and upload time "invisibly" (the user thinks they're downloading a large HTML page but really their downloading a smaller GZip of it).
Zip is a format that compresses an archive of one or more files, along with information about relative paths. The file produced is conventionally given the .zip extension, and the content-type application/zip (registered with IANA).
There are definite similarities aside from the name, as in they both (generally) use the DEFLATE algorithm, and we can combine the use of GZip with the use of Tar to create an archive similar to what Zip gives us, but they have different uses.
You've got two options:
The simplest (from the programming side of things anyway) is to get a windows tool that produces GZip files (Winrar will open but not create them, but there are dozens of tools that will create them including quite a few that are free). Then your code will work.
The other is to use the Package Class. It's a bit more complicated to use, because a package of potentially several files is inherently more complicated than a single file, but not dreadful by any means. This will let you examine a Zip file, extract the file(s) contained, make changes to them, etc.

Appending bytes using Amazon S3 .Net SDK

I have the following piece of code which works great for a simple file upload. But let's say I wanted to append to an existing file or simply upload random chunks of bytes, like the first and last 10 bytes? Is this even possible with the official SDK?
PutObjectRequest request = new PutObjectRequest();
FileStream fs = new FileStream(#"C:\myFolder\MyFile.bin", FileMode.Open);
request.WithInputStream(fs);
request.WithBucketName(bucketName);
request.WithKey(keyName);
client.PutObject(request);
fs.Close();

There is no way to append data to existing objects in S3. You have to overwrite the entire file.
Although, in saying that, it is possible to a degree with Amazon's large file support. With this uploads are broken into chunks and reassembled on S3. But you have to do it as part of a single transfer and its only for large files.

This previous answer appears to no longer be the case. You can currently manage an append like process by using an existing object as the initial part of a multi-part upload. Then delete the previous object when done transferring.
See:
http://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjctsUsingLLNetMPUapi.html
http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is it possible to download and unzip in parallel? - c#

You could use a async Task to unzip await Task.Run(() => ZipFile.ExtractToDirectory(localPath + #"\" + fileName, destinationPath));

Related

What archive file format is good for random access during distributed processing?

How to download mp3 files in sequence?

End of Central Directory record could not be found

Decompression of uploaded file with: The magic number in GZip header is not correct. Make sure you are passing in a GZip stream

Appending bytes using Amazon S3 .Net SDK

Categories

Resources