Reading zip files without full download

Reading zip files without full download - c#

Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;

The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.

the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.

I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).

There is a solution implemented in ArchView
"ArchView can open archive file online without downloading the whole archive."
https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.

It's possible. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Here is implementation in pyhon: onlinezip
[full disclosure: I'm author of library]

Related

Zipping a large amount of data into an output stream without loading all the data into memory first in C#

I have a C# program that generates a bunch of short (10 seconds or so) video files. These are stored in an azure file storage blob. I want the user to be able to download these files at a later date as a zip. However, it would take a substantial amount of memory to load the entire collection of video files into memory to create the zip. I was wondering if it is possible to pull data from a stream into memory, zip encode it, output it to another stream, and dispose of it before moving onto the next segment of data.
Lets say the user has generated 100 10mb videos. If possible, this would allow me to send the zip to the user without first loading the entire 1GB of footage into memory (or storing the entire zip in memory after the fact).
The individual videos are pretty small, so if I need to load an entire file into memory at a time, that is fine as long as I can remove it from memory after it has been encoded and transmitted before moving onto the next file

Yes, it is certainly possible to stream in files, not requiring even any of those to be entirely in memory at any one time, and to compress, stream out, and transmit a zip file containing those, without holding the entire zip file either in memory or mass storage. The zip format is designed to be streamable. However I am not aware of a library that will do that for you.
ZipFile would require saving the entire zip file before transmitting it. If you're ok with saving the zip file in mass storage (not memory) before transmitting, then use ZipFile.
To write your own zip streamer, you would need to generate the zip file format manually. The zip format is documented here. You can use DeflateStream to do the actual compression and Crc32 to compute the CRC-32s. You would transmit the local header before each file's compressed data, followed by a data descriptor after each. You would save the local header information in memory as you go along, and then transmit the central directory and end record after all of the local entries.
zip is a relatively straightforward format, so while it would take a little bit of work, it is definitely doable.

What archive file format is good for random access during distributed processing?

I'm looking for an archive file type that I can use for processing large archive files in AWS lambda. The entries in the archive are not so large by themselves, the largest maybe 100mb, but there could be a lot of them. My strategy is to create a lambda for processing each entry, where the parameters to my the lambda are a path to the file in s3, as well as a byte range for the entry inside the archive. This would allow for processing each entry without needing to load the entire file. I can write a format to handle this, but I figure something like this probably already exists.
Not required, but hoping to work with these files in C#.

As long as your files are not that big, I can suggest the following approach.
Function invoked
If there is a file in /tmp GoTo Step 4.
If there is no file in /tmp download a new file from S3.
Pop data from the file in chunks making sure that the remaining file shrinks while you process it.
Process the popped chunks of data.
If the function is about to timeout, stop processing file and invoke yourself again (call sibling). It may spawn in the same container or in a different one and will either start processing another file (remaining from some other run) or continue the same one.
When file is completely processed - mark it in some way (Tag) in S3.
There are some limitations here:
- You should not care about the order of processing the files and the rows inside files.
- Occasional multiple processing of same chunks of data should not cause any problem.
- You probably want to keep track of processed files also somewhere externally
A pretty similar approach is used in the Scheduler class of the sosw package. This is a Python package not C#, but idea could help you.

Zip folder to SVN?

This may sound a silly question but I just wanted to clear something up. I've zipped a folder up and added it to my SVN repository. Is doing this all ok? or should I upload the unzipped folder instead?
I just need to be sure!

If you are going to change the contents of the directory, then you should store it unzipped. Having it in zip file will exhaust storage on server much faster, as if you were storing every version of your zip as a separate file on your server.
Zip format has one cool properly: every file inside archive takes some segment of bytes, and is compressed/decompressed independently of all the other files. As the result, if you have a 100 MB zip, and modify two files inside each having size 1 MB, then the new zip will have at most 2 MB of new data, the rest 98 MB will be most likely by byte-exact copies of some pieces of the old zip. So it is in theory possible to represent small in-zip changes as small deltas. But there are many problems in practice.
First of all, you must be sure that you don't recompress the unchanged files. If you make both the first zip and the second zip from scratch using different programs, program versions, compression settings, etc., you can get slightly different compression on the unchanged files. As the result, the actual bytes in zip file will greatly differ, and any hope for small delta will be lost. The better approach is taking the first zip, and adding/removing files in it.
The main problem however is how SVN stores deltas. As far as I know, SVN uses xdelta algorithm for computing deltas. This algorithm is perfectly capable of detecting equal blocks inside zip file, if given unlimited memory. The problem is that SVN uses memory-limited version with a window of size = 100 KB. Even if you simply remove a segment longer than 100 KB from a file, then SVN's delta computation will break on it, and the rest of the file will be simply copied into delta. Most likely, the delta will take as much space as the whole file takes.

How to detect if file is downloading in c# or python

I have a mix python-C# code that scans list of directories and manipulate it files in a loop.
Sometime there is a download directly to the income directory and the program start manipulating the file before the download completed.
Is there any way to detect if the file finish downloading?

A simple way to detect if the file is done downloading is to compare file size. If you always keep a previous "snapshot" of the files in the current directory you will be able to see which files exist and which don't at a given moment in time. Once you see an new file you know that the file has started to download. From this point you can compare the file size of that file and once the previous file size is equal to the current file size you know the file has finished downloading. Each time you would take a new "snapshot" it would be, for example 1ms after the previous. This may not be simple to implement depending on your knowledge of python or C# but I think this algorithm would get you what you want.

When you download, you get file size. You can check file size before writing to file. If file size is same download size then allow writing.

Options for header in raw byte file

I have a large raw data file (up to 1GB) which contains raw samples from a USB data logger.
I need to store extra information relating to the file (sample rate, description, trigger point, last seek position etc) and was looking into adding this as a some sort of header.
The header file should ideally be human readable and flexible so I've so far ruled out some sort of binary serialization into a header.
I also want to avoid two separate files as they could end up separated when copied or backed up. I remembered somebody telling me that newer *.*x Microsoft Office documents are actually a number of files in a zip. Is there a simple way to achieve this? Could I still keep the quick seek times to the raw file?
Update
I started using the binary serializer and found it to be a pain. I ended up using the xml serializer as I'm more comfortable using it.
I reserve some space at the start of the files for the xml. Simple

When you say you want to make the header human readable, this suggests opening the file in a text editor. Do you really want to do this considering the file size and (I'm assuming), the remainder of the file being non-human readable binary data? If it is, just write the text header data to the start of the binary file - it will be visible when the file is opened but, of course, the remainder of the file will look like garbage.
You could create an uncompressed ZIP archive, which may allow you to seek directly to the binary data. See this for information on creating a ZIP archive: http://weblogs.asp.net/jgalloway/archive/2007/10/25/creating-zip-archives-in-net-without-an-external-library-like-sharpziplib.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.