I have this C# code which unzip a zip file.
ZipFile.ExtractToDirectory(_downloadPath, _extractPath);
To test download process, I use the file size and compare them. But for extraction process, how do we ensure that the extraction process is successful? It could be corrupted (extraction process stop half way). Can I use file count to compare?
I suggest you go ahead and compare md5 hash of files in archive and the ones that were extracted. Though it is definitely not the fastest process, this way you'll be 100% sure the data is not corrupted.
You can find how to get md5 of a file inside archive here:
I have to take the directory of a file in zip file
I'm looking for an archive file type that I can use for processing large archive files in AWS lambda. The entries in the archive are not so large by themselves, the largest maybe 100mb, but there could be a lot of them. My strategy is to create a lambda for processing each entry, where the parameters to my the lambda are a path to the file in s3, as well as a byte range for the entry inside the archive. This would allow for processing each entry without needing to load the entire file. I can write a format to handle this, but I figure something like this probably already exists.
Not required, but hoping to work with these files in C#.
As long as your files are not that big, I can suggest the following approach.
Function invoked
If there is a file in /tmp GoTo Step 4.
If there is no file in /tmp download a new file from S3.
Pop data from the file in chunks making sure that the remaining file shrinks while you process it.
Process the popped chunks of data.
If the function is about to timeout, stop processing file and invoke yourself again (call sibling). It may spawn in the same container or in a different one and will either start processing another file (remaining from some other run) or continue the same one.
When file is completely processed - mark it in some way (Tag) in S3.
There are some limitations here:
- You should not care about the order of processing the files and the rows inside files.
- Occasional multiple processing of same chunks of data should not cause any problem.
- You probably want to keep track of processed files also somewhere externally
A pretty similar approach is used in the Scheduler class of the sosw package. This is a Python package not C#, but idea could help you.
I want to have multiple network stream threads writing/downloading into one file simultaneosly.
So e.G you have one File and download the ranges:
0-1000
1001-2002
2003-3004...
And I want them all to write their receiving bytes into one File as efficient as possible.
Right now I am downloading each range part into one File and combine them later when they are all finished into the final File.
I would like them to, if it is possible to all write into one File to reduce disk usage and I feel like this could all be done better.
You could use persisted memory mapped files, see https://learn.microsoft.com/en-us/dotnet/standard/io/memory-mapped-files
Persisted files are memory-mapped files that are associated with a source file on a disk. When the last process has finished working with the file, the data is saved to the source file on the disk. These memory-mapped files are suitable for working with extremely large source files.
I have an ASP.NET website that stores large numbers of files such as videos. I want an easy way to allow the user to download all the files in a single package. I was thinking about creating ZIP files dynamically.
All the examples I have seen involve creating the file before it is downloaded but potentially terabytes of information will be downloaded and therefor the user will have a long wait. Apparently ZIP files store all the information regarding what is in the ZIP file at the end of the file.
My idea is to dynamically create the file as its downloaded. This way I could allow the user to click download. The download would start and not require any space on the server to be pre packaged as it would copy things over uncompressed sequentially. The final part of the file would contain the information on the contents of what has been downloaded.
Has anyone had any experience of this? Does anyone know a better way of doing this? At the moment I cant see any pre made utilities for doing this but I believe it will work. If it doesn't exist then i'm thinking that I will have to read the Zip file format specifications and write my own code... something that will take more time than I was intending to spend on this.
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.
the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.
I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).
There is a solution implemented in ArchView
"ArchView can open archive file online without downloading the whole archive."
https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.
It's possible. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Here is implementation in pyhon: onlinezip
[full disclosure: I'm author of library]