Validate very large .zip files (~12 GB) downlaoded from FTP using chilkat - c#

I am using chilkat to download large .zip files from FTP server..
Files size usually goes around 12-13GB and after downloading I need to validate if file is not corrupt.
I've trying to use ICSharpCode.SharpZipLib.Zip
like this
ZipFile zip = new ZipFile(path);
bool isValidZip = zip.TestArchive(true, TestStrategy.FindFirstError, null);
But validation take VERY long time or even crashes..
Is there any quicker solutions ?

If the customer is uploading to FTP, then maybe the customer can also upload a SHA256 hash. For example, if the customer uploads x.zip, then compute the SHA256 of x.zip and also upload x.zip.sha256. Then your application can download both x.zip and x.zip.sha256, and then use Chilkat.Crypt2.HashFile to hash the x.zip and check against x.zip.sha256.
If it's not possible to get an expected hash value, then you might first check the file size against what is on the server. FTP servers can differ in how file information is provided. Older servers will provide human-readable directory listings (LIST command) whereas newer servers (i.e. within the last 10 years) support MLSD. Chilkat will use MLSD if possible. The older FTP servers might provide in accurate (non-exact) file size information, whereas MLSD will be accurate. You can call the Ftp2.Feat method to check to see if MLSD is supported. If so, then you can first validate the size of the downloaded file. If it's not the expected size, then you can skip any remaining validation because you already know it's invalid. (You can set Ftp2.AutoGetSizeForProgress = true, and then Chilkat will not return a success status when MLSD is used and the total number of bytes downloaded is not equal to the expected download size.
Assuming the byte counts are equal, or if you can't get an accurate byte count, and you don't have an expected hash, then you can test to see if the zip is valid. The 1st option is to call the Chilkat.Zip.OpenZip method. Opening the .zip will walk the zip's local file headers and central directory headers. Most errors will be caught if the .zip is corrupt. The more comprehensive check is only possible by actually decompressing the data for each file within the zip -- and this is probably why SharpZipLib takes so long. The only way to validate the compressed data is to actually do the decompression. Corrupted bytes would likely cause the decompressor to encounter an impossible internal state, which is clearly corruption. Also, the CRC-32 of the uncompressed data is stored in each local file header within the .zip. Checking the CRC-32 requires decompression. SharpZipLib is surely checking the CRC-32 (after it decompresses, and it's probably trying to decompress in memory and runs out of memory). Chilkat.OpenZip does not check the CRC-32 because it's not decompressing. You can call Chilkat.Unzip to unzip to the filesystem, and the act of unzipping also checks the CRC-32.
Anyway.. you might decide that checking the byte count and being able to call Chilkat.Zip.OpenZip successfully is sufficient for the validation check.
Otherwise, it's best to design the validation (using a parallel .sha256 file) in the system architecture if you're dealing with huge files..

Some FTP servers have implemented hash commands (see Appendix B). Issue HELP on ftp prompt in order to get a list of all available commands and see if your server supports a hash command. Otherwise you must stick to zip testing.

Related

Zipping a large amount of data into an output stream without loading all the data into memory first in C#

I have a C# program that generates a bunch of short (10 seconds or so) video files. These are stored in an azure file storage blob. I want the user to be able to download these files at a later date as a zip. However, it would take a substantial amount of memory to load the entire collection of video files into memory to create the zip. I was wondering if it is possible to pull data from a stream into memory, zip encode it, output it to another stream, and dispose of it before moving onto the next segment of data.
Lets say the user has generated 100 10mb videos. If possible, this would allow me to send the zip to the user without first loading the entire 1GB of footage into memory (or storing the entire zip in memory after the fact).
The individual videos are pretty small, so if I need to load an entire file into memory at a time, that is fine as long as I can remove it from memory after it has been encoded and transmitted before moving onto the next file
Yes, it is certainly possible to stream in files, not requiring even any of those to be entirely in memory at any one time, and to compress, stream out, and transmit a zip file containing those, without holding the entire zip file either in memory or mass storage. The zip format is designed to be streamable. However I am not aware of a library that will do that for you.
ZipFile would require saving the entire zip file before transmitting it. If you're ok with saving the zip file in mass storage (not memory) before transmitting, then use ZipFile.
To write your own zip streamer, you would need to generate the zip file format manually. The zip format is documented here. You can use DeflateStream to do the actual compression and Crc32 to compute the CRC-32s. You would transmit the local header before each file's compressed data, followed by a data descriptor after each. You would save the local header information in memory as you go along, and then transmit the central directory and end record after all of the local entries.
zip is a relatively straightforward format, so while it would take a little bit of work, it is definitely doable.

Find identical files in two folders with .Net

I have a folder with music videos which I want to backup from my laptop to a external hdd. I dont want to use a backup-Image, but a direct file copy so I can directly watch the music videos from the backup hdd on another computer/laptop or a console.
Curently I use the freeware SyncBack Free to mirror the files to the external hdd. SyncBack Free is a nice tool, but it does not seem to fully satisfy my needs. The problem is that I like to modify the filenames of my music videos from time to time. Though SyncBack Free has a option for files with identical content it does not seem to work for videos and you end up with two copies from the same file in each folder when you synchronise after a file name change.
So im thinking about writing my own freeware backup software.
The question is:
-how can I identify identical files with c#/.Net 4.0 without using the filename? Im thinking of generating hashes or a checksum for the files without knowing much about it
-Is it to slow to really be used for a backup software?
You can get a hash of a file like this
using System.Security.Cryptography;
static string GetFileHash(string filename)
{
byte[] data = File.ReadAllBytes(filename);
byte[] hash = MD5.Create().ComputeHash(data);
return Convert.ToBase64String(hash);
}
MD5 is not the most secure hash, but it is still fast which makes it good for file checksums. If the files are large ComputerHash() also takes a Stream.
You may also want to check out some other check sum algorithms in the HashLib library. It contains CRC and other algorithms which should be even faster. You can download it with nuget.
There are other strategies you can use as well such as checking if only the first x bytes are the same.
You can keep a database of hashes that have been backed up so that you don't have to recompute the hashes each time the backup runs. You could loop through only files which have been modified since the last backup time and see if their hash is in your hash database. SQLite comes to mind as a good database to use for this if you want your backup program to be portable.

How are delta file backups encoded?

With a backup application, a good and space-efficient way to back up is to detect changes in files. Some online services such as Dropbox do this as well since Dropbox includes version history. How do backup applications detect changes in files and store them?
If you have a monumentally large file which has already been backed up, and you make a small change (such as in a Microsoft Word document), how can an application detect a change and process it? If the file has changes made often, there must be an efficient algorithm to only process changes and not the entire file. Is there an algorithm to do this in C# .NET?
Edit: I'm trying to figure out how to encode two files as the original and the changes (in a VCDIFF format or etc.) I know how to use the format and decode it just fine.
to detect changes, you can compute the Hash code (such as MD5) for the original and the modified versions of the file. if they are identical, no changes are made.
I think DropBox has its own protocol to detect which part of this file is modified.
you can figure out to find out your own way, for example, divide the file to fixed-size parts, store their hash codes. when the client download the file, send these information to the client. after modifying the file, recalculate the hash codes for the parts, compare them with original hash codes, upload the parts that were modified, rebuild the file from original parts and the modified parts.
rsync is an open source tool that synchronizes files using delta encoding.
----------------------------------------------------EDIT: my idea above is very simple and not efficient. you can a look at VCDIFF that were explained by research paper and implemented in many languages (C#).

Reading zip files without full download

Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.
the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.
I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).
There is a solution implemented in ArchView
"ArchView can open archive file online without downloading the whole archive."
https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.
It's possible. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Here is implementation in pyhon: onlinezip
[full disclosure: I'm author of library]

XCopy - Only grab files that are fully uploaded

I have an automated job that pulls files that are uploaded to our servers via a client facing site using xcopy.
Is there any way to only pull files that are fully uploaded?
I have thought about creating a second "inProcess" folder that will be used for uploading and then move those files once fully uploaded, but that still creates a window of time when the file is in transition to a "Done" folder...
Any thoughts?
use the .filepart extension for temporary files.
It's probably the most simple and clear way of doing this.
WinSCP does this.
You can upload an MD5 hash of the file and then upload the file and if the file uploaded doesn't match the MD5 then it's not finished (or if it takes to long, perhaps it didn't upload properly)
MD5 is often used to check the integrity of a file by creating a hash that represents the file. If the file varies at all, it will almost always (as in, basically never for our purposes) generate a different MD5 hash. The only reason a file would not match it's previously uploaded MD5 hash is if it wasn't finished or the MD5/file was corrupted on upload.
There is also this. but it's perl and from expert exchange (ick)

Categories

Resources