This may sound a silly question but I just wanted to clear something up. I've zipped a folder up and added it to my SVN repository. Is doing this all ok? or should I upload the unzipped folder instead?
I just need to be sure!
If you are going to change the contents of the directory, then you should store it unzipped. Having it in zip file will exhaust storage on server much faster, as if you were storing every version of your zip as a separate file on your server.
Zip format has one cool properly: every file inside archive takes some segment of bytes, and is compressed/decompressed independently of all the other files. As the result, if you have a 100 MB zip, and modify two files inside each having size 1 MB, then the new zip will have at most 2 MB of new data, the rest 98 MB will be most likely by byte-exact copies of some pieces of the old zip. So it is in theory possible to represent small in-zip changes as small deltas. But there are many problems in practice.
First of all, you must be sure that you don't recompress the unchanged files. If you make both the first zip and the second zip from scratch using different programs, program versions, compression settings, etc., you can get slightly different compression on the unchanged files. As the result, the actual bytes in zip file will greatly differ, and any hope for small delta will be lost. The better approach is taking the first zip, and adding/removing files in it.
The main problem however is how SVN stores deltas. As far as I know, SVN uses xdelta algorithm for computing deltas. This algorithm is perfectly capable of detecting equal blocks inside zip file, if given unlimited memory. The problem is that SVN uses memory-limited version with a window of size = 100 KB. Even if you simply remove a segment longer than 100 KB from a file, then SVN's delta computation will break on it, and the rest of the file will be simply copied into delta. Most likely, the delta will take as much space as the whole file takes.
Related
I have a C# program that generates a bunch of short (10 seconds or so) video files. These are stored in an azure file storage blob. I want the user to be able to download these files at a later date as a zip. However, it would take a substantial amount of memory to load the entire collection of video files into memory to create the zip. I was wondering if it is possible to pull data from a stream into memory, zip encode it, output it to another stream, and dispose of it before moving onto the next segment of data.
Lets say the user has generated 100 10mb videos. If possible, this would allow me to send the zip to the user without first loading the entire 1GB of footage into memory (or storing the entire zip in memory after the fact).
The individual videos are pretty small, so if I need to load an entire file into memory at a time, that is fine as long as I can remove it from memory after it has been encoded and transmitted before moving onto the next file
Yes, it is certainly possible to stream in files, not requiring even any of those to be entirely in memory at any one time, and to compress, stream out, and transmit a zip file containing those, without holding the entire zip file either in memory or mass storage. The zip format is designed to be streamable. However I am not aware of a library that will do that for you.
ZipFile would require saving the entire zip file before transmitting it. If you're ok with saving the zip file in mass storage (not memory) before transmitting, then use ZipFile.
To write your own zip streamer, you would need to generate the zip file format manually. The zip format is documented here. You can use DeflateStream to do the actual compression and Crc32 to compute the CRC-32s. You would transmit the local header before each file's compressed data, followed by a data descriptor after each. You would save the local header information in memory as you go along, and then transmit the central directory and end record after all of the local entries.
zip is a relatively straightforward format, so while it would take a little bit of work, it is definitely doable.
Good day, I've created my own custom Wizard Installer for my website project. My goal is to minimize the work during the installation of our client.
I'm trying to extract a 7z file that has millions of tiny files (200-bit size of each file) inside. I'm using sharpcompress to achieve this extracting process but it seems that it will take hours to finish the task which is very bad for the user.
I don't care about compression. What I need is to reduce the time of the extracting process of these millions of tiny files or if possible, to speed up the extraction.
My question is. What is the fastest way to extract millions of tiny files? or any method to pack and unpack the files with the highest speed of unpacking.
I'm trying to extract the 7z file by this code:
using (SevenZipArchive zipArchive = SevenZipArchive.Open(source7z))
{
zipArchive.WriteToDirectory(destination7z,
new ExtractionOptions { Overwrite = true, ExtractFullPath = true });
}
But seems the extracting time is very slow for tiny files.
I'm working on a web site which will host thousands of user uploaded images in the formats, .png, .jpeg, and .gif.
Since there will be such a huge amount of images, saving just a few kb of space per file will in the end mean quite a lot on total storage requirements.
My first thought was to enable windows folder compression on the folder that the files are stored in (using a Windows / IIS server). On a total of 1Gb of data the total space saved on this was ~200kb.
This to me seems like a poor result. I therefore went to check if the windows folder compression could be tweaked but according to this post it cant be: NTFS compressed folders
My next though was then that I could use libraries such as Seven Zip Sharp to compress the files individually as I save them. But before I did this I went to test a few different compression programs on a few images.
The results on a 7Mb .gif was that
7z, Compress to .z7 = 1kb space saved
7z, Compress to .zip = 2kb space INCREASE
windows, native zip = 4kb space saved.
So this leaves me with two thoughs.. the zipping programs I'm using aren't very good, or images are pretty much already compressed as far as they can be (..and I'm surprised that windows built in compression is better than 7z).
So my question is, is there any way to decrease the filesize of an image archive consisting of the image formats listed above?
the zipping programs I'm using suck, or images are pretty much already compressed as far as they can be
Most common image formats are already compressed (PNG, JPEG, etc). Compressing a file twice will almost never yield any positive result, most likely it will only increase the file size.
So my question is, is there any way to decrease the filesize of an image archive consisting of the image formats listed above?
No, not likely. Compressed files might have at most a little more to give, but you have specialize on images itself, not the compression algoritm. Some good options are available in the post of Robert Levy. A tool I used to strip out metadata is PNGOUT.
Most users will likely be uploading files that have a basic level of compression already done on them so that's why you aren't seeing a ton of benefit. Some users may be uploading uncompressed files though in which case your attempts would make a difference.
That said, image compression should be thought of as a unique field from normal file compression. Normal file compression techniques will be "lossless", ensuring that every bit of the file is restored when the file is uncompressed - images (and other media) can be compressed in "lossy" ways without degrading the file to an unacceptable level.
There are specialized tools such which you can use to do things like strip out metadata, apply a slight blur, perform sampling, reduce quality, reduce dimensions, etc. Have a look at the answer here for a good example: Recommendation for compressing JPG files with ImageMagick. The top answer took the example file from 264kb to 170kb.
I implemented a RAMDisk into my C# application, and everything is going great, except I need to back up the contents regularly due to it being volatile. I have been battling with AlphaVSS for Shadow Copy backups for a week, then someone informed me that VSS does not work on a RAMDisk.
The contents that are located on the RAMDisk (world files for Minecraft) are very small, but there can be hundreds of them. The majority of them are .dat files only a few hundred bytes in size, and there is other files that are 2-8MB each.
I posted about this yesterday Here, and the solution that was suggested was to use a FileStream, and save the data out of it. I just read that this is a horrible idea for binary data on another Stack Overflow question, so I am looking for a better approach to backup all of these little files, some of which might be in use.
I suggest you first zip all the small files together, then back them up to a location.
ref:
zip library: http://www.icsharpcode.net/opensource/sharpziplib/
use System.IO.File.Copy to copy the zip packed.
Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.
the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.
I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).
There is a solution implemented in ArchView
"ArchView can open archive file online without downloading the whole archive."
https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.
It's possible. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Here is implementation in pyhon: onlinezip
[full disclosure: I'm author of library]