I conceived this idea to merge an arbitrary number of small text file into 1 single zip file with GZipStream class. I spent several nights to make it work, but the outcome is that the final zip file ended up being bigger than if the text files had concatenated together. I vaguely know how Huffman coding works, so I don't know if it's practical to do this, or if there's a better alternative. Ultimately, I want an external sorted index file to map out each blob for fast access. What do you think?
// keep track of index current position
long indexByteOffset = 0;
// in reality the blobs vary in size from 1k to 300k bytes
string[] originalData = { "data blob1", "data blob2", "data blob3", "data blob4" /* etc etc etc */};
// merged compressed file
BinaryWriter zipWriter = new BinaryWriter(File.Create(#"c:\temp\merged.gz"));
// keep track of begining position and size of each blob
StreamWriter indexWriter = new StreamWriter(File.Create(#"c:\temp\index.txt"));
foreach(var blob in originalData){
using(MemoryStream ms = new MemoryStream()){
using(GZipStream zipper = new GZipStream(ms, CompressionMode.Compress)){
Encoding utf8Encoder = new UTF8Encoding();
byte[] encodeBuffer = utf8Encoder.GetBytes(blob);
zipper.Write(encodeBuffer, 0, encodeBuffer.Length);
}
byte[] compressedData = ms.ToArray();
zipWriter.Write(compressedData);
zipWriter.Seek(0, SeekOrigin.End);
indexWriter.WriteLine(indexByteOffset + '\t' + (indexByteOffset + compressedData.Length));
indexByteOffset += compressedData.Length;
}
}
Different data can compress with different effectiveness. Small data usually isn't worth trying to compress. One common approach is to allow for an "is it compressed?" flag - do a speculative compress, but if it is larger store the original. That information could be included in the index. Personally, though, I'd probably be tempted to go for a single file - either a .zip, or just including the length of each fragment as a 4-byte chunk (or maybe a "varint") before each - then seeking to the n-th fragment is just a case of "read length prefix, decode as int, seek that many bytes, repeat". You could also reserve one bit of that for "is it compressed".
But as for "is it worth compression": that depends on your data.
Related
I have homework from my university teacher. I have to write code which will encrypt\decrypt small part of big file (about 10GB). I use algorithm Salsa20.
The main thing is not to load RAM. As he said, I should read, for example, 100 lines then encrypt\decrypt it, write to file and back.
I create List
List<string> dict = new List<string>();
Read lines (because reading all bytes is loading lots of RAM)
using (StreamReader sReader = new StreamReader(filePath))
{
while (dict.Count < 100)
{
dict.Add(sReader.ReadLine());
}
}
Try to create one line from
string words = string.Join("", dict.ToArray());
Encrypt this line
string encrypted;
using (var salsa = new Salsa20.Salsa20())
using (var mstream_out = new MemoryStream())
{
salsa.Key = key;
salsa.IV = iv;
using (var cstream = new CryptoStream(mstream_out,
salsa.CreateEncryptor(), CryptoStreamMode.Write))
{
var bytes = Encoding.UTF8.GetBytes(words);
cstream.Write(bytes, 0, bytes.Length);
}
encrypted = Encoding.UTF8.GetString(mstream_out.ToArray());
}
Then I need to write 100 lines of encrypted string, but I don't know how to do it! Is there any solution?
OK, so here's what you could do.
Accept a filename, a starting line number and an ending line number.
Read the lines, simply writing them them to another file if they are lower than the starting line number or larger than the ending line number.
Once you read a line that is in the range, you can encrypt it with the key and an IV. You will possibly need to encode it to a byte array e.g. using UTF-8 first, as modern ciphers such as Salsa operate on bytes, not text.
You can use the line number possibly as nonce/IV for your stream cipher, if you don't expect the number of lines to change. Otherwise you can prefix the ciphertext with a large, fixed size, random nonce.
The ciphertext - possibly including the nonce - can be encoded as base64 without line endings. Then you write the base 64 line to the other file.
Keep encrypting the lines until you found the end index. It is up to you if your ending line is inclusive or exclusive.
Now read the remaining lines, and write them to the other file.
Don't forget to finalize the encryption and close the file. You may possibly want to destroy the source input file.
Encrypting bytes may be easier as you could write to the original file. However, writing encrypted strings will likely always expand the ciphertext compared with the plaintext. So you need to copy the file, as it needs to grow from the middle out.
I haven't got a clue why you would keep a list or dictionary in memory. If that's part of the requirements then I don't see it in the rest of the question. If you read in all the lines of a file that way then clearly you're using up memory.
Of course, if your 4 GiB file is just a single line then you're still using way too much memory. In that case you need to stream everything, parsing text from files, putting it in a character buffer, character-decoding it, encrypting it, encoding it again to base 64 and writing it to file. Certainly doable, but tricky if you've never done such things.
I have two binary files, "bigFile.bin" and "smallFile.bin".
The "bigFile.bin" contains "smallFile.bin".
Opening it in beyond compare confirms that.
I want to extract the smaller file form the bigger into a "result.bin" that equals "smallFile.bin".
I have two keywords- one for the start position ("Section") and one for the end position ("Man");
I tried the following:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
UTF8Encoding enc = new UTF8Encoding();
string text = enc.GetString(bigFile);
int startIndex = text.IndexOf("Section");
int endIndex = text.IndexOf("Man");
string smallFile = text.Substring(startIndex, endIndex - startIndex);
File.WriteAllBytes("result.bin",enc.GetBytes(smallFile));
I tried to compare the result file with the origin small file in beyond compare, which shows hex representation comparison.
nost of the bytes areequal -but some not.
For example in the new file I have 84 but in the old file I have EF BF BD sequence instead.
What can cause those differences? Where am I mistaken?
Since you are working with binary files, you should not use text-related functionality (which includes encodings etc). Work with byte-related methods instead.
Your current code could be converted to work by making it into something like this:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
int startIndex = /* assume we somehow know this */
int endIndex = /* assume we somehow know this */
var length = endIndex - startIndex;
var smallFile = new byte[length];
Array.Copy(bigFile, startIndex, smallFile, 0, length);
File.WriteAllBytes("result.bin", smallFile);
To find startIndex and endIndex you could even use your previous technique, but something like this would be more appropriate.
However this would still be problematic because:
Stuffing both binary data and "text" into the same file is going to complicate matters
There is still a lot of unnecessary copying going on here; you should work with your input as a Stream rather than an array of bytes
Even worse than the unnecessary copying, any non-stream solution would either need to load all of your input file in memory as happens above (wasteful), or be exceedingly complicated to code
So, what to do?
Don't read file contents in memory as byte arrays. Work with FileStream instead.
Wrap a StreamReader around the FileStream and use it to find the markers for the start and end indexes. Even better, change your file format so that you don't need to search for text.
After you know startIndex and length, use stream functions to seek to the relevant part of your input stream and copy length bytes to the output stream.
What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.
I have a wmv file whose size is 300 bytes. I want to split it into several bytes (example: (150 bytes each) or (3 100 bytes)). How do I implement this in C# Language?
It really depends on whether you want the files to work or not. Splitting them in chunks is easy: Read them into a byte array, have a for loop that copies part of the array to a file of size CHUNK, without forgetting to copy the final bytes of the file. Splitting them in working files is different.
I would try to just stream it without explicit splitting (the tcp stack will split it as it likes^^). If you have a good codec it will play it anyway. (Vlc can always plays the videos while downloading)
The real answer is, just use a streaming server and forget about writing a streaming protocol. Thats crazy. To split a file into byte segments you could use something like the code below. Not thats untested, but it should be about 95% complete.
You should take a look at the proto spec if you havent already. http://msdn.microsoft.com/en-us/library/cc251059(v=PROT.10).aspx And if you have, and you asked this question, you dont stand an ice cubes chance in hell at making it work,
int chunkSize = 300;
var file = File.Open("c:\file.wmv", FileMode.Open);
var numberOfChunks = (file.Length/chunkSize)+1;
byte[][] fileBytes = new byte[numberOfChunks][];
for (int i = 0; i < numberOfChunks; i++)
{
int bytesToRead = chunkSize;
if (i == numberOfChunks + 1)
{
bytesToRead = (int)(file.Length - (i * chunkSize));
}
fileBytes[i] = new byte[bytesToRead];
file.Read(fileBytes[i], i * chunkSize, bytesToRead);
}
I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:
public static class ProgramExtensions
{
public static byte[] ToSHA256Hash(this FileInfo file)
{
using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
{
using (SHA256 hasher = new SHA256Managed())
{
return hasher.ComputeHash(fs);
}
}
}
public static string ToHexString(this byte[] p)
{
char[] c = new char[p.Length * 2 + 2];
byte b;
c[0] = '0'; c[1] = 'x';
for (int y = 0, x = 2; y < p.Length; ++y, ++x)
{
b = ((byte)(p[y] >> 4));
c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
b = ((byte)(p[y] & 0xF));
c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
}
return new string(c);
}
}
class Program
{
static void Main(string[] args)
{
var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");
List<string> readFileHashes = GetReadFileHashes();
List<FileInfo> filesToRead = new List<FileInfo>();
foreach (var file in allFiles)
{
if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
filesToRead.Add(file);
}
//read new files
}
}
Is there anyway I can speed this up?
I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.
Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.
Depending on the general size your files generally have, a further improvement can be worthwhile:
Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:
hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.
I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.
Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)
Create a file list
Sort the list by filesize
Eliminate files with unique sizes from the list
Now do hashing (a fast hash first might improve performance as well)
Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.
Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?
I'd do a quick CRC hash check first, as it is less expensive.
if the CRC does not match, continue on with a more "reliable" hash test such as SHA
Your description of the problem still isn't clear enough.
The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.
You might want to try searching for the modification time, which does not change if a filename is changed:
http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx
Or you might want to monitor the folder for any new file changes:
http://www.codeguru.com/forum/showthread.php?t=436716
First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.
You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.
So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.
updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime
I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...
.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime
Our packager takes about 20 secs to find out what files have been modified.