String compression result as a string - c#

I founded following code on internet for string compression. When I compress a simple string, return value is very different.
For example, Compress("abc") returns "AwAAAB+LCAAAAAAABADtvQdgHEmWJSYvbcp7f0r1StfgdKEIgGATJNiQQBDswYjN5pLsHWlHIymrKoHKZVZlXWYWQMztnbz33nvvvffee++997o7nU4n99//P1xmZAFs9s5K2smeIYCqyB8/fnwfPyKyyfT/AcJBJDUDAAAA"
Can I take simple string result.
Thanks
using System.IO.Compression;
using System.Text;
using System.IO;
public static string Compress(string text)
{
byte[] buffer = Encoding.UTF8.GetBytes(text);
MemoryStream ms = new MemoryStream();
using (GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true))
{
zip.Write(buffer, 0, buffer.Length);
}
ms.Position = 0;
MemoryStream outStream = new MemoryStream();
byte[] compressed = new byte[ms.Length];
ms.Read(compressed, 0, compressed.Length);
byte[] gzBuffer = new byte[compressed.Length + 4];
System.Buffer.BlockCopy(compressed, 0, gzBuffer, 4, compressed.Length);
System.Buffer.BlockCopy(BitConverter.GetBytes(buffer.Length), 0, gzBuffer, 0, 4);
return Convert.ToBase64String (gzBuffer);
}

Code you are using is intended for compress really large string. It compress source string by using GZip compression algorithm and then make it readable (or maybe usable / "passable") by using BASE64 encoding.
Base64 expand source string up to ~1.33 times large (8 bit symbol is encoded as 6 bit + 2 bit overflow for next symbol). So to make sense string have to be compressed at least to 70% from source length.
The result is expected and usual when using that encoding.
To answer your question please define what you mean by "simple string result"

sure, because the result is in base64 (see the last line in your code).

Compression doesn't always result in a smaller output for a few reasons:
The input might be completely random, in which case most compressions will not compress anything, but still need to save the decompression "instructions". The result of compressing such data is data + instructions...bigger.
The input has no features searched by the used compression algorithm. This is a very similar case to the previous one, except it is dependent on the compression algorithm used (in your case Gzip).
Very small input. The smaller the input, the less chance to find compressible segments in it, so there is a big chance you'll get pseudo-random input (not random, but so small it looks random), and we go back to the first case again.
Base64 is a big point here, yes, but just don't forget these small facts about compression in general.

Related

C# Out of Memory while handling and encrypting large files

Im trying to encrypt a large file (Camtasia.exe) with the AES encryption.
Now for some reason I get a "Out of Memory" Exception. Im really new to this and I don't know how I could possibly fix that. This is my code
I use this to call my encryption method.
bytes = File.ReadAllBytes("Camtasia.exe");
Cryptography.Encryption.EncryptAES(System.Text.Encoding.Default.GetString(bytes), encryptionKey);
This is the AES encryption itself
public static string EncryptAES(string content, string password)
{
byte[] bytes = Encoding.UTF8.GetBytes(content);
using (SymmetricAlgorithm crypt = Aes.Create())
using (HashAlgorithm hash = MD5.Create())
using (MemoryStream memoryStream = new MemoryStream())
{
crypt.Key = hash.ComputeHash(Encoding.UTF8.GetBytes(password));
// This is really only needed before you call CreateEncryptor the second time,
// since it starts out random. But it's here just to show it exists.
crypt.GenerateIV();
using (CryptoStream cryptoStream = new CryptoStream(
memoryStream, crypt.CreateEncryptor(), CryptoStreamMode.Write))
{
cryptoStream.Write(bytes, 0, bytes.Length);
}
string base64IV = Convert.ToBase64String(crypt.IV);
string base64Ciphertext = Convert.ToBase64String(memoryStream.ToArray());
return base64IV + "!" + base64Ciphertext;
}
}
Here is the error again that I get when calling the function "EncryptAES" at the top. I would be glad if someone could explain how this happens and how to solve it
https://imgur.com/xqcLsKW
You're reading the entire exe into memory, interpreting it as a UTF-16 string (??!), turning that back into UTF-8 bytes, and encrypting those. This converting to/from a string is horifically broken. An executable file is not a human-readable string, and even if it was, you're in a real muddle as to which encoding you're using. I think you can drop the whole string thing.
You're also reading the entire thing into memory (several times in fact, because of the whole string thing), which is wasteful. You don't need to do that: you can encrypt it bit-by-bit. To do this, use a Stream.
Something like this should work (untested): at least it gets the general concept across. We set up a series of streams which lets us read the data out of the input file bit-by-bit, and write them out to the output file bit-by-bit.
// The file we're reading from
using var inputStream = File.OpenRead("Camtasia.exe");
// The file we're writing to
using var outputStream = File.OpenWrite("EncryptedFile.txt");
using var HashAlgorithm hash = MD5.Create();
using var aes = Aes.Create();
aes.Key = hash.ComputeHash(Encoding.UTF8.GetBytes(password));
// Turn the IV into a base64 string, add "!", encode as UTF-8, and write to the file
string base64IV = Convert.ToBase64String(aes.IV) + "!";
byte[] base64IVBytes = Encoding.UTF8.GetBytes(base64IV);
outputStream.Write(base64IVBytes, 0, base64IVBytes.Length);
// Create a stream which, when we write bytes to it, turns those into
// base64 characters and writes them to outputStream
using var base64Stream = new CryptoStream(outputStream, new ToBase64Transform(), CryptoStreamMode.Write);
// Create a stream which, when we write bytes to it, encrypts them and sends them to
// base64Stream
using var encryptStream = new CryptoStream(base64Stream, aes.CreateEncryptor(), CryptoStreamMode.Write);
// Copy the entirety of our input file into encryptStream. This will encrypt them and
// push them into base64Stream, which will base64-encode them and push them into
// outputStream
inputStream.CopyTo(encryptStream);
Note, that using MD5 to derive key bytes isn't best practice. Use Rfc2898DeriveBytes.
Also note that you don't necessarily need to base64-encode the encrypted result before writing it to a file -- you can just write the encrypted bytes straight out. To go this, get rid of base64Stream, and tell the encryptStream to write straight to outputStream.

Compressing\decompressing binary files

I need to decompress the *.bin file into a certain temp-file, perform certain actions with it in the my program and then compress it. There is a function that unpacks the *.bin file. I.e., temp-file for the correct operation of the program was obtained. Function:
public void Unzlib_bin(string initial_location, string target_location)
{
byte[] data, new_data;
Inflater inflater;
inflater = new Inflater(false);
data = File.ReadAllBytes(initial_location);
inflater.SetInput(data, 16, BitConverter.ToInt32(data, 8));
new_data = new byte[BitConverter.ToInt32(data, 12)];
inflater.Inflate(new_data);
File.WriteAllBytes(target_location, new_data);
}
The question is how to pack the temp file to its original state? I'm trying to do something like the following, but the result is wrong:
public void Zlib_bin(byte[] data, int length, string target_location)
{
byte[] new_data;
Deflater deflater;
List<byte> compress_data;
compress_data = new List<byte>();
new_data = new byte[length];
deflater = new Deflater();
compress_data.AddRange(new byte[] { ... }); //8 bytes from header, it does not matter
deflater.SetLevel(Deflater.BEST_COMPRESSION);
deflater.SetInput(data, 0, data.Length);
deflater.Finish();
deflater.Deflate(new_data);
compress_data.AddRange(BitConverter.GetBytes(new_data.Length));
compress_data.AddRange(BitConverter.GetBytes(data.Length));
compress_data.AddRange(new_data);
File.WriteAllBytes(target_location, compress_data.ToArray());
}
Any ideas?
If by "the result is wrong" you mean that you are not able to compress back to exactly the same bytes, that is entirely to be expected. There is no guarantee that decompressing followed by compressing gives you the same thing, unless you are using exactly the same compression code, version of that code, and settings for that code.
There are many compressed data streams to represent the same uncompressed data, and compressors are free to use any of them, usually due to tradeoffs in execution time, memory used, and compressed ratio. That is why compressors have "levels" and other adjustments.
The only guarantee for a lossless compressor is that when you compress and then decompress that you get exactly what you started with.
If decompressing what you got gives the same uncompressed data, then all is well.

Base64 - CryptoStream with StreamWriter vs Convert.ToBase64String()

Following feedback from Alexei, a simplification of the question:
How do I use a buffered Stream approach to convert the contents of a CryptoStream (using ToBase64Transform) into a StreamWriter (Unicode encoding) without using Convert.ToBase64String()?
Note: Calling Convert.ToBase64String() throws OutOfMemoryException, hence the need for a buffered/Stream approach to the conversion.
You probably should implement custom Stream, not a TextWriter. It is much easier to compose streams than writers (like pass your stream to compressed stream).
To create custom stream - derive from Stream and implement at least Write and Flush (and Read if you need R/W stream). The rest is more or less optional and depends on you additional needs, regular copy to other stream does not need anything else.
In constructor get inner stream passed to you for writing to. Base64 is always producing ASCII characters, so it should be easy to write output as UTF-8 with or without BOM directly to a stream, but if you want to specify encoding you can wrap inner stream with StreamWriter internally.
In your Write implementation buffer data till you get enough bytes to have block of multiple of 3 bytes (i.e. 300) and call Convert.ToBase64String on that portion. Make sure not to loose not-yet-converted portion. Since Base64 converts 3 bytes to 4 characters converting in blocks of multiple of 3 size will never have =/== padding at the end and can be concatenated with next block. So write that converted portion into inner stream/writer. Note that you want to limit block size to something relatively small like 3*10000 to avoid allocation of your blocks on large objects heap.
In Flush make sure to convert the last unwritten bytes (this will be the only one with = padding at the end) and write it to the stream too.
For reading you may need to be more careful as in Base64 white spaces are allowed, so you can't read fixed number of characters and convert to bytes. The easiest approach would be to read by character from StreamReader and convert each 4 non-space ones to bytes.
Note: you can consider writing/reading Base64 by hand directly from bytes. It will give you some performance benefits, but may be hard if you are not good with bit shifting.
Please try using following to encrypt. I am using fileName/filePath as input. You can adjust it as per your requirement. Using this I have encrypted over 1 gb file successfully without any out of memory exception.
public bool EncryptUsingStream(string inputFileName, string outputFileName)
{
bool success = false;
// here assuming that you already have key
byte[] key = new byte[128];
SymmetricAlgorithm algorithm = SymmetricAlgorithm.Create();
algorithm.Key = key;
using (ICryptoTransform transform = algorithm.CreateEncryptor())
{
CryptoStream cs = null;
FileStream fsEncrypted = null;
try
{
using (FileStream fsInput = new FileStream(inputFileName, FileMode.Open, FileAccess.Read))
{
//First write IV
fsEncrypted = new FileStream(outputFileName, FileMode.Create, FileAccess.Write);
fsEncrypted.Write(algorithm.IV, 0, algorithm.IV.Length);
//then write using stream
cs = new CryptoStream(fsEncrypted, transform, CryptoStreamMode.Write);
int bytesRead;
int _bufferSize = 1048576; //buggersize = 1mb;
byte[] buffer = new byte[_bufferSize];
do
{
bytesRead = fsInput.Read(buffer, 0, _bufferSize);
cs.Write(buffer, 0, bytesRead);
} while (bytesRead > 0);
success = true;
}
}
catch (Exception ex)
{
//handle exception or throw.
}
finally
{
if (cs != null)
{
cs.Close();
((IDisposable)cs).Dispose();
if (fsEncrypted != null)
{
fsEncrypted.Close();
}
}
}
}
return success;
}

Raw Stream Has Data, Deflate Returns Zero Bytes

I'm reading data (an adCenter report, as it happens), which is supposed to be zipped. Reading the contents with an ordinary stream, I get a couple thousand bytes of gibberish, so this seems reasonable. So I feed the stream to DeflateStream.
First, it reports "Block length does not match with its complement." A brief search suggests that there is a two-byte prefix, and indeed if I call ReadByte() twice before opening DeflateStream, the exception goes away.
However, DeflateStream now returns nothing at all. I've spent most of the afternoon chasing leads on this, with no luck. Help me, StackOverflow, you're my only hope! Can anyone tell me what I'm missing?
Here's the code. Naturally I only enabled one of the two commented blocks at a time when testing.
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Skip the zlib prefix, which conflicts with the deflate specification
compressed.ReadByte(); compressed.ReadByte();
// Reports reading 3,000-odd bytes, followed by random characters
/*byte[] buffer = new byte[4096];
int bytesRead = compressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (DeflateStream decompressed = new DeflateStream(compressed, CompressionMode.Decompress))
{
// Reports reading 0 bytes, and no output
/*byte[] buffer = new byte[4096];
int bytesRead = decompressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
As you can probably guess from the last line, the unzipped content should be TDT.
Just for fun, I tried decompressing with GZipStream, but it reports that the magic number is not correct. MS' docs just say "The downloaded report is compressed by using zip compression. You must unzip the report before you can use its contents."
Here's the code that finally worked. I had to save the content out to a file and read it back in. This does not seem reasonable, but for the small quantities of data I'm working with, it's acceptable, I'll take it!
WebRequest request = HttpWebRequest.Create(reportURL);
WebResponse response = request.GetResponse();
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Save the content to a temporary location
string zipFilePath = #"\\Server\Folder\adCenter\Temp.zip";
using (StreamWriter file = new StreamWriter(zipFilePath))
{
compressed.CopyTo(file.BaseStream);
file.Flush();
}
// Get the first file from the temporary zip
ZipFile zipFile = ZipFile.Read(zipFilePath);
if (zipFile.Entries.Count > 1) throw new ApplicationException("Found " + zipFile.Entries.Count.ToString("#,##0") + " entries in the report; expected 1.");
ZipEntry report = zipFile[0];
// Extract the data
using (MemoryStream decompressed = new MemoryStream())
{
report.Extract(decompressed);
decompressed.Position = 0; // Note that the stream does NOT start at the beginning
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
You will find that DeflateStream is hugely limited in what data it will decompress. In fact if you are expecting entire files it will be of no use at all.
There are hundereds of (mostly small) variations of ZIP files and DeflateStream will get along only with two or three of them.
Best way is likely to use a dedicated library for reading Zip files/streams like DotNetZip or SharpZipLib (somewhat unmaintained).
You could write the stream to a file and try my tool Precomp on it. If you use it like this:
precomp -c- -v [name of input file]
any ZIP/gZip stream(s) inside the file will be detected and some verbose information will be reported (position and length of the stream). Additionally, if they can be decompressed and recompressed bit-to-bit identical, the output file will contain the decompressed stream(s).
Precomp detects ZIP/gZip (and some other) streams anywhere in the file, so you won't have to worry about header bytes or garbage at the beginning of the file.
If it doesn't detect a stream like this, try to add -slow, which detects deflate streams even if they don't have a ZIP/gZip header. If this fails, you can try -brute which even detects deflate streams that lack the two byte header, but this will be extremely slow and can cause false positives.
After that, you'll know if there is a (valid) deflate stream in the file and if so, the additional information should help you to decompress other reports correctly using zLib decompression routines or similar.

C++ zlib inflate failing - translation of c# fixup?

I'm trying to inflate a string using zlib's deflate, but it's failing, apparently because it doesn't have the right header. I read elsewhere that the C# solution to this problem is:
public static byte[] FlateDecode(byte[] inp, bool strict) {
MemoryStream stream = new MemoryStream(inp);
InflaterInputStream zip = new InflaterInputStream(stream);
MemoryStream outp = new MemoryStream();
byte[] b = new byte[strict ? 4092 : 1];
try {
int n;
while ((n = zip.Read(b, 0, b.Length)) > 0) {
outp.Write(b, 0, n);
}
zip.Close();
outp.Close();
return outp.ToArray();
}
catch {
if (strict)
return null;
return outp.ToArray();
}
}
But I know nothing about C#. I can surmise that all it's doing is adding a prefix to the string, but what that prefix is, I have no idea. Would someone be able to phrase this function (or even just the header creation and string concatenation) in C++?
The data which I'm trying to inflate is taken from a PDF using zlib deflation.
Thanks a million,
Wyatt
I've had better luck using SharpZipLib for zlib interop than with the native .Net Framework classes. This correctly handles streams from C++ (zlib native) and from Java's compression classes without any funny business being needed.
I can't see any prefixes, sorry. Here's what the logic appears to be; sorry this isn't in C++:
MemoryStream stream = new MemoryStream(inp);
InflaterInputStream zip = new InflaterInputStream(stream);
Create an inflate stream from the data passed
MemoryStream outp = new MemoryStream();
Create a memory buffer stream for output
byte[] b = new byte[strict ? 4092 : 1];
try {
int n;
while ((n = zip.Read(b, 0, b.Length)) > 0) {
If you're in strict mode, read up to 4092 bytes - or 1 in non-strict mode - into a byte buffer
outp.Write(b, 0, n);
Write all the bytes decoded (may be less than the 4092) to the output memory buffer stream
zip.Close();
outp.Close();
return outp.ToArray();
Clean up, and return the output memory buffer stream as an array.
I'm a bit confused, though: why not just cut array b off at n elements and return that rather than go via a MemoryStream? The code also ought really to take care to clean up the memory streams and zip on exception (e.g. using using) since they're all IDisposable but I guess that's not really important since they don't correspond to I/O file handles, only memory structures.

Categories

Resources