I am looking to implement the Rosetta Code LZSW Decompression method in C# and I need some help. The original code is available here: http://rosettacode.org/wiki/LZW_compression#C.23
I am only focusing on the Decompress method as I "simply" (if only) want to decompress .Z-files in my C# program in .NET 6.
I want my version take a byte[] as input and return a byte[] (as I am reading .ReadAllBytes() from file and want to create a new file with the decompressed result).
My problem comes from the fact that in C#, chars are 16bit (2 bytes) and not 8bit (1byte). This really messes with my head as that consequently (in my mind) means that each char should be represented by two bytes. In the code at Rosetta Code, the intial dictionary created only contains integer keys of 0 -> 255 meaning up to 1 byte, not two. I am thinking if this is an error in their implementation? What do you think? And how would you go about converting this algorithm to a method with the signature: byte[] Decompress(byte[]) ?
Thanks
No, there is no error. No, you don't need to convert the algorithm to work on 16-bit values. The usual lossless compression libraries operate on sequences of bytes. Your string of characters would first need to be converted to a sequence of bytes, e.g. to UTF-8, e.g. byte[] bs = Encoding.UTF8.GetBytes(str);. UTF-8 would be the right choice, since that encoding gives the compressor the best shot at compressing. In fact, just encoding UTF-16 to UTF-8 will almost always compress the strings, which makes it a good starting point. (In fact, using UTF-16 as the standard for character strings in .NET is a terrible choice for exactly this reason, but I digress.)
Any data you compress would first be serialized to a sequence of bytes for these compressors, if it isn't bytes already, in a way that permits reversing the transformation after decompression on the other end.
Since you are decompressing, someone encoded the characters into a sequence of bytes, so you need to first find out what they did. It may just be a sequence of ASCII characters, which are already one byte per character. Then you would use System.Text.Encoding.ASCII.GetString(bs); to make a character string out of it.
When compressing data we usually talk about sequences of symbols. A symbol in this context might be a byte, a character, or something completely different.
Your example obviously uses characters as it's symbols, but there should not be any real problem just replacing this with bytes instead. The more difficult part will be its use of strings to represent sequences of characters. You will need an equivalent representation of byte sequences that provides functionality like:
Concatenation/appending
Equality
GetHashCode (for performance)
Immutability (i.e. appending a byte should produce a new sequence, not modify the existing sequence)
Note that there LZW implementations have to agree on some specific things to be compatible, so implementing the posted example may or may not not allow you to decode .Z files encoded with another implementation. If your goal is to decode actual files you may have better luck asking on software recommendations for a preexisting decompression library.
Related
I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?
Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.
I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.
Short version
Is this an identity function?
f = (gₐ · hᵤ · gᵤ · hₐ)
where:
hₐ is the UTF-16 conversion from bytes to string,
gₐ is the UTF-16 conversion from string to bytes,
gᵤ is the Encoding.UTF8.GetBytes(),
hᵤ is the Encoding.UTF8.GetString(),
Long version
I'm using WebSocket4Net to send and receive messages through WebSockets between a C# application and a C# service.
Some messages being binary, I should convert them from and to strings when interacting with the library, since while its Send() method enables to send an array of bytes, its MessageReceived communicates the received message as a string only.
To convert bytes to string and string to bytes, I follow the answer by Mehrdad where the internal encoding of .NET Framework, i.e. UTF-16, is used.
On the other hand, according to the code source (see for example DraftHybi10Processor.cs, line 114), WebSocket4Net uses UTF-8 to convert string to bytes and bytes to string.
Would it cause issues? Is data loss possible?
If you need to send binary data as a string, well, that's what Base-64 and similar encodings are for. If you need to send a string as string... well, send it as a string. If you need to send a string as bytes, Unicode (UTF-16) or UTF-8 will do just fine. Strings aren't simple byte arrays (even if they can be represented that way if necessary). Unicode especially is quite a complicated encoding (see http://www.joelonsoftware.com/articles/Unicode.html; read it - it's a must). Did you know that you can get a unicode normalization that splits a single character into 5 bytes? The same character could also be interpreted as 2. Or a completely different number. I haven't observed it, but I'd expect that some byte arrays will be outright invalid in UTF-16 (which is the current default string encoding in .NET).
I'm not going to go through the proof that your "double-encoding" is flawed. I'm not sure, it might even work. However, the string you're going to get is going to be pretty silly and you'll have a lot of trouble encoding it to make sure that you're not sending commands or something.
The more important thing is - you're not showing intent. You're doing micro-optimalizations, and sacrificing readability. Worse, you're relying on implementation details, which aren't necessarily portable or stable with respect to later versions of .NET, not to mention other environments.
Unless you have a very, very good reason (based on actual performance analysis, not a "gut feeling"), go with the simple, readable solution. You can always improve if you have to.
EDIT: A sample code to show why using Unicode to encode non-unicode bytes is a bad idea:
Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(new byte[] { 200, 8 }))
The two bytes on input turned into four bytes, { 239, 191, 189, 8 }. Not quite what you wanted.
My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).
How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?
Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?
Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.
Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by #jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.
For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).
The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.
I need a library/tool/function that compresses a 50-60 char long string to smaller.
Do you know any?
Effective compression on that scale will be difficult. You might consider Huffman coding. This might give you smaller compression than gzip (since it will result in binary codes instead of a base-85 sequence).
Are you perhaps thinking of a cryptographic hash? For example, SHA-1 (http://en.wikipedia.org/wiki/SHA-1) can be used on an input string to produce a 20-byte digest. Of course, the digest will always be 20 bytes - even if the input string is shorter than 20 bytes.
The framework includes the GZipStream and DeflateStream classes. But that might not really be what you are after - what input strings have to be compressed? ASCII only? Letters only? Alphanumerical string? Full Unicode? And what are allowed output strings?
From an algorithmic stand point and without any further knowledge of the space of possible inputs I suggest to use arithmetic coding. This might shrink the compressed size by a few additional bits compared to Huffman coding because it is not restricted to an integral number of bits per symbol - something that can turn out important when dealing with such small inputs.
If your string only contains lowercase characters between a-z and 0-9 you could encode it in 7bits.
This will compress a 60 char string to 53 bytes. If you don't need digits you could use 6bits instead, bringing it down to 45 bytes.
So choosing the right compression method depends on what data your string contains.
You could simply gzip it
http://www.example-code.com/csharp/gzip_compressString.asp
I would use some basic like RLE or shared dictionary based compression followed by a block cipher that keeps the size constant.
Maybe smaz is also interesting for you.
Examples of basic compression algorithms:
RLE
(Modified or not) Huffman coding
Burrows-Wheeler transformation
Examples of block ciphers ("bit twiddlers"):
AES
Blowfish
DES
Triple DES
Serpent
Twofish
You will be able to find out what fullfills your needs using wikipedia (links above).
Is it possible to get strings, ints, etc in binary format? What I mean is that assume I have the string:
"Hello" and I want to store it in binary format, so assume "Hello" is
11110000110011001111111100000000 in binary (I know it not, I just typed something quickly).
Can I store the above binary not as a string, but in the actual format with the bits.
In addition to this, is it actually possible to store less than 8 bits. What I am getting at is if the letter A is the most frequent letter used in a text, can I use 1 bit to store it with regards to compression instead of building a binary tree.
Is it possible to get strings, ints,
etc in binary format?
Yes. There are several different methods for doing so. One common method is to make a MemoryStream out of an array of bytes, and then make a BinaryWriter on top of that memory stream, and then write ints, bools, chars, strings, whatever, to the BinaryWriter. That will fill the array with the bytes that represent the data you wrote. There are other ways to do this too.
Can I store the above binary not as a string, but in the actual format with the bits.
Sure, you can store an array of bytes.
is it actually possible to store less than 8 bits.
No. The smallest unit of storage in C# is a byte. However, there are classes that will let you treat an array of bytes as an array of bits. You should read about the BitArray class.
What encoding would you be assuming?
What you are looking for is something like Huffman coding, it's used to represent more common values with a shorter bit pattern.
How you store the bit codes is still limited to whole bytes. There is no data type that uses less than a byte. The way that you store variable width bit values is to pack them end to end in a byte array. That way you have a stream of bit values, but that also means that you can only read the stream from start to end, there is no random access to the values like you have with the byte values in a byte array.
What I am getting at is if the letter
A is the most frequent letter used in
a text, can I use 1 bit to store it
with regards to compression instead of
building a binary tree.
The algorithm you're describing is known as Huffman coding. To relate to your example, if 'A' appears frequently in the data, then the algorithm will represent 'A' as simply 1. If 'B' also appears frequently (but less frequently than A), the algorithm usually would represent 'B' as 01. Then, the rest of the characters would be 00xxxxx... etc.
In essence, the algorithm performs statistical analysis on the data and generates a code that will give you the most compression.
You can use things like:
Convert.ToBytes(1);
ASCII.GetBytes("text");
Unicode.GetBytes("text");
Once you have the bytes, you can do all the bit twiddling you want. You would need an algorithm of some sort before we can give you much more useful information.
The string is actually stored in binary format, as are all strings.
The difference between a string and another data type is that when your program displays the string, it retrieves the binary and shows the corresponding (ASCII) characters.
If you were to store data in a compressed format, you would need to assign more than 1 bit per character. How else would you identify which character is the mose frequent?
If 1 represents an 'A', what does 0 mean? all the other characters?