Is it possible to get strings, ints, etc in binary format? What I mean is that assume I have the string:
"Hello" and I want to store it in binary format, so assume "Hello" is
11110000110011001111111100000000 in binary (I know it not, I just typed something quickly).
Can I store the above binary not as a string, but in the actual format with the bits.
In addition to this, is it actually possible to store less than 8 bits. What I am getting at is if the letter A is the most frequent letter used in a text, can I use 1 bit to store it with regards to compression instead of building a binary tree.
Is it possible to get strings, ints,
etc in binary format?
Yes. There are several different methods for doing so. One common method is to make a MemoryStream out of an array of bytes, and then make a BinaryWriter on top of that memory stream, and then write ints, bools, chars, strings, whatever, to the BinaryWriter. That will fill the array with the bytes that represent the data you wrote. There are other ways to do this too.
Can I store the above binary not as a string, but in the actual format with the bits.
Sure, you can store an array of bytes.
is it actually possible to store less than 8 bits.
No. The smallest unit of storage in C# is a byte. However, there are classes that will let you treat an array of bytes as an array of bits. You should read about the BitArray class.
What encoding would you be assuming?
What you are looking for is something like Huffman coding, it's used to represent more common values with a shorter bit pattern.
How you store the bit codes is still limited to whole bytes. There is no data type that uses less than a byte. The way that you store variable width bit values is to pack them end to end in a byte array. That way you have a stream of bit values, but that also means that you can only read the stream from start to end, there is no random access to the values like you have with the byte values in a byte array.
What I am getting at is if the letter
A is the most frequent letter used in
a text, can I use 1 bit to store it
with regards to compression instead of
building a binary tree.
The algorithm you're describing is known as Huffman coding. To relate to your example, if 'A' appears frequently in the data, then the algorithm will represent 'A' as simply 1. If 'B' also appears frequently (but less frequently than A), the algorithm usually would represent 'B' as 01. Then, the rest of the characters would be 00xxxxx... etc.
In essence, the algorithm performs statistical analysis on the data and generates a code that will give you the most compression.
You can use things like:
Convert.ToBytes(1);
ASCII.GetBytes("text");
Unicode.GetBytes("text");
Once you have the bytes, you can do all the bit twiddling you want. You would need an algorithm of some sort before we can give you much more useful information.
The string is actually stored in binary format, as are all strings.
The difference between a string and another data type is that when your program displays the string, it retrieves the binary and shows the corresponding (ASCII) characters.
If you were to store data in a compressed format, you would need to assign more than 1 bit per character. How else would you identify which character is the mose frequent?
If 1 represents an 'A', what does 0 mean? all the other characters?
Related
I am looking to implement the Rosetta Code LZSW Decompression method in C# and I need some help. The original code is available here: http://rosettacode.org/wiki/LZW_compression#C.23
I am only focusing on the Decompress method as I "simply" (if only) want to decompress .Z-files in my C# program in .NET 6.
I want my version take a byte[] as input and return a byte[] (as I am reading .ReadAllBytes() from file and want to create a new file with the decompressed result).
My problem comes from the fact that in C#, chars are 16bit (2 bytes) and not 8bit (1byte). This really messes with my head as that consequently (in my mind) means that each char should be represented by two bytes. In the code at Rosetta Code, the intial dictionary created only contains integer keys of 0 -> 255 meaning up to 1 byte, not two. I am thinking if this is an error in their implementation? What do you think? And how would you go about converting this algorithm to a method with the signature: byte[] Decompress(byte[]) ?
Thanks
No, there is no error. No, you don't need to convert the algorithm to work on 16-bit values. The usual lossless compression libraries operate on sequences of bytes. Your string of characters would first need to be converted to a sequence of bytes, e.g. to UTF-8, e.g. byte[] bs = Encoding.UTF8.GetBytes(str);. UTF-8 would be the right choice, since that encoding gives the compressor the best shot at compressing. In fact, just encoding UTF-16 to UTF-8 will almost always compress the strings, which makes it a good starting point. (In fact, using UTF-16 as the standard for character strings in .NET is a terrible choice for exactly this reason, but I digress.)
Any data you compress would first be serialized to a sequence of bytes for these compressors, if it isn't bytes already, in a way that permits reversing the transformation after decompression on the other end.
Since you are decompressing, someone encoded the characters into a sequence of bytes, so you need to first find out what they did. It may just be a sequence of ASCII characters, which are already one byte per character. Then you would use System.Text.Encoding.ASCII.GetString(bs); to make a character string out of it.
When compressing data we usually talk about sequences of symbols. A symbol in this context might be a byte, a character, or something completely different.
Your example obviously uses characters as it's symbols, but there should not be any real problem just replacing this with bytes instead. The more difficult part will be its use of strings to represent sequences of characters. You will need an equivalent representation of byte sequences that provides functionality like:
Concatenation/appending
Equality
GetHashCode (for performance)
Immutability (i.e. appending a byte should produce a new sequence, not modify the existing sequence)
Note that there LZW implementations have to agree on some specific things to be compatible, so implementing the posted example may or may not not allow you to decode .Z files encoded with another implementation. If your goal is to decode actual files you may have better luck asking on software recommendations for a preexisting decompression library.
I'm reading up on the ProtectedMemory class in C# (which uses the Data Protection API in Windows (DPAPI)) and I see that in order to use the Protect() Method of the class, the data to be encrypted must be stored in a byte array whose size/length is a multiple of 16.
I know how to convert many different data types to byte array form and back again, but how can I guarantee that the size of a byte array is a multiple of 16? Do I literally need to create an array whose size is a multiple of 16 and keep track of the original data's length using another variable or am I missing something? With traditional block-ciphers all of these details are handled for you automatically with padding settings. Likewise, when I attempt to convert data back to its original form from a byte array, how do I ensure that any additional bytes are ignored, assuming of course that the original data wasn't a multiple of 16.
In the code sample provided in the .NET Framework documentation, the byte array utilised just so happens to be 16 bytes long so I'm not sure what best practice is in relation to this hence the question.
Yes, just to iterate over the possibilities given in the comments (and give an answer to this nice question), you can use:
a padding method that is also used for block cipher modes, see all the options on the Wikipedia page on the subject.
prefix a length in some form or other. A fixed size of 32 bits / 4 bytes is probably easiest. Do write down the type of encoding for the size (unsigned, little endian is probably best for C#).
Both of these already operate on bytes, so you may need to define a character encoding such as UTF-8 if you use a string.
You could also use a specific encoding of the string, e.g. one defined by ASN.1 / DER and then perform zero padding. That way you can even indicate the type of the data that has been encoded in a platform independent way. You may want to read up on masochism before taking this route.
I have try to generate unlock key like XXXX-XXXX-XXXX or simply small length string or Hexstring. I am using RSA algorithm to encrypt and decrypt the Key. I got some long string like
Q65g2+uiytyEUW5SFsiI/c5z9NSxyuU2CM1SEly6cAVv9PdTpH81XaWS8lITcaTZ4IjdmINwhHBosvt5kdg==
when I convert the byte array (array size is 64 byte) using the below convert method.
Convert.ToBase64String(bytes);
My requirement is to generate the minimal length Key. Is there any way to convert the Byte array (array size is 64 byte) to minimal length and I need that back to byte array or any other suggestions (to minimize the string length) would be helpful.
I have tried to convert the output string to Hex decimal, but the output is too long than the string.
You may want to take a look at What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)? There the Base 85 encoding is discussed which is used to compress PDF files.
Though, the difference between Base64 and Base85 in your case is 8 characters.
You can safely remove trailing '==' in Base64 string because it is used for alignment and will always be there for 64-byte values (Of course you will have to add these characters back to decode the string).
Since you mention you want users to be able to type in the string,
there will be an inverse correlation between easy-of-use from point of view of users and the length of string.
Even typing a Base64 string is prone to lot of errors. Base32 strings are much easier to type, but correspondingly the length will increase.
If the users can Copy-Paste the key, then the above is moot and there should not be any valid reason why the length of the string should be as small as possible.
Obviously, you can only fit a certain amount of data into a fixed number of characters. You have pretty much maxed out the limit with base64 already which gives you 6 bits per byte.
Therefore you need to reduce the amount of data that needs to be stored. Can you reduce the key length? You could use a 96 bit key (by always leaving all other bytes zero). That would require 16 base64 characters which is much better.
It seems you don't need much security against brute forcing. So you can reduce the key size even further.
I have data 0f 340 bytes in string mostly consists of signs and numbers like "føàA¹º#ƒUë5§Ž§"
I want to compress into 250 or less bytes to save it on my RFID card.
As this data is related to finger print temp. I want lossless compression.
So is there any algorithm which i can implement in C# to compress it?
If the data is strictly numbers and signs, I highly recommend changing the numbers into int based values. eg:
+12939272-23923+927392
can be compress into 3 piece of 32-bit integers, which is 22 bytes => 16 bytes. Picking the right integer size (whether 32-bit, 24-bit, 16-bit) should help.
If the integer size varies greatly, you could possibly use 8-bit to begin and use the value 255 to specify that the next 8-bit becomes the 8 more significant bits of the integer, making it 15-bit.
alternatively, you could identify the most significant character and assign 0 for it. the second most significant character gets 10, and the third 110. This is a very crude compression, but if you data is very limited, this might just do the job for you.
Is there any other information you know about your string? For instance does it contain certain characters more often than others? Does it contain all 255 characters or just a subset of them?
If so, huffman encoding may help you, see this or this other link for implementations in C#.
To be honest it just depends on how your input string looks like. What I'd do is try the using rar, zip, 7zip (LZMA) with very small dictionary sizes (otherwise they'll just use up too much space for preprocessed information) and see how big the raw compressed file they produce is (will probably have to use their libraries in order to make them strip headers to conserve space). If any of them produce a file under 250b, then find the c# library for it and there you go.
Can anyone tell me how many bytes the below string will take up?
string abc = "a";
From my article on strings:
In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies. The only other classes which do this (as far as I know) are arrays. Essentially, a string is a character array in memory, plus the length of the array and the length of the string (in characters). The length of the array isn't always the same as the length in characters, as strings can be "over-allocated" within mscorlib.dll, to make building them up easier. (StringBuilder does this, for instance.) While strings are immutable to the outside world, code within mscorlib can change the contents, so StringBuilder creates a string with a larger internal character array than the current contents requires, then appends to that string until the character array is no longer big enough to cope, at which point it creates a new string with a larger array. The string length member also contains a flag in its top bit to say whether or not the string contains any non-ASCII characters. This allows for extra optimisation in some cases.
I suspect that was written before I had a chance to work with a 64-bit CLR; I suspect in 64-bit land each string takes up either 4 or 8 more bytes.
EDIT: I wrote up a blog post more recently which includes 64-bit information (and contradicts the above slightly for x86...)
Basically, Each string object require a constant 20 bytes for the object data.
The buffer requires 2 bytes per character.
The memory usage estimation for string in bytes: 20 + (2 * Length).
So, Normally The memory in CLR for this string: 22 bytes
However while we pass or sending this string to another end or in any other usage, we do not need this much memory(we never need the 20 bytes for the object data). So it depends on the type of encoding you select, while you use it.
For a default encoding, it will take 1 byte for a character.
So Answer is 1 byte for default encoding.
You can check with this code:
Encoding.Default.GetBytes("a"); //It will give you a byte array of size 1.
Encoding.Default.GetBytes("ABC"); //It will give you a byte array of size 3.
If you ask about size of string object then it is wrong to ask about its size, without debugger it is impossible to say what exactly is it. Not sure that it is possible with debugger either. string uses pointers internally.
If you ask about size of sequence of chars that it contains then it is 4, because strings are stored in UTF-16. All chars in Basic Multilingual Plane are coded with two bytes.