UTF8 encoding to base64string and storing to database - c#

I'm currently trying to encode data before storing it to my database.
try
{
byte[] byteEncString = new byte[_strToConvert.Length];
byteEncString = System.Text.Encoding.UTF8.GetBytes(_strToConvert);
string strEncoded = Convert.ToBase64String(byteEncString);
return strEncoded;
}
Does anybody know how long a 15 character string will be after it is encoded via utf8 and base64string? Also, is there a max? My field on sql server is only 50 and i want to limit it in that range. Thoughts?

Well, for one thing there's no point in creating a byte array and then ignoring it - so your code would be simpler as:
byte[] byteEncString = Encoding.UTF8.GetBytes(_strToConvert);
return Convert.ToBase64String(byteEncString);
A .NET char can end up as 3 bytes when UTF-8 encoded1, so that gets to a 45 byte maximum. However, base64 converts 3 bytes to 4 characters, so that gives a 60 character maximum base64 encoded string as the result.
1 This is because any characters not in the basic multilingual plane are represented as a surrogate pair. That pair would end up as 4 bytes, but having taken 2 input characters, so the average "bytes per char" in that case is only 2.

Related

can not convert byte array to string and vice versa

I am trying to convert byte array to string but bytes are not being converted to string correctly.
byte[] testByte = new byte[]
{
2, 200
};
string string1 = Encoding.ASCII.GetString(testByte);
byte[] byte1 = Encoding.ASCII.GetBytes(string1);
string1 is giving value as \u0002? and byte1 is not getting converted back to 2 and 200. I tried with UTF8 but that is also giving same problem.
I have been given 256 array of chars and integer values. I need to write these values on media as string and read back as bytes. I need conversion to write and read byte data. I am facing problems when integer value comes more then 127.
What should I do so I get original byte values from string?
You appear to be using an encoding backwards. A text Encoding(such as ASCII) is for converting arbitrary text data into encoded (meaning: specially formatted) binary data.
(a caveat should be included here that not all encodings support all text characters; ASCII only supports code-points 0-128, for example, but that isn't the main problem with the code shown)
You appear to want to treat arbitrary binary data as a string - which is the exact opposite. Arbitrary binary data, and encoded text data. Not a problem: just use base-N for some N. Hex (base-16) would work, but base-64 will be more space efficient:
string encoded = Convert.ToBase64String(testByte);
byte[] decoded = Convert.FromBase64String(encoded);

How does deserialising byte arrays to utf8 know when each character starts/ends?

I am bit confused how networking does this. I have a string in C# and I serialise it to utf-8. But according to utf-8 each character takes up "possibly" 1 to 4 bytes.
So if my server receives this byte array over the net and deserialises it knowing its a utf8 string of some size. How does it know how many bytes each character is to convert it properly?
Will i have to include the total bytes for each string in the protocol eg:
[message length][char byte length=1][2][char byte length=2][56][123][ ... etc...]
Or is this unnecessary ?
UTF-8 encodes the number of bytes required in the bits that make up the character. Read the description on Wikipedia; only single-byte code points start with a zero bit. Only two-byte code points start with bits 110, only bytes inside a multi-byte code point start with 10.

How can I get the maximum byte array length from a string that has always the same length?

Let's say I have a fixed string with 245 chars, for example
v0iRfw0rBic4HlLIDmIm5MtLlbKvakb3Q2kXxMWssNctLgw445dre2boZG1a1kQ+xTUZWvry61QBmTykFEJii217m+BW7gEz3xlMxwXZnWwk2P6Pk1bcOkK3Nklbx2ckhtj/3jtj6Nc05XvgpiROJ/zPfztD0/gXnmCenre32BeyJ0Es2r4xwO8nWq3a+5MdaQ5NjEgr4bLg50DaxUoffQ1jLn/jIQ==`
then I transform in an array byte using
System.Text.Encoding.UTF8.GetBytes
and the length of the array byte is 224.
Then I generate another string, eg
PZ2+Sxx4SjyjzIA1qGlLz4ZFjkzzflb7pQfdoHfMFDlHwQ/uieDFOpWqnA5FFXYTwpOoOVXVWb9Hw6YUm6rF1rhG7eZaXEWmgFS2SeFItY+Qyt3jI9rkcWhPp8Y5sJ/q5MVV/iePuGVOArgBHhDe/g0Wg9DN4bLeYXt+CrR/bNC1zGQb8rZoABF4lSEh41NXcai4IizOHQMSd52rEa2wzpXoS1KswgxWroK/VUyRvH4oJpkMxkqj565gCHsZvO9jx8aLOZcBq66cYXOpDsi2gboeg+oUpAdLRGSjS7qQPfKTW42FBYPmJ3vrb2TW+g==
but now the array length is 320.
So my question is: how can I determine the maximum length of a byte array resulted from a string fixed to 245 chars?
This is the class that I'm using for generating the random string
static class Utilities
{
static Random randomGenerator = new Random();
internal static string GenerateRandomString(int length)
{
byte[] randomBytes = new byte[randomGenerator.Next(length)];
randomGenerator.NextBytes(randomBytes);
return Convert.ToBase64String(randomBytes);
}
}
According to the RFC 3629:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
The maximum number of bytes per UTF-8 character is 4, so the maximum length of your byte array is 4 times 245 = 980.
If you are encoding using the Byte Order Mark (BOM) you'll need 3 extra bytes
[...] the BOM
will always appear as the octet sequence EF BB BF.
so 983 in total.
Additional Info:
In your example, you also converted the byte array to Base64, which uses 6 Bits per Character and therefore has a length of 4 * Math.Ceiling(Characters/3), or in your case 1312 ASCII Characters.
According to the design of UTF8, it is expandable.
https://en.wikipedia.org/wiki/UTF-8
In theory, you don't have a maximum length.
But of course, words in real world are limited.
In practice, byte lengths are limited to word count x 4.
245 chars => 980 bytes
If you look for a fixed length encoding, use Encoding.Unicode.
Also, Encoding provides a method giving maximum number of bytes.
Encoding.UTF8.GetMaxByteCount(charCount: 245)
Encoding.Unicode.GetMaxByteCount(charCount: 245)
Simply, you cant. Universal Text Format 8 (which you use), uses 1, 2, 3 or 4 bytes per char (like Tommy said), so the only way for you is to traverse all the chars (GetMaxByteCount()) and calculate it.
Perhaps, if you gonna keep using the BASE64-like strings, you don't not need UTF8, instead, you can use ASCII of any other 1-byte per char encoding and your total byte array size will be the Length of your string.

Why does C# Convert.ToBase64String() give me 88 as a length when I'm passing in 64 bytes?

I'm trying to understand the following:
If I am declaring 64 bytes as the array length (buffer). When I convert to a base 64 string, it says the length is 88. Shouldn't the length only be 64, since I am passing in 64 bytes? I could be totally misunderstanding how this actual works. If so, could you please explain.
//Generate a cryptographic random number
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
// Create byte array
byte[] buffer = new byte[64];
// Get random bytes
rng.GetBytes(buffer);
// This line gives me 88 as a result.
// Shouldn't it give me 64 as declared above?
throw new Exception(Convert.ToBase64String(buffer).Length.ToString());
// Return a Base64 string representation of the random number
return Convert.ToBase64String(buffer);
No, base-64 encoding uses a whole byte to represent six bits of the data being encoded. The lost two bits is the price of using only alphanumeric, plus and slash as your symbols (basically, excluding the numbers representing not visible or special characters in plain ASCII/UTF-8 encoding). The result that you are getting is (64*4/3) rounded up to the nearest 4-byte boundary.
Base64 encoding converts 3 octets into 4 encoded characters; therefore
(64/3)*4 ≈ (22*4) = 88 bytes.
Read here.
Shouldn't the length only be 64, since I am passing in 64 bytes?
No. You are passing 64 tokens in Base256 notation. Base64 has less information per token, so it needs more tokens. 88 sounds about right.

TextWriter.ReadToEnd vs. Unix wc Command

Another question re. Unicode, terminals and now C# and wc. If I write this simple piece of code
int i=0;
foreach(char c in Console.In.ReadToEnd())
{
if(c!='\n') i++;
}
Console.WriteLine("{0}", i);
and input it only the character "€" (3 bytes in utf-8), wc returns 3 characters (maybe using wint_t, though I haven't checked), but ReadToEnd() returns 1 (one character). What exactly is the behavior of ReadToEnd in this case? How do I know what ReadToEnd is doing behind the scenes?
I'm running xterm initialized with utf-8.en.US, running Ubuntu Linux and Mono.
Thank you.
wc and most unix-like commands deal with characters in terms of the C char data type which is usually an unsigned 8 bit integer. wc simply reads the bytes from the standard input one by one with no conversion and determines that there are 3 characters.
.NET deals with characters in terms of its own Char data type which is a 16 bit unsigned integer and represents a UTF-16 character. The console class has recieved the 3 bytes of input, determined that the console it is attached to is UTF-8 and has properly converted them to a single UTF-16 euro character.
You need to take into consideration the character encoding. Currently you are merely counting the bytes and chars and bytes are not necessarily the same size.
Encoding encoding = Encoding.UTF8;
string s = "€";
int byteCount = encoding.GetByteCount(s);
Console.WriteLine(byteCount); // prints "3" on the console
byte[] bytes = new byte[byteCount];
encoding.GetBytes(s, 0, s.Length, bytes, 0);
int charCount = encoding.GetCharCount(bytes);
Console.WriteLine(charCount); // prints "1" on the console
ReadToEnd returns a string. All strings in .NET are Unicode. They're not just an array of bytes.
Apparently, wc is returning the number of bytes. The number of bytes and the number of characters used to be the same thing.
wc, by default, returns the number of lines, words and bytes in a file. If you want to to return the number of characters according to the active locale's encoding rather than just the number of bytes then you should look at the -m or --chars option which modern wc's have.

Categories

Resources