How to know the size of the string in bytes? - c#

I'm wondering if I can know how long in bytes for a string in C#, anyone know?

You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding class.
or try this
System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);

From MSDN:
A String object is a sequential collection of System.Char objects that represent a string.
So you can use this:
var howManyBytes = yourString.Length * sizeof(Char);

System.Text.ASCIIEncoding.Unicode.GetByteCount(yourString);
Or
System.Text.ASCIIEncoding.ASCII.GetByteCount(yourString);

How many bytes a string will take depends on the encoding you choose (or is automatically chosen in the background without your knowledge). This sample code shows the difference:
void Main()
{
string text = "a🡪";
Console.WriteLine("{0,15} length: {1}", "String", text.Length);
PrintInfo(text, Encoding.ASCII); // Note that '🡪' cannot be encoded in ASCII, information loss will occur
PrintInfo(text, Encoding.UTF8); // This should always be your choice nowadays
PrintInfo(text, Encoding.Unicode);
PrintInfo(text, Encoding.UTF32);
}
void PrintInfo(string input, Encoding encoding)
{
byte[] bytes = encoding.GetBytes(input);
var info = new StringBuilder();
info.AppendFormat("{0,16} bytes: {1} (", encoding.EncodingName, bytes.Length);
info.AppendJoin(' ', bytes);
info.Append(')');
string decodedString = encoding.GetString(bytes);
info.AppendFormat(", decoded string: \"{0}\"", decodedString);
Console.WriteLine(info.ToString());
}
Output:
String length: 3
US-ASCII bytes: 3 (97 63 63), decoded string: "a??"
Unicode (UTF-8) bytes: 5 (97 240 159 161 170), decoded string: "a🡪"
Unicode bytes: 6 (97 0 62 216 106 220), decoded string: "a🡪"
Unicode (UTF-32) bytes: 8 (97 0 0 0 106 248 1 0), decoded string: "a🡪"

Related

Order of bytes after BitArray to byte[] convertation

I'm, trying to figure out bytes order after convertation from BitArray to byte[].
Firstly, here is the BitArray content:
BitArray encoded = huffmanTree.Encode(input);
foreach (bool bit in encoded)
{
Console.Write((bit ? 1 : 0));
}
Console.WriteLine();
Output:
Encoded: 000001010110101011111111
Okay, so if we convert these binary to Hex manually we will get: 05 6A FF
However, when I am using convertation in C#, here is what I get:
BitArray encoded = huffmanTree.Encode(input);
byte[] bytes = new byte[encoded.Length / 8 + (encoded.Length % 8 == 0 ? 0 : 1)];
encoded.CopyTo(bytes, 0);
string StringByte = BitConverter.ToString(bytes);
Console.WriteLine(StringByte); // just to check the Hex
Output:
A0-56-FF
Nevertheless, as I have mentioned, it should be 05 6A FF. Please help me to understand why is that so.

Why do I get a different value after turning an integer into ASCII and then back to an integer?

Why, when I turn INT value to bytes and to ASCII and back, I get another value?
Example:
var asciiStr = new string(Encoding.ASCII.GetChars(BitConverter.GetBytes(2000)));
var intVal = BitConverter.ToInt32(Encoding.ASCII.GetBytes(asciiStr), 0);
Console.WriteLine(intVal);
// Result: 1855
ASCII is only 7-bit - code points above 127 are unsupported. Unsupported characters are converted to ? per the docs on Encoding.ASCII:
The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your app. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.
So 2000 decimal = D0 07 00 00 hexadecimal (little endian) = [unsupported character] [BEL character] [NUL character] [NUL character] = ? [BEL character] [NUL character] [NUL character] = 3F 07 00 00 hexadecimal (little endian) = 1855 decimal.
TL;DR: Everything's fine. But you're a victim of character replacement.
We start with 2000. Let's acknowledge, first, that this number can be represented in hexadecimal as 0x000007d0.
BitConverter.GetBytes
BitConverter.GetBytes(2000) is an array of 4 bytes, Because 2000 is a 32-bit integer literal. So the 32-bit integer representation, in little endian (least significant byte first), is given by the following byte sequence { 0xd0, 0x07, 0x00, 0x00 }. In decimal, those same bytes are { 208, 7, 0, 0 }
Encoding.ASCII.GetChars
Uh oh! Problem. Here's where things likely took an unexpected turn for you.
You're asking the system to interpret those bytes as ASCII-encoded data. The problem is that ASCII uses codes from 0-127. The byte with value 208 (0xd0) doesn't correspond to any character encodable by ASCII. So what actually happens?
When decoding ASCII, if it encounters a byte that is out of the range 0-127 then it decodes that byte to a replacement character and moves to the next byte. This replacement character is a question mark ?. So the 4 chars you get back from Encoding.ASCII.GetChars are ?, BEL (bell), NUL (null) and NUL (null).
BEL is the ASCII name of the character with code 7, which traditionally elicits a beep when presented on a capable terminal. NUL (code 0) is a null character traditionally used for representing the end of a string.
new string
Now you create a string from that array of chars. In C# a string is perfectly capable of representing a NUL character within the body of a string, so your string will have two NUL chars in it. They can be represented in C# string literals with "\0", in case you want to try that yourself. A C# string literal that represents the string you have would be "?\a\0\0" Did you know that the BEL character can be represented with the escape sequence \a? Many people don't.
Encoding.ASCII.GetBytes
Now you begin the reverse journey. Your string is comprised entirely of characters in the ASCII range. The encoding of a question mark is code 63 (0x3F). And the BEL is 7, and the NUL is 0. so the bytes are { 0x3f, 0x07, 0x00, 0x00 }. Surprised? Well, you're encoding a question mark now where before you provided a 208 (0xd0) byte that was not representable with ASCII encoding.
BitConverter.ToInt32
Converting these four bytes back to a 32-bit integer gives the integer 0x0000073f, which, in decimal, is 1855.
String encoding (ASCII, UTF8, SHIFT_JIS, etc.) is designed to pigeonhole human language into a binary (byte) form. It isn't designed to store arbitrary binary data, such as the binary form of an integer.
While your binary data will be interpreted as a string, some of the information will be lost, meaning that storing binary data in this way will fail in the general case. You can see the point where this fails using the following code:
for (int i = 0; i < 255; ++i)
{
var byteData = new byte[] { (byte)i };
var stringData = System.Text.Encoding.ASCII.GetString(byteData);
var encodedAsBytes = System.Text.Encoding.ASCII.GetBytes(stringData);
Console.WriteLine("{0} vs {1}", i, (int)encodedAsBytes[0]);
}
Try it online
As you can see it starts off well because all of the character codes correspond to ASCII characters, but once we get up in the numbers (i.e. 128 and beyond), we start to require a more than 7 bits to store the binary value. At this point it ceases to be decoded correctly, and we start seeing 63 come back instead of the input value.
Ultimately you will have this problem encoding binary data using any string encoding. You need to choose an encoding method specifically meant for storing binary data as a string.
Two popular methods are:
Hexadecimal
Base64 using ToBase64String and FromBase64String
Hexadecimal example (using the hex methods here):
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to hex
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = ByteArrayToString(bytesValue);
Console.WriteLine("As hex: {0}", stringValue); // outputs D0070000
// Convert form hex to bytes and then to int
byte[] decodedBytesValue = StringToByteArray(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
Base64 example:
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to base64
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = Convert.ToBase64String(bytesValue);
Console.WriteLine("As base64: {0}", stringValue); // outputs 0AcAAA==
// Convert form base64 to bytes and then to int
byte[] decodedBytesValue = Convert.FromBase64String(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
P.S. If you simply wanted to convert your integer to a string (e.g. "2000") then you can simply use .ToString():
int initialValue = 2000;
string stringValue = initialValue.ToString();

find the size of a JSON string before submitting it to a Azure Queue [duplicate]

I'm wondering if I can know how long in bytes for a string in C#, anyone know?
You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding class.
or try this
System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);
From MSDN:
A String object is a sequential collection of System.Char objects that represent a string.
So you can use this:
var howManyBytes = yourString.Length * sizeof(Char);
System.Text.ASCIIEncoding.Unicode.GetByteCount(yourString);
Or
System.Text.ASCIIEncoding.ASCII.GetByteCount(yourString);
How many bytes a string will take depends on the encoding you choose (or is automatically chosen in the background without your knowledge). This sample code shows the difference:
void Main()
{
string text = "a🡪";
Console.WriteLine("{0,15} length: {1}", "String", text.Length);
PrintInfo(text, Encoding.ASCII); // Note that '🡪' cannot be encoded in ASCII, information loss will occur
PrintInfo(text, Encoding.UTF8); // This should always be your choice nowadays
PrintInfo(text, Encoding.Unicode);
PrintInfo(text, Encoding.UTF32);
}
void PrintInfo(string input, Encoding encoding)
{
byte[] bytes = encoding.GetBytes(input);
var info = new StringBuilder();
info.AppendFormat("{0,16} bytes: {1} (", encoding.EncodingName, bytes.Length);
info.AppendJoin(' ', bytes);
info.Append(')');
string decodedString = encoding.GetString(bytes);
info.AppendFormat(", decoded string: \"{0}\"", decodedString);
Console.WriteLine(info.ToString());
}
Output:
String length: 3
US-ASCII bytes: 3 (97 63 63), decoded string: "a??"
Unicode (UTF-8) bytes: 5 (97 240 159 161 170), decoded string: "a🡪"
Unicode bytes: 6 (97 0 62 216 106 220), decoded string: "a🡪"
Unicode (UTF-32) bytes: 8 (97 0 0 0 106 248 1 0), decoded string: "a🡪"

Sending chars from string to byte array C#

Which encoding should I use to write bbb to a file as exact bytes, so if the file were opened in a hex editor, its contents would be "99 59"?
The following methods created incorrect results, as listed:
Byte[] bbb = { 0x99, 0x59 };
string o = System.Text.Encoding.UTF32.GetString(bbb);
UTF32 (above) writes 'EF BF BD', UTF7 writes 'C2 99 59', UTF8 writes 'EF BF BD 59', Unicode writes 'E5 A6 99', ASCII writes '3F 59'
What encoding will produce the un-changed 8-bit bytes?
If you want bytes to be written unencoded to a file/stream, simply write them to the file/stream.
File.WriteAllBytes(#"d:\temp\test.bin", bbb);
or
stream.Write(bbb, 0, bbb.Length);
Don't encode them at all.

convert Hex UTF-8 bytes to Hex code point

how can i convert
Hex UTF-8 bytes -E0 A4 A4 to hex code point - 0924
ref: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=e0+a4+a4&mode=bytes
I need this because when i read Unicode data in c# it is taking it as single byte sequence and displaying 3 characters instead of 1,but i need 3 byte sequence(read 3 bytes and display single character),I tried many solutions but didn't get the result.
If I can display or store a 3-byte sequence utf-8 character then I don't need conversion.
senario is like this:
string str=getivrresult();
in str I have a word with each character as 3 byte utf-8 sequence.
Edited:
string str="त";
//i want it as "त" in str.
Character त
Character name DEVANAGARI LETTER TA
Hex code point 0924
Decimal code point 2340
Hex UTF-8 bytes E0 A4 A4
Octal UTF-8 bytes 340 244 244
UTF-8 bytes as Latin-1 characters bytes à ¤ ¤
Thank You.
Use the GetString methdod in the Encoding class:
byte[] data = { 0xE0, 0xA4, 0xA4 };
string str = Encoding.UTF8.GetString(data);
The string now contains one character with the character code 0x924.
//utf-8 Single Byte Sequence input
string str = "त";
int i = 0;
byte[] data=new byte[3];
foreach (char c in str)
{
string tmpstr = String.Format("{0:x2}", (int)c);
data[i] = Convert.ToByte(int.Parse(tmpstr, System.Globalization.NumberStyles.HexNumber));
i++;
}
//utf-8 3-Byte Sequence Output now stp contains "त".
string stp = Encoding.UTF8.GetString(data);

Categories

Resources