convert Hex UTF-8 bytes to Hex code point

convert Hex UTF-8 bytes to Hex code point - c#

how can i convert
Hex UTF-8 bytes -E0 A4 A4 to hex code point - 0924
ref: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=e0+a4+a4&mode=bytes
I need this because when i read Unicode data in c# it is taking it as single byte sequence and displaying 3 characters instead of 1,but i need 3 byte sequence(read 3 bytes and display single character),I tried many solutions but didn't get the result.
If I can display or store a 3-byte sequence utf-8 character then I don't need conversion.
senario is like this:
string str=getivrresult();
in str I have a word with each character as 3 byte utf-8 sequence.
Edited:
string str="à¤¤";
//i want it as "त" in str.
Character त
Character name DEVANAGARI LETTER TA
Hex code point 0924
Decimal code point 2340
Hex UTF-8 bytes E0 A4 A4
Octal UTF-8 bytes 340 244 244
UTF-8 bytes as Latin-1 characters bytes à ¤ ¤
Thank You.

Use the GetString methdod in the Encoding class:
byte[] data = { 0xE0, 0xA4, 0xA4 };
string str = Encoding.UTF8.GetString(data);
The string now contains one character with the character code 0x924.

//utf-8 Single Byte Sequence input
string str = "à¤¤";
int i = 0;
byte[] data=new byte[3];
foreach (char c in str)
{
string tmpstr = String.Format("{0:x2}", (int)c);
data[i] = Convert.ToByte(int.Parse(tmpstr, System.Globalization.NumberStyles.HexNumber));
i++;
}
//utf-8 3-Byte Sequence Output now stp contains "त".
string stp = Encoding.UTF8.GetString(data);

Related

Why do I get a different value after turning an integer into ASCII and then back to an integer?

Why, when I turn INT value to bytes and to ASCII and back, I get another value?
Example:
var asciiStr = new string(Encoding.ASCII.GetChars(BitConverter.GetBytes(2000)));
var intVal = BitConverter.ToInt32(Encoding.ASCII.GetBytes(asciiStr), 0);
Console.WriteLine(intVal);
// Result: 1855

ASCII is only 7-bit - code points above 127 are unsupported. Unsupported characters are converted to ? per the docs on Encoding.ASCII:
The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your app. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.
So 2000 decimal = D0 07 00 00 hexadecimal (little endian) = [unsupported character] [BEL character] [NUL character] [NUL character] = ? [BEL character] [NUL character] [NUL character] = 3F 07 00 00 hexadecimal (little endian) = 1855 decimal.

TL;DR: Everything's fine. But you're a victim of character replacement.
We start with 2000. Let's acknowledge, first, that this number can be represented in hexadecimal as 0x000007d0.
BitConverter.GetBytes
BitConverter.GetBytes(2000) is an array of 4 bytes, Because 2000 is a 32-bit integer literal. So the 32-bit integer representation, in little endian (least significant byte first), is given by the following byte sequence { 0xd0, 0x07, 0x00, 0x00 }. In decimal, those same bytes are { 208, 7, 0, 0 }
Encoding.ASCII.GetChars
Uh oh! Problem. Here's where things likely took an unexpected turn for you.
You're asking the system to interpret those bytes as ASCII-encoded data. The problem is that ASCII uses codes from 0-127. The byte with value 208 (0xd0) doesn't correspond to any character encodable by ASCII. So what actually happens?
When decoding ASCII, if it encounters a byte that is out of the range 0-127 then it decodes that byte to a replacement character and moves to the next byte. This replacement character is a question mark ?. So the 4 chars you get back from Encoding.ASCII.GetChars are ?, BEL (bell), NUL (null) and NUL (null).
BEL is the ASCII name of the character with code 7, which traditionally elicits a beep when presented on a capable terminal. NUL (code 0) is a null character traditionally used for representing the end of a string.
new string
Now you create a string from that array of chars. In C# a string is perfectly capable of representing a NUL character within the body of a string, so your string will have two NUL chars in it. They can be represented in C# string literals with "\0", in case you want to try that yourself. A C# string literal that represents the string you have would be "?\a\0\0" Did you know that the BEL character can be represented with the escape sequence \a? Many people don't.
Encoding.ASCII.GetBytes
Now you begin the reverse journey. Your string is comprised entirely of characters in the ASCII range. The encoding of a question mark is code 63 (0x3F). And the BEL is 7, and the NUL is 0. so the bytes are { 0x3f, 0x07, 0x00, 0x00 }. Surprised? Well, you're encoding a question mark now where before you provided a 208 (0xd0) byte that was not representable with ASCII encoding.
BitConverter.ToInt32
Converting these four bytes back to a 32-bit integer gives the integer 0x0000073f, which, in decimal, is 1855.

String encoding (ASCII, UTF8, SHIFT_JIS, etc.) is designed to pigeonhole human language into a binary (byte) form. It isn't designed to store arbitrary binary data, such as the binary form of an integer.
While your binary data will be interpreted as a string, some of the information will be lost, meaning that storing binary data in this way will fail in the general case. You can see the point where this fails using the following code:
for (int i = 0; i < 255; ++i)
{
var byteData = new byte[] { (byte)i };
var stringData = System.Text.Encoding.ASCII.GetString(byteData);
var encodedAsBytes = System.Text.Encoding.ASCII.GetBytes(stringData);
Console.WriteLine("{0} vs {1}", i, (int)encodedAsBytes[0]);
}
Try it online
As you can see it starts off well because all of the character codes correspond to ASCII characters, but once we get up in the numbers (i.e. 128 and beyond), we start to require a more than 7 bits to store the binary value. At this point it ceases to be decoded correctly, and we start seeing 63 come back instead of the input value.
Ultimately you will have this problem encoding binary data using any string encoding. You need to choose an encoding method specifically meant for storing binary data as a string.
Two popular methods are:
Hexadecimal
Base64 using ToBase64String and FromBase64String
Hexadecimal example (using the hex methods here):
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to hex
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = ByteArrayToString(bytesValue);
Console.WriteLine("As hex: {0}", stringValue); // outputs D0070000
// Convert form hex to bytes and then to int
byte[] decodedBytesValue = StringToByteArray(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
Base64 example:
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to base64
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = Convert.ToBase64String(bytesValue);
Console.WriteLine("As base64: {0}", stringValue); // outputs 0AcAAA==
// Convert form base64 to bytes and then to int
byte[] decodedBytesValue = Convert.FromBase64String(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
P.S. If you simply wanted to convert your integer to a string (e.g. "2000") then you can simply use .ToString():
int initialValue = 2000;
string stringValue = initialValue.ToString();

find the size of a JSON string before submitting it to a Azure Queue [duplicate]

I'm wondering if I can know how long in bytes for a string in C#, anyone know?

You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding class.
or try this
System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);

From MSDN:
A String object is a sequential collection of System.Char objects that represent a string.
So you can use this:
var howManyBytes = yourString.Length * sizeof(Char);

System.Text.ASCIIEncoding.Unicode.GetByteCount(yourString);
Or
System.Text.ASCIIEncoding.ASCII.GetByteCount(yourString);

How many bytes a string will take depends on the encoding you choose (or is automatically chosen in the background without your knowledge). This sample code shows the difference:
void Main()
{
string text = "a🡪";
Console.WriteLine("{0,15} length: {1}", "String", text.Length);
PrintInfo(text, Encoding.ASCII); // Note that '🡪' cannot be encoded in ASCII, information loss will occur
PrintInfo(text, Encoding.UTF8); // This should always be your choice nowadays
PrintInfo(text, Encoding.Unicode);
PrintInfo(text, Encoding.UTF32);
}
void PrintInfo(string input, Encoding encoding)
{
byte[] bytes = encoding.GetBytes(input);
var info = new StringBuilder();
info.AppendFormat("{0,16} bytes: {1} (", encoding.EncodingName, bytes.Length);
info.AppendJoin(' ', bytes);
info.Append(')');
string decodedString = encoding.GetString(bytes);
info.AppendFormat(", decoded string: \"{0}\"", decodedString);
Console.WriteLine(info.ToString());
}
Output:
String length: 3
US-ASCII bytes: 3 (97 63 63), decoded string: "a??"
Unicode (UTF-8) bytes: 5 (97 240 159 161 170), decoded string: "a🡪"
Unicode bytes: 6 (97 0 62 216 106 220), decoded string: "a🡪"
Unicode (UTF-32) bytes: 8 (97 0 0 0 106 248 1 0), decoded string: "a🡪"

How to convert From Hex To Dump in C#

I convert my Hex to dump to get special character like symbol but when I try to convert my "0x18" i "\u0018" this value. Can anyone give me solution regarding this matter.
Here is my code:
public static string FromHexDump(string sText)
{
Int32 lIdx;
string prValue ="" ;
for (lIdx = 1; lIdx < sText.Length; lIdx += 2)
{
string prString = "0x" + Mid(sText, lIdx, 2);
string prUniCode = Convert.ToChar(Convert.ToInt64(prString,16)).ToString();
prValue = prValue + prUniCode;
}
return prValue;
}
I used VB language. I have a database that already encrypted text to my password and the value is BAA37D40186D like this so I loop it by step 2 and it will like this 0xBA,0xA3,0x7D,0x40,0x18,0x6D and the VB result getting like this º£}#m

You can use this code:
var myHex = '\x0633';
var formattedString += string.Format(#"\x{0:x4}", (int)myHex);
Or you can use this code from MSDN (https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types):
string hexValues = "48 65 6C 6C 6F 20 57 6F 72 6C 64 21";
string[] hexValuesSplit = hexValues.Split(' ');
foreach (string hex in hexValuesSplit)
{
// Convert the number expressed in base-16 to an integer.
int value = Convert.ToInt32(hex, 16);
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Console.WriteLine("hexadecimal value = {0}, int value = {1}, char value = {2} or {3}",
hex, value, stringValue, charValue);
}

The question is unclear - what is the database column's type? Does it contain 6 bytes, or 12 characters with the hex encoding of the bytes? In any case, this has nothing to do with special characters or encodings.
First, 0x18 is the byte value of the Cancel Character in the Latin 1 codepage, not the pound sign. That's 0xA3. It seems that the byte values in the question are just the Latin 1 bytes for the string in hex.
.NET strings are Unicode (UTF16LE specifically). There's no UTF8 string or Latin1 string. Encodings and codepages apply when converting bytes to strings or vice versa. This is done using the Encoding class and eg Encoding.GetBytes
In this case, this code will convert the byte to the expected string form, including the unprintable character :
new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(dbBytes);
The result is :
º£}#m
With the Cancel character between # and m.
If the database column contains the byte values as strings :
it takes double the required space and
the hex values have to be converted back to bytes before converting to strings
The x format is used to convert numbers or bytes to their hex form and vice versa. For each byte value, ToString("x") returns the hex string.
The hex string can be produced from the original buffer with :
var dbBytes=new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var hexString=String.Join("",dbBytes.Select(c=>c.ToString("x")));
There are many questions that show how to parse a byte string into a byte array. I'll just steal Jared Parson's LINQ answer :
public static byte[] StringToByteArray(string hex) {
return Enumerable.Range(0, hex.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
With that, we can parse the hex string into a byte array and convert it to the original string :
var bytes=StringToByteArray(hexString);
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(bytes);

First of all u don't need dump but Unicode, I would recomend to read about unicode/encoding etc and why this is a problem with strings.
PS: solution : StackOverflow

How to know the size of the string in bytes?

I'm wondering if I can know how long in bytes for a string in C#, anyone know?

You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding class.
or try this
System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);

From MSDN:
A String object is a sequential collection of System.Char objects that represent a string.
So you can use this:
var howManyBytes = yourString.Length * sizeof(Char);

System.Text.ASCIIEncoding.Unicode.GetByteCount(yourString);
Or
System.Text.ASCIIEncoding.ASCII.GetByteCount(yourString);

How many bytes a string will take depends on the encoding you choose (or is automatically chosen in the background without your knowledge). This sample code shows the difference:
void Main()
{
string text = "a🡪";
Console.WriteLine("{0,15} length: {1}", "String", text.Length);
PrintInfo(text, Encoding.ASCII); // Note that '🡪' cannot be encoded in ASCII, information loss will occur
PrintInfo(text, Encoding.UTF8); // This should always be your choice nowadays
PrintInfo(text, Encoding.Unicode);
PrintInfo(text, Encoding.UTF32);
}
void PrintInfo(string input, Encoding encoding)
{
byte[] bytes = encoding.GetBytes(input);
var info = new StringBuilder();
info.AppendFormat("{0,16} bytes: {1} (", encoding.EncodingName, bytes.Length);
info.AppendJoin(' ', bytes);
info.Append(')');
string decodedString = encoding.GetString(bytes);
info.AppendFormat(", decoded string: \"{0}\"", decodedString);
Console.WriteLine(info.ToString());
}
Output:
String length: 3
US-ASCII bytes: 3 (97 63 63), decoded string: "a??"
Unicode (UTF-8) bytes: 5 (97 240 159 161 170), decoded string: "a🡪"
Unicode bytes: 6 (97 0 62 216 106 220), decoded string: "a🡪"
Unicode (UTF-32) bytes: 8 (97 0 0 0 106 248 1 0), decoded string: "a🡪"

Weird behavior when converting byte from text box to byte array to characters?

I have a textbox that I use to convert things like:
74 00 65 00 73 00 74 00
Back into a string, the above says "test" but for some reason when I click the convert button it will display only the first letter "t" 74 00 and other byte arrays work just as expected, the entire text is converted.
Here is the 2 codes I have tried which produce the same behavior of not properly converting the entire byte array back to word:
byte[] bArray = ByteStrToByteArray(iSequence.Text);
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
which uses the function:
private byte[] ByteStrToByteArray(string byteString)
{
byteString = byteString.Replace(" ", string.Empty);
byte[] buffer = new byte[byteString.Length / 2];
for (int i = 0; i < byteString.Length; i += 2)
buffer[i / 2] = (byte)Convert.ToByte(byteString.Substring(i, 2), 16);
return buffer;
}
another way I was using is:
string str = iSequence.Text.Replace(" ", "");
byte[] bArray = Enumerable.Range(0, str.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(str.Substring(x, 2), 16))
.ToArray();
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
Tried checking for the lengths to see if it was iterating thru and it was ...
Don't really know how to debug why this is happenning to the above byte array but all the other byte arrays seemed to be working just fine only this one is outputing only the first letter of it.
Have I done something wrong that could produce this behavior some how ?
What could I try in order to find out what is wrong ?

If you have the byte sequence
var bytes = new byte[] { 0x74, 0x00, 0x65, 0x00, 0x73, 0x00, 0x74, 0x00 };
and you decode it to a string using ASCII encoding (Encoding.ASCII), then you get
var result = Encoding.ASCII.GetString(bytes);
// result == "\x74\x00\x65\x00\x73\x00\x74\x00" == "t\0e\0s\0t\0"
Notice the Null \0 characters? When you display such a string in a textbox, only the part of the string until the first Null character is displayed.
Since you say the result should read "test", the input is actually not encoded in ASCII but in UTF-16LE (Encoding.Unicode).
var result = Encoding.Unicode.GetString(bytes);
// result == "\u0074\u0065\u0073\u0074" == "test"

your converting a unicode string to ascii , your not specifying the codepage on your machine to convert from.
System.Text.Encoding.GetEncoding("codepage").GetString()
if my memory serves me correct. Also to note, any control in .NET is unicode ... Soooooo.... what your trying to stick in the text box (if the conversion isent correct) could be an end of line character .. or eof, or any kind of control character. all depends on your codepage.

I tried debugging the first program using breakpoints in VS2010. I found out that the line
string word = enc.GetString(bArray);
output word as "t\0e\0s\0t".
The last line
iResult.Text = word + Environment.NewLine;
gives iResult.Text as simply "t".
So I was thinking since \0 is not a valid escape sequence, the compiler ignored everything after it. Could be wrong though but try removing all occurrences of 00 in the input string.
I'm not really into C#. I'm only suggesting this because it looks like C++.

It works for me:
string outputText = "t\0e\0s\0t";
outputText = outputText.Replace("\0", " ");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.