Converting unicode character to a single hexadecimal value in C# - c#

I am getting a character from a emf record using Encoding.Unicode.GetString and the resulting string contains only one character but has two bytes. I don't have any idea about the encoding scheme and the multi byte character set. I want to convert that character to its equivalent single hexadecimal value.Can you help me regarding this..

It's not clear what you mean. A char in C# is a 16-bit unsigned value. If you've got a binary data source and you want to get Unicode characters, you should use an Encoding to decode the binary data into a string, that you can access as a sequence of char values.
You can convert a char to a hex string by first converting it to an integer, and then using the X format specifier like this:
char = '\u0123';
string hex = ((int)c).ToString("X4"); // Now hex = "0123"
Now, that leaves one more issue: surrogate pairs. Values which aren't in the Basic Multilingual Plane (U+0000 to U+FFFF) are represented by two UTF-16 code units - a high surrogate and a low surrogate. You can use the char.IsSurrogate* methods to check for surrogate pairs... although it's harder (as far as I can see) to then convert a surrogate pair into a UCS-4 value. If you're lucky, you won't need to deal with this... if you're happy converting your binary data into a sequence of UTF-16 code units instead of strict UCS-4 values, you don't need to worry.
EDIT: Given your comments, it's still not entirely clear what you've got to start with. You say you've got two bytes... are they separate, or in a byte array? What do they represent? Text in a particular encoding, presumably... but which encoding? Once you know the encoding, you can convert a byte array into a string easily:
byte[] bytes = ...;
// For example, if your binary data is UTF-8
string text = Encoding.UTF8.GetString(bytes);
char firstChar = text[0];
string hex = ((int)firstChar).ToString("X4");
If you could edit your question to give more details about your actual situation, it would be a lot easier to help you get to a solution. If you're generally confused about encodings and the difference between text and binary data, you might want to read my article about it.

Try this:
System.Text.Encoding.Unicode.GetBytes(theChar.ToString())
.Aggregate("", (agg, val) => agg + val.ToString("X2"));
However, since you don't specify exactly what encoding that the character is in, this could fail. Futher, you don't make it very clear if you want the output to be a string of hex chars or bytes. I'm guessing the former, since I'd guess you want to generate HTML. Let me know if any of this is wrong.

I created an extension method to convert unicode or non-unicode string to hex string.
I shared for whom concern.
public static class StringHelper
{
public static string ToHexString(this string str)
{
byte[] bytes = str.IsUnicode() ? Encoding.UTF8.GetBytes(str) : Encoding.Default.GetBytes(str);
return BitConverter.ToString(bytes).Replace("-", string.Empty);
}
public static bool IsUnicode(this string input)
{
const int maxAnsiCode = 255;
return input.Any(c => c > maxAnsiCode);
}
}

Get thee to StringInfo:
http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx
http://msdn.microsoft.com/en-us/library/8k5611at.aspx
The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow your application to split a string into its text elements and iterate through the text elements. For an example of using the StringInfo class, see String Indexing.

Related

C# - Comparing strings of different encodings

Using C#, I fetch a TextBox.Text value from an .ascx page. When I compare the equality of the value to a regular string object inside a LINQ-query, it always returns false.
I have come to the conclusion that they are differently encoded, but have so far had no luck in converting or comparing them.
docname = "Testdoc 1.docx"; //regular string created in C#
fetchedVal = ((TextBox)e.Item.FindControl("txtSelectedDocs")).Text; //UTF-8
The above two strings are identical when represented as literals, but comparing the byte[] they are obviously different due to the encoding.
I've tried alot of different things, such as:
System.Text.Encoding.Default.GetString(utf8.GetBytes(fetchedVal));
but that will return the value "Testdoc 1.docx".
If I instead try
System.Text.Encoding.Default.GetString(System.Text.Encoding.Default.GetBytes(fetchedVal));
it returns "Testdoc 1.docx" but an Equals()-check still returns false.
I have also tried the following, which seem to be the recommended approach, but with no luck:
byte[] utf8Bytes = Encoding.UTF8.GetBytes(fetchedVal);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
string fetchedValConverted = Encoding.Unicode.GetString(unicodeBytes);
The culprit appears to be the whitespace, because when examining the byte sequence it's always the seventh byte that differs.
How do you properly convert from UTF-8 to default string encoding in C#?
Strings don't have encodings or byte arrays. Encodings only come into play when you convert a string into a byte array; you can only do that by specifying which encoding to use to pick bytes.
It sounds like you actually simply have different characters in your strings. You might have an invisible character in one of them, or they might have different characters that look the same.
To find out, look at the Unicode codepoint values of each character in each string (eg, (int) str[0]).

Get a Null-Terminated String when using Encoding.ANSI or Encoding.Unicode

I have a c# string like this:
string a = "Hello";
How can I use the Encoding class to get the exact length of characters including null-terminating characters? For example, if I used Encoding.Unicode.GetByteCount, I should get 12 and if I used Encoding.ASCII.GetByteCount, I should get 6.
How can I use the Encoding class to encode the string into a byte array including the null-terminating characters?
Thank you for help!
As far as I remember, null-termination is a specific thing to C/C++'y languages/platforms. Unicode and ANSI encodings does not specify any requirement for the string to be null-terminated, nor does the C#/CLR platform. You can't expect them to include that extra character. So you will probably have a hard time making those classes emit that from yours 5-character "Hello" string.
However, in C#/CLR, strings can contain null characters.
So, basing on that, try converting the following this 6-character string:
string a = "Hello\0";
or
string a = "Hello";
a += "\0"; // if you really can't have the \0 at first time, you can simply add it
and I'm pretty sure you will get the result you wanted through both Encoding.ANSI and Encoding.Unicode (single \0 in ANSI, single \0 in UTF, \0\0 in UTF16 etc..)
(Also, note that if you are P/Invoking, then you don't need to handle that manually. The Marshaller will nullterminate the string correctly, assuming the datatype set is considered to be string-like data and not array-like data.)
In .NET, strings are not null terminated, so you need to add the null character yourself if the protocol you're working with requires one. That means:
You need to manually add 1 to the string length.
You need to manually write a null character (e.g. (byte)0) to the end of the byte array.

Strange results when converting from byte array to string

I get strange results when converting byte array to string and then converting the string back to byte array.
Try this:
byte[] b = new byte[1];
b[0] = 172;
string s = Encoding.ASCII.GetString(b);
byte[] b2 = Encoding.ASCII.GetBytes(s);
MessageBox.Show(b2[0].ToString());
And the result for me is not 172 as I'd expect but... 63.
Why does it happen?
Why does it happen?
Because ASCII only contains values up to 127.
When faced with binary data which is invalid for the given encoding, Encoding.GetString can provide a replacement character, or throw an exception. Here, it's using a replacement character of ?.
It's not clear exactly what you're trying to achieve, but:
If you're converting arbitrary binary data to text, use Convert.ToBase64String instead; do not try to use an encoding, as you're not really representing text. You can use Convert.FromBase64String to then decode.
Encoding.ASCII is usually a bad choice, and certainly binary data including a byte of 172 is not ASCII text
You need to work out which encoding you're actually using. Personally I dislike using Encoding.Default unless you really know the data is in the default encoding for the platform you're working on. If you get the choice, using UTF-8 is a good one.
ASCII encoding is a 7-bit encoding. If you take a look into generated string it contains "?" - unrecognized character. You might choose Encoding.Default instead.
ASCII is a seven bit character encoding, so 172 falls out of that range, so when converting to a string, it converts to "?" which is used for characters that cannot be represented.

How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" .
there are 4 characters in this string. i need all the four unicode characters.
Please help me.
When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is े (DEVANAGARI VOWEL SIGN E).
If you want to see the numeric representations of those characters, just cast them to integers. For example
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e".
Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.
If you wanted to handle characters in other planes too, you could use code like the following.
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:
var abc = "मेरा";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in
2350
2375
2352
2366
use
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values.
If you are trying to convert files from a legacy encoding into Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.
using (StreamReader reader = new StreamReader(#"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(#"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points:
You can find the chart at the Unicode Consortium website here.
Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.
If you have the string s = मेरा then you already have the answer.
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.
If you want the underlying 8 bytes you can access them as so:
string str = #"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.
There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.
Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

Categories

Resources