C# - Comparing strings of different encodings

C# - Comparing strings of different encodings - c#

Using C#, I fetch a TextBox.Text value from an .ascx page. When I compare the equality of the value to a regular string object inside a LINQ-query, it always returns false.
I have come to the conclusion that they are differently encoded, but have so far had no luck in converting or comparing them.
docname = "Testdoc 1.docx"; //regular string created in C#
fetchedVal = ((TextBox)e.Item.FindControl("txtSelectedDocs")).Text; //UTF-8
The above two strings are identical when represented as literals, but comparing the byte[] they are obviously different due to the encoding.
I've tried alot of different things, such as:
System.Text.Encoding.Default.GetString(utf8.GetBytes(fetchedVal));
but that will return the value "TestdocÂ 1.docx".
If I instead try
System.Text.Encoding.Default.GetString(System.Text.Encoding.Default.GetBytes(fetchedVal));
it returns "Testdoc 1.docx" but an Equals()-check still returns false.
I have also tried the following, which seem to be the recommended approach, but with no luck:
byte[] utf8Bytes = Encoding.UTF8.GetBytes(fetchedVal);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
string fetchedValConverted = Encoding.Unicode.GetString(unicodeBytes);
The culprit appears to be the whitespace, because when examining the byte sequence it's always the seventh byte that differs.
How do you properly convert from UTF-8 to default string encoding in C#?

Strings don't have encodings or byte arrays. Encodings only come into play when you convert a string into a byte array; you can only do that by specifying which encoding to use to pick bytes.
It sounds like you actually simply have different characters in your strings. You might have an invisible character in one of them, or they might have different characters that look the same.
To find out, look at the Unicode codepoint values of each character in each string (eg, (int) str[0]).

Related

Get a Null-Terminated String when using Encoding.ANSI or Encoding.Unicode

I have a c# string like this:
string a = "Hello";
How can I use the Encoding class to get the exact length of characters including null-terminating characters? For example, if I used Encoding.Unicode.GetByteCount, I should get 12 and if I used Encoding.ASCII.GetByteCount, I should get 6.
How can I use the Encoding class to encode the string into a byte array including the null-terminating characters?
Thank you for help!

As far as I remember, null-termination is a specific thing to C/C++'y languages/platforms. Unicode and ANSI encodings does not specify any requirement for the string to be null-terminated, nor does the C#/CLR platform. You can't expect them to include that extra character. So you will probably have a hard time making those classes emit that from yours 5-character "Hello" string.
However, in C#/CLR, strings can contain null characters.
So, basing on that, try converting the following this 6-character string:
string a = "Hello\0";
or
string a = "Hello";
a += "\0"; // if you really can't have the \0 at first time, you can simply add it
and I'm pretty sure you will get the result you wanted through both Encoding.ANSI and Encoding.Unicode (single \0 in ANSI, single \0 in UTF, \0\0 in UTF16 etc..)
(Also, note that if you are P/Invoking, then you don't need to handle that manually. The Marshaller will nullterminate the string correctly, assuming the datatype set is considered to be string-like data and not array-like data.)

In .NET, strings are not null terminated, so you need to add the null character yourself if the protocol you're working with requires one. That means:
You need to manually add 1 to the string length.
You need to manually write a null character (e.g. (byte)0) to the end of the byte array.

Strange results when converting from byte array to string

I get strange results when converting byte array to string and then converting the string back to byte array.
Try this:
byte[] b = new byte[1];
b[0] = 172;
string s = Encoding.ASCII.GetString(b);
byte[] b2 = Encoding.ASCII.GetBytes(s);
MessageBox.Show(b2[0].ToString());
And the result for me is not 172 as I'd expect but... 63.
Why does it happen?

Why does it happen?
Because ASCII only contains values up to 127.
When faced with binary data which is invalid for the given encoding, Encoding.GetString can provide a replacement character, or throw an exception. Here, it's using a replacement character of ?.
It's not clear exactly what you're trying to achieve, but:
If you're converting arbitrary binary data to text, use Convert.ToBase64String instead; do not try to use an encoding, as you're not really representing text. You can use Convert.FromBase64String to then decode.
Encoding.ASCII is usually a bad choice, and certainly binary data including a byte of 172 is not ASCII text
You need to work out which encoding you're actually using. Personally I dislike using Encoding.Default unless you really know the data is in the default encoding for the platform you're working on. If you get the choice, using UTF-8 is a good one.

ASCII encoding is a 7-bit encoding. If you take a look into generated string it contains "?" - unrecognized character. You might choose Encoding.Default instead.

ASCII is a seven bit character encoding, so 172 falls out of that range, so when converting to a string, it converts to "?" which is used for characters that cannot be represented.

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.

There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.

Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

Converting unicode character to a single hexadecimal value in C#

I am getting a character from a emf record using Encoding.Unicode.GetString and the resulting string contains only one character but has two bytes. I don't have any idea about the encoding scheme and the multi byte character set. I want to convert that character to its equivalent single hexadecimal value.Can you help me regarding this..

It's not clear what you mean. A char in C# is a 16-bit unsigned value. If you've got a binary data source and you want to get Unicode characters, you should use an Encoding to decode the binary data into a string, that you can access as a sequence of char values.
You can convert a char to a hex string by first converting it to an integer, and then using the X format specifier like this:
char = '\u0123';
string hex = ((int)c).ToString("X4"); // Now hex = "0123"
Now, that leaves one more issue: surrogate pairs. Values which aren't in the Basic Multilingual Plane (U+0000 to U+FFFF) are represented by two UTF-16 code units - a high surrogate and a low surrogate. You can use the char.IsSurrogate* methods to check for surrogate pairs... although it's harder (as far as I can see) to then convert a surrogate pair into a UCS-4 value. If you're lucky, you won't need to deal with this... if you're happy converting your binary data into a sequence of UTF-16 code units instead of strict UCS-4 values, you don't need to worry.
EDIT: Given your comments, it's still not entirely clear what you've got to start with. You say you've got two bytes... are they separate, or in a byte array? What do they represent? Text in a particular encoding, presumably... but which encoding? Once you know the encoding, you can convert a byte array into a string easily:
byte[] bytes = ...;
// For example, if your binary data is UTF-8
string text = Encoding.UTF8.GetString(bytes);
char firstChar = text[0];
string hex = ((int)firstChar).ToString("X4");
If you could edit your question to give more details about your actual situation, it would be a lot easier to help you get to a solution. If you're generally confused about encodings and the difference between text and binary data, you might want to read my article about it.

Try this:
System.Text.Encoding.Unicode.GetBytes(theChar.ToString())
.Aggregate("", (agg, val) => agg + val.ToString("X2"));
However, since you don't specify exactly what encoding that the character is in, this could fail. Futher, you don't make it very clear if you want the output to be a string of hex chars or bytes. I'm guessing the former, since I'd guess you want to generate HTML. Let me know if any of this is wrong.

I created an extension method to convert unicode or non-unicode string to hex string.
I shared for whom concern.
public static class StringHelper
{
public static string ToHexString(this string str)
{
byte[] bytes = str.IsUnicode() ? Encoding.UTF8.GetBytes(str) : Encoding.Default.GetBytes(str);
return BitConverter.ToString(bytes).Replace("-", string.Empty);
}
public static bool IsUnicode(this string input)
{
const int maxAnsiCode = 255;
return input.Any(c => c > maxAnsiCode);
}
}

Get thee to StringInfo:
http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx
http://msdn.microsoft.com/en-us/library/8k5611at.aspx
The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow your application to split a string into its text elements and iterate through the text elements. For an example of using the StringInfo class, see String Indexing.

Base64 String throwing invalid character error

I keep getting a Base64 invalid character error even though I shouldn't.
The program takes an XML file and exports it to a document. If the user wants, it will compress the file as well. The compression works fine and returns a Base64 String which is encoded into UTF-8 and written to a file.
When its time to reload the document into the program I have to check whether its compressed or not, the code is simply:
byte[] gzBuffer = System.Convert.FromBase64String(text);
return "1F-8B-08" == BitConverter.ToString(new List<Byte>(gzBuffer).GetRange(4, 3).ToArray());
It checks the beginning of the string to see if it has GZips code in it.
Now the thing is, all my tests work. I take a string, compress it, decompress it, and compare it to the original. The problem is when I get the string returned from an ADO Recordset. The string is exactly what was written to the file (with the addition of a "\0" at the end, but I don't think that even does anything, even trimmed off it still throws). I even copy and pasted the entire string into a test method and compress/decompress that. Works fine.
The tests will pass but the code will fail using the exact same string? The only difference is instead of just declaring a regular string and passing it in I'm getting one returned from a recordset.
Any ideas on what am I doing wrong?

You say
The string is exactly what was written
to the file (with the addition of a
"\0" at the end, but I don't think
that even does anything).
In fact, it does do something (it causes your code to throw a FormatException:"Invalid character in a Base-64 string") because the Convert.FromBase64String does not consider "\0" to be a valid Base64 character.
byte[] data1 = Convert.FromBase64String("AAAA\0"); // Throws exception
byte[] data2 = Convert.FromBase64String("AAAA"); // Works
Solution: Get rid of the zero termination. (Maybe call .Trim("\0"))
Notes:
The MSDN docs for Convert.FromBase64String say it will throw a FormatException when
The length of s, ignoring white space
characters, is not zero or a multiple
of 4.
-or-
The format of s is invalid. s contains a non-base 64 character, more
than two padding characters, or a
non-white space character among the
padding characters.
and that
The base 64 digits in ascending order
from zero are the uppercase characters
'A' to 'Z', lowercase characters 'a'
to 'z', numerals '0' to '9', and the
symbols '+' and '/'.

Whether null char is allowed or not really depends on base64 codec in question.
Given vagueness of Base64 standard (there is no authoritative exact specification), many implementations would just ignore it as white space. And then others can flag it as a problem. And buggiest ones wouldn't notice and would happily try decoding it... :-/
But it sounds c# implementation does not like it (which is one valid approach) so if removing it helps, that should be done.
One minor additional comment: UTF-8 is not a requirement, ISO-8859-x aka Latin-x, and 7-bit Ascii would work as well. This because Base64 was specifically designed to only use 7-bit subset which works with all 7-bit ascii compatible encodings.

string stringToDecrypt = HttpContext.Current.Request.QueryString.ToString()
//change to
string stringToDecrypt = HttpUtility.UrlDecode(HttpContext.Current.Request.QueryString.ToString())

If removing \0 from the end of string is impossible, you can add your own character for each string you encode, and remove it on decode.

One gotcha to do with converting Base64 from a string is that some conversion functions use the preceding "data:image/jpg;base64," and others only accept the actual data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - Comparing strings of different encodings - c#

Related

Get a Null-Terminated String when using Encoding.ANSI or Encoding.Unicode

Strange results when converting from byte array to string

How do I encode a Binary blob as Unicode blob?

Converting unicode character to a single hexadecimal value in C#

Base64 String throwing invalid character error

Categories

Resources