C# char/byte encoding equality

C# char/byte encoding equality - c#

I have some code to dump strings to stdout to check their encoding, it looks like this:
private void DumpString(string s)
{
System.Console.Write("{0}: ", s);
foreach (byte b in s)
{
System.Console.Write("{0}({1}) ", (char)b, b.ToString("x2"));
}
System.Console.WriteLine();
}
Consider two strings, each of which appear as "ë", but with different encodings. DumpString will produce the following output:
ë: e(65)(08)
ë: ë(eb)
The code looks like this:
DumpString(string1);
DumpString(string2);
How can I convert string2, using the System.Text.Encoding, to be byte equivalent to string1.

They don't have different encodings. Strings in C# are always UTF-16 (thus, you shouldn't use byte to iterate over strings because you'll lose the top 8 bits). What they have is different normalization forms.
Your first string is "\u0065\u0308": LATIN SMALL LETTER E + COMBINING DIAERESIS. This is the decomposed form (NFD).
The second is "\u00EB": LATIN SMALL LETTER E WITH DIAERESIS. This is the precomposed form (NFC).
You can convert between them with string.Normalize.

You're looking for the String.Normalize method.

Related

How to perform mutliple Replace calls at once

I have a bit of a weird question here at hands. I have a text that's encoded in such a way that each character is replaced by another character and I'm creating an application that will replace each character with a correct one. But I've come across a problem that I have trouble solving. Let me show with an example:
Original text: This is a line.
Encoded text: (.T#*T#*%*=T50;
Now, as I said, each character represents another character, '(' is 'T', '.' is actually a 'h' and so on.
Now I could just go with
string decoded = encoded.Replace('(','T'); //T.T#*T#*%*=T50;
And that will solve one problem, but when I reach character 'T' that is actually encoded character 'i' I will have to replace all 'T' with 'i', which means that all previously decoded letter 'T's (that were once '(') will also change along with the encoded 'T'.
//T.T#*T#*%*=T50; -> i.i#*i#*%*=i50;
in this situation it's obvious that I should've just went the other way around, first change 'T' to 'i' and then '(' to 'T', but in the text I'm changing that kind of analysis is not an option.
What's the alternative here that I could do to perform the task correctly?
Thank you!

One possible solution is do not use replace string method at all.
Instead you can create method which for every encoded character will output decoded one, and then go through your string as through array of char and for every character in this array use "decryption" method to get decoded character - thus you'll receive decoded string.
For example (using StringBulder to create new string):
private static char Decode(char source)
{
if (source == '(')
return 'T';
else if (source == '.')
return 'h';
//.... and so on
}
string source = "ABC";
var builder = new StringBuilder();
foreach (var c in source)
builder.Append(Decode(c));
var result = builder.ToString();

Using .Replace() probably isn't the way to go in the first place, since as you're finding it covers the whole string every time. And once you've modified the whole string once, the encoding is lost.
Instead, loop over the string one time and replace characters individually.
Create a function that accepts a char and returns the replaced char. For simplicity, I'll just show the signature:
private char Decode(char c);
Then just loop over the string and call that function on each character. LINQ can make short work of that:
var decodedString = new string(encodedString.Select(c => Decode(c)).ToArray());
(This is freehand and untested, you may or may not need that .ToArray() for the string constructor to be happy, I'm not certain. But you get the idea.)
If it's easier to read you can also just loop manually over the string and perhaps use a StringBuilder with each successive char to build the final decoded result.

Without knowledge of your encryption algorithm, this answer assumes that it's a simple character translation akin to the Caesar Cipher.
Pass in your encrypted string, the method loops over each character, adjusting it by the value of shiftDelta and returns the resulting string.
private string Decrypt(string input)
{
const int shiftDelta = 10;
var inputChars = input.ToCharArray();
var outputChars = new char[inputChars.Length];
for (var i = 0; i < outputChars.Length; i++)
{
// Perform character translation here
outputChars[i] = (char)(inputChars[i] + shiftDelta);
}
return outputChars.ToString();
}

Converting a string getting Wrong value

I write a simple code and need to get each number in a string as an Integer
But when I try to use Convert.ToInt32() it give me another value
Example:
string x="4567";
Console.WriteLine(x[0]);
the result will be 4 , but if i try to use Convert
Console.WriteLine(Convert.ToInt32(x[0]));
it Give me 52 !!
I try using int.TryParse() and its the same

According to the docs:
The ToInt32(Char) method returns a 32-bit signed integer that represents the UTF-16 encoded code unit of the value argument. If value is not a low surrogate or a high surrogate, this return value also represents the Unicode code point of value.
In order to get 4, you'd have to convert it to a string before converting to int32:
Console.WriteLine(Convert.ToInt32(x[0].ToString()));

you can alternatively do this way to loop through all characters in a string
foreach (char c in x)
{
Console.WriteLine(c);
}
Will Print 4,5,6,7
And as suggested in earlier answer before converting to integer make that as a sting so it doesn't return the ascii code

A char is internally a short integer of it's ASCII representation.
If you cast/convert (explicitly or implicitly) a char to int, you will get it's Ascii value. Example Convert.ToInt32('4') = 52
But when you print it to Console, you are using it's ToString() method implicitly, so you are actually printing the ASCII character '4'. Example: Console.WriteLine(x[0]) is equivalent to Console.WriteLine("4")
Try using an ASCII letter so you will notice the difference clearly:
Console.WriteLine((char)('a')); // a
Console.WriteLine((int)('a')); // 97
Now play with char math:
Console.WriteLine((char)('a'+5)); // f
Console.WriteLine((int)('a'+5)); // 102
Bottom line, just use int digit = int.Parse(x[0]);
Or the funny (but less secure) way: int digit = x[0] - '0';

x isnt x[0], x[0] is the first char, '4' which is 52.

How to compare Unicode characters that "look alike"?

I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).
References:
Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main()
{
var s1 = "μ";
var s2 = "µ";
Console.WriteLine(s1.Equals(s2)); // false
Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true
}
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormKC);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
And the Demo

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

They both have different character codes: Refer this for more details
Console.WriteLine((int)'μ'); //956
Console.WriteLine((int)'µ'); //181
Where, 1st one is:
Display Friendly Code Decimal Code Hex Code Description
====================================================================
μ μ μ μ Lowercase Mu
µ µ µ µ micro sign Mu

For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

Search both characters in a Unicode database and see the difference.
One is the Greek small Letter µ and the other is the Micro Sign µ.
Name : MICRO SIGN
Block : Latin-1 Supplement
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Decomposition : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror : N
Index entries : MICRO SIGN
Upper case : U+039C
Title case : U+039C
Version : Unicode 1.1.0 (June, 1993)
Name : GREEK SMALL LETTER MU
Block : Greek and Coptic
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Mirror : N
Upper case : U+039C
Title case : U+039C
See Also : micro sign U+00B5
Version : Unicode 1.1.0 (June, 1993)

EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT
After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ﬃ), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'Ǆ' 453 'ǅ' 454 'ǆ' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081
Characters(int value): 55296-57343, 64976-65007, 65534
This links can be really helpful to understand what rules govern for Unicode equivalence
Unicode_equivalence
Unicode_compatibility_characters

Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.

You ask "how to compare them" but you don't tell us what you want to do.
There are at least two main ways to compare them:
Either you compare them directly as you are and they are different
Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.
There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.
For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?

If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.
First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.
The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.
What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.
You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.
There is a reason why those characters are localized the way they are localized, just don't do that.

It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.
Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

How do I convert from unicode to single byte in C#?

How do I convert from unicode to single byte in C#?
This does not work:
int level =1;
string argument;
// and then argument is assigned
if (argument[2] == Convert.ToChar(level))
{
// does not work
}
And this:
char test1 = argument[2];
char test2 = Convert.ToChar(level);
produces funky results. test1 can be: 49 '1' while test2 will be 1 ''

How do I convert from unicode to single byte in C#?
This question makes no sense, and the sample code just makes things worse.
Unicode is a mapping from characters to code points. The code points are numbered from 0x0 to 0x10FFFF, which is far more values than can be stored in a single byte.
And the sample code has an int, a string, and a char. There are no bytes anywhere.
What are you really trying to do?

Use UnicodeEncoding.GetBytes().
UnicodeEncoding unicode = new UnicodeEncoding();
Byte[] encodedBytes = unicode.GetBytes(unicodeString);

char and string are always Unicode in .NET. You can't do it the way you're trying.
In fact, what are you trying to accomplish?

If you want to test whether the int level matches the char argument[2] then use
if (argument[2] == Convert.ToChar(level + (int)'0'))

Best way to decode hex sequence of unicode characters to string

I'm working with C# .Net
I would like to know how to convert a Unicode form string like "\u1D0EC"
(note that it's above "\uFFFF") to it's symbol... "𝃬"
Thanks For Advance!!!

That Unicode codepoint is encoded in UTF32. .NET and Windows encode Unicode in UTF16, you'll have to translate. UTF16 uses "surrogate pairs" to handle codepoints above 0xffff, a similar kind of approach as UTF8. The first code of the pair is 0xd800..dbff, the second code is 0xdc00..dfff. Try this sample code to see that at work:
using System;
using System.Text;
class Program {
static void Main(string[] args) {
uint utf32 = uint.Parse("1D0EC", System.Globalization.NumberStyles.HexNumber);
string s = Encoding.UTF32.GetString(BitConverter.GetBytes(utf32));
foreach (char c in s.ToCharArray()) {
Console.WriteLine("{0:X}", (uint)c);
}
Console.ReadLine();
}
}

Convert each sequence with int.Parse(String, NumberStyles) and char.ConvertFromUtf32:
string s = #"\U1D0EC";
string converted = char.ConvertFromUtf32(int.Parse(s.Substring(2), NumberStyles.HexNumber));

I have recently push my FOSS Uncode Converter at Codeplex (http://unicode.codeplex.com)
you can convert whatever you want to Hex code and from Hex code to get the right character, also there is a full information character database.
I use this code
public static char ConvertHexToUnicode(string hexCode)
{
if (hexCode != string.Empty)
return ((char)int.Parse(hexCode, NumberStyles.AllowHexSpecifier));
char empty = new char();
return empty;
}//end
you can see entire code on the http://unicode.codeplex.com/

It appears you just want this in your code... you can type it as a string literal using the escape code \Uxxxxxxxx (note that this is a capital U, and there must be 8 digits). For this example, it would be: "\U0001D0EC".

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# char/byte encoding equality - c#

You're looking for the String.Normalize method.

Related

How to perform mutliple Replace calls at once

Converting a string getting Wrong value

How to compare Unicode characters that "look alike"?

How do I convert from unicode to single byte in C#?

Best way to decode hex sequence of unicode characters to string

Categories

Resources