Character ä is represented in different Char Codes in the same string

Character ä is represented in different Char Codes in the same string - c#

I have a on web uploaded File Name "Schränke Wintsch.pdf".
The file Name is saved in a XML file like so:
<File>Schra?nke Wintsch.pdf</File>
If I debug this in c# and maunally add an ä, then it is saved correctly.
<File>Schra?nke Wintsch-ä.pdf</File>
OK I know it is an Encoding Problem.
But why is the same ä character represented with different char codes(example on Img 2)?

XML defines the encoding used within the document using the header. It will look something like this.. <?xml version="1.0" encoding="ISO-8859-9" ?>.
If you append the string make sure to use the same encoding to avoid a mismatch.
Test appending the char bytes and see if that helps.
var en = Encoding.GetEncoding("ISO-8859-9");
en.GetString(Encoding.GetBytes("ä"));

The original XML that you have is using the Unicode 'COMBINING DIAERESIS' code (int value 776) to use two characters to representä.
(Note how the combining character has been displayed as ? in the <File>Schra?nke Wintsch.pdf</File> image in your post.)
The 776 code says to put the double-dots above the previous character (an a).
However, where you typed in the ä it has been stored as the unicode character with code 228.
The question you need to answer is: Why is the original source XML using the "Combining Diaeresis" character rather than the more usual ä? (Without knowing the origin of the XML file, we cannot answer that question.)
Incidentally, you can "normalise" those sorts of characters by using string.Normalize(), as demonstrated by the following program:
using System;
namespace Demo
{
static class Program
{
static void Main()
{
char[] a = {(char)97, (char)776};
string s = new string(a);
Console.WriteLine(s + " -> " + s.Length); // Prints a¨ -> 2
var t = s.Normalize();
Console.WriteLine(t + " -> " + t.Length); // Prints ä -> 1
}
}
}
Note how the length of s is 2, but the length of t is only 1 (and it contains the single character ä).
So you might be able to improve things by using string.Normalize() to normalise these unexpected characters.

string.Normalize() ist the working solution for the string "Schränke Wintsch-ä.pdf". So it ist correctly saved as Schränke Wintsch-ä.pdf

Related

Replace text in file regardless of the end of the line

I've been working on a tool to modify a text file to change graphics settings for a game. A few examples of the settings are as follows:
sg.ShadowQuality=0
ResolutionSizeX=1440
ResolutionSizeY=1080
bUseVSync=False
I want to be able to find sg.ShadowQuality=(rest of line, regardless of what is after this text), and replace it. This is so that a user can set this to say, 10 then 1 without having to check for 10 and 1 etc.
Basically, I'm try to find out what I need to use to find/replace a string in a text file without knowing the end of the string.
My current code looks like:
FileInfo GameUserSettings = new FileInfo(#SD + GUSDirectory);
GameUserSettings.IsReadOnly = false;
string text = File.ReadAllText(SD + GUSDirectory);
text = text.Replace("sg.ShadowQuality=0", "sg.ShadowQuality=" + Shadows.Value.ToString());
File.WriteAllText(SD + GUSDirectory, text);
text = text.Replace("sg.ShadowQuality=1", "sg.ShadowQuality=" + Shadows.Value.ToString());
File.WriteAllText(SD + GUSDirectory, text);
SD + GUSDirectory is the location of the text file.
The file must have readonly Off to be edited, otherwise the game can revert the settings back, hence the need for this.(It is turned back to readonly On after any change, its just not included in this code provided)

You can do it like you do, if you use a regular expression to match all the line
FileInfo gameUserSettings = new FileInfo(Path.Combine(#SD, GUSDirectory)); //name local varaible in camelCase, use Path.Combine to combine paths
gameUserSettings.IsReadOnly = false;
string text = File.ReadAllText(gameUserSettings.FullName); //use the fileinfo you just made rather than make the path again
text = Regex.Replace(text, "^sg[.]ShadowQuality=.*$", $"sg.ShadowQuality={Shadows.Value}", RegexOptions.Multiline); //note switch to interpolated strings
File.WriteAllText(gameUserSettings.FullName, text);
That regex is a Multiline one (so ^ and $ have altered meanings):
^sg[.]ShadowQuality=.*$
start of line ^ (not start of input)
followed by sg
followed by period . (in a character class it loses its "any character" meaning)
followed by ShadowQuality=
followed by any number of any character(.*)
followed by end of line $ (not end of input)
The vital bit is "any number of any character" that can cope with the vlaue in the file being 1, 2, 7, hello and so on..
The replacement is:
$"sg.ShadowQuality={Shadows.Value}"
This is an interpolated string; a neater way of representing strings that mix constant content (hardcoded chars) and variable content. When a $tring contains a { that "breaks out" of the string and back into normal c# code so you can write code that resolves to values that will be included in the string -> if Shadows.Value is for example a decimal? of 1.23 it will become 1.23
You can format data too; calling for $"to one dp is {Shadows.Value:F1}" would produce "to one dp is 1.2" - the 1.23 is formatted to 1 decimal place by the F1, just like calling Shadows.Value.ToString("F1") would

Delete a specific string from a file in C#

Title says it all, I have a file called test.txt with these contents:
Hello from th [BACK]e
This i [BACK]s line two.
Here, [BACK] is just a visible representation of the backspace. So a i [BACK]s would mean is. Because a backspace is implemented after a space and i.
So basically, at the click of a button, i should be able to access this file and remove ALL strings containing the word [BACK]-1. -1 implemented because a [BACK] means a backspace and is used to remove the last string before the word [BACK].
EDIT:
This time i replaced [BACK] with [SDRWUE49CDKAS]. Just to make it a unique string. I also tested on another file. This time a .html with following contents:
Alpha, Brav [SDRWUE49CDKAS]o, Charlie, Dr [SDRWUE49CDKAS][SDRWUE49CDKAS]elta, Echo.
// ^^Implementing "backspace" ^^Here doing it double because we made a mistake in spelling "Delta"
//These sentences should be Alpha, Bravo, Charlie, Delta, Echo
Did some experimenting and tested it out with this code:
string s = File.ReadAllText(path2html);
string line = "";
string contents = File.ReadAllText(path2html);
if (contents.Contains("[SDRWUE49CDKAS]"))
{
System.IO.StreamWriter sw = new System.IO.StreamWriter(path2html);
s = s.Remove(s.LastIndexOf(line + "[SDRWUE49CDKAS]") - 2, 15);
sw.WriteLine(s);
sw.Close();
}
The edited code above will give me an output of actually deleting [SDRWUE49CDKAS] but not exactly how i wanted it to be:
Alpha, BraS]o, Charlie, DS]elta, Echo.
This really caused some confusion with the testing. And also not to mention that i had to run this code 3 times, because we had 3 x [SDRWUE49CDKAS]. So a loop will do good. I checked out a bunch of similar problems on the web, but couldn't find a one working. I'm tryna test out this one too. But it's using a StreamReader and a StreamWriter at the same time. Or maybe i should make a copy the original, and make a temp file?

var s = "i [BACK]s";
s = s.Remove(s.IndexOf("[BACK]")-1, 1 + "[BACK]".Length);
result is "is".
Explanation:
Find the start-index of [Back]
Go back one position to start at the char before
Remove from there that extra char + the marker chars
But there are several issues:
This assumes that that search string is not at the start of that string (there is a char to remove)
Plus it only removes the first occurrence so you will have to repeat this until all are removed
And it doesn't handle "surrogate pairs"

Use Regex. You may have multiple spaces or no space is it is at the beginning of the line
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input =
"Hello from th [BACK]e\n" +
"This i [BACK]s line two.\n";
string pattern = #"\s*\[BACK\]";
string output = Regex.Replace(input, pattern, "");
}
}

String equivalence in C# requires encoding match?

I've been struggling with a problem for a few days and have finally worked out what's going wrong but I've only been able to find contradicting answers on StackOverflow (et al) so would like to ask for an explanation of what's going on.
For example this link (in common with many other reference for example this one, or these seemingly go-to references on the topic by Jon Skeet here and here) states that "A string in C# is always UTF-16 [Unicode?], there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...)."
The much simplified Test case I've built to demonstrate my issue is as below, it's probably not copy paste replicable as it depends on some of the strings to have a different encoding, but believe me the test passes as written. I'm using VS2012 Update 4.
The oddity is that the following two lines pass.
Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
The identical strings fail equivalency as they are encoded differently (copiedFromXmlDoubleQuote had the \ replaced by " in the editor).
All this suggests that the Visual Studio editor is encoding aware, and the strings that the code declares are also encoding aware. My question is, have I done something stupid or can anyone please concur with my findings and if possible refer me to something that will help clarify what the story is with string encoding equivalence... As I'm going to be working in an Xml world a lot is it best practice to explicitly convert everything to Unicode at point of deserialization, and recode it as required when serializing out again?
[TestMethod]
public void EscapedCharacterDoesNotEqualLiteralString()
{
string actual = "\"";
Assert.AreEqual("\"", actual);
Assert.AreEqual(#"""", actual);
string typedEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
string typedDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
Assert.IsTrue(typedDoubleQuote == typedEscapedQuote);
Assert.AreEqual(typedDoubleQuote, typedEscapedQuote);
string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
string copiedFromXmlDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
Assert.IsTrue(copiedFromXmlDoubleQuote.ToUnicode() == copiedFromXmlEscapedQuote.ToUnicode());
Assert.AreEqual(copiedFromXmlDoubleQuote.ToUnicode(), copiedFromXmlEscapedQuote.ToUnicode());
}
private static string BytesToString(byte[] bytes, Encoding encoding)
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, encoding))
{
string s = sr.ReadToEnd();
sr.Close();
return s;
}
}
}
public static string ToUnicode(this string s)
{
return BytesToString(new UnicodeEncoding().GetBytes(s), Encoding.Unicode);
}
I've loaded an example Vs2012 sln in a zip here

My initial check of your ZIP file shows that
static string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
static string copiedFromXmlDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
? copiedFromXmlEscapedQuote.Length
39
? copiedFromXmlDoubleQuote.Length
40
The first check for string equivalence in the .net framework is length check - it doesn't bother checking the content if the strings are different lengths.
Further checking;
? copiedFromXmlDoubleQuote.Last()
62 '>'
? copiedFromXmlEscapedQuote.Last()
62 '>'
? copiedFromXmlEscapedQuote.First()
60 '<'
? copiedFromXmlDoubleQuote.First()
65279 ''
So its the first char which is different. The value of 65279 is covered in this article. What is this char? 65279 ''.
It seems you are correct - it is the VS.net editor which is preserving the BOM char, and opening the program file in the binary editor shows these are different, so I'm guessing the use of # in VS.net tells the compiler to open the following bytes using a different encoder.

How to compare Unicode characters that "look alike"?

I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).
References:
Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main()
{
var s1 = "μ";
var s2 = "µ";
Console.WriteLine(s1.Equals(s2)); // false
Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true
}
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormKC);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
And the Demo

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

They both have different character codes: Refer this for more details
Console.WriteLine((int)'μ'); //956
Console.WriteLine((int)'µ'); //181
Where, 1st one is:
Display Friendly Code Decimal Code Hex Code Description
====================================================================
μ μ μ μ Lowercase Mu
µ µ µ µ micro sign Mu

For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

Search both characters in a Unicode database and see the difference.
One is the Greek small Letter µ and the other is the Micro Sign µ.
Name : MICRO SIGN
Block : Latin-1 Supplement
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Decomposition : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror : N
Index entries : MICRO SIGN
Upper case : U+039C
Title case : U+039C
Version : Unicode 1.1.0 (June, 1993)
Name : GREEK SMALL LETTER MU
Block : Greek and Coptic
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Mirror : N
Upper case : U+039C
Title case : U+039C
See Also : micro sign U+00B5
Version : Unicode 1.1.0 (June, 1993)

EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT
After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ﬃ), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'Ǆ' 453 'ǅ' 454 'ǆ' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081
Characters(int value): 55296-57343, 64976-65007, 65534
This links can be really helpful to understand what rules govern for Unicode equivalence
Unicode_equivalence
Unicode_compatibility_characters

Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.

You ask "how to compare them" but you don't tell us what you want to do.
There are at least two main ways to compare them:
Either you compare them directly as you are and they are different
Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.
There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.
For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?

If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.
First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.
The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.
What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.
You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.
There is a reason why those characters are localized the way they are localized, just don't do that.

It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.
Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

How to prevent conversion of Windows-1252 argument into a Unicode string?

I've written my first COM classes. My unit tests work fine, but my first use of the COM objects has hit a snag.
The COM classes provide methods which accept a string, manipulate it and return a string. The consumer of the COM objects is a dBASE PLUS program.
When the input string contains common keyboard characters (ASCII 127 or lower), the COM methods work fine. However, if the string contains characters beyond the ASCII range, some of them get remapped from Windows-1252 to C#'s Unicode. This table shows the mapping that takes place: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
For example, if the dBASE program calls the COM object with:
oMyComObject.MyMethod("It will cost€123") where the € is hex 80,
the C# method receives it as Unicode:
public string MyMethod(string source)
{
// source is Unicode and now the Euro symbol is hex 20AC
...
}
I would like to avoid this remapping because I want the original hex content of the string.
I've tried adding the following to MyMethod to convert the string back to Windows-1252, but the Euro symbol gets lost because it becomes a question mark:
byte[] UnicodeBytes = Encoding.Unicode.GetBytes(source.ToString());
byte[] Win1252Bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), UnicodeBytes);
string Win1252 = Encoding.GetEncoding(1252).GetString(Win1252Bytes);
Is there a way to prevent this conversion of the "source" parameter to Unicode? Or, is there a way to convert it 100% from Unicode back to Windows-1252?

Yes, I'm answering my own question. The answer by "Jigsore" put me on the right track, but I want to explain more clearly in case someone else makes the same mistake I made.
I eventually figured out that I had misdiagnosed the problem. dBASE was passing the string fine and C# was receiving it fine. It was how I checked the contents of the string that was in error.
This turnkey builds on Jigsore's answer:
void Main()
{
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (int i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]); // output: 0x80 0x8A 0x8C 0x9A
// win1252String represents the string passed from dBASE to C#
string win1252String = Encoding.GetEncoding(1252).GetString(win1252bytes);
Console.WriteLine("\r\nWin1252 string is " + win1252String); // output: Win1252 string is €ŠŒš
Console.WriteLine("looking at the code of the first character the wrong way: " + (int)win1252String[0]);
// output: looking at the code of the first character the wrong way: 8364
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(win1252String[0].ToString());
Console.WriteLine("looking at the code of the first character the right way: " + bytes[0]);
// output: looking at the code of the first character the right way: 128
// Warning: If your input contains character codes which are large in value than what a byte
// can hold (ex: multi-byte Chinese characters), then you will need to look at more than just bytes[0].
}
The reason the first method was wrong is that casting (int)win1252String[0] (or the converse of casting an integer j to a character with (char)j) involves an implicit conversion with the Unicode character set C# uses.
I consider this resolved and would like to thank each person who took the time to comment or answer for their time and trouble. It is appreciated!

Actually you're doing the Unicode to Win-1252 conversion correctly, but you're performing an extra step. The original Win1252 codes are in the Win1252Bytes array!
Check the following code:
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]);
The output shows the Win-1252 codes for the unicodeText string, you can check this by looking at the CP1252.TXT table.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Character ä is represented in different Char Codes in the same string - c#

string.Normalize() ist the working solution for the string "Schränke Wintsch-ä.pdf". So it ist correctly saved as Schränke Wintsch-ä.pdf

Related

Replace text in file regardless of the end of the line

Delete a specific string from a file in C#

String equivalence in C# requires encoding match?

How to compare Unicode characters that "look alike"?

How to prevent conversion of Windows-1252 argument into a Unicode string?

Categories

Resources