How to compare characters (respecting a culture) - c#

For my answer in this question I have to compare two characters. I thought that the normal char.CompareTo() method would allow me to specify a CultureInfo, but that's not the case.
So my question is: How can I compare two characters and specify a CultureInfo for the comparison?

There is no culture enabled comparison for characters, you have to convert the characters to strings so that you can use for example the String.Compare(string, string, CultureInfo, CompareOptions) method.
Example:
char a = 'å';
char b = 'ä';
// outputs -1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("sv-SE"),
CompareOptions.IgnoreCase
));
// outputs 1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("en-GB"),
CompareOptions.IgnoreCase
));

There is indeed a difference between comparing characters and strings. Let me try to explain the basic issue, which is quite simple: A character always represents a single unicode point. Comparing characters always compares the code points without any regard as to their equal meaning.
If you want to compare characters for equal meaning, you need to create a string and use the comparison methods provided there. These include support for different cultures. See Guffa's answer on how to do that.

Did you try String.Compare Method?
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
String.Compare(str1, str2, false, new CultureInfo("en-US"))

I don't think cultureInfo matters while comparing chars in C#. char is already a Unicode character so two characters can be easily compared witohut CultureInfo.

Related

Why does float.Parse("123,45") not throw an exception

... but returns 12345?
The doc for Single.Parse says:
Exceptions
...
FormatException
s does not represent a numeric value.
...
For my understanding "123,45" doesn't represent a proper numeric value (in countries that use comma as thousands separator).
The system's CultureInfo has:
CultureInfo.CurrentCulture.NumberFormat.NumberDecimalSeparator == "."
CultureInfo.CurrentCulture.NumberFormat.NumberGroupSeparator == ","
CultureInfo.CurrentCulture.NumberFormat.NumberGroupSizes == [3]
Apparently the comma is simply ignored and this leads to even more irritating results: "123,45.67" or "1,23,45.67"–which look utterly wrong–become 12345.67.
Supplementary question
I don't get what this sentence in the doc is supposed to mean and whether this is relevant for this case:
If a separator is encountered in the s parameter during a parse operation, and the applicable currency or number decimal and group separators are the same, the parse operation assumes that the separator is a decimal separator rather than a group separator.
In the default and US culture, the comma (,) is legal as a separator between groups. Think of larger numbers like this:
987,654,321
That it's in the wrong place for a group doesn't really matter; the parser isn't that smart. It just ignores the separator.
For the supplemental question, some cultures use commas as the decimal separator, rather than a group separator. This part of the documentation clarifies what will happen if the group separator and decimal separator are somehow set to the same character.
As Joel said, "the parser isn't that smart". The source code is available, so here's the proof.
The code for Single.Parse ends up calling Number.ParseNumber.
Interestingly, Number.ParseNumber is given a NumberFormatInfo object, which does have a NumberGroupSizes property, which defines "the number of digits in each group to the left of the decimal".
However, you'll notice that on line 851, where it checks for the group separator, it doesn't bother to reference the NumberGroupSizes property to check if the group separator is in an expected position. In fact Number.ParseNumber never uses the NumberGroupSizes property.
NumberFormatInfo.NumberGroupSizes is only ever used when converting a number to a string.

C# toUpper for language without Uppercase

When using String.toUpper() are there any additional precautions which must be taken when attempting to "format" a language which does not contain uppercase characters such as Arabic?
string arabic = "مرحبا بالعالم";
string upper= arabic.ToUpper();
Sidebar: Never call .ToUpper() or .ToLower() when localization matters because these methods do not accept an explicit IFormatProvider that makes your intent (about localization) clear. You should prefer CultureInfo.TextInfo.ToUpperCase instead.
But to answer your question: case-conversions do not affect characters not subject to casing, they are kept as-is. This also happens in en-US and other Latin-alphabet languages because characters like digits 0, 1, 2 etc don't have cases either - so your Arabic characters will be preserved as-is.
Note how the non-alphabetical and already-uppercase characters are ignored:
"abcDEF1234!##" -> "ABCDEF1234!##"
Another thing to be aware of is that some languages have characters that don't have a one-to-one mapping between lowercase and uppercase forms, namely the Turkish I, which is written up here: https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/ (and it's why FxCop yells at you if you ever use ToLower instead of ToUpper, and why you should use StringComparison.OrdinalIgnoreCase or CurrentCultureIgnoreCase and never str1.ToLower() == str2.ToLower() for case-insensitive string comparison.

Inserting spaces between chars of two strings modify their order [duplicate]

This question already has answers here:
Unexpected behavior when sorting strings with letters and dashes
(2 answers)
Closed 5 years ago.
I have 2 strings of the same length.
I was assuming (probably wrongly) that inserting a space between each character of each string will not change their order.
var e1 = "12*4";
var e2 = "12-4";
Console.WriteLine(String.Compare(e1,e2)); // -1 (e1 < e2)
var f1 = "1 2 * 4";
var f2 = "1 2 - 4";
Console.WriteLine(String.Compare(f1,f2)); // +1 (f1 > f2)
If I insert other characters (_ x for instance), the order is preserved.
What's going on ?
Thanks in advance.
If you use Ordinal comparison, you will get the right result.
The reason is that ordinal comparison works by evaluating the numeric value of each of the chars in the string object, so inserting spaces will make no difference.
If you use other types of comparisons, there are other things involved. From the documentation:
An operation that uses word sort rules performs a culture-sensitive
comparison wherein certain nonalphanumeric Unicode characters might
have special weights assigned to them. Using word sort rules and the
conventions of a specific culture, the hyphen ("-") might have a very
small weight assigned to it so that "coop" and "co-op" appear next to
each other in a sorted list.
An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each Char in the
string. An ordinal comparison is fast but culture-insensitive. When
you use ordinal sort rules to sort strings that start with Unicode
characters (U+), the string U+xxxx comes before the string U+yyyy if
the value of xxxx is numerically less than yyyy.
From MSDN:
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
When comparing strings, you should call the Compare(String, String, StringComparison) method, which requires that you explicitly specify the type of string comparison that the method uses.
It suggests that there is some cultural issue which means that the last space changes the sort order of the two.

Difference between the different overloads of String.Compare

Concretely what is the difference between
String.Compare(String, String, StringComparison) and
String.Compare(String, String, CultureInfo, CompareOptions)
I feel like that the second one offers more options (comparison using any culture instead of only the current one or invariant one, ignore special characters, ignore the width of katakanas (!!) etc...) than the first one. Both have been introduced it in .NET 2.0 so I guess it can't be a question of backward compatibility.
So what's the difference and when should I use the first one and when should I use the second one?
I had a look at this post and this article, but I think they're dealing with a slightly different matters.
Your answer is in the remarks for the second overload.
http://msdn.microsoft.com/en-us/library/cc190529.aspx
"The comparison uses the culture parameter to obtain culture-specific information, such as casing rules and the alphabetical order of individual characters. For example, a particular culture could specify that certain combinations of characters be treated as a single character, that uppercase and lowercase characters be compared in a particular way, or that the sort order of a character depends on the characters that precede or follow it."
The other overload just uses the default culture.

What is a String Culture

Just trying to understand that - I have never used it before. How is a culture different to ToUpper() / ToLower()??
As SLaks says, different cultures handle casing differently.
A specific example from MSDN:
In most Latin alphabets, the character
i (Unicode 0069) is the lowercase
version of the character I (Unicode
0049). However, the Turkish alphabet
has two versions of the character I:
one with a dot and one without a dot.
In Turkish, the character I (Unicode
0049) is considered the uppercase
version of a different character ı
(Unicode 0131).
Different cultures have different rules for converting between uppercase and lowercase characters.
They also have different rules for comparing and sorting strings, and for converting numbers and dates to strings.
The Turkish I is the most common example of cultural differences in case mappings, but there are many others.
I recommend checking out the Unicode Consortium's information on this.
http://www.unicode.org/faq/casemap_charprop.html

Categories

Resources