Concretely what is the difference between
String.Compare(String, String, StringComparison) and
String.Compare(String, String, CultureInfo, CompareOptions)
I feel like that the second one offers more options (comparison using any culture instead of only the current one or invariant one, ignore special characters, ignore the width of katakanas (!!) etc...) than the first one. Both have been introduced it in .NET 2.0 so I guess it can't be a question of backward compatibility.
So what's the difference and when should I use the first one and when should I use the second one?
I had a look at this post and this article, but I think they're dealing with a slightly different matters.
Your answer is in the remarks for the second overload.
http://msdn.microsoft.com/en-us/library/cc190529.aspx
"The comparison uses the culture parameter to obtain culture-specific information, such as casing rules and the alphabetical order of individual characters. For example, a particular culture could specify that certain combinations of characters be treated as a single character, that uppercase and lowercase characters be compared in a particular way, or that the sort order of a character depends on the characters that precede or follow it."
The other overload just uses the default culture.
Related
Based on my understanding (see my other question), in order to decide whether to test string equality by using ordinal or cultural rules, the semantic of the performed comparison must be taken into account.
If the two compared strings must be considered as raw sequences of characters (in other words, two symbols) then an ordinal string comparison must be performed. This is the case for most string comparisons performed in server side code.
Example: performing a user lookup by username. In this case the usernames of available users and the searched username are just symbols, they are not words in a specific language, so there is no need to take linguistic elements into account when comparing them. In this context two symbols composed by different characters must be considered different, regardless of any linguistic rule.
If the two compared strings must be considerd as words in a specific language, then cultural rules must be taken into account during the comparison. It is entirely possible that two strings, composed by different characters, are considerd the same word in a certain language, based on the grammatical rules of that language.
Example: the two words strasse and straße have the same meaning of street in the german language. So, in the context of comparing strings representing words of the german language this grammatical rule must be taken into account and these two strings must be considered equal (think of an application for the german market where the user inputs the name of a street and that street must be searched into a database, in order to get the city where the street is located).
So far, so good.
Given all of this, in which cases using the .NET invariant culture for strings equality makes sense ?
The point is that the invariant culture (as opposed of the German culture, mentioned in the example above) is a fake culture based on the american english linguistic rules.
Put another way, there is no human language whose rules are based on the .NET invariant culture, so why should I compare two strings by using this fictitious culture ?
I know that the invariant culture is typically used to format and parse strings used in machine to machine communication scenarios (such as the contracts exposed by a web API).
I would like to understand when calling string.equals using StringComparison.InvariantCulture as opposed of StringComparison.CurrentCulture (for some manually set thread culture, in order to not depend on the machine OS configuations) really makes sense.
Combining diacritics / non-normalised strings is one example. See this answer for a decent treatment with code: https://stackoverflow.com/a/31361980/2701753
In summary for (many) 'alphabets' there are several potential Unicode (and UCS-2) representations for the same glyph (letter)
For example:
Unicode Character “á” (U+00E1) [one unicode codepoint]
Unicode Character “a” (U+0061) [followed by] Unicode Character “◌́” (U+0301) [two unicode codepoints]
so:
á
á
Same linguistic string (for all cultures, they are supposed to represent the same character) but different ordinal string (different bytes).
So Invariant equality comparison is [in this case] like normalising the strings before comparing them
Look-up unicode normalisation / decomposition for more info.
There are other interesting cases, ligatures for example. And left to right and right to left marks and ....
So, in summary, once you have 'interesting' alphabets in play (pretty much anything outside pure ascii), once you are interested in any sort of comparison of the strings as linguistic items / streams of glyphs, you probably do want to go beyond ordinal comparison.
To directly answer the question: If you have a multicultural user-base, but still need the above linguistic sensitivity, what culture would you choose for:
StringComparison.CurrentCulture (for some manually set thread culture, in order to not depend on the machine OS configuations)
other than InvariantCulture?
When using String.toUpper() are there any additional precautions which must be taken when attempting to "format" a language which does not contain uppercase characters such as Arabic?
string arabic = "مرحبا بالعالم";
string upper= arabic.ToUpper();
Sidebar: Never call .ToUpper() or .ToLower() when localization matters because these methods do not accept an explicit IFormatProvider that makes your intent (about localization) clear. You should prefer CultureInfo.TextInfo.ToUpperCase instead.
But to answer your question: case-conversions do not affect characters not subject to casing, they are kept as-is. This also happens in en-US and other Latin-alphabet languages because characters like digits 0, 1, 2 etc don't have cases either - so your Arabic characters will be preserved as-is.
Note how the non-alphabetical and already-uppercase characters are ignored:
"abcDEF1234!##" -> "ABCDEF1234!##"
Another thing to be aware of is that some languages have characters that don't have a one-to-one mapping between lowercase and uppercase forms, namely the Turkish I, which is written up here: https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/ (and it's why FxCop yells at you if you ever use ToLower instead of ToUpper, and why you should use StringComparison.OrdinalIgnoreCase or CurrentCultureIgnoreCase and never str1.ToLower() == str2.ToLower() for case-insensitive string comparison.
I need to validate user input for a property name to retrieve.
For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.
What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).
I have this as of now, is this a right solution?
^([\w]+\.)+[\w]+$|([\w]+)
I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"
I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.
Clarifying what we want
First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:
Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
For verbatim identifiers (# prefix) the language treats the # as part of the identifier, but the runtime doesn't see it.
Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.
"Valid" based on the runtime
C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:
The rules for identifiers given in this subclause correspond exactly
to those recommended by the Unicode Standard Annex 15 except that
underscore is allowed as an initial character (as is traditional in
the C programming language), Unicode escape sequences are permitted in
identifiers, and the “#” character is allowed as a prefix to enable
keywords to be used as identifiers.
The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:
#"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"
All the ugly details
The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.
The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and Elmo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.
The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.
Nesting, generics, etc.
Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.
"Valid" based on the language
This is just the previous regex but also allowing for:
Prefixed #
Unicode escape sequences
The first is a trivial modification: just add #?.
Unicode escape sequences are of form #"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.
So putting it all together, a check for whether a string is valid from a C# language standpoint would be:
bool IsValidIdentifier(string input)
{
if (input is null) { throw new ArgumentNullException(); }
// Technically the input must be in normal form C. Implementations aren't required
// to verify that though, so you could remove this check if your runtime doesn't
// mind.
if (!input.IsNormalized())
{
return false;
}
// Convert escape sequences to the characters they represent. The only allowed escape
// sequences are of form \u0000 or \U0000, where 0 is a hex digit.
MatchEvaluator replacer = (Match match) =>
{
string hex = match.Groups[1].Value;
var codepoint = int.Parse(hex, NumberStyles.HexNumber);
return new string((char)codepoint, 1);
};
var escapeSequencePattern = #"\\[Uu]([\dA-Fa-f]{4})";
var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
withoutEscapes.Dump();
// Now do the real check.
var isIdentifier = #"^#?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}
Back to the original question
The asker is long gone, but I feel obliged to include an answer to the actual question:
string[] parts = input.Split();
return parts.Length == 2
&& IsValidIdentifier(parts[0])
&& IsValidIdentifier(parts[1]);
Sources
ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7
Not the best, but this will work. Demo here.
^#?[a-zA-Z_]\w*(\.#?[a-zA-Z_]\w*)*$
Note that
* Number 0-9 is not allowed as first character
* # is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed
Edit
Looking at your requirement, the below Regex will be more useful, as input property name need not have # in it. Check here.
^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$
What you posted in the comments is almost right. But it won't detect single properties like "Name".
^(?:[\w]+\.)*\w+$
Works as expected. Just changed the + to * and the group to non-capturing group since you are not concerned about groups here.
For my answer in this question I have to compare two characters. I thought that the normal char.CompareTo() method would allow me to specify a CultureInfo, but that's not the case.
So my question is: How can I compare two characters and specify a CultureInfo for the comparison?
There is no culture enabled comparison for characters, you have to convert the characters to strings so that you can use for example the String.Compare(string, string, CultureInfo, CompareOptions) method.
Example:
char a = 'å';
char b = 'ä';
// outputs -1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("sv-SE"),
CompareOptions.IgnoreCase
));
// outputs 1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("en-GB"),
CompareOptions.IgnoreCase
));
There is indeed a difference between comparing characters and strings. Let me try to explain the basic issue, which is quite simple: A character always represents a single unicode point. Comparing characters always compares the code points without any regard as to their equal meaning.
If you want to compare characters for equal meaning, you need to create a string and use the comparison methods provided there. These include support for different cultures. See Guffa's answer on how to do that.
Did you try String.Compare Method?
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
String.Compare(str1, str2, false, new CultureInfo("en-US"))
I don't think cultureInfo matters while comparing chars in C#. char is already a Unicode character so two characters can be easily compared witohut CultureInfo.
Just trying to understand that - I have never used it before. How is a culture different to ToUpper() / ToLower()??
As SLaks says, different cultures handle casing differently.
A specific example from MSDN:
In most Latin alphabets, the character
i (Unicode 0069) is the lowercase
version of the character I (Unicode
0049). However, the Turkish alphabet
has two versions of the character I:
one with a dot and one without a dot.
In Turkish, the character I (Unicode
0049) is considered the uppercase
version of a different character ı
(Unicode 0131).
Different cultures have different rules for converting between uppercase and lowercase characters.
They also have different rules for comparing and sorting strings, and for converting numbers and dates to strings.
The Turkish I is the most common example of cultural differences in case mappings, but there are many others.
I recommend checking out the Unicode Consortium's information on this.
http://www.unicode.org/faq/casemap_charprop.html