Why does "\u1FFF:foo".StartsWith(":") return true?

Why does "\u1FFF:foo".StartsWith(":") return true? - c#

The string "\u1FFF:foo" starts with \u1FFF (or "῿"), right?
So how can these both be true?
"\u1FFF:foo".StartsWith(":") // equals true
"\u1FFF:foo".StartsWith("\u1FFF") // equals true
// alternatively, the same:
"῿:foo".StartsWith(":") // equals true
"῿:foo".StartsWith("῿") // equals true
Does .NET claim that this string starts with two different characters?
And while I find this very surprising and would like to understand the "why", I'm equally interested in how I can force .NET to search exclusively by codepoints instead (using InvariantCulture doesn't seem to do a thing)?
And for comparison, one characters below that, "\u1FFE:foo".StartsWith(":") returns false.

That a string in general might be considered to start with two different strings that are not byte-for-byte identical is not surprising (because Unicode is complicated). For example, these results are almost always going to reflect what a user wants:
"n\u0303".StartsWith("\u00f1") // true
"n\u0303".StartsWith("n") // false
Using System.Globalization.CharUnicodeInfo.GetUnicodeCategory, you can see that '\u1fff' is in the "OtherNotAssigned" category; it's unclear to me whether that should affect string search/sort/comparison operations (it does not appear to affect normalization, that is, the characters remain after normalization).
If you want a byte-for-byte comparison, use StringComparison.Ordinal.

Because you are using String.StartsWith() incorrectly. You should use String.StartsWith (String, StringComparison) overload and StringComparison.Ordinal.
There is no character assigned to \u1FFF. I.e. there is no linguistic meaning attached to this code. See Greek Extended, Range: 1F00–1FFF excerpt from character code tables for Unicode Standard. Best Practices for Using Strings in .NET document from MSDN explicitly states that if you need to compare strings in a manner that ignores features of natural languages then you should use StringComparison.Ordinal:
Specifying the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase value in a method call signifies a non-linguistic comparison in which the features of natural languages are ignored. Methods that are invoked with these StringComparison values base string operation decisions on simple byte comparisons instead of casing or equivalence tables that are parameterized by culture. In most cases, this approach best fits the intended interpretation of strings while making code faster and more reliable.
Moreover, it recommends to always explicitly specify StringComparison in such method calls:
When you develop with .NET, follow these simple recommendations when you use strings:
Use overloads that explicitly specify the string comparison rules for string operations. Typically, this involves calling a method overload that has a parameter of type StringComparison.

Related

Is there a syntactically legal expression that has 2 consecutive identifiers separated only by white space in C#?

That might not be the best way to phrase it, but I'm considering writing a tool that converts identifiers separated by spaces in my code to camel case. A quick example:
var zoo animals = GetZooAnimals(); // i can't help but type this
var zooAnimals = GetZooAnimals(); // i want it to rewrite it like this
I was wondering if writing a tool like this would run into any ambiguities assuming it ignores all keywords. The only reason I can think of is if there is a syntactically valid expression with 2 identifiers only separated by white space.
Looking through the grammar I could not immediately find a place that allows it, but perhaps someone else would know better.
On a side note, I realize this is not a practical solution to a real problem a lot of people have, but just something I do all the time and wanted to take a stab at fixing with tools instead of forcing myself to always write camel case.

It is hard to tell whether a space-separated sequence of identifiers represents a single variable or not without doing full semantic analysis. For example
Myclass myVariable;
is a pair of space-separated identifiers which are perfectly valid. This would cause an ambiguity if you want to camel-case both type names and variable names.

If one enters:
csharp> var i j = 3;
(1,7): error CS1525: Unexpected symbol `j', expecting `,', `;', or `='
in the csharp interactive shell, one gets an error generated by the parser (a (LA)LR parser does bookkeeping what to expect next). Such parser works left-to-right so it doesn't know which characters to come next. It simply knows that the next characters are one of the list shown above.
So that means that there is probably no way to - at least declare a variable - with spaces.
Furthermore based on this context-free grammar for C# there doesn't seem to be a case where one can place two identifiers next to each other. It is for instance possible that a primary expressions is an identifier, but there is no situation where a primary expression is placed next to an identifier (or with an optional part in between).
As #dasblinkenlight says, you can indeed see the rule "local-variable-declaration":
type variable-declarator
with type that can be evaluated to an identifier and variable-declarator starting with an identifier. You can however know that the type is the first identifier (or the var keyword). Some kind of rewrite rule is thus:
(\w+)(\s+\w+)+ -> \1 concat(\2)
where you need to combine (concat) the identifiers of the second group. In case of an assignment.

Does the culture of the StringComparison type of String.Equals matter?

In C#, you can compare two strings with String.Equals and supply a StringComparison.
I've recently been looking to update my archaic method of comparing ToLower() because I read that it doesn't work on all languages/cultures.
From what I can tell, the comparison types are used to determine order when confronted with a list containing aé and ae as to which should appear first (some cultures order things differently).
With string.Equals, ordering is not important. Therefore is it safe to assume that many of the options are irrelevent, and only [Ordinal] and [Ordinal]IgnoreCase are important?
The MSDN article for String.Equals says
The comparisonType parameter indicates whether the comparison should
use the current or invariant culture, honor or ignore the case of the
two strings being compared, or use word or ordinal sort rules.
string.Equals(myString, theirString, StringComparison.OrdinalIgnoreCase)
I'd also be interested to know how the sort method works internally, does it use String.Compare to work out the relative positioning of two strings?

Case insensitive comparisons are culture dependent. For example using Turkish culture, i is not lowercase for I. With that culture I is paired with ı, and İ is paired with i. See Dotted and dotless I on Wikipedia.
There are a number of weird effects related to culture sensitive string operations. For example "KonNy".StartsWith("Kon") can return false.
So I recommend switching to culture insensitive operations even for seemingly harmless operations.
And even with culture insensitive operations there is plenty of unintuitive behavior in unicode, such as multiple representations of the same glyph, different codepoints that look identical, zero-width characters that are ignored by some operations, but observed by others,...

How to Compare localized strings in C#

I am doing localization for ASP.NET Web Application, when user enters a localized string "XXXX" and i am comparing that string with a value in my localized resource file.
Example :
if ( txtCalender.Text == Resources.START_NOW)
{
//do something
}
But When i do that even when the two strings(localized strings) are equal, it returns false. ie.
txtCalender.Text ="இப்போது தொடங்க"
Resources.START_NOW="இப்போது தொடங்க"
This is localized for Tamil.
Please help..

Use one of the string.Equals overloads that takes a StringComparison value - this allows you to use the current culture for comparison..
if ( txtCalender.Text.Equals(Resources.START_NOW, StringComparison.CurrentCulture))
{
//do something
}
Or, if you want case insensitive comparison:
if ( txtCalender.Text.Equals(Resources.START_NOW,
StringComparison.CurrentCultureIgnoreCase))
{
//do something
}

I found the answer and it works. Here is the solution,
it was not working when i tried from Chrome browser and it works with Firefox. Actually when i converted both string to char array,
txtCalender.Text Returns 40 characters and Resource.START_NOW returned 46. So i have tried to Normalize the string using Normalize() method
if(txtCalender.Text.Normalize() == Resources.START_NOW.Normalize())
It was interpreting one character as two different characters when i didn't put normalize method.
it has worked fine. Thanks for your answers.

You can compare with InvariantCulture in String.Equals (statis method):
String.Equals("XXX", "XXX", StringComparison.InvariantCulture);
Not sure whether this helps though, could others comment on it? I've never come across your actual error.

Use String.Equals or String.Compare.
There is some performance differences between these two. String.Compare is faster than String.Equal because String.Compare is static method and String.Equals is instance method.
String.Equal returns a boolean. String.Compare returns 0 when the strings equal, but if they're different they return a positive or negative number depending on whether the first string is before (less) or after (greater) the second string. Therefore, use String.Equals when you need to know if they are the same or String.Compare when you need to make a decision based on more than equality.

You probably need to use .Equals
if(txt.Calendar.Text.Equals(Resources.START_NOW))
{ //...
And if case-insensitive comparison is what you're after (often is) use StringComparison.OrdinalIgnoreCase as the second argument to the .Equals call.
If this isn't working - then can I suggest you breakpoint the line and check the actual value of Resources.START_NOW - the only possible reason why this equality comparison would fail is if the two strings really aren't the same. So my guess is that your culture management isn't working properly.

string.ToLower() and string.ToLowerInvariant()

What's the difference and when to use what? What's the risk if I always use ToLower() and what's the risk if I always use ToLowerInvariant()?

Depending on the current culture, ToLower might produce a culture specific lowercase letter, that you aren't expecting. Such as producing ınfo without the dot on the i instead of info and thus mucking up string comparisons. For that reason, ToLowerInvariant should be used on any non-language-specific data. When you might have user input that might be in their native language/character-set, would generally be the only time you use ToLower.
See this question for an example of this issue:
C#- ToLower() is sometimes removing dot from the letter "I"

TL;DR:
When working with "content" (e.g. articles, posts, comments, names, places, etc.) use ToLower(). When working with "literals" (e.g. command line arguments, custom grammars, strings that should be enums, etc.) use ToLowerInvariant().
Examples:
=Using ToLowerInvariant incorrectly=
In Turkish, DIŞ means "outside" and diş means "tooth". The proper lower casing of DIŞ is dış. So, if you use ToLowerInvariant incorrectly you may have typos in Turkey.
=Using ToLower incorrectly=
Now pretend you are writing an SQL parser. Somewhere you will have code that looks like:
if(operator.ToLower() == "like")
{
// Handle an SQL LIKE operator
}
The SQL grammar does not change when you change cultures. A Frenchman does not write SÉLECTIONNEZ x DE books instead of SELECT X FROM books. However, in order for the above code to work, a Turkish person would need to write SELECT x FROM books WHERE Author LİKE '%Adams%' (note the dot above the capital i, almost impossible to see). This would be quite frustrating for your Turkish user.

I think this can be useful:
http://msdn.microsoft.com/en-us/library/system.string.tolowerinvariant.aspx
update
If your application depends on the case of a string changing in a predictable way that is unaffected by the current culture, use the ToLowerInvariant method. The ToLowerInvariant method is equivalent to ToLower(CultureInfo.InvariantCulture). The method is recommended when a collection of strings must appear in a predictable order in a user interface control.
also
...ToLower is very similar in most places to ToLowerInvariant. The documents indicate that these methods will only change behavior with Turkish cultures. Also, on Windows systems, the file system is case-insensitive, which further limits its use...
http://www.dotnetperls.com/tolowerinvariant-toupperinvariant
hth

String.ToLower() uses the default culture while String.ToLowerInvariant() uses the invariant culture. So you are essentially asking the differences between invariant culture and ordinal string comparision.

When to use char field in a data class

Say I have a Currency class
public class Currency
{
string Name; // eg "US Dollars"
string Symbol; // eg "$"
decimal Rate; // eg 1.7
}
So, symbol is $, £, ..etc i.e. it's going to be a single character.
I used string simply because it makes translation to/from views easier, but is there any reason why I should be using char? What are the advantages/disadvantages of using a char field instead of a string? Or is this a case like byte/int where you should basically always prefer int.

Not all currency symbols are a single character: Symbols List.

Well, if you use char:
It's guaranteed to be exactly a single character, and that's obvious to the developer (probably the biggest "pro" for char)
It can never be null (although it could be U+0000)
You won't be able to handle characters which aren't in the Basic Multilingual Plane
I don't know if any of those are important to you.

Or is this a case like byte/int where you should basically always prefer int.
First off, I would argue that there are valid uses for byte. byte takes less space than int - if you only need a byte, use a byte.
System.Char has a couple of advantages to string, in the right case. First, its a single, immutable value type. This is more efficient if you only need a single character.
In addition, using a char in your API makes it impossible to put more than one character into that field. This may eliminate or simplify some of the validation required.
If it truly makes sense that you'd only ever want a single char, I'd say to use a char. It simplifies the code (less validation), makes it more efficient (less memory as you don't have another object reference), and most importantly, expresses the intent more clearly, since you're saying "I just want one character here."

string is acceptable here, you just have to ensure that the string always contains only 1 character, which means a little extra code on that end to prevent exceptions. Really, you can pick your poison here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why does "\u1FFF:foo".StartsWith(":") return true? - c#

Related

Is there a syntactically legal expression that has 2 consecutive identifiers separated only by white space in C#?

Does the culture of the StringComparison type of String.Equals matter?

How to Compare localized strings in C#

string.ToLower() and string.ToLowerInvariant()

When to use char field in a data class

Categories

Resources