string.ToLower() and string.ToLowerInvariant() - c#

What's the difference and when to use what? What's the risk if I always use ToLower() and what's the risk if I always use ToLowerInvariant()?

Depending on the current culture, ToLower might produce a culture specific lowercase letter, that you aren't expecting. Such as producing ınfo without the dot on the i instead of info and thus mucking up string comparisons. For that reason, ToLowerInvariant should be used on any non-language-specific data. When you might have user input that might be in their native language/character-set, would generally be the only time you use ToLower.
See this question for an example of this issue:
C#- ToLower() is sometimes removing dot from the letter "I"

TL;DR:
When working with "content" (e.g. articles, posts, comments, names, places, etc.) use ToLower(). When working with "literals" (e.g. command line arguments, custom grammars, strings that should be enums, etc.) use ToLowerInvariant().
Examples:
=Using ToLowerInvariant incorrectly=
In Turkish, DIŞ means "outside" and diş means "tooth". The proper lower casing of DIŞ is dış. So, if you use ToLowerInvariant incorrectly you may have typos in Turkey.
=Using ToLower incorrectly=
Now pretend you are writing an SQL parser. Somewhere you will have code that looks like:
if(operator.ToLower() == "like")
{
// Handle an SQL LIKE operator
}
The SQL grammar does not change when you change cultures. A Frenchman does not write SÉLECTIONNEZ x DE books instead of SELECT X FROM books. However, in order for the above code to work, a Turkish person would need to write SELECT x FROM books WHERE Author LİKE '%Adams%' (note the dot above the capital i, almost impossible to see). This would be quite frustrating for your Turkish user.

I think this can be useful:
http://msdn.microsoft.com/en-us/library/system.string.tolowerinvariant.aspx
update
If your application depends on the case of a string changing in a predictable way that is unaffected by the current culture, use the ToLowerInvariant method. The ToLowerInvariant method is equivalent to ToLower(CultureInfo.InvariantCulture). The method is recommended when a collection of strings must appear in a predictable order in a user interface control.
also
...ToLower is very similar in most places to ToLowerInvariant. The documents indicate that these methods will only change behavior with Turkish cultures. Also, on Windows systems, the file system is case-insensitive, which further limits its use...
http://www.dotnetperls.com/tolowerinvariant-toupperinvariant
hth

String.ToLower() uses the default culture while String.ToLowerInvariant() uses the invariant culture. So you are essentially asking the differences between invariant culture and ordinal string comparision.

Related

Why does "\u1FFF:foo".StartsWith(":") return true?

The string "\u1FFF:foo" starts with \u1FFF (or "῿"), right?
So how can these both be true?
"\u1FFF:foo".StartsWith(":") // equals true
"\u1FFF:foo".StartsWith("\u1FFF") // equals true
// alternatively, the same:
"῿:foo".StartsWith(":") // equals true
"῿:foo".StartsWith("῿") // equals true
Does .NET claim that this string starts with two different characters?
And while I find this very surprising and would like to understand the "why", I'm equally interested in how I can force .NET to search exclusively by codepoints instead (using InvariantCulture doesn't seem to do a thing)?
And for comparison, one characters below that, "\u1FFE:foo".StartsWith(":") returns false.
That a string in general might be considered to start with two different strings that are not byte-for-byte identical is not surprising (because Unicode is complicated). For example, these results are almost always going to reflect what a user wants:
"n\u0303".StartsWith("\u00f1") // true
"n\u0303".StartsWith("n") // false
Using System.Globalization.CharUnicodeInfo.GetUnicodeCategory, you can see that '\u1fff' is in the "OtherNotAssigned" category; it's unclear to me whether that should affect string search/sort/comparison operations (it does not appear to affect normalization, that is, the characters remain after normalization).
If you want a byte-for-byte comparison, use StringComparison.Ordinal.
Because you are using String.StartsWith() incorrectly. You should use String.StartsWith (String, StringComparison) overload and StringComparison.Ordinal.
There is no character assigned to \u1FFF. I.e. there is no linguistic meaning attached to this code. See Greek Extended, Range: 1F00–1FFF excerpt from character code tables for Unicode Standard. Best Practices for Using Strings in .NET document from MSDN explicitly states that if you need to compare strings in a manner that ignores features of natural languages then you should use StringComparison.Ordinal:
Specifying the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase value in a method call signifies a non-linguistic comparison in which the features of natural languages are ignored. Methods that are invoked with these StringComparison values base string operation decisions on simple byte comparisons instead of casing or equivalence tables that are parameterized by culture. In most cases, this approach best fits the intended interpretation of strings while making code faster and more reliable.
Moreover, it recommends to always explicitly specify StringComparison in such method calls:
When you develop with .NET, follow these simple recommendations when you use strings:
Use overloads that explicitly specify the string comparison rules for string operations. Typically, this involves calling a method overload that has a parameter of type StringComparison.

C# - What is Culture-Specific?

I'm quite new at C#. ReSharper warned me that using "string.IndexOf" is culture-specific. What exactly is being culture-specific?
Culture is referring to the language for example American English (en-us) and British English (en-gb) (Obviously the same language but different cultures as far as .NET is concerned). The reason it might matter when using "string.IndexOf" is because certain characters (think characters with accents and umlauts) get treated differently in different cultures. There aren't enough unicode values to represent every character in every language so within certain culture settings certain character combinations (such as an 'a' followed by an umlaut) are combined into a single character but in other culture settings they may not be. So using "string.IndexOf" on a string with an umlaut might yield different results depending on the set culture. But in most circumstances, especially if you're just learning, default behavior of the string class will be just fine.
Resharaper is a extension to follow some coding standard if any method accepting culture as a overload.then it is very good practice to provide these features in your code
comparisonType
parameter specifies to search for the value parameter using the current or invariant culture, using a case-sensitive or case-insensitive search, and using word or ordinal comparison rules.
More Info

Does the culture of the StringComparison type of String.Equals matter?

In C#, you can compare two strings with String.Equals and supply a StringComparison.
I've recently been looking to update my archaic method of comparing ToLower() because I read that it doesn't work on all languages/cultures.
From what I can tell, the comparison types are used to determine order when confronted with a list containing aé and ae as to which should appear first (some cultures order things differently).
With string.Equals, ordering is not important. Therefore is it safe to assume that many of the options are irrelevent, and only [Ordinal] and [Ordinal]IgnoreCase are important?
The MSDN article for String.Equals says
The comparisonType parameter indicates whether the comparison should
use the current or invariant culture, honor or ignore the case of the
two strings being compared, or use word or ordinal sort rules.
string.Equals(myString, theirString, StringComparison.OrdinalIgnoreCase)
I'd also be interested to know how the sort method works internally, does it use String.Compare to work out the relative positioning of two strings?
Case insensitive comparisons are culture dependent. For example using Turkish culture, i is not lowercase for I. With that culture I is paired with ı, and İ is paired with i. See Dotted and dotless I on Wikipedia.
There are a number of weird effects related to culture sensitive string operations. For example "KonNy".StartsWith("Kon") can return false.
So I recommend switching to culture insensitive operations even for seemingly harmless operations.
And even with culture insensitive operations there is plenty of unintuitive behavior in unicode, such as multiple representations of the same glyph, different codepoints that look identical, zero-width characters that are ignored by some operations, but observed by others,...

Query strings and text case

Is it safe for me to evaluate a query string's string and take case (upper/lower) into consideration? Do some browsers lower the whole string for example? Is it reliable enough to code as though whatever parameters I add onto the query strings to remain the same case-wise? (Obviously putting to one side the fact that users might mess with it).
Tagged with C# as I'm not sure if the platform evaluating the query string affects the answer to this question; and it's C# I'm coding in.
Convention is key. If you use camel-cased query strings throughout your app, use camel-case, etc. You're going to be the one passing arguments and specifying query strings, so keep it consistent to make life easy on yourself. Other than keeping it consistent, there's no real benefit to a particular casing convention.
The browser will keep capitalization in tact.

Is there a regex to test if a string is for a locale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales('');
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}
You could try something like this:
[a-z]{2}-[a-z]{2}
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
etc....
On a similar note, here is a useful list of two letter country codes.
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
/^[a-z]{2}_[A-Z]{2}$/
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
^(?<language>[A-Za-z]{2,4})([_-](?<script>[A-Za-z]{4}|[0-9]{3}))?([_-](?<country>[A-Za-z]{2}|[0-9]{3}))?([_-]x[_-](?<private>[A-Za-z0-9-_]+))?$
^[a-z]{2}([_])?([A-Za-z]{2})?$
I used this regex and it works for locale only having optional '_'
For example:
en,
de,
en_us,
en_US
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)

Categories

Resources