I'm quite new at C#. ReSharper warned me that using "string.IndexOf" is culture-specific. What exactly is being culture-specific?
Culture is referring to the language for example American English (en-us) and British English (en-gb) (Obviously the same language but different cultures as far as .NET is concerned). The reason it might matter when using "string.IndexOf" is because certain characters (think characters with accents and umlauts) get treated differently in different cultures. There aren't enough unicode values to represent every character in every language so within certain culture settings certain character combinations (such as an 'a' followed by an umlaut) are combined into a single character but in other culture settings they may not be. So using "string.IndexOf" on a string with an umlaut might yield different results depending on the set culture. But in most circumstances, especially if you're just learning, default behavior of the string class will be just fine.
Resharaper is a extension to follow some coding standard if any method accepting culture as a overload.then it is very good practice to provide these features in your code
comparisonType
parameter specifies to search for the value parameter using the current or invariant culture, using a case-sensitive or case-insensitive search, and using word or ordinal comparison rules.
More Info
Related
In C#, you can compare two strings with String.Equals and supply a StringComparison.
I've recently been looking to update my archaic method of comparing ToLower() because I read that it doesn't work on all languages/cultures.
From what I can tell, the comparison types are used to determine order when confronted with a list containing aé and ae as to which should appear first (some cultures order things differently).
With string.Equals, ordering is not important. Therefore is it safe to assume that many of the options are irrelevent, and only [Ordinal] and [Ordinal]IgnoreCase are important?
The MSDN article for String.Equals says
The comparisonType parameter indicates whether the comparison should
use the current or invariant culture, honor or ignore the case of the
two strings being compared, or use word or ordinal sort rules.
string.Equals(myString, theirString, StringComparison.OrdinalIgnoreCase)
I'd also be interested to know how the sort method works internally, does it use String.Compare to work out the relative positioning of two strings?
Case insensitive comparisons are culture dependent. For example using Turkish culture, i is not lowercase for I. With that culture I is paired with ı, and İ is paired with i. See Dotted and dotless I on Wikipedia.
There are a number of weird effects related to culture sensitive string operations. For example "KonNy".StartsWith("Kon") can return false.
So I recommend switching to culture insensitive operations even for seemingly harmless operations.
And even with culture insensitive operations there is plenty of unintuitive behavior in unicode, such as multiple representations of the same glyph, different codepoints that look identical, zero-width characters that are ignored by some operations, but observed by others,...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales('');
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}
You could try something like this:
[a-z]{2}-[a-z]{2}
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
etc....
On a similar note, here is a useful list of two letter country codes.
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
/^[a-z]{2}_[A-Z]{2}$/
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
^(?<language>[A-Za-z]{2,4})([_-](?<script>[A-Za-z]{4}|[0-9]{3}))?([_-](?<country>[A-Za-z]{2}|[0-9]{3}))?([_-]x[_-](?<private>[A-Za-z0-9-_]+))?$
^[a-z]{2}([_])?([A-Za-z]{2})?$
I used this regex and it works for locale only having optional '_'
For example:
en,
de,
en_us,
en_US
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)
Background
I need to validate user input in some fields, where these are defining how to show time in some views.
Requirements
Time format must be expressed in Microsoft .NET way (check this MSDN Library article if you want to learn more about framework's date and time formatting: http://msdn.microsoft.com/en-us/library/8kb3ddd4.aspx)
Keep in mind I'm looking to validate the format instead of an actual time string.
For example, user may input:
HH:mm
hh:mm
ss
hh:ss
mm:ss
... and so on.
In fact, it should validate from the shortest to longest time format available.
Another point is I need to do it in client-side using JavaScript. In other words, any given regular expression by you should work in browsers JavaScript regular expressions' engine.
I'll appreciate any self-taylored one, any link or pasted expression!
Thank you in advance.
NOTE (Update)
I can't use ASP.NET validation engine, or any other. Because of project's requirements, I need to avoid that.
As far as I understand, there is no much options - sort of 20, as maximum. Why not just enumerate them all in one big regex without much special symbols? Like
'hh:mm|hh:mm:ss|yyyy-MM-dd hh:mm|<etc>'
you could than make it case sensitive to differentiate between M for month and m for minute, and for hours make it [hH], then make it [:-/] there where you allow for different separators, and lots of other similar things. But the main idea is to simply enumerate all options separated by | with just little amount of regex syntax between | and |.
What is your definition of a "valid" format string? Only once you know that can it be possible to validate a format string.
"K" is also a valid format string
"zz" is also a valid format string
"e" is also a valid format (it would fall into the "The character is copied to the result string unchanged." case)
I'm not even sure what formats would actually cause .NET .ToString() to throw an exception (if that's what you are trying to avoid).
What's the difference and when to use what? What's the risk if I always use ToLower() and what's the risk if I always use ToLowerInvariant()?
Depending on the current culture, ToLower might produce a culture specific lowercase letter, that you aren't expecting. Such as producing ınfo without the dot on the i instead of info and thus mucking up string comparisons. For that reason, ToLowerInvariant should be used on any non-language-specific data. When you might have user input that might be in their native language/character-set, would generally be the only time you use ToLower.
See this question for an example of this issue:
C#- ToLower() is sometimes removing dot from the letter "I"
TL;DR:
When working with "content" (e.g. articles, posts, comments, names, places, etc.) use ToLower(). When working with "literals" (e.g. command line arguments, custom grammars, strings that should be enums, etc.) use ToLowerInvariant().
Examples:
=Using ToLowerInvariant incorrectly=
In Turkish, DIŞ means "outside" and diş means "tooth". The proper lower casing of DIŞ is dış. So, if you use ToLowerInvariant incorrectly you may have typos in Turkey.
=Using ToLower incorrectly=
Now pretend you are writing an SQL parser. Somewhere you will have code that looks like:
if(operator.ToLower() == "like")
{
// Handle an SQL LIKE operator
}
The SQL grammar does not change when you change cultures. A Frenchman does not write SÉLECTIONNEZ x DE books instead of SELECT X FROM books. However, in order for the above code to work, a Turkish person would need to write SELECT x FROM books WHERE Author LİKE '%Adams%' (note the dot above the capital i, almost impossible to see). This would be quite frustrating for your Turkish user.
I think this can be useful:
http://msdn.microsoft.com/en-us/library/system.string.tolowerinvariant.aspx
update
If your application depends on the case of a string changing in a predictable way that is unaffected by the current culture, use the ToLowerInvariant method. The ToLowerInvariant method is equivalent to ToLower(CultureInfo.InvariantCulture). The method is recommended when a collection of strings must appear in a predictable order in a user interface control.
also
...ToLower is very similar in most places to ToLowerInvariant. The documents indicate that these methods will only change behavior with Turkish cultures. Also, on Windows systems, the file system is case-insensitive, which further limits its use...
http://www.dotnetperls.com/tolowerinvariant-toupperinvariant
hth
String.ToLower() uses the default culture while String.ToLowerInvariant() uses the invariant culture. So you are essentially asking the differences between invariant culture and ordinal string comparision.
I have some Excel data including an Excel column I created programatically in sql table my excel column on the other hand. One of the column's name's is mydetail. When I try to convert it to uppercase I get MYDETAİL. How do I use the ToUpper() method to obtain MYDETAIL not MYDETAİL?
I'm guessing that you are Turkish, or at least using a Turkish computer.
In Turkish the "i" does convert to "İ" in upper case.
You need to use a different culture when doing the conversion by using String.ToUpper method that takes an CultureInfo object as an argument. If you use en-US or en-GB you should get what you want.
In fact the example on the page I linked to uses en-US and tr-TR (Turkey-Turkish) on the word "indigo" as an example of the differences.
Try something like:
String result = source.ToUpper(CultureInfo.InvariantCulture);
From MSDN:
use the InvariantCulture to ensure that the behavior will be consistent regardless of the culture settings of the system
You will need to call .ToUpper() with the desired CultureInfo. See MSDN with some examples on how to use .ToUpper(CultureInfo).
It is recommended to specify CultureInfo on all String manipulation methods like String.Format(), <primitive>.ToString() or for example Convert.Int32(object, CultureInfo).
FxCop does a good job in reminding you on issues with this in your code.