I have ISO 3166-1 (alpha-2) country codes, which are two-letter codes, such as "US" and "NL". How do I get the corresponding flag emoji?
EDIT: Preferably I would like to do this without using an explicit mapping between country codes and their corresponding emojis. It has been done in JavaScript but I'm not sure how to do it in C#.
Solution:
public static string IsoCountryCodeToFlagEmoji(this string country)
{
return string.Concat(country.ToUpper().Select(x => char.ConvertFromUtf32(x + 0x1F1A5)));
}
string gb = "gb".IsoCountryCodeToFlagEmoji(); // 🇬🇧
string fr = "fr".IsoCountryCodeToFlagEmoji(); // 🇫🇷
XXXXXXXX SKIP THIS SECTION WHICH CONTAINS MY ORIGINAL ANSWER XXXXXXXX
You will need to generate a cross-reference table or dictionary that allows you to look up the corresponding emoji. Luckily it looks like you've already found a great source for the information you need!
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
You can go here to find a chart of the appropriate unicode symbols for each letter. Basically, you just use the regional indicator symbol for each letter. For example, NZ would be U+1F1F3 (N) + U+1F1FF (Z). These two symbols are interpreted as the NZ flag if support is there for that emoji.
Because these letters are all contiguous, you can calculate the appropriate code for a given letter by using an offset from the normal upper case letters. You may have seen it in the code repository you referenced: it is 127397. Thus, 'A'+127397 is the regional indicator symbol for A.
Thanks for teaching me something new today, and good luck!
Related
First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.
This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales('');
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}
You could try something like this:
[a-z]{2}-[a-z]{2}
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
etc....
On a similar note, here is a useful list of two letter country codes.
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
/^[a-z]{2}_[A-Z]{2}$/
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
^(?<language>[A-Za-z]{2,4})([_-](?<script>[A-Za-z]{4}|[0-9]{3}))?([_-](?<country>[A-Za-z]{2}|[0-9]{3}))?([_-]x[_-](?<private>[A-Za-z0-9-_]+))?$
^[a-z]{2}([_])?([A-Za-z]{2})?$
I used this regex and it works for locale only having optional '_'
For example:
en,
de,
en_us,
en_US
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)
I have a string in c#
How can i detect if this string contains Chars from Different Languages ?
i.e : a person fills his english name in text box and also his local language name.
I want to disallow that.
something like this :
"check the language table of the chars in the string and if it comes
from different unicode tables - return ERROR".
but i think there is a problem for 'a' in us or uk.
maybe im wrong.
how can i recognize more than one language ?
I think you're searching for codepoints. The unique identifiers of a character in codepage. I think this should be useful to you How would you get an array of Unicode code points from a .NET String?. Once you get codepoints array from the string, you can check it against the range of code points you want.
Hope this helps.
I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.
Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.
I've included a subset here:
private const string STREETTYPES = #"ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...";
private const string QUADRANTS = "N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST";
HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.
private void Parse(string line1)
{
HouseNumber = string.Empty;
Quadrant = string.Empty;
StreetName = string.Empty;
StreetType = string.Empty;
if (!String.IsNullOrEmpty(line1))
{
string noPeriodsLine1 = String.Copy(line1);
noPeriodsLine1 = noPeriodsLine1.Replace(".", "");
string addressParseRegEx =
#"(?ix)
^
\s*
(?:
(?<housenumber>\d+)
(?:(?:\s+|-)(?<quadrant>" +
QUADRANTS +
#"))?
(?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))??
(?:(?:\s+|-)(?<quadrant>" +
QUADRANTS + #"))?
(?:(?:\s+|-)(?<streettype>" + STREETTYPES +
#"))?
(?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
QUADRANTS +
#"))(?:\d+|\S+)))?
(?:(?:\s+|-)(?<streettypequadrant>(" +
QUADRANTS + #")))??
(?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))?
|
(?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+))
)
\s*
$
";
Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
if (match.Success)
{
HouseNumber = match.Groups["housenumber"].Value;
Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
if (match.Groups["streetname"].Captures.Count > 1)
{
foreach (Capture capture in match.Groups["streetname"].Captures)
{
StreetName += capture.Value + " ";
}
StreetName = StreetName.Trim();
}
else
{
StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
}
StreetType = match.Groups["streettype"].Value;
//if the matched street type is found
//use the abbreviated version...especially for credit bureau calls
string streetTypeAbbreviation;
if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
{
StreetType = streetTypeAbbreviation;
}
}
}
}
Have fun with addresses and regexs, you're in for a long, horrible ride.
You're trying to lay order upon chaos.
For every "123 Simple Way", there's a "14 1/2 South".
Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".
Have fun with that.
There are more exceptions than rules when it comes to street adresses.
I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.
I think you should clarify your usage scenario.
Unless you're in a very, very limited scenario where you know that the addresses were entered following a strict schema, parsing addresses for content is an extremely hard problem to solve and, usually, quite futile (unless it's the raison d'être of your application).
If you're limited to a particular country that has very specific conventions for writing addresses, then using these regex might get you 90% of the way.
However, as soon as you have to start accepting foreign addresses, you're screwed.
Even if you're a US-centric site, there is a good chance that you may have to be able to accept addresses from US citizen living abroad for instance.
Again, it may be OK in a very narrow field, but it's almost always a bad idea to validate or split addresses that were not strictly validated and constrained at the time the user entered them.
When you do enforce some strict rules for users to enter their addresses, these end-up being inadequate in a small portion of cases, even in the best address validation components out there.
Just a few things that mess up address parsing:
postal codes (Zip codes) are sometimes placed before, after, or may even not exist at all.
postal codes follow strict rules: a 10-digit Zip code is probably easy to spot as invalid, but what about a non-existent one? What about more codes such as those used in the UK for instance?
What about a place like Hong Kong where you could write the address in either English, Traditional Chinese or Mandarin?
What if it's perfectly fine to split your address and write it out of sequence?
even if you're just parsing US addresses, there are at least a handfull of ways to describe a PO box: you can also use poste restante, general delivery and then need to add a 4-digit code to the Zip code, which would normally probably not be present at all...
Bottom line is
If getting addresses in a parseable format is really important, be 100% sure that you can get all possible combinations right or you're going to have a percentage of failures that will mean frustrated users and loss sales.
If you don't have 100% case coverage then don't enforce strict rules on the user.
I can't count the number of websites I gave up purchasing from because they would require a Zip/Postal Code when the place I live in has none.
Sorry for the rant, but I think it's important that people wanting to do address validation and parsing think hard about what they're getting themselves in.
This actually works pretty well except that it doesn't pull apartment numbers. We're working on that. It also coughed a little when we had an address of 769 Branch Ave. Of course "branch" is one of the street types that its looking for. It all goes back that making order out of chaos thing. We know that its going to break here and there.
If someone runs into this problem in 2013/2014 :)
You can use google geocode API. it provides more functionality than just regex - you can even get lat/long for address. And its free
For an address example-
http://maps.googleapis.com/maps/api/geocode/xml?address=2520%20Cohasset%20Rd%20-%20Chico%2C%20CA%2095973-1307%20530-893-1300%20%20&sensor=false
I tried to get this to work, but it seems as though you have a static member of a StreetTypes class that is not included. It seems to work except for that, but I can not do much testing without it.
I'll agree that your strictness is going to be a problem. I'm writing an address parser designed to strip addresses from classified ads where the format could be just about anything. For instance, for your quadrant matches, you're ignoring punctuation altogether. I have to search data that could represent NE in all these different ways:
"NE", "N.E", "N E", "N.E.", "N. E", "North East", "Northeast"
so I am using the following pattern match which should catch all direction qualifiers no matter how they are expressed:
\b(?:(?:[nesw]\.? ?){0,2}|(?:north|no\.|east|south|so\.|west){0,2})\b
Of course, context is also important since "no" is going to be matched by this. But "NE" for Nebraska would be matched by either, so you really have to be careful about what's to the left and right in your larger expression. I'm having to compile lists of words that commonly appear interspersed in address texts which are not address components, such as "near, x-street, in, across", etc.
It is a very tough problem, and I agree Salt Lake City is a bitch. In addition to having the double direction/coordinate format, they also compound it by referring to stuff like "3700 North 5300 East Arborville Way" where the streets can be referenced by name, number, or both.