How can i detect multiLanguages in string?

How can i detect multiLanguages in string? - c#

I have a string in c#
How can i detect if this string contains Chars from Different Languages ?
i.e : a person fills his english name in text box and also his local language name.
I want to disallow that.
something like this :
"check the language table of the chars in the string and if it comes
from different unicode tables - return ERROR".
but i think there is a problem for 'a' in us or uk.
maybe im wrong.
how can i recognize more than one language ?

I think you're searching for codepoints. The unique identifiers of a character in codepage. I think this should be useful to you How would you get an array of Unicode code points from a .NET String?. Once you get codepoints array from the string, you can check it against the range of code points you want.
Hope this helps.

Related

How to convert two-letter country codes to flag emojis?

I have ISO 3166-1 (alpha-2) country codes, which are two-letter codes, such as "US" and "NL". How do I get the corresponding flag emoji?
EDIT: Preferably I would like to do this without using an explicit mapping between country codes and their corresponding emojis. It has been done in JavaScript but I'm not sure how to do it in C#.

Solution:
public static string IsoCountryCodeToFlagEmoji(this string country)
{
return string.Concat(country.ToUpper().Select(x => char.ConvertFromUtf32(x + 0x1F1A5)));
}
string gb = "gb".IsoCountryCodeToFlagEmoji(); // 🇬🇧
string fr = "fr".IsoCountryCodeToFlagEmoji(); // 🇫🇷

XXXXXXXX SKIP THIS SECTION WHICH CONTAINS MY ORIGINAL ANSWER XXXXXXXX
You will need to generate a cross-reference table or dictionary that allows you to look up the corresponding emoji. Luckily it looks like you've already found a great source for the information you need!
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
You can go here to find a chart of the appropriate unicode symbols for each letter. Basically, you just use the regional indicator symbol for each letter. For example, NZ would be U+1F1F3 (N) + U+1F1FF (Z). These two symbols are interpreted as the NZ flag if support is there for that emoji.
Because these letters are all contiguous, you can calculate the appropriate code for a given letter by using an offset from the normal upper case letters. You may have seen it in the code repository you referenced: it is 127397. Thus, 'A'+127397 is the regional indicator symbol for A.
Thanks for teaching me something new today, and good luck!

C# Comparing two strings, one with unique indenifiers

First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.

If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.

How to find a string in a column that contains a lengthy "word" or set of characters

I am looking for an unusually long word or grouping of characters in a specific column of data that contains notes written by users. For example, if something like this -
I am looking for an unusuallylongwordorgroupingofcharactersina specific column
exists, I need to find it so I can add spaces if necessary. My question is: How do I find a word or set of characters that exceeds a certain number of characters?
The problem is that somewhere in this data, an unusually long word or grouping of characters is being parsed and causing an OutOfMemoryException, so I need to find the source and fix it.

You could use a regex in C# if the raw string fits in memory: \w{15,} gives you words at least 15 characters in length. There are many ways to tweak this (lookahead, lookbehind, more specific character classes, etc.).

You can write a C# stored procedure that can be run against the column in question.
It would split the column into an array of strings containing a word Then you can easily find the largest word in the column.
see http://msdn.microsoft.com/en-us/library/vstudio/zxsa8hkf%28v=vs.100%29.aspx
for details on how to, write install and debug a C# stored procedure in SQL Server

Using the answers given, I created a program that pulls the data and tosses each word into a list. It then pulls words of a given length (in my case, I did greater than 20 characters) and found the bad "word". Now I can fix the data.
I appreciate all your help, guys.

JavaScript to replace Chinese characters

I am building a JavaScript array depending on the input of the user. The array is building fine but if the user enters Chinese symbols it crashes. I'm assuming that it is if the user enters a chinese " or a , or a '. I have the program replacing the English versions of this but i don't know how to replace the Chinese versions of it.
Can anyone help?
Thanks to all for their input

From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:
4E00-9FFF (common)
3400-4DFF (rare)
F900-FAFF (compatability - Duplicates, unifiable variants, corporate characters)
20000-2A6DF (rare, historic)
2F800-2FA1F (compatability - supplement)
Because JS strings only support UCS-2, which max out at FFFF, the last two ranges probably aren't of great interest. Thus, if you're building a JS string should be able to filter out chinese characters using something like:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff]/g, '')

You need to use unicode replacer.
I think it will help you: http://answers.yahoo.com/question/index?qid=20080528045141AAJ0AIS

.Net provides JavaScriptSerializer and it's method Serialize, which creates correctly escaped JavaScript literals (although I personally haven't used it with Chinese characters, but there is no reason it shouldn't work).

Building on broofa's answer:
If you just want to find and replace the Chinese punctuation like " or " or a . then you'll want to use unicode characters in the range of FF00-FFEF. Here is a PDF from Unicode showing them: http://unicode.org/charts/PDF/UFF00.pdf
I think you'd want at least replace these: FF01, FF02, FF07, FF0C, FF0E, FF1F, and FF61. That should be the major Chinese punctuation marks. You can use broofa's replace function.

Not asked by the question, but adding \u30a0-\u30ff\u3040-\u309f you can also take out the Hiragana and Katakana from Japanese:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff\u30a0-\u30ff\u3040-\u309f]/g, '')
https://regex101.com/r/4Aw9Q8/1
https://en.wikipedia.org/wiki/Katakana_(Unicode_block)
https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.
I retrieve the cell contents into a string thus:
string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();
Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):
s = Regex.Replace(s, "\u0094", "\u0022");
But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:
bool a = s.Contains("\u0094");
but it returns false. However:
bool b = s.Contains("”");
returns true.
My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

Yes strings in .Net are UTF-16.
You're doing it right; perhaps your hex-math is incorrect.
The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:
((int)"”"[0]).ToString("X") returns "201D"
"”" == "\u201D" returns true
"\u0094" == "" (right hand side is the empty string) returns false
A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.
References - Jon Skeet's articles on:
Strings
Unicode

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters.
That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.