JavaScript to replace Chinese characters - c#

I am building a JavaScript array depending on the input of the user. The array is building fine but if the user enters Chinese symbols it crashes. I'm assuming that it is if the user enters a chinese " or a , or a '. I have the program replacing the English versions of this but i don't know how to replace the Chinese versions of it.
Can anyone help?
Thanks to all for their input

From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:
4E00-9FFF (common)
3400-4DFF (rare)
F900-FAFF (compatability - Duplicates, unifiable variants, corporate characters)
20000-2A6DF (rare, historic)
2F800-2FA1F (compatability - supplement)
Because JS strings only support UCS-2, which max out at FFFF, the last two ranges probably aren't of great interest. Thus, if you're building a JS string should be able to filter out chinese characters using something like:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff]/g, '')

You need to use unicode replacer.
I think it will help you: http://answers.yahoo.com/question/index?qid=20080528045141AAJ0AIS

.Net provides JavaScriptSerializer and it's method Serialize, which creates correctly escaped JavaScript literals (although I personally haven't used it with Chinese characters, but there is no reason it shouldn't work).

Building on broofa's answer:
If you just want to find and replace the Chinese punctuation like " or " or a . then you'll want to use unicode characters in the range of FF00-FFEF. Here is a PDF from Unicode showing them: http://unicode.org/charts/PDF/UFF00.pdf
I think you'd want at least replace these: FF01, FF02, FF07, FF0C, FF0E, FF1F, and FF61. That should be the major Chinese punctuation marks. You can use broofa's replace function.

Not asked by the question, but adding \u30a0-\u30ff\u3040-\u309f you can also take out the Hiragana and Katakana from Japanese:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff\u30a0-\u30ff\u3040-\u309f]/g, '')
https://regex101.com/r/4Aw9Q8/1
https://en.wikipedia.org/wiki/Katakana_(Unicode_block)
https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Right To Left Language Bracket Reversed

I am using a StringBuilder in C# to append some text, which can be English (left to right) or Arabic (right to left)
stringBuilder.Append("(");
stringBuilder.Append(text);
stringBuilder.Append(") ");
stringBuilder.Append(text);
If text = "A", then output is "(A) A"
But if text = "بتث", then output is "(بتث) بتث"
Any ideas?
This is a well-known flaw in the Windows text rendering engine when asked to render Right-To-Left text, Arabic or Hebrew. It has a difficult problem to solve, people often fall back to Western words and punctuation when there is no good alternative word available in the language. Brand and company names for example. The renderer tries to guess at the proper render order by looking at the code points, with characters in the Latin character set clearly having to be rendered left-to-right.
But it fumbles at punctuation, with brackets being the most visible. You have to be explicit about it so it knows what to do, you must use the Unicode Right-to-left mark, U+200F or \u200f in C# code. Conversely, use the Left-to-right mark if you know you need LTR rendering, U+200E.
Use AppendFormat instead of just Append:
stringBuilder.AppendFormat("({0}) {0}", text)
This may fix the issue, but it may - you need to look at the text value - it probably has LTR/RTL markers characters embedded. These need to either be removed or corrected in the value.
I had a similar issue and I managed to solve it by creating a function that checks each Char in Unicode. If it is from page FE then I add 202C after it as shown below. Without this it gets RTL and LTF mixed for what I wanted.
string us = string.Format("\uFE9E\u202C\uFE98\u202C\uFEB8\u202C\uFEC6\u202C\uFEEB\u202C\u0020\u0660\u0662\u0664\u0668 Aa1");

Issue with HttpUtility.HtmlEncode with other language characters

I was trying to encode the HTML special characters like ', ", <,> etc with HttpUtility.HtmlEncode. But I noticed this is also encoding french characters like (é) to é and now é is getting displayed as it is on my HTML page. I don't want this I just want to encode ', ", <,> and few other characters.
Should those characters look differently? Why is it a problem if they are replaced? This is by design. You can take a look at this question to see longer discussion. Unless your users can't properly see text you are displaying, you shouldn't mess with this, for security/compatibility reasons.
HtmlUtility seems to encode several classes of characters, among which ISO-8859-1 character set
If you still don't want a specific character to be encoded, you are forced to use string.Replace() for this purpose.
The various .NET text encoding functions are notorious for being badly documented and doing weird conversions. In some cases I've had better luck with the encoding functions in Microsoft Anti-XSS library, but not sure if this would work in your particular application.

How can i detect multiLanguages in string?

I have a string in c#
How can i detect if this string contains Chars from Different Languages ?
i.e : a person fills his english name in text box and also his local language name.
I want to disallow that.
something like this :
"check the language table of the chars in the string and if it comes
from different unicode tables - return ERROR".
but i think there is a problem for 'a' in us or uk.
maybe im wrong.
how can i recognize more than one language ?
I think you're searching for codepoints. The unique identifiers of a character in codepage. I think this should be useful to you How would you get an array of Unicode code points from a .NET String?. Once you get codepoints array from the string, you can check it against the range of code points you want.
Hope this helps.

How to limit users from entering english characters in a textbox

Do I need to use regex to ensure that the user has typed in English? All characters are valid except non English characters.
How do I validate this textbox?
A regex would work quite well for this. Something like
^[a-zA-Z0-9 ?!.,;:$]*$
would be a good starting point. It would allow all alphabetical and numerical characters, as well as some common punctuation. You would need to change it depending on what your definition of English characters is.
See the regex docs here for more information.
This has nothing to do with regular expressions and the link referred to by gurukulki does not answer the question either. To change language, you need to implement localization in your website:
http://www.west-wind.com/presentations/wwdbResourceProvider/introtolocalization.aspxlink text
http://www.codeproject.com/KB/aspnet/localizationByVivekTakur.aspx
You might use the FilteredTextBox form ASP.NET AJAX.

Categories

Resources