Right To Left Language Bracket Reversed - c#

I am using a StringBuilder in C# to append some text, which can be English (left to right) or Arabic (right to left)
stringBuilder.Append("(");
stringBuilder.Append(text);
stringBuilder.Append(") ");
stringBuilder.Append(text);
If text = "A", then output is "(A) A"
But if text = "بتث", then output is "(بتث) بتث"
Any ideas?

This is a well-known flaw in the Windows text rendering engine when asked to render Right-To-Left text, Arabic or Hebrew. It has a difficult problem to solve, people often fall back to Western words and punctuation when there is no good alternative word available in the language. Brand and company names for example. The renderer tries to guess at the proper render order by looking at the code points, with characters in the Latin character set clearly having to be rendered left-to-right.
But it fumbles at punctuation, with brackets being the most visible. You have to be explicit about it so it knows what to do, you must use the Unicode Right-to-left mark, U+200F or \u200f in C# code. Conversely, use the Left-to-right mark if you know you need LTR rendering, U+200E.

Use AppendFormat instead of just Append:
stringBuilder.AppendFormat("({0}) {0}", text)
This may fix the issue, but it may - you need to look at the text value - it probably has LTR/RTL markers characters embedded. These need to either be removed or corrected in the value.

I had a similar issue and I managed to solve it by creating a function that checks each Char in Unicode. If it is from page FE then I add 202C after it as shown below. Without this it gets RTL and LTF mixed for what I wanted.
string us = string.Format("\uFE9E\u202C\uFE98\u202C\uFEB8\u202C\uFEC6\u202C\uFEEB\u202C\u0020\u0660\u0662\u0664\u0668 Aa1");

Related

Detect non-displayable characters in string C#

Hopefully someone can help me with this, because I haven't found any solution online so far.
I am processing strings with special characters and I want to detect if any character in a string can't be displayed properly by for instance a webbrowser or even Visual Studio itself. The following string shows such characters. This comes from the Text vizualizer in VS2019:
TargetsforReduceCO
I've checked similar questions, but the answers were mostly limited to checking if the character code exceeds 255. However, there are lots of characters that can still be displayed, like Greek and Cyrillic symbols.
I also found this website that has an overview of all Unicode characters and show how they are displayed in the browser, but there doesn't seem to be any logic in which characters can't be displayed and their character code.
I can imagine that VS doesn't know which characters can't be displayed in various browsers, but I'm hoping that there is at least a way of checking if VS can display them.
Thanks in advance for your help!
Edit:
Right now I'm using
input.Any(c => !char.IsLetterOrDigit(c) && c > 255);
Because the input shouldn't normally contain other symbols than what you can usually find in a text, but I'm sure it will be triggered on symbols that can actually be displayed by VS or a webbrowser.
Type char has a number of static member methods like IsPunctuation() that should help you "categorize" character by character. See example on this page System.Char reference. Each of those methods' documentation explains what characters it applies to. As commenters have mentioned, your "displayable" criterion is more a font-presentation problem than a character value problem but you'll be able to narrow down what your system can work with using these methods. Look out for other methods like GetUnicodeCategory().
It may be that something as simple as !char.IsControl(c) will do the trick.
See similar Q&A here C# Printable Characters

How to collect special letters like "ᑕƘᏔ®✞ℍ"

I'm scraping a social platform using selenium, and a lot of users use special characters like HEᑕƘᏔ®✞ℍ, fire Emojis and so on. These characters turn into questions marks like "HE?????????".
I've tried to use the decode and encode utilities but I've had absolutely no luck.
See here:
WebUtility.HtmlDecode(string);
WebUtility.HtmlEncode(string);
I get the feeling I'm barking up the wrong tree here, but have no idea where to start, as special character answers normally talk about Unicode, and I'm pretty sure this isn't relevant in this case.
EDIT:
This is how I'm fetching the content using selenium
title = driver.FindElement(By.XPath("//*[#id=\"header-
section\"]/div[2]/div/div/div/div/div[1]/div/h1")).Text;
What you are doing is looking at HTML decode and encode rather which replaces letters to make them HTML safe for example £ becomes £
You want to look at text encoding, as this controls which characters are available with different characters sets giving you different characters. If a character is not available in the character set you are using it shows as a question mark or black block.
You can use Encoding.Convert() see this discussion for more info.
It is likely you will want to convert your input to UTF-8 text encoding to see the full character set.

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

JavaScript to replace Chinese characters

I am building a JavaScript array depending on the input of the user. The array is building fine but if the user enters Chinese symbols it crashes. I'm assuming that it is if the user enters a chinese " or a , or a '. I have the program replacing the English versions of this but i don't know how to replace the Chinese versions of it.
Can anyone help?
Thanks to all for their input
From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:
4E00-9FFF (common)
3400-4DFF (rare)
F900-FAFF (compatability - Duplicates, unifiable variants, corporate characters)
20000-2A6DF (rare, historic)
2F800-2FA1F (compatability - supplement)
Because JS strings only support UCS-2, which max out at FFFF, the last two ranges probably aren't of great interest. Thus, if you're building a JS string should be able to filter out chinese characters using something like:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff]/g, '')
You need to use unicode replacer.
I think it will help you: http://answers.yahoo.com/question/index?qid=20080528045141AAJ0AIS
.Net provides JavaScriptSerializer and it's method Serialize, which creates correctly escaped JavaScript literals (although I personally haven't used it with Chinese characters, but there is no reason it shouldn't work).
Building on broofa's answer:
If you just want to find and replace the Chinese punctuation like " or " or a . then you'll want to use unicode characters in the range of FF00-FFEF. Here is a PDF from Unicode showing them: http://unicode.org/charts/PDF/UFF00.pdf
I think you'd want at least replace these: FF01, FF02, FF07, FF0C, FF0E, FF1F, and FF61. That should be the major Chinese punctuation marks. You can use broofa's replace function.
Not asked by the question, but adding \u30a0-\u30ff\u3040-\u309f you can also take out the Hiragana and Katakana from Japanese:
replace(/[\u4e00-\u9fff\u3400-\u4dff\uf900-\ufaff\u30a0-\u30ff\u3040-\u309f]/g, '')
https://regex101.com/r/4Aw9Q8/1
https://en.wikipedia.org/wiki/Katakana_(Unicode_block)
https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Categories

Resources