Hopefully someone can help me with this, because I haven't found any solution online so far.
I am processing strings with special characters and I want to detect if any character in a string can't be displayed properly by for instance a webbrowser or even Visual Studio itself. The following string shows such characters. This comes from the Text vizualizer in VS2019:
TargetsforReduceCO
I've checked similar questions, but the answers were mostly limited to checking if the character code exceeds 255. However, there are lots of characters that can still be displayed, like Greek and Cyrillic symbols.
I also found this website that has an overview of all Unicode characters and show how they are displayed in the browser, but there doesn't seem to be any logic in which characters can't be displayed and their character code.
I can imagine that VS doesn't know which characters can't be displayed in various browsers, but I'm hoping that there is at least a way of checking if VS can display them.
Thanks in advance for your help!
Edit:
Right now I'm using
input.Any(c => !char.IsLetterOrDigit(c) && c > 255);
Because the input shouldn't normally contain other symbols than what you can usually find in a text, but I'm sure it will be triggered on symbols that can actually be displayed by VS or a webbrowser.
Type char has a number of static member methods like IsPunctuation() that should help you "categorize" character by character. See example on this page System.Char reference. Each of those methods' documentation explains what characters it applies to. As commenters have mentioned, your "displayable" criterion is more a font-presentation problem than a character value problem but you'll be able to narrow down what your system can work with using these methods. Look out for other methods like GetUnicodeCategory().
It may be that something as simple as !char.IsControl(c) will do the trick.
See similar Q&A here C# Printable Characters
Related
I am currently working on a little dictionary app for Korean in C# (which I am trying to learn). I would like to add a feature where a conjugation chart is given with all basic verb forms for a certain verb. To ensure the verbs are conjugated correctly I have to check wether a verb is irregular. To do this I have to check wether a verb stem ends with a certain character or not.
The problem is, however, that a computer sees an entire syllable of a Korean word as a character, not the individual 2 or 3 letters that form that specific syllable, but I need to compare the final letter of a syllable to do it correctly.
For example the Korean verb 춥다 is an irregular verb and we can tell because the verb stem 춥 has ㅂ as the final letter. Yet 춥 is the char, not ㅂ in the case of the verb stem. So this does not work:
verbStem = "춥";
verbStem.EndsWith("ㅂ");
I am currently a bit puzzled on how to make this work and thus I would be quite happy if I could get some directions.
Using the popular Korean Q&A service 지식IN (link to orginal answer) I was able to find the answer to my question. I am so grateful to.
The first step is to seperate the individual letters by normalizing the string. This is done using Normalize method:
string a = "안녕";
string b = a.Normalize( System.Text.NormalizationForm.FormKD);
When using the Normalize method with the Korean string it will be split into its individual component unicode characters.
However, the extremely helpful answer at 지식IN did not stop there with helping me with directions. It pointed out I needed to be aware that even when it has been split there is a different unicode for characters depending whether it is in the initial possition or not and thus I will have to use the appropriate unicode for it. ('ᄋ' is different from 'ᆼ') The unicodes for these are found at Hangul Jamo (Unicode block).
I am so glad someone managed to answer this question for me, but I felt I ought to write out the answer at Stackoverflow as well since you might never know someone else might want to learn how to do something similar.
I'm scraping a social platform using selenium, and a lot of users use special characters like HEᑕƘᏔ®✞ℍ, fire Emojis and so on. These characters turn into questions marks like "HE?????????".
I've tried to use the decode and encode utilities but I've had absolutely no luck.
See here:
WebUtility.HtmlDecode(string);
WebUtility.HtmlEncode(string);
I get the feeling I'm barking up the wrong tree here, but have no idea where to start, as special character answers normally talk about Unicode, and I'm pretty sure this isn't relevant in this case.
EDIT:
This is how I'm fetching the content using selenium
title = driver.FindElement(By.XPath("//*[#id=\"header-
section\"]/div[2]/div/div/div/div/div[1]/div/h1")).Text;
What you are doing is looking at HTML decode and encode rather which replaces letters to make them HTML safe for example £ becomes £
You want to look at text encoding, as this controls which characters are available with different characters sets giving you different characters. If a character is not available in the character set you are using it shows as a question mark or black block.
You can use Encoding.Convert() see this discussion for more info.
It is likely you will want to convert your input to UTF-8 text encoding to see the full character set.
I decompiled an executable and cant understand what are these symbols in the source code (C#). I would paste source examples here but as I tried pasting, the special characters in the decompiled source are not printable here in this editor. So i'm taking few snip images and pasting links here so anyone can see, so examples are:
what I am guessing is that this source code is obfuscated right? And that these symbols are OK to exist in the MSIL, but when translated as is in C#, they make for illegal characters. Is that right? Any suggestions on how do I get past this, like do a replace-all on this stuff?
MSIL has very lax rules for what is allowed as an identifier name. Obsfuscators intentionally choose chars which C# cannot represent so you can't roundtrip to C#.
You can decompile to IL however and be able to compile the project.
Also look at C#'s unicode identifiers. You can have unicode escape code inside of C# identifiers which is surprising to many. Example:
class #class
{
public static void #static(bool #bool) {
if (#bool)
System.Console.WriteLine("true");
else
System.Console.WriteLine("false");
}
}
class Class1
{
static void M() {
cl\u0061ss.st\u0061tic(true);
}
}
You could look at the file with a hex editor, figure out the 'rules' of these values, and then you might be able to write yourself a program that would convert them to ascii representations with some prefix - ie, obs_627 or whatever.
Of course you can only change names which will be referred to only from within the codebase you are changing. Any external linkage to these special names, or internal use of whatever the equivalent of reflection is, would break. If there's reason to expect either of these are the case, then it would be a wasted effort.
These are from the old MS-DOS ANSI character set.
The first example you posted contains ASCII line drawing characters. IIRC, they started around 172 decimal (0xAC hex) or so.
The second and third contain ASCII characters between 1 and 31 decimal (0x01-0x1F in hex notation).
You can't copy and paste them because the characters displayed don't exist in most modern fonts.
I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.
I have a string in a watch window in VS2008 and want to see the hex representation of each character. If I right click there's a hexadecimal option but this doesn't appear to do anything. Anybody know how to view the string as a series of hex values?
Add your string as a watch, then edit the watch expression and append ".ToCharArray()" to view it as an array of chars. When you expand your watch you will see char code next to each individual char. Checking "Hexadecimal display" will show you hex codes for each character.
Default visualizer in VS (at least 2005) does not support this. However, apparently it isn't too much trouble to roll one's own visualizer: http://msdn.microsoft.com/en-us/library/ms379596.aspx (That's an old article from 2005 beta times, but I don't think the API changed much.)
Perhaps somebody somewhere even wrote one, but I haven't seen one yet.