How to collect special letters like "ᑕƘᏔ®✞ℍ" - c#

I'm scraping a social platform using selenium, and a lot of users use special characters like HEᑕƘᏔ®✞ℍ, fire Emojis and so on. These characters turn into questions marks like "HE?????????".
I've tried to use the decode and encode utilities but I've had absolutely no luck.
See here:
WebUtility.HtmlDecode(string);
WebUtility.HtmlEncode(string);
I get the feeling I'm barking up the wrong tree here, but have no idea where to start, as special character answers normally talk about Unicode, and I'm pretty sure this isn't relevant in this case.
EDIT:
This is how I'm fetching the content using selenium
title = driver.FindElement(By.XPath("//*[#id=\"header-
section\"]/div[2]/div/div/div/div/div[1]/div/h1")).Text;

What you are doing is looking at HTML decode and encode rather which replaces letters to make them HTML safe for example £ becomes £
You want to look at text encoding, as this controls which characters are available with different characters sets giving you different characters. If a character is not available in the character set you are using it shows as a question mark or black block.
You can use Encoding.Convert() see this discussion for more info.
It is likely you will want to convert your input to UTF-8 text encoding to see the full character set.

Related

Detect non-displayable characters in string C#

Hopefully someone can help me with this, because I haven't found any solution online so far.
I am processing strings with special characters and I want to detect if any character in a string can't be displayed properly by for instance a webbrowser or even Visual Studio itself. The following string shows such characters. This comes from the Text vizualizer in VS2019:
TargetsforReduceCO
I've checked similar questions, but the answers were mostly limited to checking if the character code exceeds 255. However, there are lots of characters that can still be displayed, like Greek and Cyrillic symbols.
I also found this website that has an overview of all Unicode characters and show how they are displayed in the browser, but there doesn't seem to be any logic in which characters can't be displayed and their character code.
I can imagine that VS doesn't know which characters can't be displayed in various browsers, but I'm hoping that there is at least a way of checking if VS can display them.
Thanks in advance for your help!
Edit:
Right now I'm using
input.Any(c => !char.IsLetterOrDigit(c) && c > 255);
Because the input shouldn't normally contain other symbols than what you can usually find in a text, but I'm sure it will be triggered on symbols that can actually be displayed by VS or a webbrowser.
Type char has a number of static member methods like IsPunctuation() that should help you "categorize" character by character. See example on this page System.Char reference. Each of those methods' documentation explains what characters it applies to. As commenters have mentioned, your "displayable" criterion is more a font-presentation problem than a character value problem but you'll be able to narrow down what your system can work with using these methods. Look out for other methods like GetUnicodeCategory().
It may be that something as simple as !char.IsControl(c) will do the trick.
See similar Q&A here C# Printable Characters

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Issue with HttpUtility.HtmlEncode with other language characters

I was trying to encode the HTML special characters like ', ", <,> etc with HttpUtility.HtmlEncode. But I noticed this is also encoding french characters like (é) to é and now é is getting displayed as it is on my HTML page. I don't want this I just want to encode ', ", <,> and few other characters.
Should those characters look differently? Why is it a problem if they are replaced? This is by design. You can take a look at this question to see longer discussion. Unless your users can't properly see text you are displaying, you shouldn't mess with this, for security/compatibility reasons.
HtmlUtility seems to encode several classes of characters, among which ISO-8859-1 character set
If you still don't want a specific character to be encoded, you are forced to use string.Replace() for this purpose.
The various .NET text encoding functions are notorious for being badly documented and doing weird conversions. In some cases I've had better luck with the encoding functions in Microsoft Anti-XSS library, but not sure if this would work in your particular application.

Encoding conversion from RSS feed chars

I am trying to show a simple text RSS feed from a CodePlex project in a window.
My problem is that the feed text contains a lot of character sequences that looks like:
:
-
etc..
I know that they represent the punctuation and some special chars, with some kind of encoding, but I do not know how I can convert them back to simple ascii chars... I mean, without a switch/case covering each special char of course.
Thank you !
Sum-up: How can I convert "My name is : Aurelien" to "My name is : Aurelien" ?
As you can see by the question generated by your markup, those are HTML encoded characters.
All you have to do to decode them is use HttpUtility.HtmlDecode() to decode them.
If you're using .NET 4.0, you could also use System.Net.WebUtility.HtmlDecode() which would allow you to continue to target the Client Profile rather than the full framework.

Why is this appearing in my c# strings: £

I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?
AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.
The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.
Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.
I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the  sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.

Categories

Resources