I am trying to show a simple text RSS feed from a CodePlex project in a window.
My problem is that the feed text contains a lot of character sequences that looks like:
:
-
etc..
I know that they represent the punctuation and some special chars, with some kind of encoding, but I do not know how I can convert them back to simple ascii chars... I mean, without a switch/case covering each special char of course.
Thank you !
Sum-up: How can I convert "My name is : Aurelien" to "My name is : Aurelien" ?
As you can see by the question generated by your markup, those are HTML encoded characters.
All you have to do to decode them is use HttpUtility.HtmlDecode() to decode them.
If you're using .NET 4.0, you could also use System.Net.WebUtility.HtmlDecode() which would allow you to continue to target the Client Profile rather than the full framework.
Related
I'm scraping a social platform using selenium, and a lot of users use special characters like HEᑕƘᏔ®✞ℍ, fire Emojis and so on. These characters turn into questions marks like "HE?????????".
I've tried to use the decode and encode utilities but I've had absolutely no luck.
See here:
WebUtility.HtmlDecode(string);
WebUtility.HtmlEncode(string);
I get the feeling I'm barking up the wrong tree here, but have no idea where to start, as special character answers normally talk about Unicode, and I'm pretty sure this isn't relevant in this case.
EDIT:
This is how I'm fetching the content using selenium
title = driver.FindElement(By.XPath("//*[#id=\"header-
section\"]/div[2]/div/div/div/div/div[1]/div/h1")).Text;
What you are doing is looking at HTML decode and encode rather which replaces letters to make them HTML safe for example £ becomes £
You want to look at text encoding, as this controls which characters are available with different characters sets giving you different characters. If a character is not available in the character set you are using it shows as a question mark or black block.
You can use Encoding.Convert() see this discussion for more info.
It is likely you will want to convert your input to UTF-8 text encoding to see the full character set.
I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html)
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.
I created a sql table with a field in "latin_swedish_ci" and downloaded a lot of data in this field. But a lot of special foreign characters have automatically converted in something very difficult to use.
For example I have those east-europe characters :
ō
ą
ł
š
Problem is those characters are part of a url and .net does not automatically convert into "real characters". For example when I use with webclient the real url isn't reached.
Is there a way to directly convert those characters code into the real characters in the db.
Or at least a way to turn them for use in the webclient DownloadString function ?
Those codes looks like special html characters. Have a look here and check if you get the right character out of the code. If so at the bottom of the post there are some hint about how to use code for conversion.
You can also look at these SO answers to get some more examples:
Answer 1
Answer 2
Hey,
so we have a backend written in C# and we have text in that backend in a language which has "special characters".
Problem is when I output my saved text (from C# app) to the web page (ASP.NET), the characters are all messed up even though the browser interprest the page as UTF (since I have placed a meta tag telling the browser that it is UTF8).
But since its all messed up, Im sort of questioning what the output from C# is. Its probably not UTF8, but something else. Somewhere I read that text in .NET is usually UTF-16?
Basically, I am assigning a label (that can do HTML) with a value taken from the backend. That needs to be in UTF8.
How do I do that in the best way?
.NET strings are natively encoded as UTF-16. The following will set the HTTP output to UTF-8:
Response.ContentEncoding = System.Text.Encoding.UTF8;
When outputting special characters in HTML, you should escape them anyways using Unicode escape sequences (for example é makes é).
Better resources:
http://msdn.microsoft.com/en-us/library/39d1w2xf.aspx
Response.ContentEncoding = Encoding.GetEncoding(xxx);
I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?
AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.
The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.
Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.
I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the  sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.