Output from C# to html web page - UTF8 fails?

Output from C# to html web page - UTF8 fails? - c#

Hey,
so we have a backend written in C# and we have text in that backend in a language which has "special characters".
Problem is when I output my saved text (from C# app) to the web page (ASP.NET), the characters are all messed up even though the browser interprest the page as UTF (since I have placed a meta tag telling the browser that it is UTF8).
But since its all messed up, Im sort of questioning what the output from C# is. Its probably not UTF8, but something else. Somewhere I read that text in .NET is usually UTF-16?
Basically, I am assigning a label (that can do HTML) with a value taken from the backend. That needs to be in UTF8.
How do I do that in the best way?

.NET strings are natively encoded as UTF-16. The following will set the HTTP output to UTF-8:
Response.ContentEncoding = System.Text.Encoding.UTF8;

When outputting special characters in HTML, you should escape them anyways using Unicode escape sequences (for example é makes é).

Better resources:
http://msdn.microsoft.com/en-us/library/39d1w2xf.aspx
Response.ContentEncoding = Encoding.GetEncoding(xxx);

Related

How to read and store string in UTF-8 format in C#?

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo. Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo.
I tried reading the URLs as following:
List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();
Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?

The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).
Likely looking encodings (if we take away the BOM):
utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

HTML Decode and Encode

I have tried to decode the html text that i have in the databse in my MVC 3 Razor application.
the html text in the databse is not encoded.
I tries httpUtility.decode , server.decode but none of them work.
finally i managed to make it work with Html.raw(string)
sample of non working code
#Server.HtmlDecode(item.ShortDescription)
#HttpUtility.HtmlDecode(item.ShortDescription)
Do you know why we can not use html.decode in my case !
I thought this would save some one else from looking for few hours.

It works just fine to decode the text, but then it will automatically be encoded again when it's put in the page using the # syntax.
The Html.Raw method wraps the string in an HtmlString, which tells the razor engine not to encode it when it's put in the page.

If you want to display the value as-is without any HTML encoding you could use the Html.Raw helper:
#Html.Raw(item.ShortDescription)
Be warned thought that by doing this you are opening your site to XSS attacks so you should be very careful about what HTML this ShortDescription property contains. If it is the user that enters it you should absolutely ensure that it is safe. You could use the AntiXss library for this.
Do you know why we can not use html.decode in my case !
Because Html.Decode returns a string and when you feed a string to the #() Razor function it automatically Html encodes it again and ruins your previous efforts. That's why the Html.Raw helper exists.

Which HttpUtility decode method to use?

this may be a silly question, but it trips me up every time.
HttpUtility has the methods HtmlDecode and UrlDecode. Do these two methods decode anything (Html/Http related) I might find? When do I have to use them, and which one am I supposed to use?
Just now I hit an error. This is my error log:
Payment receiver was not payment#mysite.com. (it was payment%40mysite.com).
But, I wrapped the email address here in HttpUtility.HtmlDecode before using it. It turns out I have to use .UrlDecode instead, but this email address didn't come from a URL so this wasn't obvious to me.
Can someone clarify this?

See What is meant by htmlencode and urlencode?
It's the reverse of your case, but essentially you need to use UrlEncode/Decode anytime you are using an address of sorts (urls and yes, email addresses). HtmlEncode/Decode is for code that typically a browser would render (html/xml tags).

This same encoding is also used in Form POST requests as well.
My guess is something read it 'naked' without decoding it.
Html Encoding/Decoding is only used to escape strings that contain characters that would otherwise be interpreted as html control characters. The process turns the characters into html entities and back again.
Url Encoding is to get around the fact that many characters are not allowed in Uris; or because they too could be misinterpreted. Thus the percent encoding is used.
Percent encoding is also used in the body of http requests.
In both cases, of course, it's also a way of expressing a specific character code in a request/response independent of character sets; but equally, interpreting what is meant by a particular code can also be dependent on knowing a particular character set. Generally you don't worry about that - but it can be important (especially in the HTML case).

URLEncode converts characters that aren't allowed in a URL into character equivalents which are parsable as a URL. In your example # became %40. URLDecode reverses this.
HTMLEncode is similar to URLEncode, but the target environment is text NESTED inside of HTML. This helps the browser from interpereting your content as HTML, but when rendered it should look like the decoded version. HTMLDecode reverses this.

When you see %xx this means percent encoding has occured - this is a URL encoding scheme, so you need to use UrlEncode / UrlDecode.
The HtmlEncode and HtmlDecode methods are for encoding and decoding elements for HTML display - so things like & get encoded to & and > to >.

Encoding conversion from RSS feed chars

I am trying to show a simple text RSS feed from a CodePlex project in a window.
My problem is that the feed text contains a lot of character sequences that looks like:
:
-
etc..
I know that they represent the punctuation and some special chars, with some kind of encoding, but I do not know how I can convert them back to simple ascii chars... I mean, without a switch/case covering each special char of course.
Thank you !
Sum-up: How can I convert "My name is : Aurelien" to "My name is : Aurelien" ?

As you can see by the question generated by your markup, those are HTML encoded characters.
All you have to do to decode them is use HttpUtility.HtmlDecode() to decode them.
If you're using .NET 4.0, you could also use System.Net.WebUtility.HtmlDecode() which would allow you to continue to target the Client Profile rather than the full framework.

Why is this appearing in my c# strings: Â£

I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
Â£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?

AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.

Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.

I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the Â sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.