Why is this appearing in my c# strings: Â£

Why is this appearing in my c# strings: Â£ - c#

I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
Â£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?

AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.

Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.

I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the Â sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) 
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

How to read and store string in UTF-8 format in C#?

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo. Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo.
I tried reading the URLs as following:
List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();
Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?

The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).
Likely looking encodings (if we take away the BOM):
utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

Which HttpUtility decode method to use?

this may be a silly question, but it trips me up every time.
HttpUtility has the methods HtmlDecode and UrlDecode. Do these two methods decode anything (Html/Http related) I might find? When do I have to use them, and which one am I supposed to use?
Just now I hit an error. This is my error log:
Payment receiver was not payment#mysite.com. (it was payment%40mysite.com).
But, I wrapped the email address here in HttpUtility.HtmlDecode before using it. It turns out I have to use .UrlDecode instead, but this email address didn't come from a URL so this wasn't obvious to me.
Can someone clarify this?

See What is meant by htmlencode and urlencode?
It's the reverse of your case, but essentially you need to use UrlEncode/Decode anytime you are using an address of sorts (urls and yes, email addresses). HtmlEncode/Decode is for code that typically a browser would render (html/xml tags).

This same encoding is also used in Form POST requests as well.
My guess is something read it 'naked' without decoding it.
Html Encoding/Decoding is only used to escape strings that contain characters that would otherwise be interpreted as html control characters. The process turns the characters into html entities and back again.
Url Encoding is to get around the fact that many characters are not allowed in Uris; or because they too could be misinterpreted. Thus the percent encoding is used.
Percent encoding is also used in the body of http requests.
In both cases, of course, it's also a way of expressing a specific character code in a request/response independent of character sets; but equally, interpreting what is meant by a particular code can also be dependent on knowing a particular character set. Generally you don't worry about that - but it can be important (especially in the HTML case).

URLEncode converts characters that aren't allowed in a URL into character equivalents which are parsable as a URL. In your example # became %40. URLDecode reverses this.
HTMLEncode is similar to URLEncode, but the target environment is text NESTED inside of HTML. This helps the browser from interpereting your content as HTML, but when rendered it should look like the decoded version. HTMLDecode reverses this.

When you see %xx this means percent encoding has occured - this is a URL encoding scheme, so you need to use UrlEncode / UrlDecode.
The HtmlEncode and HtmlDecode methods are for encoding and decoding elements for HTML display - so things like & get encoded to & and > to >.

Server.UrlEncode(string s)... doesn't

Server.UrlEncode("My File.doc") returns "My+File.doc", whereas the javascript escape("My File.doc") returns "My%20File.doc". As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not. It certainly seems to work that way in practice putting http://somesite/My+File.doc will not fetch "My File.doc" in any case i could test using firefox/i.e. and IIS, whereas http://somesite/My%20File.doc works fine. Am i missing something or does Server.UrlEncode simply not work properly?

Use Javascripts encodeURIComponent()/decodeURIComponent() for "round-trip" encoding with .Net's URLEncode/URLDecode.
EDIT
As far as I know, historically the "+" has been used in URL encoding as a special substitution for the space char ( ASCII 20 ). If an implementation does not take the space into consideration as a special character with the '+' substitution, then it still has to escape it using its ASCII code ( hence '%20' ).
There is a really good discussion of the situation at http://bytes.com/topic/php/answers/5624-urlencode-vs-rawurlencode. It's inconclusive, by the way. RFC 2396 lumps the space with other characters without an unreserved representation, which sides with the '%20' crowd.
RFC 1630 sides with the '+' crowd ( via forum discusion )...
Within the query string, the plus sign
is reserved as shorthand notation for
a space. Therefore, real plus signs
must beencoded. This method was used
to make query URIs easier to pass in
systems which did not allow spaces.
Also, the core RFCs are...
RFC 1630 - Universal Resource Identifiers in WWW
RFC 1738 - Uniform Resource Locators (URL)
RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax

As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not
Actually they're both wrong!
JavaScript escape() should never be used. As well as failing to encode the + character to %2B, it encodes all non-ASCII characters as a non-standard %uNNNN sequence.
Meanwhile Server.UrlEncode is not exactly URL-encoding as such, but encoding to application/x-www-form-urlencoded, which should only normally be used for query parameters. Using + to represent a space outside of a form name=value construct, such as in a path part, is wrong.
This is rather unfortunate. You might want to try doing a string replace of the + character with %20 after encoding with UrlEncode() when you are encoding into a path part rather than a parameter. In a parameter, + and %20 are equally good.

A + instead of a space is correct URL encoding, as would escaping it to %20. See this article (CGI Programming in Perl - URL Encoding).
The + is not something that JavaScript can parse, so javascript will escape the space or + to %20.

Using System.Uri.EscapeDataString() serverside and decodeURIComponent clientside works.

Two encodings used in RTF string won't display correct in RichTextBox?

I am trying to parse some RTF, that i get back from the server. For most text i get back this works fine (and using a RichTextBox control will do the job), however some of the RTF seems to contain an additional "encoding" and some of the characters get corrupted.
The original string is as follows (and contains some of the characters used in Polish):
ąćęłńóśźż
The RTF string with hex encoded characters that is send back looks like this
{\lang1045\langfe1045\f16383 {\'b9\'e6\'ea\'b3{\f7 \'a8\'bd\'a8\'ae}\'9c\'9f\'bf}}
I am having problems decoding the ńó characters in the returned string, they seem to be represented by two hex values each, whereas the rest of the string is represented (as expected) by single hex values.
Using a RichTextBox control to "parse" the RTF results in corrupter text (the two characters in question are displayed as four different unwanted characters).
If i would encode the plain string myself to hex using the expected codepage (1250, Latin 2, the ANSI codepage for lcid 1045) i would get the following:
\'B9\'E6\'EA\'B3\'F1\'F3\'9C\'9F\'BF
I am lost as to how i can correctly decode the {\f7 \'a8\'bd\'a8\'ae} part of the returned string that should correspond to ńó.
Note that there is no font definition for \f7 in the RTF header and the string looks fine when viewed directly on the server meaning that the characters (if they are corrupted) are corrupted somewhere in the conversion before sending.
I am not sure if the problem is on the server side (as i have no control over that), but since the server is used for a lot of translation work i assume that the returned string is ok.
I have been going through the RTF specs but can not find any hint regarding this type of combination of encodings.

I don't know why it's happening, but the encoding appears to be GBK (or something sufficiently similar).
Perhaps the server tries to do some "clever" matching to find the characters, or the server's default character encoding is GBK or so, and those characters (and only those) also occur in GBK so it prefers that.
I found out by adding the offending hex codes (A8 BD A8 AE) as bytes into a simple HTML file, so I could go through my browser's encodings and see if anything matched:
<html><body>¨½¨®</body></html>
To my surprise, my browser came up with "ńó" straight away.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.