Caveats Encoding a C# string to a Javascript string

Caveats Encoding a C# string to a Javascript string - c#

I'm trying to write a custom Javascript MVC3 Helper class foe my project, and one of the methods is supposed to escape C# strings to Javascript strings.
I know C# strings are UTF-16 encoded, and Javascript strings also seem to be UTF-16. No problem here.
I know some characters like backslash, single quotes or double quotes must be backslash-escaped on Javascript so:
\ becomes \\
' becomes \'
" becomes \"
Is there any other caveat I must be aware of before writing my conversion method ?
EDIT:
Great answers so far, I'm adding some references from the answers in the question to help others in the future.
Alex K. suggested using System.Web.HttpUtility.JavaScriptStringEncode, which I marked as the right answer for me, because I'm using .Net 4. But this function is not available to previous .Net versions, so I'm adding some other resources here:
CR becomes \r // Javascript string cannot be broke into more than 1 line
LF becomes \n // Javascript string cannot be broke into more than 1 line
TAB becomes \t
Control characters must be Hex-Escaped
JP Richardson gave an interesting link informing that Javascript uses UCS-2, which is a subset of UTF-16, but how to encode this correctly is an entirely new question.
LukeH on the comments below reminded the CR, LF and TAB chars, and that reminded me of the control chars (BEEP, NULL, ACK, etc...).

(.net 4) You can;
System.Web.HttpUtility.JavaScriptStringEncode(#"aa\bb ""cc"" dd\tee", true);
==
"aa\\bb \"cc\" dd\\tee"

It's my understanding that you do have to be careful, as JavaScript is not UTF-16, rather, it's UCS-2 which I believe is a subset of UTF-16. What this means for you, is that any character that is represented than a higher code point of 2 bytes (0xFFFF) could give you problems in JavaScript.
In summary, under the covers, the engine may use UTF-16, but it only exposes UCS-2 like methods.
Great article on the issue:
http://mathiasbynens.be/notes/javascript-encoding

Just use Microsoft.JScript.GlobalObject.escape
Found it here: http://forums.asp.net/p/1308104/4468088.aspx/1?Re+C+equivalent+of+JavaScript+escape+

Instead of using JavaScriptStringEncode() method, you can encode server side using:
HttpUtility.UrlEncode()
When you need to read the encoded string client side, you have to call unescape() javascript function before using the string.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) 
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

MySQL (and c# ?) convert character format

I created a sql table with a field in "latin_swedish_ci" and downloaded a lot of data in this field. But a lot of special foreign characters have automatically converted in something very difficult to use.
For example I have those east-europe characters :
ō
ą
ł
š
Problem is those characters are part of a url and .net does not automatically convert into "real characters". For example when I use with webclient the real url isn't reached.
Is there a way to directly convert those characters code into the real characters in the db.
Or at least a way to turn them for use in the webclient DownloadString function ?

Those codes looks like special html characters. Have a look here and check if you get the right character out of the code. If so at the bottom of the post there are some hint about how to use code for conversion.
You can also look at these SO answers to get some more examples:
Answer 1
Answer 2

Semicolon in url as a separator for query strings

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?

As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.

In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

Issue with HttpUtility.HtmlEncode with other language characters

I was trying to encode the HTML special characters like ', ", <,> etc with HttpUtility.HtmlEncode. But I noticed this is also encoding french characters like (é) to é and now é is getting displayed as it is on my HTML page. I don't want this I just want to encode ', ", <,> and few other characters.

Should those characters look differently? Why is it a problem if they are replaced? This is by design. You can take a look at this question to see longer discussion. Unless your users can't properly see text you are displaying, you shouldn't mess with this, for security/compatibility reasons.
HtmlUtility seems to encode several classes of characters, among which ISO-8859-1 character set
If you still don't want a specific character to be encoded, you are forced to use string.Replace() for this purpose.

The various .NET text encoding functions are notorious for being badly documented and doing weird conversions. In some cases I've had better luck with the encoding functions in Microsoft Anti-XSS library, but not sure if this would work in your particular application.

Server.UrlEncode(string s)... doesn't

Server.UrlEncode("My File.doc") returns "My+File.doc", whereas the javascript escape("My File.doc") returns "My%20File.doc". As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not. It certainly seems to work that way in practice putting http://somesite/My+File.doc will not fetch "My File.doc" in any case i could test using firefox/i.e. and IIS, whereas http://somesite/My%20File.doc works fine. Am i missing something or does Server.UrlEncode simply not work properly?

Use Javascripts encodeURIComponent()/decodeURIComponent() for "round-trip" encoding with .Net's URLEncode/URLDecode.
EDIT
As far as I know, historically the "+" has been used in URL encoding as a special substitution for the space char ( ASCII 20 ). If an implementation does not take the space into consideration as a special character with the '+' substitution, then it still has to escape it using its ASCII code ( hence '%20' ).
There is a really good discussion of the situation at http://bytes.com/topic/php/answers/5624-urlencode-vs-rawurlencode. It's inconclusive, by the way. RFC 2396 lumps the space with other characters without an unreserved representation, which sides with the '%20' crowd.
RFC 1630 sides with the '+' crowd ( via forum discusion )...
Within the query string, the plus sign
is reserved as shorthand notation for
a space. Therefore, real plus signs
must beencoded. This method was used
to make query URIs easier to pass in
systems which did not allow spaces.
Also, the core RFCs are...
RFC 1630 - Universal Resource Identifiers in WWW
RFC 1738 - Uniform Resource Locators (URL)
RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax

As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not
Actually they're both wrong!
JavaScript escape() should never be used. As well as failing to encode the + character to %2B, it encodes all non-ASCII characters as a non-standard %uNNNN sequence.
Meanwhile Server.UrlEncode is not exactly URL-encoding as such, but encoding to application/x-www-form-urlencoded, which should only normally be used for query parameters. Using + to represent a space outside of a form name=value construct, such as in a path part, is wrong.
This is rather unfortunate. You might want to try doing a string replace of the + character with %20 after encoding with UrlEncode() when you are encoding into a path part rather than a parameter. In a parameter, + and %20 are equally good.

A + instead of a space is correct URL encoding, as would escaping it to %20. See this article (CGI Programming in Perl - URL Encoding).
The + is not something that JavaScript can parse, so javascript will escape the space or + to %20.

Using System.Uri.EscapeDataString() serverside and decodeURIComponent clientside works.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.