Semicolon in url as a separator for query strings

Semicolon in url as a separator for query strings - c#

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?

As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.

In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

Related

Caveats Encoding a C# string to a Javascript string

I'm trying to write a custom Javascript MVC3 Helper class foe my project, and one of the methods is supposed to escape C# strings to Javascript strings.
I know C# strings are UTF-16 encoded, and Javascript strings also seem to be UTF-16. No problem here.
I know some characters like backslash, single quotes or double quotes must be backslash-escaped on Javascript so:
\ becomes \\
' becomes \'
" becomes \"
Is there any other caveat I must be aware of before writing my conversion method ?
EDIT:
Great answers so far, I'm adding some references from the answers in the question to help others in the future.
Alex K. suggested using System.Web.HttpUtility.JavaScriptStringEncode, which I marked as the right answer for me, because I'm using .Net 4. But this function is not available to previous .Net versions, so I'm adding some other resources here:
CR becomes \r // Javascript string cannot be broke into more than 1 line
LF becomes \n // Javascript string cannot be broke into more than 1 line
TAB becomes \t
Control characters must be Hex-Escaped
JP Richardson gave an interesting link informing that Javascript uses UCS-2, which is a subset of UTF-16, but how to encode this correctly is an entirely new question.
LukeH on the comments below reminded the CR, LF and TAB chars, and that reminded me of the control chars (BEEP, NULL, ACK, etc...).

(.net 4) You can;
System.Web.HttpUtility.JavaScriptStringEncode(#"aa\bb ""cc"" dd\tee", true);
==
"aa\\bb \"cc\" dd\\tee"

It's my understanding that you do have to be careful, as JavaScript is not UTF-16, rather, it's UCS-2 which I believe is a subset of UTF-16. What this means for you, is that any character that is represented than a higher code point of 2 bytes (0xFFFF) could give you problems in JavaScript.
In summary, under the covers, the engine may use UTF-16, but it only exposes UCS-2 like methods.
Great article on the issue:
http://mathiasbynens.be/notes/javascript-encoding

Just use Microsoft.JScript.GlobalObject.escape
Found it here: http://forums.asp.net/p/1308104/4468088.aspx/1?Re+C+equivalent+of+JavaScript+escape+

Instead of using JavaScriptStringEncode() method, you can encode server side using:
HttpUtility.UrlEncode()
When you need to read the encoded string client side, you have to call unescape() javascript function before using the string.

Which HttpUtility decode method to use?

this may be a silly question, but it trips me up every time.
HttpUtility has the methods HtmlDecode and UrlDecode. Do these two methods decode anything (Html/Http related) I might find? When do I have to use them, and which one am I supposed to use?
Just now I hit an error. This is my error log:
Payment receiver was not payment#mysite.com. (it was payment%40mysite.com).
But, I wrapped the email address here in HttpUtility.HtmlDecode before using it. It turns out I have to use .UrlDecode instead, but this email address didn't come from a URL so this wasn't obvious to me.
Can someone clarify this?

See What is meant by htmlencode and urlencode?
It's the reverse of your case, but essentially you need to use UrlEncode/Decode anytime you are using an address of sorts (urls and yes, email addresses). HtmlEncode/Decode is for code that typically a browser would render (html/xml tags).

This same encoding is also used in Form POST requests as well.
My guess is something read it 'naked' without decoding it.
Html Encoding/Decoding is only used to escape strings that contain characters that would otherwise be interpreted as html control characters. The process turns the characters into html entities and back again.
Url Encoding is to get around the fact that many characters are not allowed in Uris; or because they too could be misinterpreted. Thus the percent encoding is used.
Percent encoding is also used in the body of http requests.
In both cases, of course, it's also a way of expressing a specific character code in a request/response independent of character sets; but equally, interpreting what is meant by a particular code can also be dependent on knowing a particular character set. Generally you don't worry about that - but it can be important (especially in the HTML case).

URLEncode converts characters that aren't allowed in a URL into character equivalents which are parsable as a URL. In your example # became %40. URLDecode reverses this.
HTMLEncode is similar to URLEncode, but the target environment is text NESTED inside of HTML. This helps the browser from interpereting your content as HTML, but when rendered it should look like the decoded version. HTMLDecode reverses this.

When you see %xx this means percent encoding has occured - this is a URL encoding scheme, so you need to use UrlEncode / UrlDecode.
The HtmlEncode and HtmlDecode methods are for encoding and decoding elements for HTML display - so things like & get encoded to & and > to >.

Issue with HttpUtility.HtmlEncode with other language characters

I was trying to encode the HTML special characters like ', ", <,> etc with HttpUtility.HtmlEncode. But I noticed this is also encoding french characters like (é) to é and now é is getting displayed as it is on my HTML page. I don't want this I just want to encode ', ", <,> and few other characters.

Should those characters look differently? Why is it a problem if they are replaced? This is by design. You can take a look at this question to see longer discussion. Unless your users can't properly see text you are displaying, you shouldn't mess with this, for security/compatibility reasons.
HtmlUtility seems to encode several classes of characters, among which ISO-8859-1 character set
If you still don't want a specific character to be encoded, you are forced to use string.Replace() for this purpose.

The various .NET text encoding functions are notorious for being badly documented and doing weird conversions. In some cases I've had better luck with the encoding functions in Microsoft Anti-XSS library, but not sure if this would work in your particular application.

Server.UrlEncode(string s)... doesn't

Server.UrlEncode("My File.doc") returns "My+File.doc", whereas the javascript escape("My File.doc") returns "My%20File.doc". As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not. It certainly seems to work that way in practice putting http://somesite/My+File.doc will not fetch "My File.doc" in any case i could test using firefox/i.e. and IIS, whereas http://somesite/My%20File.doc works fine. Am i missing something or does Server.UrlEncode simply not work properly?

Use Javascripts encodeURIComponent()/decodeURIComponent() for "round-trip" encoding with .Net's URLEncode/URLDecode.
EDIT
As far as I know, historically the "+" has been used in URL encoding as a special substitution for the space char ( ASCII 20 ). If an implementation does not take the space into consideration as a special character with the '+' substitution, then it still has to escape it using its ASCII code ( hence '%20' ).
There is a really good discussion of the situation at http://bytes.com/topic/php/answers/5624-urlencode-vs-rawurlencode. It's inconclusive, by the way. RFC 2396 lumps the space with other characters without an unreserved representation, which sides with the '%20' crowd.
RFC 1630 sides with the '+' crowd ( via forum discusion )...
Within the query string, the plus sign
is reserved as shorthand notation for
a space. Therefore, real plus signs
must beencoded. This method was used
to make query URIs easier to pass in
systems which did not allow spaces.
Also, the core RFCs are...
RFC 1630 - Universal Resource Identifiers in WWW
RFC 1738 - Uniform Resource Locators (URL)
RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax

As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not
Actually they're both wrong!
JavaScript escape() should never be used. As well as failing to encode the + character to %2B, it encodes all non-ASCII characters as a non-standard %uNNNN sequence.
Meanwhile Server.UrlEncode is not exactly URL-encoding as such, but encoding to application/x-www-form-urlencoded, which should only normally be used for query parameters. Using + to represent a space outside of a form name=value construct, such as in a path part, is wrong.
This is rather unfortunate. You might want to try doing a string replace of the + character with %20 after encoding with UrlEncode() when you are encoding into a path part rather than a parameter. In a parameter, + and %20 are equally good.

A + instead of a space is correct URL encoding, as would escaping it to %20. See this article (CGI Programming in Perl - URL Encoding).
The + is not something that JavaScript can parse, so javascript will escape the space or + to %20.

Using System.Uri.EscapeDataString() serverside and decodeURIComponent clientside works.

Can someone give me a Regular Expression to identify an encoded URL?

I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13
Can you please give me a Regular Expression for this?
Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?

If you are interested in the base64-encoded URLs, you can do it.
A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.
In practice, it gets rather messy.
A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)
If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).
This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.
This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
As you can see, it gets messier and messier. Therefore, I'd recommend you rather to
break the URL in its parts (eg. protocol, host, query string)
get the parameters from the query string, and urldecode them
try base64 decode on the values of the parameters
apply your criterion for "good encoded URLs"

Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!

This question may be harder than you realize. For example:
I could say that if a query string includes a question mark character then what follows it is encoded.
Now, it may be simple encoding like "?year=2009" or complicated like in your examples.
Or
The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!
In fact, you could say that the entire URL is encoded!
Perhaps you need to better define what you mean by "encoded".

You can't reliably parse URL using regex. (Is this an SO mantra yet?)
Here are some specific examples:
It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?
Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.