Server.UrlEncode("My File.doc") returns "My+File.doc", whereas the javascript escape("My File.doc") returns "My%20File.doc". As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not. It certainly seems to work that way in practice putting http://somesite/My+File.doc will not fetch "My File.doc" in any case i could test using firefox/i.e. and IIS, whereas http://somesite/My%20File.doc works fine. Am i missing something or does Server.UrlEncode simply not work properly?
Use Javascripts encodeURIComponent()/decodeURIComponent() for "round-trip" encoding with .Net's URLEncode/URLDecode.
EDIT
As far as I know, historically the "+" has been used in URL encoding as a special substitution for the space char ( ASCII 20 ). If an implementation does not take the space into consideration as a special character with the '+' substitution, then it still has to escape it using its ASCII code ( hence '%20' ).
There is a really good discussion of the situation at http://bytes.com/topic/php/answers/5624-urlencode-vs-rawurlencode. It's inconclusive, by the way. RFC 2396 lumps the space with other characters without an unreserved representation, which sides with the '%20' crowd.
RFC 1630 sides with the '+' crowd ( via forum discusion )...
Within the query string, the plus sign
is reserved as shorthand notation for
a space. Therefore, real plus signs
must beencoded. This method was used
to make query URIs easier to pass in
systems which did not allow spaces.
Also, the core RFCs are...
RFC 1630 - Universal Resource Identifiers in WWW
RFC 1738 - Uniform Resource Locators (URL)
RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax
As far as i understand it the javascript is corectly URL encoding the string whereas the .net method is not
Actually they're both wrong!
JavaScript escape() should never be used. As well as failing to encode the + character to %2B, it encodes all non-ASCII characters as a non-standard %uNNNN sequence.
Meanwhile Server.UrlEncode is not exactly URL-encoding as such, but encoding to application/x-www-form-urlencoded, which should only normally be used for query parameters. Using + to represent a space outside of a form name=value construct, such as in a path part, is wrong.
This is rather unfortunate. You might want to try doing a string replace of the + character with %20 after encoding with UrlEncode() when you are encoding into a path part rather than a parameter. In a parameter, + and %20 are equally good.
A + instead of a space is correct URL encoding, as would escaping it to %20. See this article (CGI Programming in Perl - URL Encoding).
The + is not something that JavaScript can parse, so javascript will escape the space or + to %20.
Using System.Uri.EscapeDataString() serverside and decodeURIComponent clientside works.
Related
Is it safe to use an # symbol as part of a user? For example, a possible URL would be http://example.com/#dave.
The idea is that, nowadays, users are commonly called "#user", so why not make the user page "#username"?
Percent-encoded …
You can use the # character in HTTP URI paths if you percent-encode it as %40.
Many browsers would display it still as #, but e.g. when you copy-and-paste the URI into a text document, it will be %40.
… but also directly
Instead of percent-encoding it, you may use # directly in the HTTP URI path.
See the syntax for the path of an URI. Various unrelated clauses aside, the path may consist of characters in the segment, segment-nz, or segment-nz-nc set. segment and segment-nz consist of characters from the pchar set, which is defined as:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
As you can see, the # is listed explicitly.
The segment-nz-nc set also lists the # character explicitly:
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
So, a HTTP URI like this is totally valid:
http://example.com/#dave
Example
Here is an example Wikipedia page:
link
copy-and-paste: http://en.wikipedia.org/wiki/%22#%22_%28album%29
As you can see, the ", (, and ) characters are percent-encoded, but the # and the _ is used directly.
Can you use the #-symbol in a URL? - Yes, you can!
Note that that #-character, hexadecimal value 40, decimal value 64, is a reserved characters for URI's. It's usage is for things like email-addresses in mailto:URI's, for example mailto:username#somewhere.foo and for passing username and password information on a URI (which is a bad idea, but possible): http://username:password#somewhere.foo
If you want a URL that has an #-symbol in a path you need to encode it, with so called "URL-encoding". For example like this: http://somewhere.foo/profile/username%40somewhere.foo
All modern browsers will display this as http://somewhere.foo/profile/username#somewhere.foo, and will convert any typed in #-sign to %40, so it's easy to use.
Many web-frameworks will also help you either automatically, or with helper-functions, to convert to and from URL-encoded URL's.
So, in summary: Yes, you can use the #-symbol in a URL, but you have to make sure it's encoded, as you can't use the #-character.
In the RFC the following characters:
* ' ( ) ; : # & = + $ , / ? % # [ ]
are reserved and:
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
So it is not recommended to use these characters without encoding.
Basicaly no.
# is a reserved character and should only be used for its intended purpose.
See: http://perishablepress.com/stop-using-unsafe-characters-in-urls/ and http://www.ietf.org/rfc/rfc3986.txt
It can be used encoded, but I don't think that is what you were asking.
Apparently modern browsers will handle this. However you asked if this was safe and according to the spec of the RFC you should not be using it (unencoded) unless it is for its intended purpose.
I found this question when I tried to search site:typescriptlang.org #ts-ignore at Chrome, and then got the result of This page isn't working, ts-ignore is currently unable to handle this request and I saw the URL became "http://site:typescriptlang.org%20#ts-ignore/". I felt so refused, then searched # symbol's function at an URL and then I found my answer on Wikipedia.
The full format of the URL is scheme://userInfo#host:port/path?query#fragment. so when we search site:typescriptlang.org #ts-ignore, the browser will think you want to visit "http://site:typescriptlang.org%20#ts-ignore/". In this URL, http is a scheme, site:typescriptlang.org%20 is a userInfo ("%20" is escaped by a space character), "ts-ignore/" is a host. Of course, we can't visit the host named "ts-ignore" without a domain.
So, # symbol can be a separator between userInfo and host.
I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?
As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.
I'm trying to write a custom Javascript MVC3 Helper class foe my project, and one of the methods is supposed to escape C# strings to Javascript strings.
I know C# strings are UTF-16 encoded, and Javascript strings also seem to be UTF-16. No problem here.
I know some characters like backslash, single quotes or double quotes must be backslash-escaped on Javascript so:
\ becomes \\
' becomes \'
" becomes \"
Is there any other caveat I must be aware of before writing my conversion method ?
EDIT:
Great answers so far, I'm adding some references from the answers in the question to help others in the future.
Alex K. suggested using System.Web.HttpUtility.JavaScriptStringEncode, which I marked as the right answer for me, because I'm using .Net 4. But this function is not available to previous .Net versions, so I'm adding some other resources here:
CR becomes \r // Javascript string cannot be broke into more than 1 line
LF becomes \n // Javascript string cannot be broke into more than 1 line
TAB becomes \t
Control characters must be Hex-Escaped
JP Richardson gave an interesting link informing that Javascript uses UCS-2, which is a subset of UTF-16, but how to encode this correctly is an entirely new question.
LukeH on the comments below reminded the CR, LF and TAB chars, and that reminded me of the control chars (BEEP, NULL, ACK, etc...).
(.net 4) You can;
System.Web.HttpUtility.JavaScriptStringEncode(#"aa\bb ""cc"" dd\tee", true);
==
"aa\\bb \"cc\" dd\\tee"
It's my understanding that you do have to be careful, as JavaScript is not UTF-16, rather, it's UCS-2 which I believe is a subset of UTF-16. What this means for you, is that any character that is represented than a higher code point of 2 bytes (0xFFFF) could give you problems in JavaScript.
In summary, under the covers, the engine may use UTF-16, but it only exposes UCS-2 like methods.
Great article on the issue:
http://mathiasbynens.be/notes/javascript-encoding
Just use Microsoft.JScript.GlobalObject.escape
Found it here: http://forums.asp.net/p/1308104/4468088.aspx/1?Re+C+equivalent+of+JavaScript+escape+
Instead of using JavaScriptStringEncode() method, you can encode server side using:
HttpUtility.UrlEncode()
When you need to read the encoded string client side, you have to call unescape() javascript function before using the string.
I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13
Can you please give me a Regular Expression for this?
Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?
If you are interested in the base64-encoded URLs, you can do it.
A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.
In practice, it gets rather messy.
A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)
If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).
This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.
This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
As you can see, it gets messier and messier. Therefore, I'd recommend you rather to
break the URL in its parts (eg. protocol, host, query string)
get the parameters from the query string, and urldecode them
try base64 decode on the values of the parameters
apply your criterion for "good encoded URLs"
Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!
This question may be harder than you realize. For example:
I could say that if a query string includes a question mark character then what follows it is encoded.
Now, it may be simple encoding like "?year=2009" or complicated like in your examples.
Or
The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!
In fact, you could say that the entire URL is encoded!
Perhaps you need to better define what you mean by "encoded".
You can't reliably parse URL using regex. (Is this an SO mantra yet?)
Here are some specific examples:
It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?
Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').
I have a a string in c# initialised as follows:
string strVal = "£2000";
However whenever I write this string out the following is written:
£2000
It does not do this with dollars.
An example bit of code I am using to write out the value:
System.IO.File.AppendAllText(HttpContext.Current.Server.MapPath("/logging.txt"), strVal);
I'm guessing it's something to do with localization but if c# strings are just unicode surely this should just work?
CLARIFICATION: Just a bit more info, Jon Skeet's answer is correct, however I also get the issue when I URLEncode the string. Is there a way of preventing this?
So the URL encoded string looks like this:
"%c2%a32000"
%c2 = Â
%a3 = £
If I encode as ASCII the £ comes out as ?
Any more ideas?
AppendAllText is writing out the text in UTF-8.
What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.
EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).
URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.
The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.
It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.
So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.
(In this character set, the UK pound symbol is 163.)
Background
The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.
And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.
ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.
The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.
It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.
Hence, for example, the advice on this page:
URL encoding of a character consists
of a "%" symbol, followed by the
two-digit hexadecimal representation
(case-insensitive) of the ISO-Latin
code point for the character.
(ISO-Latin is another name for IS-8859-1).
So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.
<HTML>
<BODY>
Test
</BODY>
</HTML>
It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.
Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).
The Pound Sign (down the page) is part of Latin-1 encoding.
I have noticed that this is happening only when long strings are used (over 4000) chars. My solution was upon receiving the parameter in database, I simply replace the  sign with nothing.
Be careful, Â may actually be needed, and if that is the case this solution is not appropriate.