http://example.com/something/somewhere//somehow/script.js
Does the double slash break anything on the server side? I have a script that parses URLs and i was wondering if it would break anything (or change the path) if i replaced multiple slashes with a single slash. Especially on the server side, some frameworks like CodeIgniter and Joomla use segmented url schemes and routing. I would just want to know if it breaks anything.
HTTP RFC 2396 defines path separator to be single slash.
However, unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.
An additional thing that might be affected is caching. Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).
The correct answer to this question is it depends upon the implementation of the server!
Preface: Double-slash is syntactically valid according to RFC 2396, which defines URL path syntax. As amn explains, it therefore implies an empty URI segment. Note however that RFC 2396 only defines the syntax, not semantics of paths, including empty path segments, so it is up to your server to decide the semantics of the empty path.
You didn't mention the server software stack you're using, perhaps you're even rolling your own? So please use your imagination as to what the semantics could be!
Practically, I would like to point out some everyday semantic-related reasons which mean you should avoid double slashes even though they are syntactically valid:
Since empty being valid is somehow not expected by everyone, it can cause bugs. And even though your server technology of today might be compatible with it, either your server technology of tomorrow or the next version of your server technology of today might decide not to support it any more. Example: ASP.NET MVC Web API library throws an error when you try to specify a route template with a double slash.
Some servers might interpret // as indicating the root path. This can either be on-purpose, or a bug - and then likely it is a security bug, i.e. a directory traversal vulnerability.
Because it is sometimes a bug, and a security bug, some clever server stacks and firewalls will see the substring '//', deduce you are possibly making an attempt at exploiting such a bug, and therefore they will return 403 Forbidden or 400 Bad Request etc, and refuse to actually do any further processing of the URI.
URLs don't have to map to filesystem paths. So even if // in a filesystem path is equivalent to /, you can't guarantee the same is true for all URLs.
Consider the declaration of the relevant path-absolute non-terminal in "RFC3986: Uniform Resource Identifier (URI): Generic Syntax" (specified, as is typical, in ABNF syntax):
path-absolute = "/" [ segment-nz *( "/" segment ) ]
Then consider the segment declaration a few lines further down in the same document:
segment = *pchar
If you can read ABNF, the asterisk (*) specifies that the following element pchar may be repeated multiple times to make up a segment, including zero times. Learning this and re-reading the path-absolute declaration above, you can see that a potentially empty segment imples that the second "/" may repeat indefinitely, hence allowing valid combinations like ////// (arbitrary length of at least one /) as part of path-absolute (which itself is used in specifying the rule describing a URI).
As all URLs are URIs we can conclude that yes, URLs are allowed multiple consecutive forward slashes, per quoted RFC.
But it's not like everyone follows or implements URI parsers per specification, so I am fairly sure there are non-compliant URI/URL parsers and all kinds of software that stacks on top of these where such corner cases break larger systems.
One thing you may want to consider is that it might affect your page indexing in a search engine. According to this web page,
A URL with the same path repeated 3 times will not be indexed in Google
The example they use is:
example.com/path/path/path/
I haven't confirmed this would also be true if you used example.com///, but I would certainly want to find out if SEO optimization was critical for my website.
They mention that "This is because Google thinks it has hit a URL trap." If anyone else knows the answer for sure, please add a comment to this answer; otherwise, I thought it relevant to include this case for consideration.
Yes, it can most definitely break things.
The spec considers http://host/pages/foo.html and http://host/pages//foo.html to be different URIs, and servers are free to assign different meanings to them. However, most servers will treat paths /pages/foo.html and /pages//foo.html identically (because the underlying file system does too). But even when dealing with such servers, it's easily possible for extra slash to break things. Consider the situation where a relative URI is returned by the server.
http://host/pages/foo.html + ../images/foo.png = http://host/images/foo.png
http://host/pages//foo.html + ../images/foo.png = http://host/pages/images/foo.png
Let me explain what that means. Say your server returns an HTML document that contains the following:
<img src="../images/foo.png">
If your browser obtained that page using
http://host/pages/foo.html # Path has 2 segments: "pages" and "foo.html"
your browser will attempt to load
http://host/images/foo.png # ok
However, if your browser obtained that page using
http://host/pages//foo.html # Path has 3 segments: "pages", "" and "foo.html"
you'll probably get the same page (because the server probably doesn't distinguish /pages//foo.html from /pages/foo.html), but your browser will erroneously try to load
http://host/pages/images/foo.png # XXX
You may be surprised for example when building links for resources in your app.
<script src="mysite.com/resources/jquery//../angular/script.js"></script>
will not resolve to mysite.com/resources/angular/script.js but to mysite.com/resources/jquery/angular/script.js what you probably didn't want
Double slashes are evil, try to avoid them.
Your question is "does it break anything". In terms of the URL specification, extra slashes are allowed. Don't read the RFC, here is a quick experiment you can try to see if your browser silently mangles the URL:
echo '<?= $_SERVER['REQUEST_URI'];' > tmp.php
php -S localhost:4000 tmp.php
I tested macOS 10.14 (18A391) with Safari 12.0 (14606.1.36.1.9) and Chrome 69.0.3497.100 and both get the result:
/hello//world
This indicated that using an extra slash is visible to the web application.
Certain use cases will be broken when using a double slash. This includes URL redirects/routing that are expecting a single-slashed URL or other CGI applications that are analyzing the URI directly.
But for normal cases of serving static content, such as your example, this will still get the correct content. But the client will get a cache miss against the same content accessed with different slashes.
Related
I tried searching Google and Stack Exchange for an answer to this question, and while it did yield some documentation on the Href() method (documentation I have seen before), It seems that questions on this method are either very obscure or nonexistent, and I am still left puzzled about why/when to really use it.
MSDN Documentation: http://msdn.microsoft.com/en-us/library/system.web.webpages.webpageexecutingbase.href(v=vs.111).aspx
I can tell, through this documentation, that it builds an absolute url from a relative path.
My questions are:
Why should I need to do this?
Shouldn't a relative path work fine outside of the testing environment and even on the web server?
Should I change EVERY one of the relative paths in my site to incorporate the Href() method? (Example: Change Context.RedirectLocal("~/") to Context.RedirectLocal(Href("~/")))
What's the best practices for use with this method considering an ASP.NET web-pages environment?
I apologize for being so confused over what seems to be such a simple thing, but I would hate to get my website live only to find out that it was broken or had security holes (first impressions can be a killer).
You would use the method explicitly if you are working with Web Pages 1 and want to ensure that your virtual paths always map correctly to an absolute url. From Web Pages 2, the Href method is called by the framework if the parser encounters a tilde (~) in a url in your cshtml file e.g.
<script type="text/javascript" src="~/Scripts/jquery.js"></script>
When might a path not resolve correctly without using the Href method explicitly or tilde? It might not work if your site's root path structure changes if you change hosting for example. It also might not work if your internal folder structure changes, or if you move files about. If that is unlikely to happen, you probably don't need to worry about using the method. I didn't tend to use it until the Href method was replaced by the tilde. Now I always use it on the basis that it is so much easier to use, and I would prefer to add one additional keystroke to each url than be sorry at some stage in the future.
You can find out more about Href about half way down this page: http://www.asp.net/web-pages/tutorials/basics/2-introduction-to-asp-net-web-programming-using-the-razor-syntax
So I've been building a C# html sanitizer using html agility with a white list. It works fine, except for cases like these:
<img src="javascript:alert('BadStuff');" />
<img src="jav ascript:alert('BadStuff');">
I DO want to allow the src attribute, just not malicious stuff within it obviously. All of the stuff i've looked up has just recommended a whitelist for tags and their attributes. How would you handle something like this though? I know this won't work in any newer browser, but i'm not very familiar with security and i'm sure there are some other clever things attackers could do.
Something like "must be valid Uri either relative or absolute with http/https scheme" is good starting point.
You can safely allow the src attribute, provided that you sanitize and handle the input properly. To do this you should first sanitize it through a whitelist of valid URL characters, canonicalize it, and then verify that it points to a valid image.
The whitelist you mentioned is the first step (and an important one at that). To implement the whitelist, simply strip out every character that isn't valid for a URL. Also verify that the URL is properly formed, meaning that it points to a valid resource that the user should be able to access. For example, the user shouldn't be accessing a local file on the server by passing in file://sensitive.txt or something. If http or https are the only protocols that should be used, check that the URL starts with those. If you are extra paranoid you may reject the request altogether since it is obvious it has been tampered with. Whitelisting is important, but whitelisting alone however will not keep the feature secure.
Canonicalization is important because many attacks depend on submitting URLs that eventually take you to a certain location, but may abuse the computer's innate lack of reasoning to get at things it shouldn't. This will also help to eliminate duplicated paths to the same resource which may improve performance (or at least allow you to improve performance by not rechecking a known file that hasn't changed since the last time you checked it. Be careful with this though because it is possible to spoof a last modified date so an attacker could swap a malicious file in after you've already "checked and trusted" it).
To verify that you are pointing to a valid image, open the file and read in the first few bytes. Do not simply trust the file extension, though do check it first before opening the file (for performance and for security). Every image format has a certain pattern of bytes that you can check. A good one to look at first is JPEG. It may still be possible for a malicious user to put shellcode or other attack code in an image file that
contains the proper headers, but it is much more difficult to do. This will be a performance bottleneck so plan appropriately if you implement this.
I have an MVC3 app written in C# that I'd like to generate rel=canonical tags for. In searching SO for ways to achieve this automatically, I came across this post.
I implemented it in my dev environment and it works as intended and generates tags such as
<link href="http://localhost/" rel="canonical" />.
My question is, what good does this do? Shouldn't the canonical URL point to explicitly where I want it to (i.e. my production site), rather than whatever the URL happens to be?
The reason I bring this up is because my hosting provider (who shall remain nameless for now) also generates another URL that points to my site (same IP address just a different hostname, I have no idea why, they claim it's for reverse DNS purposes -- this is another subject). However, I've started seeing my page show up in Google search results under this mirrored URL. Not good for SEO, since it's "duplicate content". Now, I've fixed it by simply configuring my IIS site to respond only to requests to my site's domain, however, it seemed a good time to look at what type of a solution canonical URLs could have provided here.
Using the solution in the post above, the rel=canonical link tag would have output a canonical URL containing the MIRRORED URL if someone were to go to the mirrored site, which is not at all what I would want. It should ALWAYS be <link rel="canonical" href="http://www.productionsite.com" />, regardless of the URL in the address bar, right? I mean, isn't that the point of canonical URLs or am I missing something?
Assuming I'm correct, is there an accepted, generic way to generate canonical URLs for an MVC3 app? I can obviously define them individually for every page, or I can simply replace the rawUrl.Host parameter in the solution I linked with a hard-coded domain name, I'm just wondering why I see so many examples of people generating canonical URLs this way when it doesn't seem to fit the purpose (at least in my example). What problem are they trying to solve by just inserting the current URL into a rel=canonical link element?
Great question, and you're bang on regarding the mirrored site still getting marked as canonical. In fact, you've got to fix a couple problems before it hammers your "link juice" any harder.
I suspect the main reason is because MVC, by design, is a URL rewriting/routing system. So depending on the massage that occurs to the originally requested URL, people are trying to set the canonical link to the "settled on" final URL format, post rewriting. That said, I think you've dialed in on an oversight most people are having - which is "What about URLs that reached the page, that were NOT anticipated and REWRITTEN to become the valid, canonical path to the URL?" The answer here, is to rewrite these "bad requests" as you discover them. For example: If you rewrote your ISP's mirrored domain requests, then by the time it reaches the loaded page, it's NOW a valid url; This is because it was "fixed" by your rewrite rules. Make sense? So you'll need to update your MVC routes to handle the bad route created by your ISP. NOTE: You MUST make sure you don't use the originally requested URL, but the final, rewritten one, when building the canonical link value.
Continue on for my WWW vs. non-WWW tip, as well as a concern about something you mentioned regarding not processing the invalid urls.
People also do this because your site already "mirrors" another domain that people always forget about. The "WWW" subdomain.
Believe it or not, although debated, many are stating that having www.yourdomain.com/mypage.htm and yourdomain.com/mypage.htm is actually hurting your page ranking due to "duplicated" content. I suspect this is why people are showing the "same domain" there, because it's actually the domain stripped of the "WWW". (I use a rewrite rule to make the www vs no-www consistent.)
Also, be careful regarding "configuring my IIS site to respond only to requests to my site's domain" because if Google still sees links there and considers them a part of your site, it might actually just penalize you for having pages that fail to load (i.e. 404s) I recommend having a rewrite rule that sends them to your "real" domain OR at least have the canonical link be setup to only use your "real" domain with the WWW consistently there, or not there. (It is argued which is better, I don't think it matters as long as you are consistent.)
What problem are they trying to solve by just inserting the current URL into a rel=canonical link element?
None! They just make things even worse! They have just been mislead
There are misleading answers for this matter, here also on stack overflow which have been accepted and up-voted!!
The whole concept is to produce a unique link ID for each page with different content in a canonical tag.
So a good way to produce unique links for your canonical tags is based on.
Controller Name , Action Name and Language.Those 3 variations will provide different content.
Domain , Protocol and Letter casing don't!
See the question and my answer here for a better understanding.
MVC Generating rel="canonical" automatically
Creating a canonical URL based on the current URL does NOT do any good. You should create a canonical URL based off something static like database information. For instance, if your URL includes, let's say, the title of a book. You should pull that book title from the database and create the canonical URL from THAT and NOT the current page's URL. That way if part of the URL is missing AND the page still displays, the canonical URL will always be the same.
I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.
Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.
This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?
I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.
My questions boil down to:
Is BOM-aware detection sufficient for the vast majority of files?
In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.
So far I've found:
A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.
Thanks.
There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.
If the data starts with a BOM, use it.
If the data contains 0-bytes, it is likely utf-16 or ucs-32. You can distinguish between these, and between the big-endian and little-endian variants of these by looking at the positions of the 0-bytes
If the data can be decoded as utf-8 (without errors), then it is very likely utf-8 (or US-ASCII, but this is a subset of utf-8)
Next, if you want to go international, map the browser's language setting to the most likely encoding for that language.
Finally, assume ISO-8859-1
Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.
Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.
Have you tried reading a representative cross-section of your files from user, running them through your program, testing, correcting any errors and moving on?
I've found File.ReadAllLines() pretty effective across a very wide range of applications without worrying about all of the encodings. It seems to handle it pretty well.
Xmlreader() has done fairly well once I figured out how to use it properly.
Maybe you could post some specific examples of data and get some better responses.
This is a well known problem. You can try to do what Internet Explorer is doing. This is a nice article in The CodeProject that describes Microsoft's solution to the problem. However no solution is 100% accurate as everything is based on heuristcs. And it is also no safe to assume that a BOM will be present.
You may like to look at a Python-based solution called chardet. It's a Python port of Mozilla code. Although you may not be able to use it directly, its documentation is well worth reading, as is the original Mozilla article it references.
I ran into a similar issue. I needed a powershell script that figured out if a file was text-encoded ( in any common encoding ) or not.
It's definitely not exhaustive, but here's my solution...
PowerShell search script that ignores binary files
I'm having troubles with HttpWebRequest/HttpWebResponse and cookies/CookieContainer/CookieCollection.
The thing is, if the web server does not send/use a "path" in the cookie, Cookie.Path equals the path-part of the request URI instead of "/" or being empty in my application.
Therefore, those cookies do not work for the whole domain, which it actually does in proper web browsers.
Any ideas how to solve this issue?
Thanks in advance
Ah, I see what you mean. Generally what browsers really do is take the folder containing the document as the path; for ‘/login.php’ that would be ‘/’ so it would effectively work across the whole domain. ‘/potato/login.php’ would be limited to ‘/potato/’; anything with trailing path-info parts (eg. ‘/login.php/’) would not work.
In this case the Netscape spec could be considered wrong or at least misleading in claiming that path defaults to the current document path... depending on how exactly you read ‘path’ there. However the browser behaviour is consistent back as far as the original Netscape version. Netscape never were that good at writing specs...
If .NET's HttpWebRequest is really defaulting CookieContainer.Path to the entire path of the current document, I'd file a bug against it.
Unfortunately the real-world behaviour is not actually currently described in a standards document... there is RFC 2965, which does get the path thing right, but makes several other changes not representative of real-world browser behaviour, so that's not wholly reliable either. :-(
Seems like I cannot go any further with the default cookie handler, so I got annoyed and I did it the hard way. Haha. So parsing response.Headers["Set-Cookie"] myself is my solution. Not my preferred one but it works. And I simply eliminated the problem with splitting at the wrong comma using regular expressions.
If I could give out points here, I would give you some of them, bobince, because you gave me valuable information. I would also vote up if I could (need higher rep. score), but since this behavior probably is a bug, as you mentioned, I will accept that as an answer.
Thank you. :)
That's the way cookies work. ‘Proper’ web browsers do exactly the same, as originally specified in the ancient Netscape cookies doc: http://cgi.netscape.com/newsref/std/cookie_spec.html
Web apps must effectively always set a ‘path’ (often ‘/’).