So I've been building a C# html sanitizer using html agility with a white list. It works fine, except for cases like these:
<img src="javascript:alert('BadStuff');" />
<img src="jav ascript:alert('BadStuff');">
I DO want to allow the src attribute, just not malicious stuff within it obviously. All of the stuff i've looked up has just recommended a whitelist for tags and their attributes. How would you handle something like this though? I know this won't work in any newer browser, but i'm not very familiar with security and i'm sure there are some other clever things attackers could do.
Something like "must be valid Uri either relative or absolute with http/https scheme" is good starting point.
You can safely allow the src attribute, provided that you sanitize and handle the input properly. To do this you should first sanitize it through a whitelist of valid URL characters, canonicalize it, and then verify that it points to a valid image.
The whitelist you mentioned is the first step (and an important one at that). To implement the whitelist, simply strip out every character that isn't valid for a URL. Also verify that the URL is properly formed, meaning that it points to a valid resource that the user should be able to access. For example, the user shouldn't be accessing a local file on the server by passing in file://sensitive.txt or something. If http or https are the only protocols that should be used, check that the URL starts with those. If you are extra paranoid you may reject the request altogether since it is obvious it has been tampered with. Whitelisting is important, but whitelisting alone however will not keep the feature secure.
Canonicalization is important because many attacks depend on submitting URLs that eventually take you to a certain location, but may abuse the computer's innate lack of reasoning to get at things it shouldn't. This will also help to eliminate duplicated paths to the same resource which may improve performance (or at least allow you to improve performance by not rechecking a known file that hasn't changed since the last time you checked it. Be careful with this though because it is possible to spoof a last modified date so an attacker could swap a malicious file in after you've already "checked and trusted" it).
To verify that you are pointing to a valid image, open the file and read in the first few bytes. Do not simply trust the file extension, though do check it first before opening the file (for performance and for security). Every image format has a certain pattern of bytes that you can check. A good one to look at first is JPEG. It may still be possible for a malicious user to put shellcode or other attack code in an image file that
contains the proper headers, but it is much more difficult to do. This will be a performance bottleneck so plan appropriately if you implement this.
Related
http://example.com/something/somewhere//somehow/script.js
Does the double slash break anything on the server side? I have a script that parses URLs and i was wondering if it would break anything (or change the path) if i replaced multiple slashes with a single slash. Especially on the server side, some frameworks like CodeIgniter and Joomla use segmented url schemes and routing. I would just want to know if it breaks anything.
HTTP RFC 2396 defines path separator to be single slash.
However, unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.
An additional thing that might be affected is caching. Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).
The correct answer to this question is it depends upon the implementation of the server!
Preface: Double-slash is syntactically valid according to RFC 2396, which defines URL path syntax. As amn explains, it therefore implies an empty URI segment. Note however that RFC 2396 only defines the syntax, not semantics of paths, including empty path segments, so it is up to your server to decide the semantics of the empty path.
You didn't mention the server software stack you're using, perhaps you're even rolling your own? So please use your imagination as to what the semantics could be!
Practically, I would like to point out some everyday semantic-related reasons which mean you should avoid double slashes even though they are syntactically valid:
Since empty being valid is somehow not expected by everyone, it can cause bugs. And even though your server technology of today might be compatible with it, either your server technology of tomorrow or the next version of your server technology of today might decide not to support it any more. Example: ASP.NET MVC Web API library throws an error when you try to specify a route template with a double slash.
Some servers might interpret // as indicating the root path. This can either be on-purpose, or a bug - and then likely it is a security bug, i.e. a directory traversal vulnerability.
Because it is sometimes a bug, and a security bug, some clever server stacks and firewalls will see the substring '//', deduce you are possibly making an attempt at exploiting such a bug, and therefore they will return 403 Forbidden or 400 Bad Request etc, and refuse to actually do any further processing of the URI.
URLs don't have to map to filesystem paths. So even if // in a filesystem path is equivalent to /, you can't guarantee the same is true for all URLs.
Consider the declaration of the relevant path-absolute non-terminal in "RFC3986: Uniform Resource Identifier (URI): Generic Syntax" (specified, as is typical, in ABNF syntax):
path-absolute = "/" [ segment-nz *( "/" segment ) ]
Then consider the segment declaration a few lines further down in the same document:
segment = *pchar
If you can read ABNF, the asterisk (*) specifies that the following element pchar may be repeated multiple times to make up a segment, including zero times. Learning this and re-reading the path-absolute declaration above, you can see that a potentially empty segment imples that the second "/" may repeat indefinitely, hence allowing valid combinations like ////// (arbitrary length of at least one /) as part of path-absolute (which itself is used in specifying the rule describing a URI).
As all URLs are URIs we can conclude that yes, URLs are allowed multiple consecutive forward slashes, per quoted RFC.
But it's not like everyone follows or implements URI parsers per specification, so I am fairly sure there are non-compliant URI/URL parsers and all kinds of software that stacks on top of these where such corner cases break larger systems.
One thing you may want to consider is that it might affect your page indexing in a search engine. According to this web page,
A URL with the same path repeated 3 times will not be indexed in Google
The example they use is:
example.com/path/path/path/
I haven't confirmed this would also be true if you used example.com///, but I would certainly want to find out if SEO optimization was critical for my website.
They mention that "This is because Google thinks it has hit a URL trap." If anyone else knows the answer for sure, please add a comment to this answer; otherwise, I thought it relevant to include this case for consideration.
Yes, it can most definitely break things.
The spec considers http://host/pages/foo.html and http://host/pages//foo.html to be different URIs, and servers are free to assign different meanings to them. However, most servers will treat paths /pages/foo.html and /pages//foo.html identically (because the underlying file system does too). But even when dealing with such servers, it's easily possible for extra slash to break things. Consider the situation where a relative URI is returned by the server.
http://host/pages/foo.html + ../images/foo.png = http://host/images/foo.png
http://host/pages//foo.html + ../images/foo.png = http://host/pages/images/foo.png
Let me explain what that means. Say your server returns an HTML document that contains the following:
<img src="../images/foo.png">
If your browser obtained that page using
http://host/pages/foo.html # Path has 2 segments: "pages" and "foo.html"
your browser will attempt to load
http://host/images/foo.png # ok
However, if your browser obtained that page using
http://host/pages//foo.html # Path has 3 segments: "pages", "" and "foo.html"
you'll probably get the same page (because the server probably doesn't distinguish /pages//foo.html from /pages/foo.html), but your browser will erroneously try to load
http://host/pages/images/foo.png # XXX
You may be surprised for example when building links for resources in your app.
<script src="mysite.com/resources/jquery//../angular/script.js"></script>
will not resolve to mysite.com/resources/angular/script.js but to mysite.com/resources/jquery/angular/script.js what you probably didn't want
Double slashes are evil, try to avoid them.
Your question is "does it break anything". In terms of the URL specification, extra slashes are allowed. Don't read the RFC, here is a quick experiment you can try to see if your browser silently mangles the URL:
echo '<?= $_SERVER['REQUEST_URI'];' > tmp.php
php -S localhost:4000 tmp.php
I tested macOS 10.14 (18A391) with Safari 12.0 (14606.1.36.1.9) and Chrome 69.0.3497.100 and both get the result:
/hello//world
This indicated that using an extra slash is visible to the web application.
Certain use cases will be broken when using a double slash. This includes URL redirects/routing that are expecting a single-slashed URL or other CGI applications that are analyzing the URI directly.
But for normal cases of serving static content, such as your example, this will still get the correct content. But the client will get a cache miss against the same content accessed with different slashes.
I have read (and am coming to terms with) the fact that no solution can be 100% effective against XSS attacks. It seems that the best we can hope for is to stop "most" XSS attack avenues, and probably have good recovery and/or legal plans afterwords. Lately, I've been struggling to find a good frame of reference for what should and shouldn't be an acceptable risk.
After having read this article, by Mike Brind (A very good article, btw):
http://www.mikesdotnetting.com/Article/159/WebMatrix-Protecting-Your-Web-Pages-Site
I can see that using an html sanitizer can also be very effective in lowering the avenues of XSS attacks if you need the user-input unvalidated.
However, in my case, it's kind of the opposite. I have a (very limited) CMS with a web interface. The user input (after being URL encoded) is saved to a JSON file, which is then picked up (decoded) on the view-able page. My main way for stopping XSS attacks here is that you would have to be one of few registered members in order to change content at all. By logging registered users, IP addresses, and timestamps, I feel that this threat is mostly mitigated, however, I would like to use a try/catch statement that would catch the YSOD produced by asp.net's default request validator in addition to the previously mentioned methods.
My question is: How much can I trust this validator? I know it will detect tags (this partial CMS is NOT set up to accept any tags, logistically speaking, so I am fine with an error being thrown if ANY tag is detected). But what else (if anything) does this inborn validator detect?
I know that XSS can be implemented without ever having touched an angle bracket (or a full tag, at all, for that matter), as html sources can be saved, edited, and subsequently ran from the client computer after having simply added an extra "onload='BS XSS ATTACK'" to some random tag.
Just curious how much this validator can be trusted if a person does want to use it as part of their anti-XSS plans (obviously with a try/catch, so the users don't see the YSOD). Is this validator pretty decent but not perfect, or is this just a "best guess" that anyone with enough knowledge to know XSS, at all, would have enough knowledge that this validation wouldn't really matter?
-----------------------EDIT-------------------------------
At this site...: http://msdn.microsoft.com/en-us/library/hh882339(v=vs.100).aspx
...I found this example for web-pages.
var userComment = Request.Form["userInput"]; // Validated, throws error if input includes markup
Request.Unvalidated("userInput"); // Validation bypassed
Request.Unvalidated().Form["userInput"]; // Validation bypassed
Request.QueryString["userPreference"]; // Validated
Request.Unvalidated().QueryString["userPreference"]; // Validation bypassed;
Per the comment: "//Validated, throws error if input includes markup" I take it that the validator throws an error if the string contains anything that is considered markup. Now the question (for me) really becomes: What is considered markup? Through testing I have found that a single angle bracket won't throw an error, but if anything (that I have tested so far) comes after that angle bracket, such as
"<l"
it seems to error. I am sure it does more checking than that, however, and I would love to see what does and does not qualify as markup in the eyes of the request validator.
I believe the ASP.NET request validation is fairly trustworthy but you should not rely on it alone. For some projects I leave it enabled to provide an added layer of security. In general it is preferable to use a widely tested/utilized solution than to craft one yourself. If the "YSOD" (or custom error page) becomes an issue with my clients, I usually just disable the .NET request validation feature for the page.
Once doing so, I carefully ensure that my input is sanitized but more importantly that my output is encoded. So anywhere where I push user-entered (or web service, etc. -- anything that comes from a third party) content to the user it gets wrapped in Server.HtmlEncode(). This approach has worked pretty well for a number of years now.
The link you provided to Microsoft's documentation is quite good. To answer your question about what is considered markup (or what should be considered markup) get on your hacker hat and check out the OWASP XSS Evasion Cheat Sheet.
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#HTML_entities
Lets say my program is an Anti-Virus.
Lets also say I have a file, called "Signatures.dat". It contains a list of viruses to scan.
I would like to encrypt that file in a way that it can be opened my by anti-virus on any computer but the users wont able to see the content of that file.
How would I accomplish that task ?
I was looking at thigs like DPAPI, but I dont think that would work in my case because it's based on User's setting. I need my solution to be universal.
I've got a method to encrypt it, but then I am not sure how to store the keys.
I know that storing it in my code is really unsecure, so I am really not sure what to do at this point.
You want the computers of the users to be able to read the file, and you want the computers of the users to be unable to read the file. As you see, this is a contradiction, and it cannot be solved.
What you are implementing is basically a DRM scheme. Short of using TPM (no, that doesn't work in reality, don't even think about it), you simply cannot make it secure. You can just use obfuscation to make it as difficult as possible to reverse-engineer it and retrieve the key. You can store parts of the key on a server and retrieve it online (basically doing what EA did with their games) etc., but you probably will only make your product difficult to use for legitimate users, and anyone who really wants to will still be able to get the key, and thus the file.
In your example are you trying to verify the integrity of the file (to ensure it hasn't been modified), or hide the contents?
If you are trying to hide the contents then as has been stated ultimately you can't.
If you want to verify the file hasn't been modified than you can do this via hashes. You don't appear to have confused the two use-cases but sometimes people assume you use encryption to ensure a file hasn't been tampered with.
Your best bet might be to use both methods - encrypt the file to deter casual browsers, but know that this is not really going to deter anyone with enough time. Then verify the hash of the file with your server (use https, and ensure you validate the certificates thumbprints). This will ensure the file hasn't been modified even if someone has cracked your encryption.
I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.
Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.
This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?
I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.
My questions boil down to:
Is BOM-aware detection sufficient for the vast majority of files?
In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.
So far I've found:
A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.
Thanks.
There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.
If the data starts with a BOM, use it.
If the data contains 0-bytes, it is likely utf-16 or ucs-32. You can distinguish between these, and between the big-endian and little-endian variants of these by looking at the positions of the 0-bytes
If the data can be decoded as utf-8 (without errors), then it is very likely utf-8 (or US-ASCII, but this is a subset of utf-8)
Next, if you want to go international, map the browser's language setting to the most likely encoding for that language.
Finally, assume ISO-8859-1
Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.
Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.
Have you tried reading a representative cross-section of your files from user, running them through your program, testing, correcting any errors and moving on?
I've found File.ReadAllLines() pretty effective across a very wide range of applications without worrying about all of the encodings. It seems to handle it pretty well.
Xmlreader() has done fairly well once I figured out how to use it properly.
Maybe you could post some specific examples of data and get some better responses.
This is a well known problem. You can try to do what Internet Explorer is doing. This is a nice article in The CodeProject that describes Microsoft's solution to the problem. However no solution is 100% accurate as everything is based on heuristcs. And it is also no safe to assume that a BOM will be present.
You may like to look at a Python-based solution called chardet. It's a Python port of Mozilla code. Although you may not be able to use it directly, its documentation is well worth reading, as is the original Mozilla article it references.
I ran into a similar issue. I needed a powershell script that figured out if a file was text-encoded ( in any common encoding ) or not.
It's definitely not exhaustive, but here's my solution...
PowerShell search script that ignores binary files
I'm having troubles with HttpWebRequest/HttpWebResponse and cookies/CookieContainer/CookieCollection.
The thing is, if the web server does not send/use a "path" in the cookie, Cookie.Path equals the path-part of the request URI instead of "/" or being empty in my application.
Therefore, those cookies do not work for the whole domain, which it actually does in proper web browsers.
Any ideas how to solve this issue?
Thanks in advance
Ah, I see what you mean. Generally what browsers really do is take the folder containing the document as the path; for ‘/login.php’ that would be ‘/’ so it would effectively work across the whole domain. ‘/potato/login.php’ would be limited to ‘/potato/’; anything with trailing path-info parts (eg. ‘/login.php/’) would not work.
In this case the Netscape spec could be considered wrong or at least misleading in claiming that path defaults to the current document path... depending on how exactly you read ‘path’ there. However the browser behaviour is consistent back as far as the original Netscape version. Netscape never were that good at writing specs...
If .NET's HttpWebRequest is really defaulting CookieContainer.Path to the entire path of the current document, I'd file a bug against it.
Unfortunately the real-world behaviour is not actually currently described in a standards document... there is RFC 2965, which does get the path thing right, but makes several other changes not representative of real-world browser behaviour, so that's not wholly reliable either. :-(
Seems like I cannot go any further with the default cookie handler, so I got annoyed and I did it the hard way. Haha. So parsing response.Headers["Set-Cookie"] myself is my solution. Not my preferred one but it works. And I simply eliminated the problem with splitting at the wrong comma using regular expressions.
If I could give out points here, I would give you some of them, bobince, because you gave me valuable information. I would also vote up if I could (need higher rep. score), but since this behavior probably is a bug, as you mentioned, I will accept that as an answer.
Thank you. :)
That's the way cookies work. ‘Proper’ web browsers do exactly the same, as originally specified in the ancient Netscape cookies doc: http://cgi.netscape.com/newsref/std/cookie_spec.html
Web apps must effectively always set a ‘path’ (often ‘/’).