I am trying to protect my website from Cross-Site Scripting (XSS) and I'm thinking of using regular expressions to validate user inputs.
Here is my question: I have a list of dangerous HTML tags...
<applet>
<body>
<embed>
<frame>
<script>
<frameset>
<html>
<iframe>
<img>
<style>
<layer>
<link>
<ilayer>
<meta>
<object>
...and I want to include them in regular expressions - is this possible? If not, what should I use? Do you have any ideas how to implement something like that?
public static bool ValidateAntiXSS(string inputParameter)
{
if (string.IsNullOrEmpty(inputParameter))
return true;
// Following regex convers all the js events and html tags mentioned in followng links.
//https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
//https://msdn.microsoft.com/en-us/library/ff649310.aspx
var pattren = new StringBuilder();
//Checks any js events i.e. onKeyUp(), onBlur(), alerts and custom js functions etc.
pattren.Append(#"((alert|on\w+|function\s+\w+)\s*\(\s*(['+\d\w](,?\s*['+\d\w]*)*)*\s*\))");
//Checks any html tags i.e. <script, <embed, <object etc.
pattren.Append(#"|(<(script|iframe|embed|frame|frameset|object|img|applet|body|html|style|layer|link|ilayer|meta|bgsound))");
return !Regex.IsMatch(System.Web.HttpUtility.UrlDecode(inputParameter), pattren.ToString(), RegexOptions.IgnoreCase | RegexOptions.Compiled);
}
Please read over the OWASP XSS (Cross Site Scripting) Prevention Cheat Sheet for a broad array of information. Black listing tags is not a very efficient way to do it and will leave gaps. You should filter input, sanitize before outputting to browser, encode HTML entities, and various other techniques discussed in my link.
You should encode string as HTML. Use dotNET method
HttpUtils.HtmlEncode(string text)
There is more details http://msdn.microsoft.com/en-us/library/73z22y6h.aspx
Blacklisting as sanitization is not effective, as has already been discussed. Think about what happens to your blacklist when someone submits crafted input:
<SCRIPT>
<ScRiPt>
< S C R I P T >
<scr�ipt>
<scr<script>ipt> (did you apply the blacklist recursively ;-) )
This is not an enumeration of possible attacks, but just some examples to keep in mind about how the blacklist can be defeated. These will all render in the browser correctly.
Related
I want to display user content in a java script variable.
As with all user generated content, I want to sanitize it before outputting.
ASP.Net MVC does a great job of this by default:
#{
var name = "Jón";
}
<script> var name ='#name';</script>
The output for the above is:
Jón
This is great as it protects me from users putting <tags> and <script>evilStuff</script> in their names and playing silly games.
In the above example,I want sanity from evil doers but I don't want to HTML encode UTF8 valid characters that aren't evil.
I want the output to read:
Jón
but I also want the XSS protection that encoding gives me.
Outside of using a white listing framework (ie Microsoft.AntiXSS) is there any built in MVC function that helps here?
UPDATE:
It looks like this appears to achieve something that looks like it does the job:
#{
var name = "Jón";
}
<script> var name ='#Html.Raw(HttpUtility.JavaScriptStringEncode(name))';
Will this protect against most all XSS attacks?
You'd have to write your own encoder or find another 3rd party one. The default encoders in ASP.NET tend to err on the side of being more secure by encoding more than what might necessarily be needed.
Having said that, please don't write your own encoder! Writing correct HTML encoding routines is a very difficult job that is appropriate only for those who have specific advanced security expertise.
My recommendation is to use what's built-in because it is correct, and quite secure. While it might appear to produce less-than-ideal HTML output, you're better safe than sorry.
Now, please note that this code:
#Html.Raw(HttpUtility.JavaScriptStringEncode(name))
Is not correct and is not secure because it is invalid to use a JavaScript encoding routing to render HTML markup.
I am using asp.net htmleditorextender and unfortunately there is no working XSS sanitizer right now. So for as quick solution i am replacing all of the script and java words from user input as below
var regex = new Regex("script", RegexOptions.IgnoreCase);
srSendText = regex.Replace(srSendText, "");
regex = new Regex("java", RegexOptions.IgnoreCase);
srSendText = regex.Replace(srSendText, "");
Can i assume that i am safe from XSS attacks ?
Actually i am using htmlagilitypack anti xss sanitizer but it is not even removing script tags so totally useless
No, you can't.
An attacker can easily circumvent such a check by for example encoding script as script.
Making the code safe from XSS attacks is done by making sure that any content that can come from a user is never put in the page without proper encoding, to make sure that any code in the text is not executed at all.
Custom XSS prevention is usually a no no in my opinion as you should always use a library. However stripping script and java is not enough, anything that's passed up to the server should use HttpUtility.HtmlEncode which will encode any input from the user.
Also ensure that validateRequest="true" is set in the config file.
Other dangerous tags may include:
applet
body
embed
frame
script
frameset
html
iframe
img
style
layer
link
ilayer
meta
object
http://msdn.microsoft.com/en-us/library/ff649310.aspx
I am looking for the best way to remove all of the text in between 2 div tags, including the tags themselves.
For example:
<body>
<div id="spacer"> This is a title </div>
</body>
becomes:
<body>
</body>
Edit: This needs to happen on the server side (C#)
You can use this lib: http://htmlagilitypack.codeplex.com/ to manipulate on the server side, below is example for your case:
var doc = new HtmlDocument();
doc.LoadHtml("<body><div id=\"spacer\"> This is a title </div></body>");
doc.GetElementbyId("spacer").Remove();
var stream = new StringWriter();
doc.Save(stream);
var result = stream.ToString();
Edit:
You also can use xpath to select any nodes you want:
var nodes = doc.DocumentNode.SelectNodes("body/div");
nodes.ToList().ForEach(node => node.Remove());
Not sure what you are trying to achieve, but best way to hide or remove detail on fly in your case would be JQuery/Javascript since you are not refering to server side control.
In case you are just parsing string:-
1)Parse and find first occurance/last occurance and trim things in between.
2)XML parsing would be other way and a better one I guess because you can iterate throughout the xml to manipulate in a better way.
you can use regular expression to strip html tags and text. You will find several examples in google.
I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)
Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.
An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.
I want to protect my page when a user inputs the following:
<script type="text/javascript">
alert("hi");
</script>
I'm using ShowDown:
jQuery.fn.markDown = function()
{
return this.each(function() {
var caller = this;
var converter = new Showdown.converter();
var text = $(caller).text();
var html = converter.makeHtml(text);
$(caller).html(html);
});
}
If you want to sanitize html on a .NET server-side code, I'd advise you use Microsoft web protection library, after transforming the markup to html, before rendering it to the page.
e.g. the following snippet:
x = #"<div>safe</div>
<script type='text/javascript'>
alert('hi');
</script>";
return Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(x);
returns <div>safe</div>
http://wpl.codeplex.com/
One of the solution that could be effective would be to strip all the tag in the source or HTML encode the tag before it is transformed with Showdown.
For how to strip all the HTML tag, there are a couple of way to do it that you can find in this question :
Strip HTML from Text JavaScript
For how to HTML encode the tag, you can use this :
myString.replace(/</g, '<').replace(/>/g, '>');
Note: This will remove you the ability to use HTML in Showdown.
The ShowDown page strips any javascript, so I don't know what you mean exactly. But you can't do this on the client. If this is never going to be submitted to the server, then it doesn't matter. However, 99% of the time, you want to store it on the server.
I think the best approach is to create a server side DOM object out of the html that is submitted (which could be spoofed and bypass ShowDown) and look for any script or other dangerous tags. This is not so simple!
The best compromise for me is to use a server side markdown language (like https://github.com/charliesome/bbsharp) that you could then use to generate the html. You would then html encode any html before passing it to the tool that converts the markdown to HTML.
I use HTML Purifier which works very well for filtering user input and is highly customizable.
I assume you can use it with MarkDown, although I never tried.