Smart HTML encoding - c#

I'm looking for the best way to do some sort of "smart" HTML encoding.
For instance:
From: <a>Next >></a> to: <a>Next gt;gt;</a>
From: <p><a><b><< Prev</b></a><br/><a>Next >></a></p> to: <p><a><b><< Prev</b></a><br/><a>Next gt;gt;</a></p>
So only the non XML / HTML part of the text would be encoded as if HtmlEncode is called.
Any suggestions?
EDIT: This should be as lightweight as possible. The incoming text will come from users which have no knowledge of HTML encoding.

Yes: don’t ever write HTML into your source code. Instead work with an API like DOM that takes care of all encoding issues for you.

If you want a solid and totally reliable C# solution (but heavy-weight) then I'd use the HTML Agility Pack library. You could then iterate through nodes and HTML encode the contents. It's a bit more bullet-proof than regular expressions, but obviously more intense.
If you want to do it client-side, then use JQuery. See Encode HTML entities with jQuery.

Have you thought about using tidy.net? You could throw your user input into that and see what it comes up with, it very, very, very good and turning garbage into something that you actually want. Its a DLL and all managed code I believe so you can easily bolt it in.
As for the no to regexp band wagon, I disagree. If the data is limited (you don't say if it is or not) then you could come up with some rules for at least trying to validate you input string if not cleaning it up. I suspect though that your data could literally be anything in which case you would be better of using something other, but it should not be ruled out completely.

You are probably trying to solve the wrong problem. (I know this is not what you want to hear.)
If users are allowed to write unencoded >> and << into HTML then presumably they would also be able to write <> or <b>, and in that case there is no way you can reliable distinguish between text and markup. (Never mind that this makes you vulnerable to XSS attacks.)
You really have to intercept the text and encode it before it is interpolated into HTML. Probably you should explain the workflow leading to you problem. There must be a better way to solve it.
Edit in response to comment: There is simply no way to reliably encode input which can be either text or HTML at the same time. Anyway, if users are technical enough to enter raw HTML, presumably they are able to write entities - otherwise the shouldn't be entering raw HTML in the first place. If HTML input is only for advanced users, then you could have a check-box which indicated if the input is text or HTML. But you should probably look into using a rich-text editor.

I would probably try to write a good regular expression for this. Are you doing this in code behind (C#) or on client-side with JavaScript?
http://www.regular-expressions.info/

Related

In C#, how to prevent XSS while allowing HTML input, including br's?

I've been using MS's AntiXSS library for a while now. Recently I decided to change the textareas in my site to be plain textareas (used to be WYSIWYG), and run a conversion on the newlines to br's.
Problem is, MS's AntiXSS library doesn't support this... it strips out the br's. I don't want to let the user's entry go directly into my DB unchecked. Without using the MS AntiXSS library, what's a reliable way to prevent XSS while allowing HTML input, including br's (in C#)?
You can disable your AntiXSS for this field and store directly the input from the user in your database.
That way, you'll be able to render this text on any output and not only HTML.
Now, when you want to display this text on an HTML page using ASP MVC Razor, you can use something like this :
#Html.Encode(Model.MyMultilineTextField).Replace(#"\n", "<br />")
Html.Encode will encode the text so Html tags are not interpreted and the XSS is not possible.
You may add an extension method on Html that does the transformation (whith replace) for you. You may also handle \r.
Is it possible to get a copy of the AntiXSS' output? If so, run your input through the AntiXSS and then make the replacement afterword and store the data yourself.
To resolve this, I decided to store the raw HTML as-is, performing a replace on Environment.Newlines to <br /> before storing it.
Then on the flip side, when showing it to visitors I use the MS AntiXSS code to clean it up. Not 100% the ideal way I'd like to do it, but gets the job done.
I do a bit of caching here to make sure it's not running through AntiXSS on every request too.

Detect HTML in ASP.NET

(clarification: this is an old question that has been tweaked for admin purposes)
There have been a fair amount of questions on this site about parsing HTML from textareas and whatnot, or not allowing HTML in Textboxes. My question is similar: How would I detect if HTML is present in the textbox? Would I need to run it through a regular expression of all known HTML tags? Is there a current library for .NET that has the ability to detect when HTML is inserted into a Textarea?
Edit: Similarly, is there a JavaScript Library that does this?
Edit #2: Due to the way that the web app works (It validates textarea text on asyncronous postback using the Validate method of ASP.NET), it bombs before it can get back to the code-behind to use HTML.Encode. My concern was trying to find another way of handling HTML in those instances.
Not really an answer, but why you need it at all? You need to sanitize HTML input only if you are going to output it without modifications, i.e. if you want to allow your users actually to be able to use HTML. And if you want that, you do not have to "detect" HTML, you just need to make sure that you handle it safe. Jeff Atwood has a good routine for this.
If you want to prevent at all HTML output, you can take whatever the user inputs, without any checks. Just take care to HtmlEncode it, and store it that way. Then your output will not have actually any "real" HTML from what the user wrote.
Yes, a regular expression is probably the easiest way to do that.
One regex would be: <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
You can run that in both ASP.Net and javascript. The .Net framework class you use is System.Text.RegularExpressions.Regex
Hope that helps!
bool containsHtml = Regex.IsMatch(MyTextbox.Text, #"<(.|\n)*?>");
As far as I know you cannot paste HTML into a TextArea and have it work automatically at least in .Net 2.0. ASP.Net automatically santizes input. You need to set ValidateInput page directive to false (If I remember correctly).
If you want to allow HTML tags and want to pick from a possible list of tags, I suggest you lookup 'Markdown' and this Jeff Atwood Post.
+1 Sunny. “detecting” HTML is a fool's errand. You need to escape it on output, and as long as you're doing that you're safe. If you're not escaping it, sanitisation hacks aren't going to make you secure, they're just going to obfuscate the problem.
 Due to the way that the web app works (It validates textarea text on asyncronous postback using the Validate method of ASP.NET)
Yeah, you'll want to stop doing that. ASP.NET's “request validation” is utterly bogus and needs to be turned off if you want to have any chance of processing uploaded content consistently.
Well, in HTML you can't do a lot without a less than symbol "<".
So, I would look for a less than symbol followed by come characters followed by a greater than symbol. If you find that, you can pretty much be assured that it is HTML.
I don't think you have to look for specific tags, as HTML will ignore invalid tags as part of the specification and it would still be considered HTML.
EDIT: Oops! Almost forgot... the ampersand character! If you see one in the text, you MIGHT have HTML since it is used to specify special characters (like © for ©) This can be dangerous because the user could specify < for < so it might turn into HTML later...

markup or RTF? in (ASP).NET with C#

I am developing a site and i would like some simple markup. I would need to keep the users newlines (easy enough replace \n with or use pre), a way to allow links and perhaps bold.
Would it be best to use a markup or to use RTF? i was thinking maybe i want special characters like :username: to create a link to a user or maybe :icon-username: to display a link and the avatar of the user. Maybe other things like that.
Is there a good markup lib i can use or should i find something that allows the user to write in RTF and run a pass before displaying it to output links/new icons and etc?
What libs do you guys like and think i should use?
My personal preference is markdown/textile, and perhaps something like the open source WMD editor I am using to type in this message.
I am not sure having the users write RTF is a good choice. How many people are comfortable with the syntax?
An HTML/XHTML would be much better. Plus you'd have the choice to use one of the dozens of browser 'editor' components out there for WYSIWYG editing.
Use whatever syntax you want, but include an icon for adding items to the editor. E.g. There can be a 'username' icon, where clicking it would add ':username:' to the editor. Similar to Stack Overflow's editor toolbar.
If you need RTF in the future, HTML/XHTML can be converted to RTF using third-party libraries. I've used XHTML in that capacity before and it actually worked out well. The hardest part was parsing the CSS ( not hard at all ). The XHTML was taken care of with a standard XML Parser.

Self learning regular expression or xpath query?

Is it possible to write code which generates a regular expression or XPath that parses links based on some HTML document?
What I want is to parse a page for some links. The only thing I know is that the majority of the links on the page is those links.
For a simple example, take a Google search engine results page, for example this. The majority of the links is from the search results and looks something like this:
<h3 class="r"><a onmousedown="return rwt(this,'','','res','1','AFQjCNERidL9Hb6OvGW93_Y6MRj3aTdMVA','')" class="l" href="http://stackoverflow.com/"><em>Stack Overflow</em></a></h3>
Is it possible to write code that learns this and recognizes this and is able to parse all links, even if Google changes their presentation?
I'm thinking of parsing out all links, and looking X chars before and after each tag and then work from that.
I understand that this also could be done with XPath, but the question is still the same. Can I parse this content and generate a valid XPath to find the serp links?
As I understand them, most machine learning algorithms work best when they have many examples from which they generalize an 'intelligent' behavior. In this case, you don't have many examples. Google isn't likely to change their format often. Even if it feels often to us, it's probably not enough for a machine learning algorithm.
It may be easier to monitor the current format and if it changes, change your code. If you make the expected format a configurable regular expression, you can re-deploy the new format without rebuilding the rest of your project.
If I understand your question, there's really no need to write a learning algorithm. Regular expressions are powerful enough to pick this up. You can get all the links in an HTML page with the following regular expression:
(?<=href=")[^"]+(?=")
Verified in Regex Hero, this regular expression uses a positive lookbehind and a positive lookahead to grab the url inside of href="".
If you want to take it a step further you can also look for the anchor tag to ensure you're getting an actual anchor link and not a reference to a css file or something. You can do that like this:
(?<=<a[^<]+href=")[^"]+(?=")
This should work fine as long as the page follows the href="" convention for the links. If they're using onclick events then everything becomes more complicated as you're going to be dealing with the unpredictability of Javascript. Even Google doesn't crawl Javascript links.
Does that help?

How to allow simple HTML tags in comments or anywhere?

In my web application I am developing a comment functionality, where user's can comment. But I am facing a problem which is I want to allow simple HTML tags in the comment box. HTML tags like <b>, <strong>, <i>, <em>, <u>, etc., that are normally allowed to enter in a commenting box. But then I also want when user presses enter then it will be automatically converted into breaks (<br /> tags) and get stored into database, so that when I'll display them in the web page then they'll look like as user entered.
Can you please tell me how to parse that user entered only allowed set of HTML tags and how to convert enters into <br /> tags and then store them in database.
Or if anyone have some better idea or suggestion to implement this kind of functionality. I am using ASP.NET 2.0 (C#)
I noticed that StackOverflow.com is doing the same thing on Profile Editing. When we edit our profile then below the "About Me" field "basic HTML allowed" line is written, I want to do almost the same functionality.
I don't have a C# specific answer for you, but you can go about it a few different ways. One is to let the user input whatever they want, then you run a filter over it to strip out the "bad" html. There are numerous open source filters that do this for PHP, Python, etc. In general, it's a pretty difficult problem, and it's best to let some well developed 3rd party code do this rather than write it yourself.
Another way to handle it is to allow the user to enter comments in some kind of simpler markup language like BBCode, Textile, or Markdown (stackoverflow is using Markdown), perhaps in conjunction with a nice Javascript editor. You then run the user's text through a processor for one of these markup languages to get the HTML. You can usually obtain implementations of these processors for whatever language you are using. These processors usually strip out the "bad" HTML.
Its rather "simple" to do that in php and python due to the large number of functions.I am still learning c# .lol. but havent yet come across the function.The chances are that it exists and all you need to do is search for it.I mean a function that can take the user input,search for the allowed tags (which are in an array of course) and replace the <> with something else like [] then use a function to escape the other html tags.In php we use htmlentities().
Something like
<code>
$txt=$_POST['comment'];
$txt=strreplace("<b>*</b>","[b]*[/b],"$txt");
$securetxt=htmlentities($txt);
$finaltxt=strreplace("[b]*[/b]","<b>*</b>","$securetxt");
//Now save to Db
I'm not sure, but I think you have to escape html characters when inserting in database and when retrieving echo them unescaped, so the browser can see it just like html.
I don´t know asp.net, but in php there´s an easy function, strip_tags, that let you add exceptions (in your case, b, em, etc.). If there´s nothing like that in C# you can write a regular expression that strips out all tags except the allowed ones but chances are that such an expression already exists so it should be easy to find.
replacing \n (or something similar) with br shouldn´t be a problem either with a simple search and replace.
This is a dangerous road to go down. You might think you can do some awesome regexes, or find someone who can help you with it, but sanitizing SOME markup and leaving other is just crazy talk.
I highly recommend you look into BBCode or another token system. Even something untokenized such as what SO uses, is probably a much better solution.

Categories

Resources