string replacement in page created from template - c#

I've got some aspx pages being created by the user from a template. Included is some string replacement (anyting with ${fieldname}), so a portion of the template looks like this:
<%
string title = #"${title}";
%>
<title><%=HttpUtility.HtmlEncode(title) %></title>
When an aspx file is created from this template, the ${title} gets replaced by the value the user entered.
But obviously they can inject arbitrary HTML by just closing the double quote in their input string. How do I get around this? I feel like it should be obvious, but I can't figure a way around this.
I have no control over the template instantiating process -- I need to accept that as a given.

Can you store their values in another file(xml maybe) or in a database? That way their input is not compiled into your page. Then you just read the data into variables. Then all you have to worry about is html, which your html encode would take care of.

If they include a double quote in their string, that will not inject arbitrary HTML, but arbitrary code, which is even worse.
You can use a regex to filter the input string. I would use an inclusive regex rathern than trying to exclude dangerous chars. Only allow them A-Za-z0-9 and whitespace.

Not sure i understand fully, but...
Try using a regex to strip html from the title instead of html encoding it:
public string StripHTML(string text)
{
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
}
Is this possible?
<%
string title = Regex.Replace(#"${title}", #”<(.|\n)*?>”, string.Empty);
%>
or
<title><%=HttpUtility.HtmlEncode(System.Text.RegularExpressions.Regex.Replace(title, #"<(.|\n)*?>", string.Empty)) %></title>

Related

Search for a url in a string and remove enclosing tag

I have a string in which I need to search for a url and get its immediate enclosing script tag and remove it.
example
string test="<script>test</script><script
src="https://cdn.getsmartcontent.com/xxxxx.js"></script><script></script>"
should give
string test="<script>test</script><script></script>"
The xxxxx.js can be any alphanumeric name
The correct answer is use the HTML Agility Pack and parse the html properly
However, in regards to you comment
I have 13000 sharepoint sites and for each site I have to parse the
master page and remove the above specific script tag
You can use something nasty like this i guess :/
Regex.Replace(yourPage, #"<script src=""https://cdn.getsmartcontent.com/.+?\/script>", String.Empty);
You could use the multiline option i guess, but this is still a bad idea
Note/Disclaimer/Warning : See bold comment

Format HtmlEncoded text to ASP

I am taking string from database, which will then be HtmlEncoded. How do I do the formatting of newline and tab?
I don't think I will be able to use CSS because it is only one string (unless using CSS to replace the substring)
One way I've tried is by putting <br> and   inside of the text in database and then using HttpUtility.HtmlDecode to format it, but I am not sure it is the right way.
Any suggestion and feedback is welcomed.
if you are getting a html encoded string from database then you just have to use htmldecode for decoding and it will place tabs and new line.
Prior to that check if the encoded string is html encoded or any other encoding has been used.

How can i add double quotes to a string?

I want to add double quotes for a sting . I know by using /" we can add double quotes . My string is
string scrip = "$(function () {$(\"[src='" + names[i, 0] + "']\"" + ").pinit();});";
When i do this on the browser i am getting &quot instead of " quotes . How can i overcome with the problem ?
If your browser has displayed a "&quot" instead of a " character, than there are only a few causes possible. The character should have been emitted to the browser as either itself, or as a HTML entity of ". Please note the semicolor at the end. If a browser sees such 'code', it presents a quote. This is to allow writing the HTML easier, when its attribtues need to contain special characters, compare:
<div attribute="blahblahblah" />
if you want to put a " into the blahs, it'd terminate the attribute's notation, and the HTML code would break. So, adding a single " character should look like:
<div attribute="blah&quote;blahblah" />
Now, if you miss the semicolon, the browser will display blah&quotblahblah instead of blah"blahblah.
I've just noted that your code is actually glueing up the JavaScript code. In JavaScript, the semicolon is an expression delimiter, so probably there is actually a " in the emitted HTML and it is just improperly presented in the error message... Or maybe you have forgotten to open/close some quotes in the javascript, and the semicolon is actually treated as expression terminator?
Be also sure to check why the JavaScript code undergoes html-entity translation. Usually, blocks are not reparsed. Are you setting that JavaScript code as a HTML element attribute? like OnClick or OnSend? Then stop doing it now. Create a javascript-function with this code and call that function from the click/send instead.. It is not worth to encode long expressions in the JS into an attribute! Just a waste of time and nerves.
If all else fails and if the JavaScript is emitted correctly, then look for any text-correcting or text-highlighting or text-formatting modules you have on your site. Quite probable that one of them is mis-reading the html entities and removed the semicolon, or the opposite - that they add them were they are not needed. The ASP.Net itself in general does its job right, and it translates the entites correctly wherever they are needed, so I'd look at the other libraries first.
You can use something like this:
String str=#"hello,,?!"
This should escape all characters
Or
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);
Here's the manual: http://msdn.microsoft.com/en-us/library/w3te6wfz.aspx
What else are you doing with the string?
Seems that somewhere after that the string gets encoded. You can could use HttpUtility.HtmlDecode(str); but first you'll have to figure out where your string gets encoded in the first place.
Keep in mind that if you use <%: %> in aspx or #yourvarin Razor it will get encoded automatically. You'll have to use #Html.Raw(yourvar) to suppress that.

HTML Decode and Encode

I have tried to decode the html text that i have in the databse in my MVC 3 Razor application.
the html text in the databse is not encoded.
I tries httpUtility.decode , server.decode but none of them work.
finally i managed to make it work with Html.raw(string)
sample of non working code
#Server.HtmlDecode(item.ShortDescription)
#HttpUtility.HtmlDecode(item.ShortDescription)
Do you know why we can not use html.decode in my case !
I thought this would save some one else from looking for few hours.
It works just fine to decode the text, but then it will automatically be encoded again when it's put in the page using the # syntax.
The Html.Raw method wraps the string in an HtmlString, which tells the razor engine not to encode it when it's put in the page.
If you want to display the value as-is without any HTML encoding you could use the Html.Raw helper:
#Html.Raw(item.ShortDescription)
Be warned thought that by doing this you are opening your site to XSS attacks so you should be very careful about what HTML this ShortDescription property contains. If it is the user that enters it you should absolutely ensure that it is safe. You could use the AntiXss library for this.
Do you know why we can not use html.decode in my case !
Because Html.Decode returns a string and when you feed a string to the #() Razor function it automatically Html encodes it again and ruins your previous efforts. That's why the Html.Raw helper exists.

How Can I strip HTML from Text in .NET?

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
EDIT
See my answer.
EDIT 2
alt text http://tinyurl.com/sillychimp
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, #"\s+", " ").Trim();
}
Take a look at this Strip HTML tags from a string using regular expressions
Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method
TextReader tr = new StreamReader(#"Filepath");
string str = tr.ReadToEnd();
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);
but you need to have a namespace referenced i.e:
system.text.RegularExpressions
only take this logic for your website
If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:
public static string StripTags(string value)
{
if (value == null)
return string.Empty;
string pattern = #"&.{1,8};";
value = Regex.Replace(value, pattern, " ");
pattern = #"<(.|\n)*?>";
return Regex.Replace(value, pattern, string.Empty);
}
It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...
You could:
Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
Use TinyMCE's built-in configuration options for stripping unwanted HTML.
Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.
As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?
You can use HTQL COM, and query the source with a query:
<body> &tx;
You can use something like this
string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Categories

Resources