How Can I strip HTML from Text in .NET? - c#

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
EDIT
See my answer.
EDIT 2
alt text http://tinyurl.com/sillychimp

I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, #"\s+", " ").Trim();
}

Take a look at this Strip HTML tags from a string using regular expressions

Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method

TextReader tr = new StreamReader(#"Filepath");
string str = tr.ReadToEnd();
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);
but you need to have a namespace referenced i.e:
system.text.RegularExpressions
only take this logic for your website

If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:
public static string StripTags(string value)
{
if (value == null)
return string.Empty;
string pattern = #"&.{1,8};";
value = Regex.Replace(value, pattern, " ");
pattern = #"<(.|\n)*?>";
return Regex.Replace(value, pattern, string.Empty);
}
It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

You could:
Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
Use TinyMCE's built-in configuration options for stripping unwanted HTML.
Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.

As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?

You can use HTQL COM, and query the source with a query:
<body> &tx;

You can use something like this
string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Related

Regex to get values between Double Quotes

I have a value i am pulling from a database
<iframe width="420" height="315" src="//www.youtube.com/embed/8GRDA1gG8R8" frameborder="0" allowfullscreen></iframe>
I am trying to get the src as a value using regex.
Regex.Match(details.Tables["MarketingDetails"].Rows[0]["MarketingVideo"].ToString(), "\\\"([^\\\"]*)\\\"").Groups[2].Value
that is how i am currently writing it
How would I write this to pull the correct value of src?
You could do it like this....
Match match = Regex.Match( #"<iframe width=""420"" height=""315"" src=""//www.youtube.com/embed/8GRDA1gG8R8"" frameborder=""0"" allowfullscreen></iframe>", #"src=(\""[^\""]*\"")");
Console.WriteLine (match.Groups[1].Value);
However, as others have already commented on your question... it's better practice to use an actual html parser.
Don't use regex to parse xml or html. It's not worth it. I'll let you read this post, and it sort of exagerates the point, but the main thing to keep in mind is you can get into a lot of trouble with regex and html.
So, instead you should use an actual html/xml parser! For starters, use XElement, a class built into the .net framework.
string input = "<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/8GRDA1gG8R8\" frameborder=\"0\" allowfullscreen=''></iframe>";
XElement html = XElement.Parse(input);
string src = html.Attribute("src").Value;
This will make src have the value //www.youtube.com/embed/8GRDA1gG8R8. You can then split that up to get whatever you need from it.
I should also note that your input is not valid xml. allowfullscreen does not have a value attached, which is why I added =''.
If you need to get more complex, such as your input, use an HTML parser (XElement is meant for xml). Use the Html Agility Pack like this (using the previous example):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
string src = doc.DocumentNode
.Element("iframe")
.Attributes["src"]
.Value;
This parser is more forgiving for invalid or incorrect (or just irregular) inputs. This will parse your original input just fine (so missing the ='').

How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX.
but inside the string if it contains number inside the angular braces <100> it should not remove it .
var withHtml = "<p>hello <b>there<1234></b></p>";
var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);
Result: hello there
but needed output :
hello there 1234
Your example of HTML isn't valid HTML since it contains a non-HTML tag. I figure you intended for the angle-brackets to be encoded.
I don't think regular expressions are suitable for HTML parsing. I recommend using an HTML parser such as HTML Agility Pack to do this.
Here's an example:
var withHtml = "<p>hello <b>there<1234></b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);
var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work.
Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. A simple improvement that gets you almost there is:
Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);
Gives "hello there<1234>" You then just need to replace all angled brackets.

Regex for Removing Comma between <a> tag text C#

I have the following string , i tried many many regex to remove comma between a tag text, but not found any regex for removing comma between a tag text. I want that , whenever text inside a tag has comma ,then will be replace by empty string.
Getty Center, Restaurant at the
i have tried this regex but it is not working, here input is string that contains html.
input = Regex.Replace(input, #"<a(\s+[^>]*)?>[^\w\s]</a(\s+[^>]*)?>", "");
Please help me out. Thank You
You can use the Regex to find and modify the content of the tag like so.
var input = "Getty Center, Restaurant at the";
var regex = new Regex(#"<a[^>]*>(?<content>.*?)</a[^>]*>",
RegexOptions.Singleline);
var match = regex.Match(input);
while (match.Success) {
var group = match.Groups["content"];
input = input.Substring(0, group.Index)
+ group.Value.Replace(",", "")
+ input.Substring(group.Index + group.Length);
match = regex.Match(input, group.Index);
};
The loop is in place to catch multiple tags in the same string. The Regex however is fairly naive. It will mess with tags nested inside the A tag, and will parse incorrectly if a > is in any of the attributes. (Though that would probably be bad HTML anyway.) A proper HTML parser is recommended for this reason.
I would suggest to use a HTML parser. There are plenty available which are open source and are free. One of the best I found is HTMLAgilityPack at HTMLAgilityPack
Some examples at Some Examples
In nutshell, the following code snippet will give you all tag
HtmlDocument myDoc = new HtmlDocument();
myDoc.Load(path);
HtmlNodeCollection imgs = new HtmlNodeCollection(myDoc.DocumentNode.ParentNode);
imgs = myDoc.DocumentNode.SelectNodes("//img");
Hope that helps
If you want to directly use the replace, you will have to match only the comma and not the text before or after the comma. You'd have to use look ahead and look behind to check if the comma is in the tag. Although this is doable, it is not advised to do this.
An alternative is to use matching groups to match the whole text in the tag and group the comma if it exists and replace the match.
<a[^>]+>[\w\s]*(,?)[\w\s]*<\/a>
The first capture group captures comma if present. You can test it here. [http://rubular.com/r/K2jjIaObty][1]
The best option would be to use a html parser to capture contents of the a tag, search for comma and replace.

Convert HTML text to Plain text

I have a text area.
I allow entering html markups in that any html code can be entered.
now i want to convert that html code to plain text without using third party tool...how can it be done
currently i am doing it like below:-
var desc = Convert.ToString(Html.Raw(Convert.ToString(drJob["Description"])));
drJob["Description"] is datarow from where I fetch description and I want to convert description to plain text.
There is no direct way coming from .NET to do this. You either need to resort to a third party tool like HtmlAgilePack- or do this in javascript.
document.getElementById('myTextContainer').innerText = document.getElementById('myMarkupContainer').innerText;
For your safety, dont use a regex. ( http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html )
You can replace html tags with empty string using System.Text.RegularExpressions.Regex
String desc = Regex.Replace(drJob["Description"].ToString(), #"<[^>]*>", String.Empty);
You can simply use a replace method using regex "<[^>]+>"
using System.Text.RegularExpressions;
private void button1_Click(object sender, EventArgs e)
{
string sauce = htm.Text; // htm = your html box
Regex myRegex = new Regex(#"(?<=^|>)[^><]+?(?=<|$)", RegexOptions.Compiled);
foreach (Match iMatch in myRegex.Matches(sauce))
{
txt.AppendText(Environment.NewLine + iMatch.Value); //txt = your destination box
}
}
Let me know if you need more clarification.
[EDIT:] Be aware that this is not a clean function, so add a line to clean up empty spaces or line breaks. But the actual getting of text from in-between tags should work fine. If you want to save space - use regex and see if this works for you. Although the person who posted about regex not being clean is right, there might be other ways; Regex is usually better when separating a single type of tag from html. (I use it for rainmeter to parse stuff and never had any issues)

string replacement in page created from template

I've got some aspx pages being created by the user from a template. Included is some string replacement (anyting with ${fieldname}), so a portion of the template looks like this:
<%
string title = #"${title}";
%>
<title><%=HttpUtility.HtmlEncode(title) %></title>
When an aspx file is created from this template, the ${title} gets replaced by the value the user entered.
But obviously they can inject arbitrary HTML by just closing the double quote in their input string. How do I get around this? I feel like it should be obvious, but I can't figure a way around this.
I have no control over the template instantiating process -- I need to accept that as a given.
Can you store their values in another file(xml maybe) or in a database? That way their input is not compiled into your page. Then you just read the data into variables. Then all you have to worry about is html, which your html encode would take care of.
If they include a double quote in their string, that will not inject arbitrary HTML, but arbitrary code, which is even worse.
You can use a regex to filter the input string. I would use an inclusive regex rathern than trying to exclude dangerous chars. Only allow them A-Za-z0-9 and whitespace.
Not sure i understand fully, but...
Try using a regex to strip html from the title instead of html encoding it:
public string StripHTML(string text)
{
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
}
Is this possible?
<%
string title = Regex.Replace(#"${title}", #”<(.|\n)*?>”, string.Empty);
%>
or
<title><%=HttpUtility.HtmlEncode(System.Text.RegularExpressions.Regex.Replace(title, #"<(.|\n)*?>", string.Empty)) %></title>

Categories

Resources