Stripping out malformed HTML from string

Stripping out malformed HTML from string - c#

Sometimes from a 3rd party API I get malformed HTML elements returned:
olor:red">Text</span>
when I expect:
<span style="color:red">Text</span>
For my context, the text content of the HTML is more important so it does not matter if I lose surrounding tags/formatting.
What would be the best way to strip out the malformed tags such that the first example would read
Text
and the second would not change?

I recommend you to take a look at the HtmlAgilityPack, which is a very handy tool also for HTML sanitization.
Here's an approach example by using the aforementioned library:
static void Main()
{
var inputs = new[] {
#"olor:red"">Text</span>",
#"<span style=""color:red"">Text</span>",
#"Text</span>",
#"<span style=""color:red"">Text",
#"<span style=""color:red"">Text"
};
var doc = new HtmlDocument();
inputs.ToList().ForEach(i => {
if (!i.StartsWith("<"))
{
if (i.IndexOf(">") != i.Length-1)
i = "<" + i;
else
i = i.Substring(0, i.IndexOf("<"));
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.InnerText);
}
else
{
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.OuterHtml);
}
});
}
Outputs:
Text
<span style="color:red">Text</span>
Text
<span style="color:red">Text</span>
<span style="color:red">Text</span>

If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions:
var r = new Regex(">([^>]+)<");
var text = "olor:red\">Text</span>";
var m = r.Match(text);
This will find every inner text of each tag.

Very crudely, you could strip out all 'tags' by stripping everything before a > and keeping everything before a <.
I'm assuming you also need to consider the situation where the text your receive is without tags: e.g. Text.
In pseudo-code:
returnText = ""
loop:
gtI = text.IndexOf(">")
ltI = text.IndexOf("<")
if -1==gtI and -1==ltI:
returnText += text
we're done
if gtI==-1:
returnText += text up to position ltI
return returnText
if ltI==-1:
returnText += text after gtI
return returnText
if ltI < gtI:
returnText += textBefore ltI
text = text after ltI
loop
// gtI < ltI:
text = text after gtI
loop
It's crude and can be done much better (and faster) with a custom coded parser, but essentially the logic would be the same.
You should really be asking why the API returns only part of what you require: I can't see why it should be returning ext</span> either, which really messes you up.

Related

Determining if string is inside Text Content or part of attributes

I am trying to come up with an algorithm that identififies if a string is part of the text content of an element or is it part of the element attributes.
For example:
<a class="tag tag-red-dark" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
If you perform regex on TEXTCONTENT or PARTOFATTRIBUTE, you can run this algorithm to check if they are part of the text or part of the attributes:
MatchCollection matches = Regex.Matches(html, #"(?i)TEXTCONTENT");
for (int i = matches.Count-1; i >= 0 ; i--){
Match m = matches[i];
int currentIndex = m.Index;
bool isTextContent = false;
while (html[currentIndex] != '<'){
currentIndex--;
if (html[currentIndex] == '>'){ // text is placed between > and <
isTextContent = true;
break;
}
}
if (isTextContent){
// do something with text content
}else{
// do something with attribute
}
}
But the algorithm is fragile. If your html looks like this:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
PARTOFATTRIBUTE will be recognized as text, which is not.
Moreover, you could also have text with < in it, which makes the algorithm think that it found attribute:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> < Found TEXTCONTENT </a>
Placing < in text without escaping is invalid html which i would like to handle. Placing > in attributes is on the other hand valid. Is it possible to determine if the selected string is part of attributes of text content solely based on the environment in which it is placed?

HtmlAgilityPack is not slow you do not have to parse the entire page just the A tag. Since you probably already parsed the a tags from your html. Just pass in only the the Html that you need parsed.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml("<a class=\"tag tag - red - dark\" title=\"a > b\" href=\" / keywords ? q = PARTOFATTRIBUTE\"> < Found TEXTCONTENT </a>");
if (htmlDoc.DocumentNode.ChildNodes[0].InnerHtml.Contains("TEXTCONTENT"))
{
// do something with text content
}
if (htmlDoc.DocumentNode.ChildNodes[0].Attributes["href"].Value.Contains("PARTOFATTRIBUTE"))
{
// do something with attribute
}

Replace with .Replace/.Regex

I am using Html.Raw(Html.Encode()) to allow some of html to be allowed. For example I want bold, italic, code etc... I am not sure it's the right method, code seems pretty ugly.
Input
Hello, this text will be [b]bold[/b]. [code]alert("Test...")[/code]
Output
Code
#Html.Raw(Html.Encode(Model.Body)
.Replace(Environment.NewLine, "<br />")
.Replace("[b]", "<b>")
.Replace("[/b]", "</b>")
.Replace("[code]", "<div class='codeContainer'><pre name='code' class='javascript'>")
.Replace("[/code]", "</pre></div>"))
My Solution
I want to make it all a bit different. Instead of using BB-Tags I want to use simpler tags.For example * will stand for bold. That means if I input This text is *bold*. it will replace text to This text is <b>bold</b>.. Kinda like this website is using BTW.
Problem
To implement this I need some Regex and I have little to no experience with it. I've searched many sites, but no luck.
My implementation of it looks something like this, but it fails since I can't really replace a char with string.
static void Main(string[] args)
{
string myString = "Hello, this text is *bold*, this text is also *bold*. And this is code: ~MYCODE~";
string findString = "\\*";
int firstMatch, nextMatch;
Match match = Regex.Match(myString, findString);
while (match.Success == true)
{
Console.WriteLine(match.Index);
firstMatch = match.Index;
match = match.NextMatch();
if (match.Success == true)
{
nextMatch = match.Index;
myString = myString[firstMatch] = "<b>"; // Ouch!
}
}
Console.ReadLine();
}

To implement this I need some Regex
Ah no, you don't need Regex. Manipulating HTML with Regex could lead to some undesired effects. So you could simply use MarkDownSharp which by the way is what this site uses to safely render Markdown markup into HTML.
Like this:
var markdown = new Markdown();
string html = markdown.Transform(SomeTextContainingMarkDown);
Of course to polish this you would write an HTML helper so that in your view:
#Html.Markdown(Model.Body)

Dynamic String Format C#

In C#, Windows Form, how would I accomplish this:
07:55 Header Text: This is the data<br/>07:55 Header Text: This is the data<br/>07:55 Header Text: This is the data<br/>
So, as you can see, i have a return string, that can be rather long, but i want to be able to format the data to be something like this:
<b><font color="Red">07:55 Header Text</font></b>: This is the data<br/><b><font color="Red">07:55 Header Text</font></b>: This is the data<br/><b><font color="Red">07:55 Header Text</font></b>: This is the data<br/>
As you can see, i essentially want to prepend <b><font color="Red"> to the front of the header text & time, and append </font></b> right before the : section.
So yeah lol i'm kinda lost.
I have messed around with .Replace() and Regex patterns, but not with much success. I dont really want to REPLACE text, just append/pre-pend at certain positions.
Is there an easy way to do this?
Note: the [] tags are actually <> tags, but i can't use them here lol

Just because you're using RegEx doesn't mean you have to replace text.
The following regular expression:
(\d+:\d+.*?:)(\s.*?\[br/\])
Has two 'capturing groups.' You can then replace the entire text string with the following:
[b][font color="Red"]\1[/font][/b]\2
Which should result in the following output:
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
Edit: Here's some C# code which demonstrates the above:
var fixMe = #"07:55 Header Text: This is the data[br/]07:55 Header Text: This is the data[br/]07:55 Header Text: This is the data[br/]";
var regex = new Regex(#"(\d+:\d+.*?:)(\s.*?\[br/\])");
var matches = regex.Matches(fixMe);
var prepend = #"[b][font color=""Red""]";
var append = #"[/font][/b]";
string outputString = "";
foreach (Match match in matches)
{
outputString += prepend + match.Groups[1] + append + match.Groups[2] + Environment.NewLine;
}
Console.Out.WriteLine(outputString);

have you tried .Insert() check this.

Have you considered creating a style and setting the css class of each line by wrapping each line in a p or div tag?
Easier to maintain and to construct.

The easiest way probably is to use string.Replace() and string.Split(). Say your input string is input (untested):
var output = string.Join("<br/>", in
.Split("<br/>)
.Select(l => "<b><font color=\"Red\">" + l.Replace(": ", "</font></b>: "))
.ToList()
) + "<br/>";

Simple text to HTML conversion

I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.

I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}

Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>

Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.

How about throwing it in a <pre> tag. Isn't that what it's there for anyway?

I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)

Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings

I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.

Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}

Regex.Replace(htmlText, "<.*?>", string.Empty);

protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}

string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);

I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}

using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.

For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/

Simply use string.StripHTML();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Stripping out malformed HTML from string - c#

If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions: var r = new Regex(">([^>]+)<"); var text = "olor:red\">Text</span>"; var m = r.Match(text); This will find every inner text of each tag.

Related

Determining if string is inside Text Content or part of attributes

Replace with .Replace/.Regex

Dynamic String Format C#

Simple text to HTML conversion

How can I strip HTML tags from a string in ASP.NET?

Categories

Resources