How to parse HTML to modify all words

How to parse HTML to modify all words - c#

This seems to be a recurring question, but here goes.
I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.
For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.
I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?
HtmlDocument HD = new HtmlDocument();
HD.Load (#"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(#alt)]");
if (NoAltElements != null)
{
foreach (HtmlNode HN in NoAltElements)
{
HN.Attributes.Append("alt", "no alt image");
}
}
HD.Save(#"e:\test.htm");
The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (which may involve creating new tags in the process).
A very simple sample of what I might do is take the following input:
<html>
<head><title>Some Title</title></head>
<body>
<h1>This is my page</h1>
<p>This is a paragraph of text.</p>
</body>
</html>
and produce the output, which takes every word and alternates between making it uppercase and making it italics:
<html>
<head><title>Some Title</title></head>
<body>
<h1>THIS <em>is</em> MY <em>page</em></h1>
<p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
</body>
</html>
Ideas, suggestions?

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).
Processing code:
IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
string[] words = getWords(node.InnerText);
node.InnerHtml = processHtml(node.InnerHtml, words);
}
identify words (there's probably some slicker way to do this but here's an initial stab):
private string[] getWords(string text)
{
Regex reg = new Regex("/w+");
MatchCollection matches = reg.Matches(text);
List<string> words = new List<string>();
foreach (Match match in matches)
{
words.Add(match.Value);
}
return words.ToArray();
}
process the html:
private string processHtml(string html, string[] words)
{
int startPosition = 0;
foreach (string word in words)
{
startPosition = html.IndexOf(word, startPosition);
Regex reg = new Regex(word);
html = reg.Replace(html, alterWord(word), 1, startPosition);
}
return html;
}
I'll leave the details of alterWord() to you. :)

Try .SelectNodes("//body//*"). That'll get you all elements within any body element, at any depth.

Related

How to decode HTML into string?

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.
For example we have this piece of HTML:
<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>
Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?

It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc
var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
var result = string.Join(
"\n",
htmlDocument
.DocumentNode
.ChildNodes
.Select(x=> x.InnerText)
.Where(x=>!string.IsNullOrWhiteSpace(x))
);
Result:
"Some text\nSome more text"

easy way to do it is to use HTML Agility pack:
HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText

You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.
public static string GetPlainTextFromHTML(string inputText)
{
// Extracted plain text
var plainText = string.Empty;
if(string.IsNullOrWhiteSpace(inputText))
{
return plainText;
}
var htmlNote = new HtmlDocument();
htmlNote.LoadHtml(inputText);
var nodes = htmlNote.DocumentNode.ChildNodes;
if(nodes == null)
{
return plainText;
}
StringBuilder innerString = new StringBuilder();
// Replace <p> with new lines
foreach (HtmlNode node in nodes)
{
innerString.Append(node.InnerText);
innerString.Append("\\n");
}
plainText = innerString.ToString();
return plainText;
}

You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>

How can I grab a value between two tags from a string

I am trying to grab data from a webpage. I have downloaded the webpage into a string variable.
I am wondering how I can grab the value between two tags. I have included a snippet of the downloaded string and the value I want is 895
<div class="split2r right">
<strong>Avg. asking rent in M4:</strong>
<strong class="price big">£897 pcm</strong><br>
<strong>No. of properties to rent in M4:</strong> <strong><a data-ga-category="Area stats" data-ga-action="properties_to_rent" data-ga-label="/tracking/home-values/results/" href="/to-rent/property/manchester/isaac-way/m4-7ed/">225</a></strong>
</div>
A code example would be great.

This is actually quite easy using the HtmlAgilityPack library to parse the HTML.
The first step is to add a reference to the HtmlAgilityPack library. Then you can start parsing the HTML:
const string Html = "<strong>Avg. price:</strong> <strong class=\"price big\">£895 pcm</strong><br><strong>this is the price of zed headphones</strong>";
var doc = new HtmlDocument();
doc.LoadHtml(Html);
The next step is to find the element you are looking for, in this case that is the <strong> element with its class set to price big:
var priceNode = doc.DocumentNode.SelectSingleNode("//strong[#class='price big']");
Now our final step is to retrieve the actual number from the node's InnerText property. Probably the best way to do this is through a regular expression, which can be quite simple if we assume that the required number is the only number in the inner text of the node:
var priceMatch = Regex.Match(priceNode.InnerText, #"(\d+)");
Console.WriteLine(priceMatch); // Will output 895

private void button1_Click(object sender, EventArgs e)
{
string input = #"<strong class=""price big"">£895 pcm</strong><br>";
MatchCollection mc = Regex.Matches(input, ">£\d{0-5} pcm");
foreach (Match m in mc)
{
Add To List Convert.ToInt32(m);
}
}

Assuming your string value is called "source" and all extracts are formatted as the example
var value = Regex.Replace(source, #"\D", string.Empty);

Get colored texts within HTML code

I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...

I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.

It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

is it possible to fix the problem in HtmlAgilityPack when there is a not closed html tag?

well i have the following problem.
the html i have is malformed and i have problems with selecting nodes using html agility pack when this is the case.
the code is below:
string strHtml = #"
<html>
<div>
<p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
<p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
</div>
<div>
<p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
<p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
</div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
lststrText.Add(nodeP.InnerHtml);
}
the problem is that String_A2_2 is enclosed in brackets.
so htmlagility pack returns 5 strings instead of 4 in the lststrText.
so is it possible to let htmlagility pack return element 3 as
"<strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas"?
or maybe i can do some preprocessing to close the element?
the current content of lststrText is
lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"
lststrText[2] = ""
lststrText[3] = ""
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"

Most html parsers try to build a working DOM, meaning dangling tags are not accepted. They will be converted, or closed in some way.
If only selecting the nodes is of importance to you, and speed and huge amounts of data is not an issue, you could grab all your <p> tags with a regular expression instead:
Regex reMatchP = new Regex(#"<(p)>.*?</\1>");
foreach (Match m in reMatchP.Matches(strHtml))
{
Console.WriteLine(m.Value);
}
This regular expression assumes the <p> tags are well formed and closed.
If you are to run this Regex a lot in your program you should declare it as:
static Regex reMatchP = new Regex(#"<(p)>.*?</\1>", RegexOptions.Compiled);
[Edit: Agility pack change]
If you want to use HtmlAgility pack you can modify the PushNodeEnd function in HtmlDocument.cs:
if (HtmlNode.IsCDataElement(CurrentNodeName()))
{
_state = ParseState.PcData;
return true;
}
// new code start
if ( !AllowedTags.Contains(_currentnode.Name) )
{
close = true;
}
// new code end
where AllowedTags would be a list of all known tags: b, p, br, span, div, etc.
the output is not 100% what you want, but maybe close enough?
<strong>Elem_A</strong>String_A1_2 String_A1_2
<strong>Elem_B</strong>String_B1_2 String_B1_2
<strong>Elem_A</strong>String_A2_2 <ignorestring_a2_2></ignorestring_a2_2> asdas
<strong>Elem_B</strong>String_B2_2 String_B2_2

You could use TidyNet to do the pre/postprocessing you allude to. Can you edit your answer to explain why that wouldnt be applicable in your case?

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.

Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}

Regex.Replace(htmlText, "<.*?>", string.Empty);

protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}

string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);

I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}

using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.

For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/

Simply use string.StripHTML();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse HTML to modify all words - c#

Try .SelectNodes("//body//*"). That'll get you all elements within any body element, at any depth.

Related

How to decode HTML into string?

How can I grab a value between two tags from a string

Get colored texts within HTML code

is it possible to fix the problem in HtmlAgilityPack when there is a not closed html tag?

How can I strip HTML tags from a string in ASP.NET?

Categories

Resources