HTML Agility Pack - Grab Text after a node - c#

I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated

You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...

<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}

Related

HtmlAgilityPack getting id of parrent node

Given the snippet of html and code bellow if you know part of the src e.g. 'FileName' how do you get the post ID of the parent div this could be higher up the dom tree and there could be 0, 1 or many src's with the same 'FileName'
I'm after "postId_19701770"
I've attempted to follow this page and this page I get Error CS1061 'HtmlNodeCollection' does not contain a definition for 'ParentNode'
namespace GetParent
{
class Program
{
static void Main(string[] args)
{
var html =
#"<body>
<div id='postId_19701770' class='b-post'>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph <div src='example.com/FileName_720p.mp4' </div></p>
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string keyword = "FileName";
var node = htmlDoc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
var parentNode = node.ParentNode;
Console.WriteLine(parentNode.Name);
Console.ReadLine();
}
}
}
Reason your code is not working is because you are looking up a ParentNode of a collection of nodes. You need to select a single node and then look up its parent.
You can search all the nodes (collection) by src as well that contains the data you are looking for. Once you have the collection, you can search each of those nodes to see which one you need or select the First() one from that collection to get its Parent.
var html =
#"<body>
<div id='postId_19701770' class='b-post'>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph <div src='example.com/FileName_720p.mp4' </div></p>
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string keyword = "FileName";
var node = htmlDoc.DocumentNode.SelectNodes("//*[contains(#src, '" + keyword + "')]");
var parent = node.First().ParentNode; //node is a collection so get the first node for ex.
Console.WriteLine(parent.GetAttributeValue("id", string.Empty));
// Prints
postId_19701770
Instead of looking up "all" nodes, you can search specifically for 1 node via SelectSingleNode method
var singleNode = htmlDoc.DocumentNode.SelectSingleNode(#"//*[contains(#src, '" + keyword + "')]");
Console.WriteLine(singleNode.ParentNode.GetAttributeValue("id", string.Empty));
// prints
postId_19701770

Get all divs under a div with known ID and iterate over it

I'm building an app that crawls OkCupid matches. Their match result contains Html that looks like this.
<div id="match_results">
<div>person1</div>
<div>person2</div>
<div>person3</div>
</div>
I want to do a foreach person's div inside the div match_results. However, something's not quite right with my C# code. matchesList only contains one element (itself? and not all the divs inside it...)
HtmlDocument matchesHtmlDoc = new HtmlDocument();
matchesHtmlDoc.LoadHtml(matches);
string matchResultDivId = "match_results";
// match results
HtmlNodeCollection matchesList = matchesHtmlDoc.DocumentNode.SelectNodes("//div[#id = '" + matchResultDivId + "']");
foreach (HtmlNode match in matchesList)
{
//test
Console.WriteLine(match.ToString());
}
You forgot to select child divs:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(matches);
string matchResultDivId = "match_results";
string xpath = String.Format("//div[#id='{0}']/div", matchResultDivId);
var people = doc.DocumentNode.SelectNodes(xpath).Select(p => p.InnerText);
foreach(var person in people)
Console.WriteLine(person);
Output:
person1
person2
person3

How to get Contents from HTML string in Array

I am working with some html contents. The format of the HTML is like below.
<li>
<ul>
<li>Test1</li>
<li>Test2</li>
</ul>
Odd string 1
<ul>
<li>Test3</li>
<li>Test4</li>
</ul>
Odd string 2
<ul>
<li>Test5</li>
<li>Test6</li>
</ul>
<li>
There can be multiple "odd string" in html content. So I want all the "odd string" in array. Is there any easy way ? (I am using C# and HtmlAgilityPack)
Select ul elements and refer to next sibling node, which will be your text:
HtmlDocument html = new HtmlDocument();
html.Load(html_file);
var odds = from ul in html.DocumentNode.Descendants("ul")
let sibling = ul.NextSibling
where sibling != null &&
sibling.NodeType == HtmlNodeType.Text && // check if text node
!String.IsNullOrWhiteSpace(sibling.InnerHtml)
select sibling.InnerHtml.Trim();
something like
MatchCollection matches = Regex.Matches(HTMLString, "</ul>.*?<ul>", RegexOptions.SingleLine);
foreach (Match match in matches)
{
String oddstring = match.ToString().Replace("</ul>","").Replace("<ul>","");
}
Get all the ul descendants and check it the next sibling node is HtmlNodeType.Text and if is not empty:
List<string>oddStrings = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode ul in doc.DocumentNode.Descendants("ul"))
{
HtmlNode nextSibling = ul.NextSibling;
if (nextSibling != null && nextSibling.NodeType == HtmlNodeType.Text)
{
string trimmedText = nextSibling.InnerText.Trim();
if (!String.IsNullOrEmpty(trimmedText))
{
oddStrings.Add(trimmedText);
}
}
}
Agility Pack can already query those texts
var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/li[1]/text()")
Use this XPATH:
//body/li[1]/text()

Select node based on sibling properties - HtmlAgilityPack - C#

I have an HTML-document that is structured as follows
<ul class="beverageFacts">
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>
I need to parse the values of the <strong>-tags to corresponding string's, depending on what value the <span>-tag has.
I have the following:
String vintage;
String sugar;
String abv;
As of now, I am looping through each child node of the beverageFacts-node checking the values to parse it to the correct corresponding string.
The code I have so far to get the "Vintage"-value is the following, though the result is always null.
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode subNode in childNodes)
{
if (subNode.InnerText.TrimStart() == "Vintage")
vintage = subNode.NextSibling.InnerText.Trim();
}
I believe my selection of the nodes is incorrect, but I cannot figure out how to properly do it in the most efficient way.
Is there an easy way to achieve this?
Edit 2013-07-29
I have tried to remove the whitespaces as suggested by enricoariel in the comments using the following code
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://www.systembolaget.se/" + articleID);
string cleanDoc = Regex.Replace(page.DocumentNode.OuterHtml, #"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);
HtmlDocument cleanPage = new HtmlDocument();
cleanPage.LoadHtml(cleanDoc);
The resulting is still
String vintage = null;
Looking at the HTML markup, I realized I didn't go deep enough in the nodes.
Also, as enricoariel pointed out, there are whitespaces that I do not clean properly. By skipping the sibling which is the whitespaces, and instead jump to the following, I get the correct result.
foreach (HtmlNode bevFactNode in bevFactsNodes)
{
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode node in childNodes)
{
foreach(HtmlNode subNode in node.ChildNodes)
{
if (subNode.InnerText.Trim() == "Årgång")
vintage = HttpUtility.HtmlDecode(subNode.NextSibling.NextSibling.InnerText.Trim());
}
}
}
Console.WriteLine("Vintage: " + vintage);
will output
Vintage: 2007
I decoded the HTML to get the result formatted correctly.
Lessons learned!
to summarize I think the best solution would be stripping all white spaces using a regex prior to retrieve the nextSibling value:
string myHtml =
#"
<ul class='beverageFacts'>
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>";
//Remove space after and before tag
myHtml = Regex.Replace(myHtml, #"\s+<", "<", RegexOptions.Multiline | RegexOptions.Compiled);
myHtml = Regex.Replace(myHtml, #">\s+", "> ", RegexOptions.Compiled | RegexOptions.Multiline);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml.Replace("/r", "").Replace("/n", "").Replace("/r/n", "").Replace(" ", ""));
doc.OptionFixNestedTags = true;
HtmlNodeCollection vals = doc.DocumentNode.SelectNodes("//ul[#class='beverageFacts']//span");
var myNodeContent = string.Empty;
foreach (HtmlNode val in vals)
{
if (val.InnerText == "Vintage")
{
myNodeContent = val.NextSibling.InnerText;
}
}
return myNodeContent;

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

Categories

Resources