Inner text of Node ignoring inner text of children - c#

Pardon me if it sounds too simple to be asked here but since this is my very first day with html-agility-pack, I am unable to sort out a way to select the inner text of a node which is the direct child of the node and ignoring inner text of the children nodes.
For example
<div id="div1">
<div class="h1"> this needs to be selected
<small> and not this</small>
</div>
</div>
currently I am trying this
HtmlDocument page = new HtmlWeb().Load(url);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']");
string selText = s.innerText;
which returns the whole text (e.g- this needs to be selected and not this).
Any suggestions??

The div could possibly have multiple text nodes if there is text before and after its children. As I similarly indicated here, I think the best way to get all the direct text content of a node is to do something like:
HtmlDocument page = new HtmlWeb().Load(url);
var nodes = page.DocumentNode.SelectNodes("//div[#id='div1']//div[#class='h1']/text()");
StringBuilder sb = new StringBuilder();
foreach(var node in nodes)
{
sb.Append(node.InnerText);
}
string content = sb.ToString();

You can use the /text() option to get all text nodes directly under a specific tag. If you only need the first one, add [1] to it:
page.LoadHtml(text);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']/text()[1]");
string selText = s.InnerText;

Related

Looping a node collection gives me unique nodes but selecting nodes inside from these give me the results of the first loop item

Context: Using the HTMLAgilityPack library, im looping a HtmlNodeCollection, printing the HTML of the node gives me the data that I need, but when im selecting nodes inside the html, all of them gives me the result of the first item I selected nodes in.
Writing the nodes html as node.InnerHtml gives me the unique htmls of them, all correct, but when I do SelectSingleNode, all of them give me the same data.
Due to the project, I cannot disclose the website. What I can say is that theres 17 nodes, all of them are a div with the class k-user-item. All Items are unique, meaning they all are different.
Thanks for the help!
Code:
var nodes = w.DocumentNode.SelectNodes("//div[contains(#class, 'k-user-item')]");
List<Sales> saleList = new List<Sales>();
foreach (HtmlNode node in nodes)
{
//This line prints correct html, selecting single nodes gives me always the same data of the first item from the loop.
//Debug.WriteLine(node.InnerHtml);
string payout = node.SelectSingleNode("//*[#class=\"k-item--buy-date\"]").InnerText;
string size = node.SelectSingleNode("//*[#class=\"k-panel-title\"]").SelectNodes("//span")[1].InnerText;
var trNodes = node.SelectNodes("//tr");
string status = trNodes[1].SelectSingleNode("//b").InnerText;
string orderId = trNodes[2].SelectNodes("//td")[1].SelectSingleNode("//span").InnerHtml;
string sellDate = node.SelectSingleNode("//*[#class=\"k-panel-heading\"]").SelectNodes("//small")[1].InnerHtml;
}
This issue was solved by adding to the XPath a "." on to the start.
Not adding the dot onto the XPath means that the node will search in the whole document and not just the exact node html.

Html Agility Pack get specific content from a div

I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}

text of node's inner text and first child nodes text

I have multiple links in page of structure like this:
<a ....>
<b>Text I Need</b>
Also Text I need
</a>
And i want to extract string for example from code above "Text I NeedAlso Text I need"
I successfully extract second part, but I'm not sure how to select text inside b tags as well, currently I'm using this:
var link_list = doc.DocumentNode.SelectNodes(#"/a/text()");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
Should i perhaps instead get not text but html of a and remove tags with regex and extract text then, or is there some other ways?
Accessing InnerText property of <a> should give you all text nodes at once :
var html = #"<a ....>
<b>Text I Need</b>
Also Text I need
</a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
or if you really need to get only direct child text nodes and grand child text nodes, try this way :
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
var texts = link.SelectNodes("text() | */text()");
Console.WriteLine(String.Join("", texts.Select(o => o.InnerText)));
}
output :
Text I Need
Also Text I need

get html node inner text segmented?

I am trying to parse html page and I am facing a problem which is that I want to get the inner text of a node segmented i.e iterate on html node children assuming each text segment as a in child:
<node1>
This text I WANT on iterate#1
<innernode>This text I WANT on iterate#2</innernode>
This text I WANT on iterate#3
<innernode>This text I WANT on iterate#4</innernode>
This text I WANT on iterate#5
</node1>
I am using htmlagilitypack as a parser but I think that I will face this problem with any other html parser
Depending on your .NET version, you could use an extension method that works on the node you want.
I havent used the html agility pack, so this is a mix of C# and psuedo-code.
eg
public static List<string> GetTextSegments(this HtmlNode node)
{
string nodesText = ... // get the nodes text
yield nodesText;
List<HtmlNode> innerNodes = ... // get the list of inner nodes with a
// query like node.SelectNodes("//innerNodes")
foreach(HtmlNode iNode in innerNodes)
{
string iNodeText = ... // get iNodes text
yield iNodeText;
}
}
You could then call this like so:
HtmlNode nodeOfTypeNode1 = ... //
foreach(string text : nodeOfTypeNode1.getTextSegments())
{
Console.WriteLine(text);
}
To get your goal, use SelectNodes with XPath.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);//content is the variable containing your html.
var items = doc.DocumentNode.SelectNodes("/node1//text()");
foreach (var item in items)
{
Console.WriteLine(item.OuterHtml.Replace("\r\n",""));
}

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources