Only grab some innertext from a SelectNode with HtmlAgilityPack - c#

I've been using HtmlAgilityPack in order to parse some html in a web page. The current html looks like this:
div class="price__child price__price flex-child__auto tooltip-container">
<div class="price__min-order tooltip-container js-minOrder">
<i>⚠️</i>
<div class="price__min-order-tooltip tooltip">
Minimum order of $15.00.
</div>
</div>
$1.75
</div>
I only want to retrieve the text of the price at the very end, in this case, the $1.75. Doing something like below will return that number, but also the all of the other text within the larger div.
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');
Is there a way to exclude/not grab the innertext from the price__min-order tooltip-container js-minOrder and also the price__min-order-tooltip tooltip, and only grab the 1.75 from the larger div?

I found the way to do it. If you call child node and remove, it will get rid of it.
var priceNode = node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
?.ChildNodes[1];
priceNode?.Remove();
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');

Related

How to replace span with inline style tag to b tag in c#?

I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo

Select a Node which has specified subnodes

I have to write a web scraper. My php page is:
<a href="Something.php">
<div class="SPECIFIEDCLASS" title="other something">
</div>
</a>
What I wrote so far is:
var diiv = doc.DocumentNode.SelectNodes("//a/div[#class='SPECIFIEDCLASS']");
var hrefLiist = diiv.Select(q => q.GetAttributeValue("href", "not found")).ToList()
but its not working.
Your XPath expression selects div tags with the specified class within a tags.
But what you want are the a tags with div tags with the specified class. You should instead use this XPath expression:
var diiv = doc.DocumentNode.SelectNodes("//a[div[#class='SPECIFIEDCLASS']]");
For a more visual explanation:
Your XPath does this to each a tag:
Get a tag.
Get child div tag.
Select div tags with Class = "SPECIFIEDCLASS". So ultimately, the div tags are themselves selected
The correct XPath should do this:
Get a tag.
Select a tags where:
Child div tag has Class = "SPECIFIEDCLASS". Here the a tags are selected.

How to extract text inside a div tag using htmlagilitypack

I want to extract the text "Some text goes here" between the div class.
I am using html agility pack, and c#
<div class="productDescriptionWrapper">
Some Text Goes here...
<div class="emptyClear"> </div>
</div>
this is what I have :
Description = doc.DocumentNode.SelectNodes("//div[#class=\"productDescriptionWrapper\").Descendants("div").Select(x => x.InnerText).ToList();
I get this error :
An unhandled exception of type 'System.NullReferenceException'
I know how to extract if the text is b/w a <h1> or <p> instead of "div" in Descendants i will have to give "h1" or "p".
Somebody please assist.
Use single quotes such as
//div[#class='productDescriptionWrapper']
to get all descendants of all types use:
//div[#class='productDescriptionWrapper']//*,
to get all descendants of a specific type
such as a p then use //div[#class='productDescriptionWrapper']//p.
to get all descendants that are either a div or a p:
//div[#class='productDescriptionWrapper']//*[self::div or self::p]
say you wanted to get all non blank descendant text nodes then use:
//div[#class='productDescriptionWrapper']//text()[normalize-space()]
There is no way you can get null reference exception given doc is created from HTML snippet you posted. Anyway, if you meant to get text within the outer <div>, but not from the inner one, then use xpath /text() which mean get direct child text nodes.
For example, given this HTML snippet :
var html = #"<div class=""productDescriptionWrapper"">
Some Text Goes here...
<div class=""emptyClear"">Don't get this one</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
..this expression return text from the outer <div> only :
var Description = doc.DocumentNode
.SelectNodes("//div[#class='productDescriptionWrapper']/text()")
.Select(x => x.InnerText.Trim())
.First();
//Description :
//"Some Text Goes here..."
..while in contrast, the following return all the text :
var Description = doc.DocumentNode
.SelectNodes("//div[#class='productDescriptionWrapper']")
.Select(x => x.InnerText.Trim())
.First();
//Description :
//"Some Text Goes here...
//Don't get this one"

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Categories

Resources