C# What is #text node in htmlnode? - c#

I am trying to go through each html node and get its attribute and innerText. At the moment when I am scanning through any html I am getting this stupid #text node even though it doesn't exist.
Here is my html
<div class="demographic-info adr editable-item" id="demographics">
<div id="location-container" data-li-template="location">
<div id="location" class="editable-item">
<dl>
<dt>Location</dt>
<dd>
<span class="locality">Bolton, United Kingdom</span>
</dd>
<dt>Industry</dt>
<dd class="industry">Computer Games</dd>
</dl>
</div>
</div>
</div>
And here is my c#
foreach (HtmlNode node in j.ChildNodes)
if (node.HasChildNodes)
checkNode(node);
static void checkNode(HtmlNode node)
{
foreach (HtmlNode n in node.ChildNodes)
{
if (n.HasChildNodes)
checkNode(n);
else
{
HtmlNode nodeValue = hasValueInNode(n);
if (nodeValue != null)
addCategories(nodeValue);
}
}
}
When I go through debug mode to check which node the compiler is at and I get this:
1 = div, 2 = #text, 3 = div, 4 = #text, 5 = div, 6 = #text, 7 = dl ...
and so on!
I am guessing that is detecting blank space or return space as a node but this is such a waste of loops. Can someone explain this to me and a way to avoid it. Thanks

This is how HTML/XML works. There is a text node every time there is some text inside a node. In this case it happens to be whitespace, but it is still text and it cannot be discarded. The node is not "stupid" and it does exist.
Your code is free to check if the text node is whitespace and ignore it if you want to, or you can make the XML so that there isn't any whitespace.
Just as a thought: how would you tell the parser which whitespace should be important:
<div>
<div>Test<span>
</span>test</div>
</div>
So, should the parser just be "there's Test and then there's empty span element and then test, so actualy the text inside is 'Testtest'"? Or how would it know what to do?

Related

Only grab some innertext from a SelectNode with HtmlAgilityPack

I've been using HtmlAgilityPack in order to parse some html in a web page. The current html looks like this:
div class="price__child price__price flex-child__auto tooltip-container">
<div class="price__min-order tooltip-container js-minOrder">
<i>⚠️</i>
<div class="price__min-order-tooltip tooltip">
Minimum order of $15.00.
</div>
</div>
$1.75
</div>
I only want to retrieve the text of the price at the very end, in this case, the $1.75. Doing something like below will return that number, but also the all of the other text within the larger div.
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');
Is there a way to exclude/not grab the innertext from the price__min-order tooltip-container js-minOrder and also the price__min-order-tooltip tooltip, and only grab the 1.75 from the larger div?
I found the way to do it. If you call child node and remove, it will get rid of it.
var priceNode = node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
?.ChildNodes[1];
priceNode?.Remove();
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');

C# HTMLNode get correctly innerText of div

I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.

How to replace span with inline style tag to b tag in c#?

I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo

XPath retrieving values from multiple tags inside a node

I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Categories

Resources