Getting inner text with HTML Agility Pack - c#

I have the following webpage:
I am trying to grab the fields which have IDs and classnames:
label =
node.SelectSingleNode(
".//h3[#class='item_header']"
).InnerText.Replace("Label: ","").Trim();
Console.WriteLine(label);
However, I am having a difficult time trying to figure out how to get the text here:
How do you parse the text within tags that have no id's or class's such as the following?
<b>Label Cat. #: WEST 3007/8</b>
If it is at all helpful, here is the unique selector:
#\31 42248 > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > b:nth-child(1)

The HTML Agility Pack has a companion CSS Selector library, where you could use the selector in your question to find the element.
https://www.nuget.org/packages/HtmlAgilityPack.CssSelectors/

You have the ID of the table. You can just go from there.
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//table[#id='142248']//b");
foreach (HtmlNode n in nodes)
{
if (n.InnerText.ToLower().Contains("label"))
{
Console.WriteLine(n.InnerText);
}
}
The xpath in the above code gives you all the in the table with the id 142248.

Related

How to replace span with inline style tag to b tag in c#?

I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo

How read content of a span tag using HtmlAgilityPack?

I'm using HtmlAgilityPack to scrap data from a link(site). There are many p tags, header and span tags in a site. I need to scrap data from a particular span tag.
var webGet = new HtmlWeb();
var document = webGet.Load(URL);
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
{
string strData = node.InnerText.Trim();
}
I had tried by using keyword on parent tag which was not working for all kind of URLs.
Please help me to fix it.
What is the error?
You can start by fixing this:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
it should be:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("//span"))
But I want exact data. For example, there are too many span tags in source as <span>abc</span>, <span>def</span>, <span>pqr</span>, <span>xyz</span>. I want the result as "pqr". Is there any option to get it by count of particular tag or by index?
If you want to get, for example, the third span tag from the root:
doc.DocumentNode.SelectSingleNode("//span[3]")
If you want to get the node containing the text "pqr":
doc.DocumentNode.SelectSingleNode("//span[contains(text(),'pqr')]");
You can use SelectNodes for the latter to get all span tags containing "pqr" in the text.

Parsing results from HTMLAgiltyPack

I'm trying to parse the Yahoo Finance page for a list of stock symbols and company names. The URL i'm using is: http://uk.finance.yahoo.com/q/cp?s=%5EFTSE
The code i'm using is;
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://uk.finance.yahoo.com/q/cp?s=%5EFTSE");
var titles = page.DocumentNode.SelectNodes("//td[#class='yfnc_tabledata1']");
// Returns all titles on the home page of this site in an array.
foreach (var title in titles)
{
txtLog.AppendText(title.InnerHtml + System.Environment.NewLine);
}
The txtLog.AppendText line was just me testing. The code correctly gets each lines that contains a class of yfnc_tabledata1 under the node of td. Now when i'm in the foreach loop i need to parse title to grab the symbol and company name from the following HTML;
<b>GLEN.L</b>
GLENCORE XSTRAT
<b>343.95</b> <nobr><small>3 May 16:35</small></nobr>
<img width="10" height="14" style="margin-right:-2px;" border="0"
src="http://l.yimg.com/os/mit/media/m/base/images/transparent-1093278.png"
class="pos_arrow" alt="Up"> <b style="color:#008800;">12.80</b>
<bstyle="color:#008800;"> (3.87%)</b> 68,086,160
Is it possible to parse the results of a parsed document? I'm a little unsure on where to start.
You just need to continue some XPATH extraction work from where you are. There are many possibilities. The difficulty is all the yfnc_tabledata1 nodes are at the same level. Here is how you can do it (in a console app example it will dump the list of symbols and companies):
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://uk.finance.yahoo.com/q/cp?s=%5EFTSE");
// get directly the symbols under the 1st TD element. Recursively search for an A element that has an HREF attribute under this TD.
var symbols = page.DocumentNode.SelectNodes("//td[#class='yfnc_tabledata1']//a[#href]");
foreach (var symbol in symbols)
{
// from the current A element, go up two level and get the next TD element.
var company = symbol.SelectSingleNode("../../following-sibling::td").InnerText.Trim();
Console.WriteLine(symbol.InnerText + ": " + company);
}
More on XPATH axes here: XPATH Axes

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources