HTML Agility Pack - Select node after particular paragraph - c#

I have this kind of situation : various files with the following HTML. I need to retreive only the list after "targetWord" paragraph (of course it changes position in the pages I need to parse). How can I do with HTML Agility Pack?
<p>Word1</p>
<ul>
<li>listobject1</li>
<li>listobject2</li>
<li>listobject3</li>
</ul>
<p>targetWord</p>
<ul>
<li>listobject4</li>
<li>listobject5</li>
<li>listobject6</li>
</ul>
<p>Word2</p>
<ul>
<li>listobject7</li>
<li>listobject8</li>
<li>listobject9</li>
</ul>
I need to obtain with my code only the list nodes after targetWord:
foreach (var node in retreivedNodes)
{
s[i] = node.InnerText;
i++;
console.writeline (s[i]);
}
OUTPUT:
listobject4
listobject5
listobject6

You need to craft an xpath expression to match your requirement
Assuming that I have loaded a HAP.HtmlDocument with your snippet as var htmlSnippet then
htmlSnippet.DocumentNode.SelectNodes('//p[text()="targetWord"]/following-sibling::ul[1]//li')
will return the nodeset of li children of the first ul node following your target word p tag.

Related

How to replace span with inline style tag to b tag in c#?

I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo

How to find a link in HTML under a certain header AND parse it

I am currently attempting to parse a link from an HTML doc based off the header above it, but no matter what I try, the program is unable to find it.
Here is the method I have that isn't working:
public string findMajorURL(string collegeURL, string major)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(collegeURL);
var root = doc.DocumentNode;
var htmlNodes = root.Descendants();
//Find html node containing the major heading
foreach(HtmlNode node in htmlNodes)
{
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
List<string> links = target.Descendants("a").Select(a => a.Attributes["href"].Value).ToList();
return links.First()+ "__IT WORKED__";
}
}
return "Major not found";
}
This is what the HTML looks like that I am attempting to parse:
<div style="padding-left: 20px">
<h3 id="ent1629">Biological Sciences </h3>
Go to information for this department.
<br>
<p>...</p>
<div id="data_c_1629" style="display: none">...</div>
<!--script language="javascript">hideshow(data_c_1630)</script-->
The major the user inputs is supposed to match the heading, Biological Sciences. Based off of the header, I want to get the link under it, which in this case is preview_entity.php?catoid=5&ent_oid=1629&returnto=818
WARNING: I cannot use XPath withthe version of Visual Studio that I have, so I'm assuming using LINQ somehow would be the best way to go, but again I'm not sure.
EDIT It turns out that the Inner Text is not matching the major, however, I don't see how that's possible, as I took it directly from the html code. Any ideas as to what's wrong?
According to the HTML snippet posted, node inside your if block references <h3> element and target references next sibling of <h3> which is <a>. That said, you don't need to do target.Descendants("a"). Just get href attribute from target directly :
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
return target.GetAttributeValue("href", "")+ "__IT WORKED__";
}

Getting inner text with HTML Agility Pack

I have the following webpage:
I am trying to grab the fields which have IDs and classnames:
label =
node.SelectSingleNode(
".//h3[#class='item_header']"
).InnerText.Replace("Label: ","").Trim();
Console.WriteLine(label);
However, I am having a difficult time trying to figure out how to get the text here:
How do you parse the text within tags that have no id's or class's such as the following?
<b>Label Cat. #: WEST 3007/8</b>
If it is at all helpful, here is the unique selector:
#\31 42248 > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > b:nth-child(1)
The HTML Agility Pack has a companion CSS Selector library, where you could use the selector in your question to find the element.
https://www.nuget.org/packages/HtmlAgilityPack.CssSelectors/
You have the ID of the table. You can just go from there.
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//table[#id='142248']//b");
foreach (HtmlNode n in nodes)
{
if (n.InnerText.ToLower().Contains("label"))
{
Console.WriteLine(n.InnerText);
}
}
The xpath in the above code gives you all the in the table with the id 142248.

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}

Categories

Resources