Html nodes issue with HtmlAgilityPack

Html nodes issue with HtmlAgilityPack - c#

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.
In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.
<div class='downloads' id='download_block'>
<h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
</div>
This is how it looks on the webpage
And this is what I have:
nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[#class='downloads']/ul[#class='parts']")
I can't just use an array-index to determine the position like:
nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node
...because they could change the amount of nodes and its hosting positions.
Note that also the urls will not contains the hosting names, are redirections like:
http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--
What could I do, in C# or else VB.Net?.

this should do, untested though:
doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value
also use contains because you never know if the text contains spaces.

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.
void Main()
{
var xml = #"
<div class=""downloads"" id=""download_block"">
<h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
</div>";
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(xml);
var nav = xmlDocument.CreateNavigator();
var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/#href").InnerText;
Console.WriteLine(text);
}
Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.
Sorry for the not so clean and error prone code, but it should get you in the right direction.

Give the snippet you supplied, this will help you get started.
var page = "<div class=\"downloads\" id=\"download_block\"> <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5> <h4>uploadable.ch</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>uploaded.net</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>novafile.com</h4> <ul class=\"parts\"> <li> text here </li> </ul></div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
attr.Value.Dump();
}

Related

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>

This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

How to get first tag in a list?

suppose I have the following list:
<div id='page_competition_1_block_competition_left_tree_2'>
<div>
<ul>
<li>
<a href="#" />
<ul>
<li>
<a href="#">
</li>
</ul>
</li>
<li>
<a href="#" />
...
how can I get the first a tag for each li?
I tried using:
HtmlNodeCollection compsLi = doc.DocumentNode
.SelectNodes("div[#id='page_match_1_block_competition_left_tree_2']//div//ul/li[1]");
but this will return null

You need to specify a single / instead of //, so:
HtmlNodeCollection compsLi = doc.DocumentNode.SelectNodes("div[#id='page_match_1_block_competition_left_tree_2']//ul/li/a");
Essentially:
/: search for the current node.
//: search from the root document node.

You should be able to iterate through the compsLi object you retrieve. Additionally, I don't think you need the [1] in your selector. Once you get the <li> item you should be able to do something like this:
foreach(var node in compsLi)
{
var aNode = node.SelectSingleNode("./a");
...
}
You can take a look here for something similar.

Regex to catch the parent element of an <li> tag

Simply put I have HTML that looks like this:
<ul>
<li>Unorderd Item 1</li>
<li>Unordered Item 2</li>
<li>Unordered Item 3
<ol>
<li>Ordered Item 1</li>
<li>Ordered Item 2</li>
</ol>
</li>
<li>Unordered Item 4</li>
</ul>
I'm looking for a regular expression or some logic of that nature that replaces the <li> tag with something depending on what its parent list element is.
I can use straight up RegEx or I can use (more than likely my route here) the .Net System.Text.RegularExpressions class so:
Regex.Replace
Regex.Matches
<-- I know I could/should be using a HTML parser but this is being used in conjunction with a XSLT config doc. So using a Regex seems to be the best way to go. -->
Desired Output:
<ul>
<Unordered>Unordered Item 1</Unordered>
<Unordered>....</Unordered>
<ol>
<Ordered>......</Ordered>
<Ordered>......</Ordered>
</ol>
<Unordered>.....</Unordered>
</ul>

I would use HtmlAgilityPack for this
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlString);
foreach (var li in doc.DocumentNode.Descendants("li"))
{
if (li.ParentNode.Name == "ul") li.Name = "Unordered";
if (li.ParentNode.Name == "ol") li.Name = "Ordered";
}
var newHtml = doc.DocumentNode.OuterHtml;
Output:
<ul>
<unordered>Unorderd Item 1</unordered>
<unordered>Unordered Item 2</unordered>
<unordered>Unordered Item 3
<ol>
<ordered>Ordered Item 1</ordered>
<ordered>Ordered Item 2</ordered>
</ol>
</unordered>
<unordered>Unordered Item 4</unordered>
</ul>

Parsing images out of a list using HtmlAgilityPack

There's a html page like this
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
...
I want to get all the hrefs from the lis but like this that I can still get the relation between the li and the a tag.
So first li goes to first a tag, second to second and so on..
I have this code but it always returns the same a href context:
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
String href = node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value
}
How can I improve that code?

you probably want to add all href
string href="";
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
href+= node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value+",";
}

Fetch data from website using HtmlAgilityPack

I am developing an app in which I have to fetch data from website. The format of website is:
<div id="id1" class="class1">
<ol class="cls_ol">
<li>
<div class="class2">Content 1</div>
<div class="cls_img">
*** Code for some image ***
</div>
Content 2
</li>
<li> *** Same like above <li> *** </li>
<li> *** Same like above <li> *** </li>
</ol>
</div>
I use code for fetching this...
protected void Button1_Click(object sender, EventArgs e)
{
var obj = new HtmlWeb();
var document = obj.Load(" ** url of a website ** ");
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var i in bold)
{
Response.Write(i.InnerHtml);
}
But, the problem with my code is this, it also fetches the images of <div class="cls_img"></div>. I don't need this image. So, how to fetch all the content of <div id="id1" class="class1"> without fetch the image from <div class="cls_img">.

Step 1 - select and remove images inside the <div class="cls_img"> inside the <div class="class1"> tag:
var images = document.DocumentNode.SelectNodes(
"//div[#class='class1']//*//div[#class='cls_img']//img"
);
// note that if no nodes found "images" variable will hold a null value
foreach (var image in images)
{
image.Remove();
}
Step 2 - select <div class="class1"> elements (you already done it) - now without that images:
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var node in bold)
{
Console.Write(node.InnerHtml);
}

Loop through the nodes and find a node with the matching attribute of class="cls_img" and remove that node.
node.ParentNode.RemoveChild(node);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html nodes issue with HtmlAgilityPack - c#

this should do, untested though: doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value also use contains because you never know if the text contains spaces.

Related

get div information with html agility pack

How to get first tag in a list?

Regex to catch the parent element of an <li> tag

Parsing images out of a list using HtmlAgilityPack

Fetch data from website using HtmlAgilityPack

Categories

Resources