Htmlagilitypack doesnt get nodes.

Htmlagilitypack doesnt get nodes. - c#

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}

Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Related

Store links into variable instead of text file

I am on very early learning curve of C#. I have a code for storing web links into text file. How I can store them into variable so I can loop through them later in the code and access each one separately?
string pdfLinksUrl = "https://www.nordicwater.com/products/waste-water/";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().StartsWith("https://www.nordicwater.com/product/")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(#"c:\temp\pdflinks.txt", pdfUrls.ToArray());

pdfUrls holds all of your URLs, you are using it when you are writing all of them into the file
You can use a foreach loop in order to loop through the URLs easily:
foreach (string url in odfUrls.ToArray()) {
Console.WriteLine($"PDF URL: {url}");
}

HTMLNode null when selecting nodes from htmldocument

So I'm trying to get a soundcloud track ID, I'm not sure on how to go about this but so far I've figured out that I should be able to read a meta tag from the song's page on soundcloud. Here is my code:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]");
Console.WriteLine(x.InnerText);
I'm trying to read the following tag:
<meta property="twitter:app:url:googleplay" content="soundcloud://sounds:322162984">
So I can get the content and then get the track ID
When trying to display the innertext of variable X there is nothing to display, when setting a breakpoint it says that X is null, can anyone explain me why this is and how to fix it?

You need to read the attribute "content" of the node selected:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]").Attributes["content"].Value;
Console.WriteLine(x);

// Get the property attribute of x
var prop = x.GetAttributeValue("property", "");
Console.WriteLine(prop );
// output: twitter:app:url:googleplay
//similarly get the content attribute of x
var content = x.GetAttributeValue("content", "");
Console.WriteLine(content );
//output : soundcloud://sounds:322162984
Hope this helps.

You need to get the attribute there is no inner text to that tag.
Use var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]#content").Value; instead. This will point your query to the content tag where you can extract the soundcloud://....

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}

To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);

Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

How to download content of an element by its hierarchy

I'm new to stackoverflow and I hope my question not be odd..
I want to just download the text inside svalue of sindex element, and also content of another <p> tag. This is its hierarchy:
/html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/div/span/span/p/span/sindex
is it possible to download the content by its hierarchy? with HtmlAgilityPack for example, or in another way?
Thanks
WebClient client = new WebClient();
string url = "http://www.google.com";
var content = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
// ?
Update after #MSI answer, I Use this:
var value = doc.DocumentNode
.SelectSingleNode("//html/body/div/div/a/div");
But the return value always is null. mayber I get the hierarchy in a wrong way. I use firebug and look at html tab for its hierarchy, is it wrong?

Can't you use something along the line of,
*Considering svalue to be an attribute:
doc.DocumentNode
.SelectSingleNode("//html/element1/element2")
.Attributes["svalue"].Value;
or for element,
doc.DocumentNode
.SelectSingleNode("//html/element1/element2/svalue").InnerText;
EDIT:
Re. SelectSingleNode returning null for my previous examples with google.com.au as reference HTML source use the following method to et the desired result.
doc.DocumentNode
.SelectSingleNode(".//element1/element2/svalue").InnerText;
DocumentNode should refer to the html document root node and .// is relative to that.

Html Agility Pack, SelectNodes from a node

Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".

It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");

var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");

Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");

You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps

This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Htmlagilitypack doesnt get nodes. - c#

Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try. var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Related

Store links into variable instead of text file

HTMLNode null when selecting nodes from htmldocument

Parse Compelete Web Page

How to download content of an element by its hierarchy

Html Agility Pack, SelectNodes from a node

Categories

Resources