Htmlagilitypack doesnt get nodes. - c#

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}

Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Related

Store links into variable instead of text file

I am on very early learning curve of C#. I have a code for storing web links into text file. How I can store them into variable so I can loop through them later in the code and access each one separately?
string pdfLinksUrl = "https://www.nordicwater.com/products/waste-water/";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().StartsWith("https://www.nordicwater.com/product/")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(#"c:\temp\pdflinks.txt", pdfUrls.ToArray());
pdfUrls holds all of your URLs, you are using it when you are writing all of them into the file
You can use a foreach loop in order to loop through the URLs easily:
foreach (string url in odfUrls.ToArray()) {
Console.WriteLine($"PDF URL: {url}");
}

HTMLNode null when selecting nodes from htmldocument

So I'm trying to get a soundcloud track ID, I'm not sure on how to go about this but so far I've figured out that I should be able to read a meta tag from the song's page on soundcloud. Here is my code:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]");
Console.WriteLine(x.InnerText);
I'm trying to read the following tag:
<meta property="twitter:app:url:googleplay" content="soundcloud://sounds:322162984">
So I can get the content and then get the track ID
When trying to display the innertext of variable X there is nothing to display, when setting a breakpoint it says that X is null, can anyone explain me why this is and how to fix it?
You need to read the attribute "content" of the node selected:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]").Attributes["content"].Value;
Console.WriteLine(x);
// Get the property attribute of x
var prop = x.GetAttributeValue("property", "");
Console.WriteLine(prop );
// output: twitter:app:url:googleplay
//similarly get the content attribute of x
var content = x.GetAttributeValue("content", "");
Console.WriteLine(content );
//output : soundcloud://sounds:322162984
Hope this helps.
You need to get the attribute there is no inner text to that tag.
Use var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]#content").Value; instead. This will point your query to the content tag where you can extract the soundcloud://....

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

How to download content of an element by its hierarchy

I'm new to stackoverflow and I hope my question not be odd..
I want to just download the text inside svalue of sindex element, and also content of another <p> tag. This is its hierarchy:
/html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/div/span/span/p/span/sindex
is it possible to download the content by its hierarchy? with HtmlAgilityPack for example, or in another way?
Thanks
WebClient client = new WebClient();
string url = "http://www.google.com";
var content = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
// ?
Update after #MSI answer, I Use this:
var value = doc.DocumentNode
.SelectSingleNode("//html/body/div/div/a/div");
But the return value always is null. mayber I get the hierarchy in a wrong way. I use firebug and look at html tab for its hierarchy, is it wrong?
Can't you use something along the line of,
*Considering svalue to be an attribute:
doc.DocumentNode
.SelectSingleNode("//html/element1/element2")
.Attributes["svalue"].Value;
or for element,
doc.DocumentNode
.SelectSingleNode("//html/element1/element2/svalue").InnerText;
EDIT:
Re. SelectSingleNode returning null for my previous examples with google.com.au as reference HTML source use the following method to et the desired result.
doc.DocumentNode
.SelectSingleNode(".//element1/element2/svalue").InnerText;
DocumentNode should refer to the html document root node and .// is relative to that.

Html Agility Pack, SelectNodes from a node

Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".
It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");
Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");
You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps
This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.

Categories

Resources