HTMLNode null when selecting nodes from htmldocument - c#

So I'm trying to get a soundcloud track ID, I'm not sure on how to go about this but so far I've figured out that I should be able to read a meta tag from the song's page on soundcloud. Here is my code:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]");
Console.WriteLine(x.InnerText);
I'm trying to read the following tag:
<meta property="twitter:app:url:googleplay" content="soundcloud://sounds:322162984">
So I can get the content and then get the track ID
When trying to display the innertext of variable X there is nothing to display, when setting a breakpoint it says that X is null, can anyone explain me why this is and how to fix it?

You need to read the attribute "content" of the node selected:
string url = "https://soundcloud.com/hardstyle/scarphase-angernoizer-chaos-of-the-mayans-feat-tha-watcher-bkjn-vs-partyraiser-2017-anthem";
HtmlWeb w = new HtmlWeb();
HtmlDocument d = w.Load(url);
var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]").Attributes["content"].Value;
Console.WriteLine(x);

// Get the property attribute of x
var prop = x.GetAttributeValue("property", "");
Console.WriteLine(prop );
// output: twitter:app:url:googleplay
//similarly get the content attribute of x
var content = x.GetAttributeValue("content", "");
Console.WriteLine(content );
//output : soundcloud://sounds:322162984
Hope this helps.

You need to get the attribute there is no inner text to that tag.
Use var x = d.DocumentNode.SelectSingleNode("/html/head/meta[30]#content").Value; instead. This will point your query to the content tag where you can extract the soundcloud://....

Related

Htmlagilitypack doesnt get nodes.

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

Finding node using HTML agility pack

Here is the google chrome dev tool to get the elment im looking for.
Here are all the different ways I have tried to get the nodes..
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webObject.Html);
// HtmlNode footer = doc.DocumentNode.Descendants().SingleOrDefault(y => y. == "boardPickerInner");
// "//div[#class='boardPickerInner']"
//var y = (from HtmlNode node in doc.DocumentNode.SelectNodes("//")
// where node.InnerText == "boardPickerInner"
// select node.InnerHtml);
HtmlAgilityPack.HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//nameAndIcons");
var xq = doc.DocumentNode.SelectSingleNode("//td[#class='nameAndIcons']");
var x = doc.DocumentNode.SelectSingleNode("");
HtmlNode nodes = doc.DocumentNode.SelectSingleNode("//[#class='nameAndIcons']");
var boards = nodes.SelectNodes("//*[#class='nameAndIcons']");
Can someone explain what I am doing wrong..?
It looks like you have multiple span elements with class="nameAndIcons". So in order to get them all you could use the SelectNodes function:
var nodes = doc.DocumentNode.SelectNodes("//span[#class='nameAndIcons'"])

extract content from html page

I'm trying to extract the content inside div tag with id job_title1 in a html page. I'm using htmlagilitypack to fetch the data. Here is my code
var obj = new HtmlWeb();
var document = obj.Load("url of website ");
var bold = document.DocumentNode.SelectNodes("//div[#class='job_title1']");
foreach (var i in document.DocumentNode.SelectNodes("//div[#class='job_title1']"))
{
Response.Write(i.InnerHtml);
}
When i tried to run this code i'm getting error at foreach saying the Object reference not set to an instance of an object. Please help me solving this.
You said "div tag with id job_title1", shouldn't the xpath be:
document.DocumentNode.SelectNodes("//div[#id='job_title1']")
check if null like this:
var nodes = document.DocumentNode.SelectNodes("//div[#class='job_title1']");
if(nodes != null)
foreach (var i in document.DocumentNode.SelectNodes("//div[#class='job_title1']"
...
Edit: Use \" instead ':
var obj = new HtmlWeb();
var document = obj.Load("url of website ");
var bold = document.DocumentNode.SelectNodes("//div[#class=\"job_title1\"]");
if(bold!= null)
foreach (var i in bold)
{
Response.Write(i.InnerHtml);
}

Html Agility Pack, SelectNodes from a node

Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".
It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");
Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");
You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps
This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.

Categories

Resources