I am just getting into traversing through XML documents to learn how to use xpath.
I have stumbled on to a issue. Everytime I try to execute my xpath it returns null as if it didnt find anything.
I've tried the xpath out in XMLQuire and it worked there.
class Program
{
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var Listing in Featured)
{
}
}
}
I commented out the other xpath I tried, I've tried those two and both are returning null why is that?
Here is a image showing the part of the DOM I want to access.
<table class="top-feature js-hover" data-ad-id="1299717863" data-vip-url="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863">
<tbody><tr>
<td class="watchlist">
<div class="watch js-hover p-vap-lnk-actn-addwtch" data-action="add" data-adid="1299717863" title="Click to add to My Favourites"><div class="icon"></div></div>
<input id="watchlistXsrf" name="ca.kijiji.xsrf.token" value="1527418405414.9b71d1309fdd8a315258ea5a3dac1a09e4a99ec7f32041df88307c46e26a5b1b" type="hidden">
</td>
<td class="image">
<div class="multiple-images"><img src="https://i.ebayimg.com/00/s/NjAwWDgwMA==/z/fXEAAOSwaZdZxTv~/$_2.JPG" alt="C.L. Contracting. Any job big or small."></div>
</td>
<td class="description">
<a href="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863" class="title ">
C.L. Contracting. Any job big or small.</a>
<p>
Contractor handyman home renovations and repairs. Contractor for Dollarama, Rexall, LaSenza and more. Fully licensed and insured. Able to do drywall, decks, framing, plumbing, flooring windows, ...</p>
<p class="details">
</p>
</td>
<td class="posted">
</td>
</tr>
</tbody></table>
My solution (Need help making my xpath into 1 line instead of traversing through with a bunch of loops.)
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var table in DOM.DocumentNode.SelectNodes("//table[contains(#class, 'top-feature')]"))
{
Console.WriteLine($"Found: {table}");
foreach (var rows in table.SelectNodes("tr"))
{
Console.WriteLine(rows);
foreach (var cell in rows.SelectNodes("td[#class='description']/a"))
{
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Console.ReadKey();
I've managed to fix it, however I ams till curious to why this xpath works
//table[contains(#class, 'top-feature')]/tr/td[#class='description']/a
And this one doesnt.
//table[contains(#class,'top-feature')]/tbody/tr/td/a
As mentioned in the comment, the <tbody> element is generated by a browser developer tool.
If you look at your var DOM object during runtime with the debugger, you can see the InnerHtml property.
<table class="regular-ad js-hover" data-ad-id=".." data-vip-url="..">
<tr>
<td class="watchlist">
...
</td>
<td class="image">
...
</td>
...
</tr>
</table>
No <tbody> element so your XPath has to look like this:
DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tr/td/a");
Related
I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully
I have a html to parse(see below)
<div id="mailbox" class="div-w div-m-0">
<h2 class="h-line">InBox</h2>
<div id="mailbox-table">
<table id="maillist">
<tr>
<th>From</th>
<th>Subject</th>
<th>Date</th>
</tr>
<tr onclick="location='readmail.html?mid=welcome'" style="font-weight: bold;">
<td>no-reply#somemail.net</td>
<td>
Hi, Welcome
</td>
<td>
<span title="2016-02-16 13:23:50 UTC">just now</span>
</td>
</tr>
<tr onclick="location='readmail.html?mid=T0wM6P'" style="font-weight: bold;">
<td>someone#outlook.com</td>
<td>
sa
</td>
<td>
<span title="2016-02-16 13:24:04">just now</span>
</td>
</tr>
</table>
</div>
</div>
I need to parse links in <tr onclick= tags and email addresses in <td> tags.
So far i manged to get first occurance of email/link from my html.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);
Could someone show me how is it properly done? Basically what i want to do is take all email addresses and links from html that are in said tags.
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[#onclick]"))
{
HtmlAttribute att = link.Attributes["onclick"];
Console.WriteLine(att.Value);
}
EDIT: I need to store parsed values in a class (list) in pairs. Email (link) and senders Email.
public class ClassMailBox
{
public string From { get; set; }
public string LinkToMail { get; set; }
}
You can write the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[#onclick]"))
{
HtmlAttribute att = link.Attributes["onclick"];
ClassMailBox classMailbox = new ClassMailBox() { LinkToMail = att.Value };
classMailBoxes.Add(classMailbox);
}
int currentPosition = 0;
foreach (HtmlNode tableDef in doc.DocumentNode.SelectNodes("//tr[#onclick]/td[1]"))
{
classMailBoxes[currentPosition].From = tableDef.InnerText;
currentPosition++;
}
To keep this code simple, I'm assuming some things:
The email is always on the first td inside the tr which contains an onlink property
Every tr with an onlink attribute contains an email
If those conditions don't apply this code won't work and it could throw some exceptions (IndexOutOfRangeExceptions) or it could match links with wrong email addresses.
How can I get information out of all td tags in Classen = "string_14" so that I can store it away clean without html code in.
I have thought about this:
enter code here
<table class="string_14">
<tbody><tr>
<td>Postadr.:</td>
<td class="tab_space">Stenslivegen 67, 2817 Gjøvik</td>
</tr>
<tr>
<td>Telefon:</td>
<td class="tab_space">611 80 710</td>
</tr>
<tr>
<td>Mobil:</td>
<td class="tab_space">957 92 455</td>
</tr>
</tbody>
</table>
And my code to get it looking like this today, I want help with is to write xpath to name =? how should I write to get a single td.
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;
List<string> list = new List<string>();
foreach (HtmlNode div in root.SelectNodes("//div[#class='biz_list']"))
{
string name = doc.DocumentNode.SelectNodes("//d[#class='string_14']/#tr");
list.Add(name);
string att = div.OuterHtml;
list.Add(att);
}
What I want out of this is I'm going to scrape a page and then the LATE stage I'll save this down to the xml file.
I think what you want is this:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//table[#class='string_14']//td[#class='tab_space']");
You can consult XPath Tutorial for more on this.
Lets say I have this html:
<table class="c1">
<tr>
<td>Dog</td>
<td>Dog<td>
</tr>
<tr>
<td>Cat</td>
<td>Cat<td>
</tr>
</table>
What I tried:
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
HtmlNodeCollection urls = node.SelectNodes("a");
the node have the table but urls is null. Why?
Use Descendants("a") instead of SelectNodes("a");
This should work....
var node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
var urls = node.Descendants("a").ToList();
The scenario that I am looking at is that we have a table with multiple columns. One of those columns has a name, another has a dropdown list. I need to manipulate the dropdown for a row that contains a particular name. I looked at the source output, and tried getting the element's grandparent (the table row) so that I could search for the list. However, there was no such search functionality when I used the parent object.
It seems like there would be a lot of this kind of scenario in automating/testing a site, but I have not found anything after searching for a couple of hours. Any help would be appreciated.
EDIT: The application in question is an ASP.NET, and the output HTML is gnarly at best. However, here is a cleaned up example of what the HTML being searched looks like:
<table class="myGrid" cellspacing="0" cellpadding="3" rules="all" border="1" id="ctl00_content_MyRpt_ctl01_MyGrid" style="border-collapse:collapse;">
<tr align="left" style="color:Black;background-color:#DFDBDB;">
<th scope="col">Name</th><th scope="col">Unit</th><th scope="col">Status</th><th scope="col">Action</th>
</tr>
<tr>
<td>
<span id="ctl00_content_MyRpt_ctl01_MyGrid_ctl02_Name">JOHN DOE</span>
</td>
<td>
<span id="ctl00_content_MyRpt_ctl01_MyGrid_ctl02_UnitType">Region</span>
<span id="ctl00_content_MyRpt_ctl01_MyGrid_ctl02_UnitNum">1</span>
</td>
<td>
<span id="ctl00_content_MyRpt_ctl01_MyGrid_ctl02_Status">Complete</span>
</td>
<td class="dropdown">
<select name="ctl00$content$MyRpt$ctl01$MyGrid$ctl02$ActionDropDown" onchange="javascript:setTimeout('__doPostBack(\'ctl00$content$MyRpt$ctl01$MyGrid$ctl02$ActionDropDown\',\'\')', 0)" id="ctl00_content_MyRpt_ctl01_MyGrid_ctl02_ActionDropDown" class="dropdown">
<option value="123456">I want to...</option>
<option value="Details.aspx">View Details</option>
<option value="Summary.aspx">View Summary</option>
<option value="DirectReports.aspx">View Direct Reports</option>
</select>
</td>
</tr>
<tr>
...
</tr>
</table>
I found a way to do what I wanted. It is probably not the best or most elegant solution, but it works (it is not production code).
private void btnStart_Click(object sender, EventArgs e)
{
using (var browser = new IE("http://godev/review"))
{
browser.Link(Find.ByText("My Direct Reports")).Click();
TableRow tr = browser.Span(Find.ByText("JOHN DOE")).Parent.Parent as TableRow;
SelectList objSL = null;
if (tr.Exists)
{
foreach (var td in tr.TableCells)
{
objSL = td.ChildOfType<SelectList>(Find.Any) as SelectList;
if (objSL.Exists) break;
}
if (objSL != null && objSL.Exists)
{
Option o = objSL.Option(Find.ByText("View Direct Reports"));
if (o.Exists) o.Select();
}
}
}
}
Hopefully this saves someone a little time and effort. Also, I would love to see if someone has a better solution.