Lets say I have this html:
<table class="c1">
<tr>
<td>Dog</td>
<td>Dog<td>
</tr>
<tr>
<td>Cat</td>
<td>Cat<td>
</tr>
</table>
What I tried:
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
HtmlNodeCollection urls = node.SelectNodes("a");
the node have the table but urls is null. Why?
Use Descendants("a") instead of SelectNodes("a");
This should work....
var node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
var urls = node.Descendants("a").ToList();
Related
I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully
I am just getting into traversing through XML documents to learn how to use xpath.
I have stumbled on to a issue. Everytime I try to execute my xpath it returns null as if it didnt find anything.
I've tried the xpath out in XMLQuire and it worked there.
class Program
{
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var Listing in Featured)
{
}
}
}
I commented out the other xpath I tried, I've tried those two and both are returning null why is that?
Here is a image showing the part of the DOM I want to access.
<table class="top-feature js-hover" data-ad-id="1299717863" data-vip-url="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863">
<tbody><tr>
<td class="watchlist">
<div class="watch js-hover p-vap-lnk-actn-addwtch" data-action="add" data-adid="1299717863" title="Click to add to My Favourites"><div class="icon"></div></div>
<input id="watchlistXsrf" name="ca.kijiji.xsrf.token" value="1527418405414.9b71d1309fdd8a315258ea5a3dac1a09e4a99ec7f32041df88307c46e26a5b1b" type="hidden">
</td>
<td class="image">
<div class="multiple-images"><img src="https://i.ebayimg.com/00/s/NjAwWDgwMA==/z/fXEAAOSwaZdZxTv~/$_2.JPG" alt="C.L. Contracting. Any job big or small."></div>
</td>
<td class="description">
<a href="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863" class="title ">
C.L. Contracting. Any job big or small.</a>
<p>
Contractor handyman home renovations and repairs. Contractor for Dollarama, Rexall, LaSenza and more. Fully licensed and insured. Able to do drywall, decks, framing, plumbing, flooring windows, ...</p>
<p class="details">
</p>
</td>
<td class="posted">
</td>
</tr>
</tbody></table>
My solution (Need help making my xpath into 1 line instead of traversing through with a bunch of loops.)
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var table in DOM.DocumentNode.SelectNodes("//table[contains(#class, 'top-feature')]"))
{
Console.WriteLine($"Found: {table}");
foreach (var rows in table.SelectNodes("tr"))
{
Console.WriteLine(rows);
foreach (var cell in rows.SelectNodes("td[#class='description']/a"))
{
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Console.ReadKey();
I've managed to fix it, however I ams till curious to why this xpath works
//table[contains(#class, 'top-feature')]/tr/td[#class='description']/a
And this one doesnt.
//table[contains(#class,'top-feature')]/tbody/tr/td/a
As mentioned in the comment, the <tbody> element is generated by a browser developer tool.
If you look at your var DOM object during runtime with the debugger, you can see the InnerHtml property.
<table class="regular-ad js-hover" data-ad-id=".." data-vip-url="..">
<tr>
<td class="watchlist">
...
</td>
<td class="image">
...
</td>
...
</tr>
</table>
No <tbody> element so your XPath has to look like this:
DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tr/td/a");
Using Windows Forms and C#.
For example...
<table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>
I load the page using the WebBrowser Control. The page loads perfectly.
The next thing I want to do is search through all the rows in the table and check if they contain a specific value ; for example in this instance YES.
If they contain it I want the row to be passed on to me so I can store it as string.
But I want the row to be in HTML form. (containing the tags).
How can I accomplish this ?
Please help me.
You can use the HtmlAgilityPack to easily parse the html. For example, to get all of the TD elements, you can do this:
string value = #" <table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(value);
var nodes = doc.GetElementbyId("tbl").SelectNodes("tbody/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
You can use this: http://simplehtmldom.sourceforge.net/ , its really simple way how to search in HTML files
Just include simple_html_dom.php to your file and then just follow this manual
http://simplehtmldom.sourceforge.net/manual.htm
and your php code will looks like
$html = file_get_html('File.html');
foreach($html->find('td') as $element)
echo $element->text. '<br>';
How can I get information out of all td tags in Classen = "string_14" so that I can store it away clean without html code in.
I have thought about this:
enter code here
<table class="string_14">
<tbody><tr>
<td>Postadr.:</td>
<td class="tab_space">Stenslivegen 67, 2817 Gjøvik</td>
</tr>
<tr>
<td>Telefon:</td>
<td class="tab_space">611 80 710</td>
</tr>
<tr>
<td>Mobil:</td>
<td class="tab_space">957 92 455</td>
</tr>
</tbody>
</table>
And my code to get it looking like this today, I want help with is to write xpath to name =? how should I write to get a single td.
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;
List<string> list = new List<string>();
foreach (HtmlNode div in root.SelectNodes("//div[#class='biz_list']"))
{
string name = doc.DocumentNode.SelectNodes("//d[#class='string_14']/#tr");
list.Add(name);
string att = div.OuterHtml;
list.Add(att);
}
What I want out of this is I'm going to scrape a page and then the LATE stage I'll save this down to the xml file.
I think what you want is this:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//table[#class='string_14']//td[#class='tab_space']");
You can consult XPath Tutorial for more on this.
Here is the Html code:
<table style="border:1px solid #000">
<tr style="background:#ddd;">
<td width="150">TableEle1</td>
<td width="150">TableEle2</td>
<td width="150">TableEle3</td>
<td width="150">TableEle4</td>
<td width="150">TableEle5</td>
<td width="150">TableEle6</td>
<td width="150">TableEle7</td>
<td width="150">TableEle8</td>
</tr>
And here is the code I use to extract the table element 1 (but not successful)
htmlHelper.SetNode(#"//td/text()='TableEle1'");
Is there any advice for me?
You can use a blend of HtmlAgilityPack and Linq to get the desired td node.
HtmlDocument document = new HtmlDocument();
document.LoadHtml("[your HTML string]");
var node = document.DocumentNode.SelectNodes("//td/text()");
var tdNode = node.Where(s => s.InnerText == "TableEle1").Select(s => s);
Hope this helps!