I'm using c# with htmlagilitypack. Everything works fine except when the table I'm looking for contains no rows. I'm trying to read only the data from the 1st table on the page. The problem is if the first table contains no rows, the htmlagilitypack seems to jump down to the 2nd table for some reason.
The html I'm trying to read looks something like this:
<table class='stats'>
<tr>
<td colspan='2'>This is the 1st table</td>
<tr>
<td>Column A</td>
<td>Column B</td>
</tr>
<tr>
<td>Value A</td>
<td>Value B</td>
</tr>
</table>
<table class='stats'>
<tr>
<td colspan='2'>This is the 2nd table</td>
<tr>
<td>Column 1</td>
<td>Column 2</td>
</tr>
<tr>
<td>Value 111</td>
<td>Value 222</td>
</tr>
</table>
I then retrieve the 1st table's values using the following line:
foreach (HtmlNode node in root.SelectNodes("//table[#class='stats']/tr[position() > 2]/td"))
How do I ensure the data I'm grabbing is only from the 1st table?
Thanks.
You could ensure that you only select the first matching table by using a position index [1] after the table selector.
Try the following:
"//table[#class='stats'][1]/tr[position()>2]/td"
If the first table has no rows, then you will get null back so you should check for that before iterating in the foreach.
For example you might want to do the following:
var elements = root.SelectNodes("//table[#class='stats'][1]/tr[position()>2]/td");
if (elements != null)
{
foreach (HtmlNode node in elements)
{
// process the td node
}
}
You need to have an id on the table or row which uniquely identifies the table or or and then use the id in the xpath.
Related
I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully
Recently I was trying to use the features of "https://datatables.net" in one GridView render. It wasn't possible because the render always gives a table without the correct formatting (without thead). Is there a way to transform the render into the correct format?
Correct format:
<table id="table_id" class="display">
<thead>
<tr>
<th>Column 1</th>
<th>Column 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1 Data 1</td>
<td>Row 1 Data 2</td>
</tr>
<tr>
<td>Row 2 Data 1</td>
<td>Row 2 Data 2</td>
</tr>
</tbody>
</table>
This code formats one table to the correct format and then runs the .DataTable(); on the corrected formatted table.
To use this replace the ID of your table in '#gdVscQuote'
If the page you are working on has more than one table this will not work, is not tested.
$(document).ready( function () {
//replace tr
$($('#gdVscQuote')[0].childNodes[1].childNodes[0]).wrap('<thead/>').contents().unwrap();
//replace all td with th inside thead
$('thead td').wrap('<th/>').contents().unwrap();
//get thead
var thead = $("thead").get(0);
//remove saved thead to replace above tbody thead
$("thead").remove();
//add thead correctly
$('#gdVscQuote')[0].prepend(thead);
// replace tds for tr
$($('thead')[0].childNodes).wrapAll("<tr/>")
//add jQuery table functionality
$('#gdVscQuote').DataTable();
});
I have a bit of HTML that looks like this:
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ACME ANVILS, INC</td>
</tr>
</tbody>
</table>
and some C# code that looks like this:
var name = document.DocumentNode
.SelectSingleNode("//*[text()='Name']/following::td").InnerText
which happily returns
ACME ANVILS, INC.
However, there's a new wrinkle. The page in question now returns multiple results:
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ACME ANVILS, INC.</td>
</tr>
</tbody>
</table>
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ROAD RUNNER RACES, LLC</td>
</tr>
</tbody>
</table>
So now I'm working with
var tables = document.DocumentNode.SelectNodes("//table/tbody");
foreach (var table in tables)
{
var name = table.SelectSingleNode("//*[text()='Name']/following::td").InnerText;
...
}
Which falls over, because SelectSingleNode returns null.
How do I get my XPath to actually return a result, searching only within the specific table I have selected?
With the addition of a second table, two adjustments are required:
Change your absolute XPath,
//*[text()='Name']/following::td
to one relative to the current table or tbody element:
.//*[text()='Name']/following::td
Account for there now being more than one td element on the
following:: axis.
Either just grab the first,
(.//*[text()='Name']/following::td)[1]
or, better, use the following-sibling:: axis instead in combination
with a test on the string value of td rather than a test on a text node, which might be buried beneath intervening formatting elements:
.//td[.='Name']/following-sibling::td
See also Difference between Testing text() nodes vs string values in XPath.
I have HTML with looks basically like the following
....
<div id="a">
<table class="a1">
<tbody>
<tr>
<td><a href="a11.html>a11</a>
</tr>
<tr>
<td><a href="a12.html>a12</a>
</tr>
</tbody>
<table>
</div>
...
The following coding in C# I used, however, I cannot retrieve the URL in this stage
IWebElement baseTable = driver.FindElement(By.ClassName(TableID));
// gets all table rows
ICollection<IWebElement> rows = baseTable.FindElements(By.TagName("tr"));
// for every row
IWebElement matchedRow = null;
foreach(var row in rows)
{
Console.Write (row.FindElements(By.XPath("td/a")));
}
First of all, you gave us invalid markup. Right one:
<div id="a">
<table class="a1">
<tbody>
<tr>
<td>
a11
</td>
</tr>
<tr>
<td>
a12
</td>
</tr>
</tbody>
</table>
</div>
If you have only one anchor in table row, you should use this code to retrieve url:
IWebElement baseTable = driver.FindElement(By.ClassName(TableID));
// gets all table rows
ICollection<IWebElement> rows = baseTable.FindElements(By.TagName("tr"));
// for every row
IWebElement matchedRow = null;
foreach (var row in rows)
{
Console.WriteLine(row.FindElement(By.XPath("td/a")).GetAttribute("href"));
}
You need to get href attribute of found element. Otherwise, row.FindElement(By.XPath("td/a") will print type name of the IWebElement inherited class, because it is an some type object, not string.
This does not look like a valid xpath to me
Console.Write (row.FindElements(By.XPath("td/a")));
try
Console.Write (row.FindElements(By.XPath("/td/a")));
Using Windows Forms and C#.
For example...
<table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>
I load the page using the WebBrowser Control. The page loads perfectly.
The next thing I want to do is search through all the rows in the table and check if they contain a specific value ; for example in this instance YES.
If they contain it I want the row to be passed on to me so I can store it as string.
But I want the row to be in HTML form. (containing the tags).
How can I accomplish this ?
Please help me.
You can use the HtmlAgilityPack to easily parse the html. For example, to get all of the TD elements, you can do this:
string value = #" <table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(value);
var nodes = doc.GetElementbyId("tbl").SelectNodes("tbody/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
You can use this: http://simplehtmldom.sourceforge.net/ , its really simple way how to search in HTML files
Just include simple_html_dom.php to your file and then just follow this manual
http://simplehtmldom.sourceforge.net/manual.htm
and your php code will looks like
$html = file_get_html('File.html');
foreach($html->find('td') as $element)
echo $element->text. '<br>';