Infinite Loop parsing a table using htmlagilitypack

Infinite Loop parsing a table using htmlagilitypack - c#

I am trying to parse through TR's using HtmlAgilityPack and do something different with the 1st, 2nd, 3rd etc TD's.
I am almost there but my code (below) causes an infinite loop. It just repeats the first row over and over again:
foreach (HtmlNode row in htmlDoc.DocumentNode.SelectNodes("//table//tr"))
{
var node = row.SelectSingleNode("//td[1]");
if (node != null)
{
Console.WriteLine("Node: {0}", node.InnerText);
}
}
The raw HTML returned it correct. The table is also pretty standard:
<table>
<tr>
<th>Header 1</hr>
<th>Header 2</hr>
<th>Header 3</hr>
<th>Header 4</hr>
<th>Header 5</hr>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
<td>Cell 3</td>
<td>Cell 4</td>
<td>Cell 5</td>
...
</tr>
</table>
The following code works but then it is not grouped by row so it is much harder to manipulate:
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//table//tr//td"))
{
Console.WriteLine("Node: {0}", node.InnerText);
}

This works fine with your sample html
var res = doc.DocumentNode.SelectNodes("//table//tr[td]")
.Select(row => row.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();

Related

Using HtmlAgilityPack with C# to find all href links within td elements in html page

I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();

Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully

Retrieve the table data with xpath and Selenium

I have HTML with looks basically like the following
....
<div id="a">
<table class="a1">
<tbody>
<tr>
<td><a href="a11.html>a11</a>
</tr>
<tr>
<td><a href="a12.html>a12</a>
</tr>
</tbody>
<table>
</div>
...
The following coding in C# I used, however, I cannot retrieve the URL in this stage
IWebElement baseTable = driver.FindElement(By.ClassName(TableID));
// gets all table rows
ICollection<IWebElement> rows = baseTable.FindElements(By.TagName("tr"));
// for every row
IWebElement matchedRow = null;
foreach(var row in rows)
{
Console.Write (row.FindElements(By.XPath("td/a")));
}

First of all, you gave us invalid markup. Right one:
<div id="a">
<table class="a1">
<tbody>
<tr>
<td>
a11
</td>
</tr>
<tr>
<td>
a12
</td>
</tr>
</tbody>
</table>
</div>
If you have only one anchor in table row, you should use this code to retrieve url:
IWebElement baseTable = driver.FindElement(By.ClassName(TableID));
// gets all table rows
ICollection<IWebElement> rows = baseTable.FindElements(By.TagName("tr"));
// for every row
IWebElement matchedRow = null;
foreach (var row in rows)
{
Console.WriteLine(row.FindElement(By.XPath("td/a")).GetAttribute("href"));
}
You need to get href attribute of found element. Otherwise, row.FindElement(By.XPath("td/a") will print type name of the IWebElement inherited class, because it is an some type object, not string.

This does not look like a valid xpath to me
Console.Write (row.FindElements(By.XPath("td/a")));
try
Console.Write (row.FindElements(By.XPath("/td/a")));

Parse table with HTML Agility Pack

In the following HTML, I can parse the table element, but I don't know how to skip the th elements.
I want to get only the td elements, but when I try to use:
foreach (HtmlNode cell in row.SelectNodes("td"))
...I get an exception.
<table class="tab03">
<tbody>
<tr>
<th class="right" rowspan="2">first</th>
</tr>
<tr>
<th class="right">lp</th>
<th class="right">name</th>
</tr>
<tr>
<td class="right">1</td>
<td class="left">house</td>
</tr>
<tr>
<th class="right" rowspan="2">Second</th>
</tr>
<tr>
<td class="right">2</td>
<td class="left">door</td>
</tr>
</tbody>
</table>
My code:
var document = doc.DocumentNode.SelectNodes("//table");
string store = "";
if (document != null)
{
foreach (HtmlNode table in document)
{
if (table != null)
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
store = "";
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
store = store + cell.InnerText+"|";
}
sw.Write(store );
sw.WriteLine();
}
}
}
}
sw.Flush();
sw.Close();

This method uses LINQ to query for HtmlNode instances that have the name td.
I also noticed your output appears as val|val| (with the trailing pipe), This sample uses string.Join(pipe, array) as a less-hideous method of removing that trailing pipe: val|val.
using System.Linq;
// ...
var tablecollection = doc.DocumentNode.SelectNodes("//table");
string store = string.Empty;
if (tablecollection != null)
{
foreach (HtmlNode table in tablecollection)
{
// For all rows with at least one child with the 'td' tag.
foreach (HtmlNode row in table.DescendantNodes()
.Where(desc =>
desc.Name.Equals("tr", StringComparison.OrdinalIgnoreCase) &&
desc.DescendantNodes().Any(child => child.Name.Equals("td",
StringComparison.OrdinalIgnoreCase))))
{
// Combine the child 'td' elements into an array, join with the pipe
// to create the output in 'val|val|val' format.
store = string.Join("|", row.DescendantNodes().Where(desc =>
desc.Name.Equals("td", StringComparison.OrdinalIgnoreCase))
.Select(desc => desc.InnerText));
// You can probably get rid of the 'store' variable as it's
// no longer necessary to store the value of the table's
// cells over the iteration.
sw.Write(store);
sw.WriteLine();
}
}
}
sw.Flush();
sw.Close();

Your XPath syntax is not correct. Please try:
HtmlNode cell in row.SelectNodes("//td")
This will get you the collection of td elements that can be iterated with foreach.

HTMLAgilityPack - Detecting a blank table?

I'm using c# with htmlagilitypack. Everything works fine except when the table I'm looking for contains no rows. I'm trying to read only the data from the 1st table on the page. The problem is if the first table contains no rows, the htmlagilitypack seems to jump down to the 2nd table for some reason.
The html I'm trying to read looks something like this:
<table class='stats'>
<tr>
<td colspan='2'>This is the 1st table</td>
<tr>
<td>Column A</td>
<td>Column B</td>
</tr>
<tr>
<td>Value A</td>
<td>Value B</td>
</tr>
</table>
<table class='stats'>
<tr>
<td colspan='2'>This is the 2nd table</td>
<tr>
<td>Column 1</td>
<td>Column 2</td>
</tr>
<tr>
<td>Value 111</td>
<td>Value 222</td>
</tr>
</table>
I then retrieve the 1st table's values using the following line:
foreach (HtmlNode node in root.SelectNodes("//table[#class='stats']/tr[position() > 2]/td"))
How do I ensure the data I'm grabbing is only from the 1st table?
Thanks.

You could ensure that you only select the first matching table by using a position index [1] after the table selector.
Try the following:
"//table[#class='stats'][1]/tr[position()>2]/td"
If the first table has no rows, then you will get null back so you should check for that before iterating in the foreach.
For example you might want to do the following:
var elements = root.SelectNodes("//table[#class='stats'][1]/tr[position()>2]/td");
if (elements != null)
{
foreach (HtmlNode node in elements)
{
// process the td node
}
}

You need to have an id on the table or row which uniquely identifies the table or or and then use the id in the xpath.

Parse data/numbers from table cells C# or VisualBasic

I have a string which contains html code from a webpage. There's a table in the code I'm interested in. I want to parse the numbers present in the table cells and put them in textboxes, each number in its own textbox. Here's the table:
<table class="tblSkills">
<tr>
<th class="th_first">Strength</th><td class="align_center">15</td>
<th>Passing</th><td class="align_center">17</td>
</tr>
<tr>
<th class="th_first">Stamina</th><td class="align_center">16</td>
<th>Crossing</th><td class="align_center"><img src='/pics/star.png' alt='20' title='20' /></td>
</tr>
<tr>
<th class="th_first">Pace</th><td class="align_center"><img src='/pics/star_silver.png' alt='19' title='19' /></td>
<th>Technique</th><td class="align_center">16</td>
</tr>
<tr>
<th class="th_first">Marking</th><td class="align_center">15</td>
<th>Heading</th><td class="align_center">10</td>
</tr>
<tr>
<th class="th_first">Tackling</th><td class="align_center"><span class='subtle'>5</span></td>
<th>Finishing</th><td class="align_center">15</td>
</tr>
<tr>
<th class="th_first">Workrate</th><td class="align_center">16</td>
<th>Longshots</th><td class="align_center">8</td>
</tr>
<tr>
<th class="th_first">Positioning</th><td class="align_center">18</td>
<th>Set Pieces</th><td class="align_center"><span class='subtle'>2</span></td>
</tr>
</table>
As you can see there are 14 numbers. To make things worse numbers like 19 and 20 are replaced by images and numbers lower than 6 have a span class.
I know I could use HTML agility pack or something similar, but I'm not yet that good to figure how to do it by myself, so I need your help.

Your HTML sample also happens to be good XML. You could use any of .net's XML reading/parsing techniques.

Using LINQ to XML in C#:
var doc = XDocument.Parse(yourHtml);
var properties = new List<string>(
from th in doc.Descendants("th")
select th.Value);
var values = new List<int>(
from td in doc.Descendants("td")
let img = td.Element("img")
let textValue = img == null ? td.Value : img.Attribute("alt").Value
select int.Parse(textValue));
var dict = new Dictionary<string, int>();
for (var i = 0; i < properties.Count; i++)
{
dict[properties[i]] = values[i];
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Infinite Loop parsing a table using htmlagilitypack - c#

This works fine with your sample html var res = doc.DocumentNode.SelectNodes("//table//tr[td]") .Select(row => row.Descendants("td") .Select(td => td.InnerText).ToList()) .ToList();

Related

Using HtmlAgilityPack with C# to find all href links within td elements in html page

Retrieve the table data with xpath and Selenium

Parse table with HTML Agility Pack

HTMLAgilityPack - Detecting a blank table?

Parse data/numbers from table cells C# or VisualBasic

Categories

Resources