Parse table with HTML Agility Pack

Parse table with HTML Agility Pack - c#

In the following HTML, I can parse the table element, but I don't know how to skip the th elements.
I want to get only the td elements, but when I try to use:
foreach (HtmlNode cell in row.SelectNodes("td"))
...I get an exception.
<table class="tab03">
<tbody>
<tr>
<th class="right" rowspan="2">first</th>
</tr>
<tr>
<th class="right">lp</th>
<th class="right">name</th>
</tr>
<tr>
<td class="right">1</td>
<td class="left">house</td>
</tr>
<tr>
<th class="right" rowspan="2">Second</th>
</tr>
<tr>
<td class="right">2</td>
<td class="left">door</td>
</tr>
</tbody>
</table>
My code:
var document = doc.DocumentNode.SelectNodes("//table");
string store = "";
if (document != null)
{
foreach (HtmlNode table in document)
{
if (table != null)
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
store = "";
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
store = store + cell.InnerText+"|";
}
sw.Write(store );
sw.WriteLine();
}
}
}
}
sw.Flush();
sw.Close();

This method uses LINQ to query for HtmlNode instances that have the name td.
I also noticed your output appears as val|val| (with the trailing pipe), This sample uses string.Join(pipe, array) as a less-hideous method of removing that trailing pipe: val|val.
using System.Linq;
// ...
var tablecollection = doc.DocumentNode.SelectNodes("//table");
string store = string.Empty;
if (tablecollection != null)
{
foreach (HtmlNode table in tablecollection)
{
// For all rows with at least one child with the 'td' tag.
foreach (HtmlNode row in table.DescendantNodes()
.Where(desc =>
desc.Name.Equals("tr", StringComparison.OrdinalIgnoreCase) &&
desc.DescendantNodes().Any(child => child.Name.Equals("td",
StringComparison.OrdinalIgnoreCase))))
{
// Combine the child 'td' elements into an array, join with the pipe
// to create the output in 'val|val|val' format.
store = string.Join("|", row.DescendantNodes().Where(desc =>
desc.Name.Equals("td", StringComparison.OrdinalIgnoreCase))
.Select(desc => desc.InnerText));
// You can probably get rid of the 'store' variable as it's
// no longer necessary to store the value of the table's
// cells over the iteration.
sw.Write(store);
sw.WriteLine();
}
}
}
sw.Flush();
sw.Close();

Your XPath syntax is not correct. Please try:
HtmlNode cell in row.SelectNodes("//td")
This will get you the collection of td elements that can be iterated with foreach.

Related

Get Html elements inside div By ID (ID is not Unique)

I am developing add to read web browser data and store it into a dictionary.
During this process, I need to access data By ID but the IDs are not Unique on the page. The page looks like this.
<div id="ID1">
<tbody>
<tr>
<td id="1000" data-field="1">
text
</td>
</tr>
</tbody>
<div id="ID2">
<tbody>
<tr>
<td id="1000" data-field="2">
Some other text
</td>
</tr>
</tbody>
both div elements are on the same page
when I get element By Id It only gives me the first element, not the second one.
Here is My code
HtmlElement myElements = webBrowser1.Document.GetElementById("ID2");
HtmlElement myElements2 = myElements.Document.GetElementById("1000");
if (myElements2.InnerText != null)
{
//Do something
}
How Can I get the inner text of the second element by ID

This is the best and the easiest answer I came up with
I figured out the data-field is a unique value in the page so I looped through the elements and compared it with data-field
HtmlElement Buildingcontacts = webBrowser1.Document.GetElementById("ID2");
HtmlElementCollection ifiels = Buildingcontacts.Document.GetElementsByTagName("td");
foreach (HtmlElement element in ifiels)
{
string datafieldx = element.GetAttribute("data-field");
if (datafieldx == "2")
{
if (element.InnerText != null)
{
//do Somthing
}
}
}

Html Agility Pack Loop Through Table - Get cell value based on previous cell value

I have multiple tables and Location Value is given in different index order.
How can I get location value if previous cell string is "Location" when I loop through table. On below example it is cells[7] but on other table it will be 9. How can I conditionally get values after cells inner text is "Location"? Basically find the cell "Location" get inner text of next cell.
Html Table:
<table class="tbfix FieldsTable"">
<tbody>
<tr>
<td class="name">Last Movement</td>
<td class="value">Port Exit</td>
</tr>
<tr>
<td class="name">Date</td>
<td class="value">26/06/2017 00:00:00</td>
</tr>
<tr>
<td class="name">From</td>
<td class="value">HAMBURGE</td>
</tr>
<tr>
<td class="name">Location</td>
<td class="value">EUROGATE HAMBURG</td>
</tr>
<tr>
<td class="name">E/F</td>
<td class="value">E</td>
</tr>
</tbody>
Controller Loop Through:
foreach (var eachNode in driver.FindElements(By.XPath("//table[contains(descendant::*, 'Last Movement')]")))
{
var cells = eachNode.FindElements(By.XPath(".//td"));
cd = new Detail();
for (int i = 0; i < cells.Count(); i++)
{
cd.ActionType = cells[1].Text.Trim();
string s = cells[3].Text.Trim();
DateTime dt = Convert.ToDateTime(s);
if (_minDate > dt) _minDate = dt;
cd.ActionDate = dt;
}
}

In your foreach loop you could use this:
var location = eachNode.FindElement(By.XPath(".//td[contains(text(),'Location')]/following-sibling::td));

Assuming your data is always structured like that I would loop over all the tags and add the data to a dictionary.
Try something like this:
Dictionary<string,string> tableData = new Dictionary<string, string>();
var trNodes = eachNode.FindElements(By.TagName("tr"));
foreach (var trNode in trNodes)
{
var name = trNode.FindElement(By.CssSelector(".name")).Text.Trim();
var value = trNode.FindElement(By.CssSelector(".value")).Text.Trim();
tableData.Add(name,value);
}
var location = tableData["location"];
You would have to add validation and checks for the dictionary and the structure but that is the general idea.

c# htmlagility pack conditional select node

I am not sure the title suits my problem.
I have html like below
<table id="searchResultsTable" class="">
<tbody>
<tr class="searchResultsItem even ">
<td class="searchResultsPriceValue">
<div> 26.500 TL</div></td>
<td class="searchResultsTitleValue ">
<a class="classifiedTitle" href="xxxx"> some text</a>
</tr>
<tr class="searchResultsItem odd ">
.
//same as "searchResultsItem even "
.
</tr>
</tbody>
</table>
I am new to htmlagility pack. I have succeed in getting the price value of both "searchResultsItem even" and "searchResultsItem odd".
I want to get href value if the price is below or above some value. I can get href but all time for "searchResultsItem even". I want to get href if even's price value matches my condition for even and if odd matches my condition i want to get for odd.
below is my code
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class='searchResultsPriceValue']"))
{
string price = node2.InnerText.ToString();
price = price.Trim().Replace(".", String.Empty);
price = price.Replace("TL", String.Empty);
if (Convert.ToInt32(price) < 28000)
{
HtmlNode node3 = node.SelectSingle(".//a[#class='classifiedTitle']");
listBox1.Items.Add(node3.Attributes["href"].Value);
}
}
}
Thanks

Get the tr class name as an attribute value. Loop through rows first, then tds.
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode tr in table.SelectNodes("//tr"))
{
var #class = tr.GetAttributeValue("class", string.Empty);
switch (#class) {
// rest of your parsing
}
}
}

Parse Table with LINQ and HtmlAgilityPack

How can I parse HTML using LINQ on a webpage to get the innerhtml values from the table?
I am using the HtmlAgilityPack and would like to parse some values as good as possible.
the number you see(00000, 00001, 00002..), are unique numbers from the agents.
So maybe there is a way to use LINQ to parse those numbers and get the following values from td's
(Name, 123, state, and info) => 00000, John, 123, IDLE, coffee for each
so I can call them separately and work with them - maybe in a array?
</TH>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00000</TD>
<TD ALIGN=LEFT>John</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00001</TD>
<TD ALIGN=LEFT>Lisa</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00002</TD>
<TD ALIGN=LEFT>Mary</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00003</TD>
<TD ALIGN=LEFT>Tim</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
....
Thanks in advance!

This seems a lot like a "please give me the code I need question", which I seriously dislike. Have a look at the following and make sure you understand it:
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var tds = tr.Descendants("TD").ToArray(); // Get all the TDs
// Turn them into our datastructure
var data = new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};
// Do something with data
}
Doing it with LINQ only:
var data = from tr in doc.DocumentNode.Descendants("TR")
let tds = tr.Descendants("TD").ToArray()
select new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};

#flindeberg makes a perfectly reasonable answer (+1 to he/she), you could avoid the ToArray like this.
private class Row
{
public string Name { get; set; }
public int Number { get; set; }
public string State { get; set; }
public string Info { get; set; }
}
...
var mappings = new Action<string, Row>[]
{
(value, row) => row.Name = value,
(value, row) => row.Number = int.Parse(value),
(value, row) => row.State = value,
(value, row) => row.Info = value
};
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var row = new Row();
tr.Descendants("TD").Zip(mappings, (td, map) =>
{
map(td.InnerText, row);
return true;
});
// You now have a populated row.
}

Infinite Loop parsing a table using htmlagilitypack

I am trying to parse through TR's using HtmlAgilityPack and do something different with the 1st, 2nd, 3rd etc TD's.
I am almost there but my code (below) causes an infinite loop. It just repeats the first row over and over again:
foreach (HtmlNode row in htmlDoc.DocumentNode.SelectNodes("//table//tr"))
{
var node = row.SelectSingleNode("//td[1]");
if (node != null)
{
Console.WriteLine("Node: {0}", node.InnerText);
}
}
The raw HTML returned it correct. The table is also pretty standard:
<table>
<tr>
<th>Header 1</hr>
<th>Header 2</hr>
<th>Header 3</hr>
<th>Header 4</hr>
<th>Header 5</hr>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
<td>Cell 3</td>
<td>Cell 4</td>
<td>Cell 5</td>
...
</tr>
</table>
The following code works but then it is not grouped by row so it is much harder to manipulate:
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//table//tr//td"))
{
Console.WriteLine("Node: {0}", node.InnerText);
}

This works fine with your sample html
var res = doc.DocumentNode.SelectNodes("//table//tr[td]")
.Select(row => row.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse table with HTML Agility Pack - c#

Your XPath syntax is not correct. Please try: HtmlNode cell in row.SelectNodes("//td") This will get you the collection of td elements that can be iterated with foreach.

Related

Get Html elements inside div By ID (ID is not Unique)

Html Agility Pack Loop Through Table - Get cell value based on previous cell value

c# htmlagility pack conditional select node

Parse Table with LINQ and HtmlAgilityPack

Infinite Loop parsing a table using htmlagilitypack

Categories

Resources