Parse Table with LINQ and HtmlAgilityPack - c#

How can I parse HTML using LINQ on a webpage to get the innerhtml values from the table?
I am using the HtmlAgilityPack and would like to parse some values as good as possible.
the number you see(00000, 00001, 00002..), are unique numbers from the agents.
So maybe there is a way to use LINQ to parse those numbers and get the following values from td's
(Name, 123, state, and info) => 00000, John, 123, IDLE, coffee for each
so I can call them separately and work with them - maybe in a array?
</TH>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00000</TD>
<TD ALIGN=LEFT>John</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00001</TD>
<TD ALIGN=LEFT>Lisa</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00002</TD>
<TD ALIGN=LEFT>Mary</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00003</TD>
<TD ALIGN=LEFT>Tim</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
....
Thanks in advance!

This seems a lot like a "please give me the code I need question", which I seriously dislike. Have a look at the following and make sure you understand it:
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var tds = tr.Descendants("TD").ToArray(); // Get all the TDs
// Turn them into our datastructure
var data = new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};
// Do something with data
}
Doing it with LINQ only:
var data = from tr in doc.DocumentNode.Descendants("TR")
let tds = tr.Descendants("TD").ToArray()
select new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};

#flindeberg makes a perfectly reasonable answer (+1 to he/she), you could avoid the ToArray like this.
private class Row
{
public string Name { get; set; }
public int Number { get; set; }
public string State { get; set; }
public string Info { get; set; }
}
...
var mappings = new Action<string, Row>[]
{
(value, row) => row.Name = value,
(value, row) => row.Number = int.Parse(value),
(value, row) => row.State = value,
(value, row) => row.Info = value
};
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var row = new Row();
tr.Descendants("TD").Zip(mappings, (td, map) =>
{
map(td.InnerText, row);
return true;
});
// You now have a populated row.
}

Related

Read invisible data from table with htmlagilitypack

I have this html with table.
I can get "col1" and "col2" but I don't know how to get also value of "data-index", "data-name":
<table class="footable table" id="footable">
<tbody>
<tr class="trclass red" data-index="123" data-name="Apple">
<td class="col1" >Green</td>
<td class="col2" >1.25</td>
</td></tr>
</tbody>
</table>
What I have tried:
public static void Main()
{
var html =
#"<html>
<tbody>
<table id=\'footable\'>
<tr class=\'trclass red\' data-index=\'123\' data-name=\'Apple\'>
<td class=\'col1\' >Green</td>
<td class=\'col2\' > 1.25</td>
</table>
</tbody></html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tbody = htmlDoc.DocumentNode.SelectNodes("//table[contains(#id, 'foo')]//tr//td");
foreach(var nob in tbody)
{
Console.Write(nob.InnerHtml);
}
}
I know that I can use nob.Attributes["data-index"], but my data is in tr before td where are my "Green" and "1.25".

Html Agility Pack Loop Through Table - Get cell value based on previous cell value

I have multiple tables and Location Value is given in different index order.
How can I get location value if previous cell string is "Location" when I loop through table. On below example it is cells[7] but on other table it will be 9. How can I conditionally get values after cells inner text is "Location"? Basically find the cell "Location" get inner text of next cell.
Html Table:
<table class="tbfix FieldsTable"">
<tbody>
<tr>
<td class="name">Last Movement</td>
<td class="value">Port Exit</td>
</tr>
<tr>
<td class="name">Date</td>
<td class="value">26/06/2017 00:00:00</td>
</tr>
<tr>
<td class="name">From</td>
<td class="value">HAMBURGE</td>
</tr>
<tr>
<td class="name">Location</td>
<td class="value">EUROGATE HAMBURG</td>
</tr>
<tr>
<td class="name">E/F</td>
<td class="value">E</td>
</tr>
</tbody>
Controller Loop Through:
foreach (var eachNode in driver.FindElements(By.XPath("//table[contains(descendant::*, 'Last Movement')]")))
{
var cells = eachNode.FindElements(By.XPath(".//td"));
cd = new Detail();
for (int i = 0; i < cells.Count(); i++)
{
cd.ActionType = cells[1].Text.Trim();
string s = cells[3].Text.Trim();
DateTime dt = Convert.ToDateTime(s);
if (_minDate > dt) _minDate = dt;
cd.ActionDate = dt;
}
}
In your foreach loop you could use this:
var location = eachNode.FindElement(By.XPath(".//td[contains(text(),'Location')]/following-sibling::td));
Assuming your data is always structured like that I would loop over all the tags and add the data to a dictionary.
Try something like this:
Dictionary<string,string> tableData = new Dictionary<string, string>();
var trNodes = eachNode.FindElements(By.TagName("tr"));
foreach (var trNode in trNodes)
{
var name = trNode.FindElement(By.CssSelector(".name")).Text.Trim();
var value = trNode.FindElement(By.CssSelector(".value")).Text.Trim();
tableData.Add(name,value);
}
var location = tableData["location"];
You would have to add validation and checks for the dictionary and the structure but that is the general idea.

How to scrape values from a web page using Html Agility Pack [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need some values from a web page, so I am building a scraping using html agility pack.
I'll show you html website and my Csharp.
Html Website:
<div class="box-overflow">
<div class="box-overflow__in">
<table class="table-main js-tablebanner-t js-tablebanner-ntb">
<tr>
<th class="h-text-left" colspan="2">17. Round</th>
<th class="h-text-center">1</th>
<th class="h-text-center">X</th>
<th class="h-text-center">2</th>
<th> </th>
</tr>
<tr>
<td class="h-text-left"><a href=
"/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/" class=
"in-match"><span>Lechia Gdansk</span> - <span>Leczna</span></a></td>
<td class="h-text-center"><a href=
"/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/">3:0</a></td>
<td class="table-matches__odds colored"></td>
<td class="table-matches__odds" data-odd="4.04"></td>
<td class="table-matches__odds" data-odd="6.29"></td>
<td class="h-text-right h-text-no-wrap">28.11.2016</td>
</tr>
<tr>
<td class="h-text-left"><a href=
"/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/" class=
"in-match"><span>Plock</span> - <span>Piast Gliwice</span></a></td>
<td class="h-text-center"><a href=
"/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/">0:0</a></td>
<td class="table-matches__odds" data-odd="2.05"></td>
<td class="table-matches__odds colored"></td>
<td class="table-matches__odds" data-odd="3.50"></td>
<td class="h-text-right h-text-no-wrap">27.11.2016</td>
</tr>
<tr>
<td class="h-text-left"><a href=
"/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/" class=
"in-match"><span>Slask Wroclaw</span> - <span>Legia</span></a></td>
<td class="h-text-center"><a href=
"/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/">0:4</a></td>
<td class="table-matches__odds" data-odd="4.53"></td>
<td class="table-matches__odds" data-odd="3.64"></td>
<td class="table-matches__odds colored"></td>
<td class="h-text-right h-text-no-wrap">27.11.2016</td>
</tr>
</table>
</div>
</div>
My csharp:
var url = "http://www.betexplorer.com/soccer/poland/ekstraklasa/results/";
var web = new HtmlWeb();
var doc = web.Load(url);
Bets = new List<Bet>();
// Lettura delle righe
var Rows = doc.DocumentNode.SelectNodes("//table");
foreach (var row in Rows)
{
if (!row.GetAttributeValue("class", "").Contains("table-main js-tablebanner-t js-tablebanner-ntb"))
{
if (string.IsNullOrEmpty(row.InnerText))
continue;
var rowBet = new Bet();
foreach (var node in row.ChildNodes)
{
var data_odd = node.GetAttributeValue("data-odd", "");
if (string.IsNullOrEmpty(data_odd))
{
if (node.GetAttributeValue("class", "").Contains("in-match"))
{
rowBet.Match = node.InnerText.Trim();
var matchTeam = rowBet.Match.Split(new[] { " - " }, StringSplitOptions.RemoveEmptyEntries);
rowBet.Home = matchTeam[0];
rowBet.Host = matchTeam[1];
}
if (node.GetAttributeValue("class", "").Contains("h-text-center"))
{
rowBet.Result = node.InnerText.Trim();
var matchPoints = rowBet.Result.Split(new[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
int help;
if (int.TryParse(matchPoints[0], out help))
{
rowBet.HomePoints = help;
}
if (matchPoints.Length == 2 && int.TryParse(matchPoints[1], out help))
{
rowBet.HostPoints = help;
}
}
if (node.GetAttributeValue("class", "").Contains("h-text-right h-text-no-wrap"))
rowBet.Date = node.InnerText.Trim();
}
else
{
rowBet.Odds.Add(data_odd);
}
}
if (!string.IsNullOrEmpty(rowBet.Match))
Bets.Add(rowBet);
}
}
I'll give you more informations:
I need to take teams name (e.g. Lechia Gdansk - Leczna),
result (e.g. 3:0)
data-odd (e.g. 1.49, 4.04, 6.29)
and match date (e.g. 28.11.2016)
If someone needs more infromations, ask me what you want to know. Thanks
I would do it like
var list = doc.DocumentNode.SelectSingleNode("//table[#class='table-main js-tablebanner-t js-tablebanner-ntb']")
.Descendants("tr")
.Select(x => new
{
Val1 = x.SelectSingleNode("td[#class='h-text-left']")?.InnerText,
Val2 = x.SelectSingleNode("td[#class='h-text-center']")?.InnerText
})
.Where(x => x.Val1!=null)
.ToList();

How to get all rows and columns in Selenium?

I have a table like this:
Name Places Sex Score
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Ken null Male 9.5
Smith London Male 7.5
Joe null null 8.0
I want to get all values of a table in Web using Selenium.
How to get values and show data in the table with columns and rows in above table?
My code to do this:
List<IWebElement> result = new List<IWebElement>();
IList<IWebElement> tableRows = browser.FindElements(By.XPath("id('column2')/tbody/tr"));
foreach (IWebElement rows in tableRows)
{
try
{
if (rows.FindElements(By.XPath("td")).Count == 10)
result.Add(rows);
}
}
And I only get all text of rows like this:
Ken Male 9.5
Smith London Male 7.5
Joe 8.0
As you can see, I only get only rows. And I can't know corresponding value column.
Joe 8.0 is not matched with:
Name Places Sex Score.
The HTML Markup of my table:
<div class="tabbox_F" id="oTableContainer_L">
<table id="column2">
<thead>
<tr class="tabthdwn">
<th>Name</th>
<th>Places</th>
<th>Sex</th>
<th>Score</th>\
</tr>
</thead>
<tbody>
<tr class="table Alpha">
<td>
<div class="name"><span>Ken</span></div>
<div class= "category"><span>Student</span></div>
</td>
<td><div class="address"></div></td>
<td><div class="sex"><h5>Male</h5></div></td>
<td>
<div class="score_math"><b>9.5</b></div>
<div class="score_bio"><b>7.5</b></div>ư
</td>
</tr>
<tr class="table Alpha">
<td>
<div class="name"><span>Joe</span></div>
<div class= "category"><span>Teacher</span></div>
</td>
<td><div class="address"></div></td>
<td><div class="sex"></div></td>
<td>
<div class="score_math"><b>8.0</b></div>
<div class="score_bio"><b>5.5</b></div>ư
</td>
</tr>
</tbody>
</table>
</div>
By looking at only the TDs, you aren't taking advantage of all the info you have in the HTML. Each TD has a class which tells you which bit of info is contained in each TD, e.g. <td class="name"> contains the name. Use that to your advantage to separate the different bits of data.
I would do something like this. I added the Values class to store the data for the row temporarily. If you don't need to reuse the data other than to just dump the values, you can just remove that bit.
class Program
{
static void Main(string[] args)
{
IWebDriver browser = new FirefoxDriver();
List<IWebElement> result = new List<IWebElement>();
IList<IWebElement> tableRows = browser.FindElements(By.XPath("id('column2')/tbody/tr"));
By nameLocator = By.ClassName("td > div.name");
By addressLocator = By.ClassName("td > div.address");
By sexLocator = By.ClassName("td > div.sex");
By scoretextLocator = By.ClassName("td > div.score_text");
// String.Format Method https://msdn.microsoft.com/en-us/library/aa331875(v=vs.71).aspx
Console.WriteLine("{0,10}{1,10}{2,10}{3,10}", "Name", "Address", "Sex", "Score");
foreach (IWebElement rows in tableRows)
{
Values values = new Values();
values.name = rows.FindElement(nameLocator).Text.Trim();
values.address = rows.FindElement(addressLocator).Text.Trim();
values.sex = rows.FindElement(sexLocator).Text.Trim();
values.scoretext = rows.FindElement(scoretextLocator).Text.Trim();
Console.WriteLine("{0,10}{1,10}{2,10}{3,10}", values.name, values.address, values.sex, values.scoretext);
}
}
}
class Values
{
public string name;
public string address;
public string sex;
public string scoretext;
public Values()
{
this.name = "";
this.address = "";
this.sex = "";
this.scoretext = "";
}
}
Why not this way:
List<IWebElement> result = new List<IWebElement>();
IList<IWebElement> tableRows = browser.FindElements(By.XPath("id('column2')/tbody/tr"));
foreach (IWebElement rows in tableRows)
{
IList<IWebElement> allColumns =row.FindElements(By.TagName("td"));
//and how allColumns[0] +1 etc .... gives you each values, including nulls
}
I think the only issue is how you're printing out your rows. Notice that some of the columns have no values. If you are not handling that in your output, then it will come out the way you've shown us above. If you use a debugger and look at the row element, you will likely find that there are still 4 td children in each row.

Parse table with HTML Agility Pack

In the following HTML, I can parse the table element, but I don't know how to skip the th elements.
I want to get only the td elements, but when I try to use:
foreach (HtmlNode cell in row.SelectNodes("td"))
...I get an exception.
<table class="tab03">
<tbody>
<tr>
<th class="right" rowspan="2">first</th>
</tr>
<tr>
<th class="right">lp</th>
<th class="right">name</th>
</tr>
<tr>
<td class="right">1</td>
<td class="left">house</td>
</tr>
<tr>
<th class="right" rowspan="2">Second</th>
</tr>
<tr>
<td class="right">2</td>
<td class="left">door</td>
</tr>
</tbody>
</table>
My code:
var document = doc.DocumentNode.SelectNodes("//table");
string store = "";
if (document != null)
{
foreach (HtmlNode table in document)
{
if (table != null)
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
store = "";
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
store = store + cell.InnerText+"|";
}
sw.Write(store );
sw.WriteLine();
}
}
}
}
sw.Flush();
sw.Close();
This method uses LINQ to query for HtmlNode instances that have the name td.
I also noticed your output appears as val|val| (with the trailing pipe), This sample uses string.Join(pipe, array) as a less-hideous method of removing that trailing pipe: val|val.
using System.Linq;
// ...
var tablecollection = doc.DocumentNode.SelectNodes("//table");
string store = string.Empty;
if (tablecollection != null)
{
foreach (HtmlNode table in tablecollection)
{
// For all rows with at least one child with the 'td' tag.
foreach (HtmlNode row in table.DescendantNodes()
.Where(desc =>
desc.Name.Equals("tr", StringComparison.OrdinalIgnoreCase) &&
desc.DescendantNodes().Any(child => child.Name.Equals("td",
StringComparison.OrdinalIgnoreCase))))
{
// Combine the child 'td' elements into an array, join with the pipe
// to create the output in 'val|val|val' format.
store = string.Join("|", row.DescendantNodes().Where(desc =>
desc.Name.Equals("td", StringComparison.OrdinalIgnoreCase))
.Select(desc => desc.InnerText));
// You can probably get rid of the 'store' variable as it's
// no longer necessary to store the value of the table's
// cells over the iteration.
sw.Write(store);
sw.WriteLine();
}
}
}
sw.Flush();
sw.Close();
Your XPath syntax is not correct. Please try:
HtmlNode cell in row.SelectNodes("//td")
This will get you the collection of td elements that can be iterated with foreach.

Categories

Resources