In the following HTML, I can parse the table element, but I don't know how to skip the th elements.
I want to get only the td elements, but when I try to use:
foreach (HtmlNode cell in row.SelectNodes("td"))
...I get an exception.
<table class="tab03">
<tbody>
<tr>
<th class="right" rowspan="2">first</th>
</tr>
<tr>
<th class="right">lp</th>
<th class="right">name</th>
</tr>
<tr>
<td class="right">1</td>
<td class="left">house</td>
</tr>
<tr>
<th class="right" rowspan="2">Second</th>
</tr>
<tr>
<td class="right">2</td>
<td class="left">door</td>
</tr>
</tbody>
</table>
My code:
var document = doc.DocumentNode.SelectNodes("//table");
string store = "";
if (document != null)
{
foreach (HtmlNode table in document)
{
if (table != null)
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
store = "";
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
store = store + cell.InnerText+"|";
}
sw.Write(store );
sw.WriteLine();
}
}
}
}
sw.Flush();
sw.Close();
This method uses LINQ to query for HtmlNode instances that have the name td.
I also noticed your output appears as val|val| (with the trailing pipe), This sample uses string.Join(pipe, array) as a less-hideous method of removing that trailing pipe: val|val.
using System.Linq;
// ...
var tablecollection = doc.DocumentNode.SelectNodes("//table");
string store = string.Empty;
if (tablecollection != null)
{
foreach (HtmlNode table in tablecollection)
{
// For all rows with at least one child with the 'td' tag.
foreach (HtmlNode row in table.DescendantNodes()
.Where(desc =>
desc.Name.Equals("tr", StringComparison.OrdinalIgnoreCase) &&
desc.DescendantNodes().Any(child => child.Name.Equals("td",
StringComparison.OrdinalIgnoreCase))))
{
// Combine the child 'td' elements into an array, join with the pipe
// to create the output in 'val|val|val' format.
store = string.Join("|", row.DescendantNodes().Where(desc =>
desc.Name.Equals("td", StringComparison.OrdinalIgnoreCase))
.Select(desc => desc.InnerText));
// You can probably get rid of the 'store' variable as it's
// no longer necessary to store the value of the table's
// cells over the iteration.
sw.Write(store);
sw.WriteLine();
}
}
}
sw.Flush();
sw.Close();
Your XPath syntax is not correct. Please try:
HtmlNode cell in row.SelectNodes("//td")
This will get you the collection of td elements that can be iterated with foreach.
Related
I am developing add to read web browser data and store it into a dictionary.
During this process, I need to access data By ID but the IDs are not Unique on the page. The page looks like this.
<div id="ID1">
<tbody>
<tr>
<td id="1000" data-field="1">
text
</td>
</tr>
</tbody>
<div id="ID2">
<tbody>
<tr>
<td id="1000" data-field="2">
Some other text
</td>
</tr>
</tbody>
both div elements are on the same page
when I get element By Id It only gives me the first element, not the second one.
Here is My code
HtmlElement myElements = webBrowser1.Document.GetElementById("ID2");
HtmlElement myElements2 = myElements.Document.GetElementById("1000");
if (myElements2.InnerText != null)
{
//Do something
}
How Can I get the inner text of the second element by ID
This is the best and the easiest answer I came up with
I figured out the data-field is a unique value in the page so I looped through the elements and compared it with data-field
HtmlElement Buildingcontacts = webBrowser1.Document.GetElementById("ID2");
HtmlElementCollection ifiels = Buildingcontacts.Document.GetElementsByTagName("td");
foreach (HtmlElement element in ifiels)
{
string datafieldx = element.GetAttribute("data-field");
if (datafieldx == "2")
{
if (element.InnerText != null)
{
//do Somthing
}
}
}
I have multiple tables and Location Value is given in different index order.
How can I get location value if previous cell string is "Location" when I loop through table. On below example it is cells[7] but on other table it will be 9. How can I conditionally get values after cells inner text is "Location"? Basically find the cell "Location" get inner text of next cell.
Html Table:
<table class="tbfix FieldsTable"">
<tbody>
<tr>
<td class="name">Last Movement</td>
<td class="value">Port Exit</td>
</tr>
<tr>
<td class="name">Date</td>
<td class="value">26/06/2017 00:00:00</td>
</tr>
<tr>
<td class="name">From</td>
<td class="value">HAMBURGE</td>
</tr>
<tr>
<td class="name">Location</td>
<td class="value">EUROGATE HAMBURG</td>
</tr>
<tr>
<td class="name">E/F</td>
<td class="value">E</td>
</tr>
</tbody>
Controller Loop Through:
foreach (var eachNode in driver.FindElements(By.XPath("//table[contains(descendant::*, 'Last Movement')]")))
{
var cells = eachNode.FindElements(By.XPath(".//td"));
cd = new Detail();
for (int i = 0; i < cells.Count(); i++)
{
cd.ActionType = cells[1].Text.Trim();
string s = cells[3].Text.Trim();
DateTime dt = Convert.ToDateTime(s);
if (_minDate > dt) _minDate = dt;
cd.ActionDate = dt;
}
}
In your foreach loop you could use this:
var location = eachNode.FindElement(By.XPath(".//td[contains(text(),'Location')]/following-sibling::td));
Assuming your data is always structured like that I would loop over all the tags and add the data to a dictionary.
Try something like this:
Dictionary<string,string> tableData = new Dictionary<string, string>();
var trNodes = eachNode.FindElements(By.TagName("tr"));
foreach (var trNode in trNodes)
{
var name = trNode.FindElement(By.CssSelector(".name")).Text.Trim();
var value = trNode.FindElement(By.CssSelector(".value")).Text.Trim();
tableData.Add(name,value);
}
var location = tableData["location"];
You would have to add validation and checks for the dictionary and the structure but that is the general idea.
I am not sure the title suits my problem.
I have html like below
<table id="searchResultsTable" class="">
<tbody>
<tr class="searchResultsItem even ">
<td class="searchResultsPriceValue">
<div> 26.500 TL</div></td>
<td class="searchResultsTitleValue ">
<a class="classifiedTitle" href="xxxx"> some text</a>
</tr>
<tr class="searchResultsItem odd ">
.
//same as "searchResultsItem even "
.
</tr>
</tbody>
</table>
I am new to htmlagility pack. I have succeed in getting the price value of both "searchResultsItem even" and "searchResultsItem odd".
I want to get href value if the price is below or above some value. I can get href but all time for "searchResultsItem even". I want to get href if even's price value matches my condition for even and if odd matches my condition i want to get for odd.
below is my code
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class='searchResultsPriceValue']"))
{
string price = node2.InnerText.ToString();
price = price.Trim().Replace(".", String.Empty);
price = price.Replace("TL", String.Empty);
if (Convert.ToInt32(price) < 28000)
{
HtmlNode node3 = node.SelectSingle(".//a[#class='classifiedTitle']");
listBox1.Items.Add(node3.Attributes["href"].Value);
}
}
}
Thanks
Get the tr class name as an attribute value. Loop through rows first, then tds.
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode tr in table.SelectNodes("//tr"))
{
var #class = tr.GetAttributeValue("class", string.Empty);
switch (#class) {
// rest of your parsing
}
}
}
How can I parse HTML using LINQ on a webpage to get the innerhtml values from the table?
I am using the HtmlAgilityPack and would like to parse some values as good as possible.
the number you see(00000, 00001, 00002..), are unique numbers from the agents.
So maybe there is a way to use LINQ to parse those numbers and get the following values from td's
(Name, 123, state, and info) => 00000, John, 123, IDLE, coffee for each
so I can call them separately and work with them - maybe in a array?
</TH>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00000</TD>
<TD ALIGN=LEFT>John</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00001</TD>
<TD ALIGN=LEFT>Lisa</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00002</TD>
<TD ALIGN=LEFT>Mary</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
<TR ALIGN=RIGHT>
<TD ALIGN=LEFT>00003</TD>
<TD ALIGN=LEFT>Tim</TD>
<TD ALIGN=CENTER>123</TD>
<TD ALIGN=LEFT>IDLE</TD>
<TD ALIGN=LEFT>coffee</TD>
</TR>
....
Thanks in advance!
This seems a lot like a "please give me the code I need question", which I seriously dislike. Have a look at the following and make sure you understand it:
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var tds = tr.Descendants("TD").ToArray(); // Get all the TDs
// Turn them into our datastructure
var data = new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};
// Do something with data
}
Doing it with LINQ only:
var data = from tr in doc.DocumentNode.Descendants("TR")
let tds = tr.Descendants("TD").ToArray()
select new {
Name = tds[1].InnerText,
Number = tds[2].InnerText,
State = tds[3].InnerText,
Info = tds[4].InnerText,
};
#flindeberg makes a perfectly reasonable answer (+1 to he/she), you could avoid the ToArray like this.
private class Row
{
public string Name { get; set; }
public int Number { get; set; }
public string State { get; set; }
public string Info { get; set; }
}
...
var mappings = new Action<string, Row>[]
{
(value, row) => row.Name = value,
(value, row) => row.Number = int.Parse(value),
(value, row) => row.State = value,
(value, row) => row.Info = value
};
var doc = ... // Load the document
var trs = doc.DocumentNode.Descendants("TR"); // Give you all the TRs
foreach (var tr in trs)
{
var row = new Row();
tr.Descendants("TD").Zip(mappings, (td, map) =>
{
map(td.InnerText, row);
return true;
});
// You now have a populated row.
}
I am trying to parse through TR's using HtmlAgilityPack and do something different with the 1st, 2nd, 3rd etc TD's.
I am almost there but my code (below) causes an infinite loop. It just repeats the first row over and over again:
foreach (HtmlNode row in htmlDoc.DocumentNode.SelectNodes("//table//tr"))
{
var node = row.SelectSingleNode("//td[1]");
if (node != null)
{
Console.WriteLine("Node: {0}", node.InnerText);
}
}
The raw HTML returned it correct. The table is also pretty standard:
<table>
<tr>
<th>Header 1</hr>
<th>Header 2</hr>
<th>Header 3</hr>
<th>Header 4</hr>
<th>Header 5</hr>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
<td>Cell 3</td>
<td>Cell 4</td>
<td>Cell 5</td>
...
</tr>
</table>
The following code works but then it is not grouped by row so it is much harder to manipulate:
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//table//tr//td"))
{
Console.WriteLine("Node: {0}", node.InnerText);
}
This works fine with your sample html
var res = doc.DocumentNode.SelectNodes("//table//tr[td]")
.Select(row => row.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();