C# and Html Agility Pack - c#

I have multiple files, from which I have to extract tables containing data. Problem is tables don't have IDs, so I have to search based on the content (which is constant in each file). There are multiple tables in each file and the table of interest doesn't have constant XPath.
<table border="0" cellspacing="0" cellpadding="0" style="BORDER-COLLAPSE: collapse" bordercolor="#111111">
<tbody>
<tr>
<td class="s">CONSTANT_TEXT</td>
<td class="l">CHANGING_VALUE</td>
</tr>
<tr>
<td class="s"> </td>
<td class="l"><a style="" id="CONSTANT_ID" href="mailto: XXXX</a>
</td>
</tr>
</tbody>
</table>
How do I:
1. Search based on the CONSTANT_TEXT CONSTANT_TEXT , return the value of 2nd TD CHANGING_VALUE , without knowing the Path (it doesn't have ID and it's position changes from file to file).
2. Search based on CONSTANT_TEXT CONSTANT_TEXT , return the Parent table of that TD
What I did is to search and return CONSTANT_TEXT , with Html Agility Pack, then iterate the XPath upwards until the Table is reached.
var output= document.DocumentNode.SelectNodes("//a[#id='CONSTANT_ID']");
output[0].XPath ="/html[1]/body[1]/table[1]/thead[1]/tr[1]/td[1]/table[1]/tbody[1]/tr[2]/td[2]/a[1]"
My plan was to iterate each output and get the XPath for lowest table occurring, table[1], then extract the data.
Thanks,
Mike

Strictly speaking, you'll need the following XPath :
Search based on the CONSTANT_TEXT CONSTANT_TEXT , return the value of
2nd TD CHANGING_VALUE
//td[.="CONSTANT_TEXT"]/following-sibling::td[1]/text()
Output : CHANGING_VALUE
Search based on CONSTANT_TEXT CONSTANT_TEXT , return the Parent table of that TD
//td[.="CONSTANT_TEXT"]/ancestor::table[1]
Output : <table> element

Related

How to count the href

How can I count the href attributes of my HTML?
<table>
<tbody>
<tr>
<td align="right" colspan="8">
2
3
4
</td>
</tr>
</tbody>
</table>
Get the elements by tagname and het the size of the result:
driver.findElements(By.xpath("//a[#href]")).size()
Whilst I generally avoid XPath, this seems like the time to use it.
If you are simply trying to get the number of links on a page without having to filter on specific links, you can do this in C# by:
int linkCount = _driver.FindElements(By.XPath("//a")).Count;
You can then Assert on that number returned (to actually do a test on this, if you don't assert, the test will always pass). If you want to filter on specific links I would use something other than XPath.

HTML table/column width is not working when used as email body in C#

I have an html file with one table and it has around 20 columns. I want to set the table width bigger so that columns with content can be wider.
This is my table structure
<table border="1" cellpadding="2" cellspacing="0" id="tblBody" style="table-layout: fixed;width:5000px">
<tr>
<td style="overflow: hidden;width:500px;">
column width 500px
</td>
<td>
</td>
.
.
.
.
.
.
.
.
.
.
<td>
</td>
</tr>
</table>
When I view in the browser it shows the structure correctly but when I am using this html in the email body in c# application it is not working.
Can anybody help me on this please?
When I apply margin for each cell and set the table width to auto it works.

XPath drops contents of td column on an HTML page for screen scraping

Below you find an excerpt of code used to screen scrape an economic calendar.
The HTML page that it parses using XPath includes this row as the first rown
in a table. (Only pasted this row instead of the entire HTML page.)
<tr class="calendar_row newday singleevent" data-eventid="42064"> <td class="date"><div class="date">Sun<div>Dec 23</div></div></td> <td class="time">All Day</td> <td class="currency">JPY</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"><div>Bank Holiday</div></td> <td class="detail"><a class="calendar_detail level1" data-level="1"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td> </tr>
This code that selects the first tr row using XPath:
var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var rows = doc.DocumentNode.SelectNodes("//tr[#class=\"calendar_row\"]");
var rowHtml = rows[0].InnerHtml;
The problem is that rowHtml returns this:
<td class="date"></td> <td class="time">All Day</td> <td class="currency">EUR</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"> <div>French Bank Holiday</div> </td> <td class="detail"><a class="calendar_detail level2" data-level="2"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td>
Now you can see that the contents of the td column for the date vanished! Why?
I've experimented many things and stumped as to why it drops the contents of that column.
The other columns have content that it keeps. So what's wrong with the date column?
Is there some kind of setting or property somewhere to cause or prevent dropping contents?
Even if you haven't got a clue what's wrong but have some suggestions of a way to investigate it more.
Like #AlexeiLevenkov mentioned, you must be selecting a different row than what you want. You've pruned too much of essential problem away in an effort to simplify, but it's still clear what's wrong...
Consider that your input document might basically look like this:
<?xml version="1.0" encoding="UTF-8"?>
<table>
<tr class="calendar_row" data-eventid="12345">
<td>This IS NOT the tr you're looking for</td>
</tr>
<tr class="calendar_row newday singleevent" data-eventid="42064">
<td>This IS the tr you're looking for</td>
</tr>
</table>
The test #class="calendar_row" won't match against the tr you show, but it will match against the first row.
You could change your test to be contains(#class,'calendar_row') instead, but that would match both rows. You're going to have to identify some content or attribute that's unique to the row you desire. Perhaps the #data-eventid attribute would work -- can't tell without seeing your whole input file.

Given a web response, how do I extract a specific portion for further processing?

I have some code that gets a web response. How do I take that response and search for a table using its CSS class (class="data")? Once I have the table, I need to extract certain field values. For example, in the sample markup below, I need the values of Field #3 and Field #5, so "85" and "1", respectively.
<table width="570" border="0" cellpadding="1" cellspacing="2" class="data">
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
</table>
Use the HTML Agility Pack and parse the HTML. If you want to do it the simplest way then go grab its beta (it supports LINQ).
As Randolf suggests, using HTML Agility Pack is a good option.
But, if you have control of the format of the HTML, it is also possible to do string parsing to extract the values you are after.
It is nearly trivial to download the entire HTML as a string and search for the string "<table" followed by the string "class=\"data\"". Then you can easily extract the values you are after by doing similar string manipulations.
I'm not saying you should do this, for the resulting code will be harder to read and maintain that the code using HTML Agility Pack, but it will save you an external dependency and your code will probably perform much better.
In a WP7 app I made, I started using HTML Agility Pack to parse some HTML and extract some values. This worked well, but it was quite slow. Switching to the string parsing regime made my code many times faster while returning the exact same result.

C# Xpath Tables

I have the xml code that contains to a table with 2 rows.
<table>
<thead>
<tr>
<td class="num">test</td>
<td class="num">test2</td>
</tr>
</thead>
</table>
I am using xpath to grap the data from the row.
how do i retrieve only the first row data from the table and not all the data.
The xpath code i am using now is:
/table/thead/tr/th[#class='num']
And my current output is:
test
test2
What do I have to add in the xpath code so I can select the first row only?
Your result is the expected output, the XPath expression asks for all nodes which match, and the two you get are therefore correct.
If you want only the first one, you can do this:
/table/thead/tr/th[#class='num'][1]
Otherwise post your expectation...

Categories

Resources