C# Xpath Tables - c#

I have the xml code that contains to a table with 2 rows.
<table>
<thead>
<tr>
<td class="num">test</td>
<td class="num">test2</td>
</tr>
</thead>
</table>
I am using xpath to grap the data from the row.
how do i retrieve only the first row data from the table and not all the data.
The xpath code i am using now is:
/table/thead/tr/th[#class='num']
And my current output is:
test
test2
What do I have to add in the xpath code so I can select the first row only?

Your result is the expected output, the XPath expression asks for all nodes which match, and the two you get are therefore correct.
If you want only the first one, you can do this:
/table/thead/tr/th[#class='num'][1]
Otherwise post your expectation...

Related

C# and Html Agility Pack

I have multiple files, from which I have to extract tables containing data. Problem is tables don't have IDs, so I have to search based on the content (which is constant in each file). There are multiple tables in each file and the table of interest doesn't have constant XPath.
<table border="0" cellspacing="0" cellpadding="0" style="BORDER-COLLAPSE: collapse" bordercolor="#111111">
<tbody>
<tr>
<td class="s">CONSTANT_TEXT</td>
<td class="l">CHANGING_VALUE</td>
</tr>
<tr>
<td class="s"> </td>
<td class="l"><a style="" id="CONSTANT_ID" href="mailto: XXXX</a>
</td>
</tr>
</tbody>
</table>
How do I:
1. Search based on the CONSTANT_TEXT CONSTANT_TEXT , return the value of 2nd TD CHANGING_VALUE , without knowing the Path (it doesn't have ID and it's position changes from file to file).
2. Search based on CONSTANT_TEXT CONSTANT_TEXT , return the Parent table of that TD
What I did is to search and return CONSTANT_TEXT , with Html Agility Pack, then iterate the XPath upwards until the Table is reached.
var output= document.DocumentNode.SelectNodes("//a[#id='CONSTANT_ID']");
output[0].XPath ="/html[1]/body[1]/table[1]/thead[1]/tr[1]/td[1]/table[1]/tbody[1]/tr[2]/td[2]/a[1]"
My plan was to iterate each output and get the XPath for lowest table occurring, table[1], then extract the data.
Thanks,
Mike
Strictly speaking, you'll need the following XPath :
Search based on the CONSTANT_TEXT CONSTANT_TEXT , return the value of
2nd TD CHANGING_VALUE
//td[.="CONSTANT_TEXT"]/following-sibling::td[1]/text()
Output : CHANGING_VALUE
Search based on CONSTANT_TEXT CONSTANT_TEXT , return the Parent table of that TD
//td[.="CONSTANT_TEXT"]/ancestor::table[1]
Output : <table> element

How to count the href

How can I count the href attributes of my HTML?
<table>
<tbody>
<tr>
<td align="right" colspan="8">
2
3
4
</td>
</tr>
</tbody>
</table>
Get the elements by tagname and het the size of the result:
driver.findElements(By.xpath("//a[#href]")).size()
Whilst I generally avoid XPath, this seems like the time to use it.
If you are simply trying to get the number of links on a page without having to filter on specific links, you can do this in C# by:
int linkCount = _driver.FindElements(By.XPath("//a")).Count;
You can then Assert on that number returned (to actually do a test on this, if you don't assert, the test will always pass). If you want to filter on specific links I would use something other than XPath.

Reorder html elements in a string using C#

I have a string of html returning from a serivce. I need to update this html server side (Using .Net) and reorder some of the elements around before sending it to the client. As a simple example lets say I have an html string like below. If the string is a table like below. How can I manipulate it to put the last name <th> and <td> into it's own <tr>. The html would be much larger and more complex but for one section of it the below illustrate how I would need to change it. Just using string replace hasn't worked well due to the complexity of the actual HTML.
Initial String
"<table>
<tbody>
<tr>
<th>First name</th>
<td>some first name</td>
<th>Last name</th>
<td>some last name</td>
</tr>
<tr>
<th>blah</td>
<td>blah blah</td>
</tr>
</tbody>
</table>
"
After Modification
"<table>
<tbody>
<tr>
<th>First name</th>
<td>some first name</td>
</tr>
<th>Last name</th>
<td>some last name</td>
<tr>
<th>blah</td>
<td>blah blah</td>
</tr>
</tbody>
</table>
"
I know URL answers are frowned upon, but you should look into the HTML Agility Pack. It's designed for this kind of thing.
http://html-agility-pack.net/?z=codeplex
For the purposes of this answer, I will make the silly assumption that you have read the file in a string list. Let us name this list HTMLLines. Then the following should do what you want
int length=HTMLLines.Count;
for(int loop=0;loop<length;loop++)
{
if(HTMLLines[loop].Equals("<th>Last name</th>"))
{
HTMLLines[loop]="</tr>\n<tr>\n"+HTMLLines[loop];
//break;//If there is only one occurrence, remove the leading // else keep that to repeat for each occurence
}
}
If you save the list after this loop, you should have the desired output.
This code assumes that there are no nulls in the list. If there are any nulls, you should replace HTMLLines[loop].Equals("<th>Last name</th>") with HTMLLines[loop]=="<th>Last name</th>"
If the "<th>Last name</th>" is just a sample you used for this question that cannot be used to match exactly, then you should place all possible matches to an array and check for them each loop. In this case, if we name the array theHeaders, the code will be something like:
int length=HTMLLines.Count;
for(int loop=0;loop<length;loop++)
{
for(int loop1=0;loop1<theHeaders.Length;loop1++)
{
if(HTMLLines[loop].Equals(theHeaders[loop1]))
{
HTMLLines[loop]="</tr>\n<tr>\n"+HTMLLines[loop];
break;
}
}
}
I hope this helps to point you to the right direction.
A very simple approach could be...
var result = htmlString.Replace("<th>Last name</th>", "</tr><tr><th>Last name</th>");
If you need something more complex than this you'll need to add more detail to your question.

XPath drops contents of td column on an HTML page for screen scraping

Below you find an excerpt of code used to screen scrape an economic calendar.
The HTML page that it parses using XPath includes this row as the first rown
in a table. (Only pasted this row instead of the entire HTML page.)
<tr class="calendar_row newday singleevent" data-eventid="42064"> <td class="date"><div class="date">Sun<div>Dec 23</div></div></td> <td class="time">All Day</td> <td class="currency">JPY</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"><div>Bank Holiday</div></td> <td class="detail"><a class="calendar_detail level1" data-level="1"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td> </tr>
This code that selects the first tr row using XPath:
var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var rows = doc.DocumentNode.SelectNodes("//tr[#class=\"calendar_row\"]");
var rowHtml = rows[0].InnerHtml;
The problem is that rowHtml returns this:
<td class="date"></td> <td class="time">All Day</td> <td class="currency">EUR</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"> <div>French Bank Holiday</div> </td> <td class="detail"><a class="calendar_detail level2" data-level="2"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td>
Now you can see that the contents of the td column for the date vanished! Why?
I've experimented many things and stumped as to why it drops the contents of that column.
The other columns have content that it keeps. So what's wrong with the date column?
Is there some kind of setting or property somewhere to cause or prevent dropping contents?
Even if you haven't got a clue what's wrong but have some suggestions of a way to investigate it more.
Like #AlexeiLevenkov mentioned, you must be selecting a different row than what you want. You've pruned too much of essential problem away in an effort to simplify, but it's still clear what's wrong...
Consider that your input document might basically look like this:
<?xml version="1.0" encoding="UTF-8"?>
<table>
<tr class="calendar_row" data-eventid="12345">
<td>This IS NOT the tr you're looking for</td>
</tr>
<tr class="calendar_row newday singleevent" data-eventid="42064">
<td>This IS the tr you're looking for</td>
</tr>
</table>
The test #class="calendar_row" won't match against the tr you show, but it will match against the first row.
You could change your test to be contains(#class,'calendar_row') instead, but that would match both rows. You're going to have to identify some content or attribute that's unique to the row you desire. Perhaps the #data-eventid attribute would work -- can't tell without seeing your whole input file.

Scraping html tables in .NET and taking care of colspans

I am trying to scrape HTML tables in my .NET application, however I came across tables that are aggressively using colspan and rowspan attributes on cells causing me headache. I was wondering if there is a library available that can convert a table into an array of strings and taking care of colspan e.g if colspan=5 on a TD element then it will use the value of the TD for the next 5 elements
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td colspan=4>1</td>
<td>2</td>
</tr></table>
the output would be an array of the following:
[1,2,3,4,5]
[1,1,1,1,2]
you may be able to use ParseControl, which would make the whole thing fairly trivial, since you can access the Colspan property.
You could put it in a XmlDocument and then loop through it. Not sure if that's the best solution, but it works.
Maybe LINQ to XML?

Categories

Resources