Scraping html tables in .NET and taking care of colspans - c#

I am trying to scrape HTML tables in my .NET application, however I came across tables that are aggressively using colspan and rowspan attributes on cells causing me headache. I was wondering if there is a library available that can convert a table into an array of strings and taking care of colspan e.g if colspan=5 on a TD element then it will use the value of the TD for the next 5 elements
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td colspan=4>1</td>
<td>2</td>
</tr></table>
the output would be an array of the following:
[1,2,3,4,5]
[1,1,1,1,2]

you may be able to use ParseControl, which would make the whole thing fairly trivial, since you can access the Colspan property.

You could put it in a XmlDocument and then loop through it. Not sure if that's the best solution, but it works.
Maybe LINQ to XML?

Related

How to count the href

How can I count the href attributes of my HTML?
<table>
<tbody>
<tr>
<td align="right" colspan="8">
2
3
4
</td>
</tr>
</tbody>
</table>
Get the elements by tagname and het the size of the result:
driver.findElements(By.xpath("//a[#href]")).size()
Whilst I generally avoid XPath, this seems like the time to use it.
If you are simply trying to get the number of links on a page without having to filter on specific links, you can do this in C# by:
int linkCount = _driver.FindElements(By.XPath("//a")).Count;
You can then Assert on that number returned (to actually do a test on this, if you don't assert, the test will always pass). If you want to filter on specific links I would use something other than XPath.

colspan not expanding the columns length

I am using C# ASP.NET that uses html tables. The problem comes in to this specific panel that I am working with where my column inside the table is not expanding for what ever reason even though my other panels the colspan property works correctly except for this one.
Here are some screenshots to explain what is happening.
Not even though I am setting the colspan to what ever value the column does not expand. Also I know there is two tables within this panel and there is a reason to why I have two, so it's not a mistake. Basically I want the left button to stay left of the panel and secondly I want the right button ("Next") to be as far right of the page as possible.
Any ideas why this is happening or is there a better solution to this problem?
By the way I am using Google Chrome to test if this adds any value to the question.
Not sure why you are using ColSpan when you have only 1 row in second table. To achieve what you are expecting, do the following:
set Width="100%" in second table
in first "td" for back button, include "align=left"
in second "td" for next button, include "align="right"
colspan works with multiple rows; you are expecting it wrongly
<table id="tblButtons" runat="server">
<tr>
<td colspan="3">
column that covers three columns
</td>
<td align="right">
right button
</td>
</tr>
<tr>
<td>
column 1
</td>
<td>
column 2
</td>
<td>
column 3
</td>
<td>
column 4
</td>
</tr>
</table>
column 1, 2, 3 will be covered by the td having colspan="3"
There aren't 100 columns in your page, so that value is relatively worthless. You can (and should) use CSS to achieve your desired width. To have the table itself fill the page, you need to add a style="width:100%;" to it, then your cells will expand to split the difference.
colspan only changes how many columns in a table a cell takes up, not how wide it actually is. Use style="width:..." (or set it in CSS) to set the width. What's happening right now is your table is being divided into one hundred and one imaginary parts (the left side having one hundred parts, the right having one).
An example of using colspan correctly:
<table>
<tr>
<td colspan="2">
Hello world
</td>
</tr>
<tr>
<td>
Left
</td>
<td>
Top
</td>
</tr>
</table>

Given a web response, how do I extract a specific portion for further processing?

I have some code that gets a web response. How do I take that response and search for a table using its CSS class (class="data")? Once I have the table, I need to extract certain field values. For example, in the sample markup below, I need the values of Field #3 and Field #5, so "85" and "1", respectively.
<table width="570" border="0" cellpadding="1" cellspacing="2" class="data">
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
</table>
Use the HTML Agility Pack and parse the HTML. If you want to do it the simplest way then go grab its beta (it supports LINQ).
As Randolf suggests, using HTML Agility Pack is a good option.
But, if you have control of the format of the HTML, it is also possible to do string parsing to extract the values you are after.
It is nearly trivial to download the entire HTML as a string and search for the string "<table" followed by the string "class=\"data\"". Then you can easily extract the values you are after by doing similar string manipulations.
I'm not saying you should do this, for the resulting code will be harder to read and maintain that the code using HTML Agility Pack, but it will save you an external dependency and your code will probably perform much better.
In a WP7 app I made, I started using HTML Agility Pack to parse some HTML and extract some values. This worked well, but it was quite slow. Switching to the string parsing regime made my code many times faster while returning the exact same result.

Extracting data from an XML document without using an XML parser

Here's some lines of the document:
<div class="rowleft">
<h3>Technical Fouls</h3>
<table class="num-left">
<tr class="datahl2b">
<td> </td>
<td>Players</td>
</tr>
<tr>
<td>DAL</td>
<td>
None</td>
</tr>
<tr>
<td>MIA</td>
<td>
Mike Miller</td>
<td>
Mike Miller, Jr.</td>
</tr>
</table>
</div>
I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.
One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.
Relevant
HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack
Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.
If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Regex matching tags

I have the following piece of text from which I'd like to extract all the <td ????>???</td> tags
<tr id=row509>
<td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
<td align=center class='style4'>23</td>
<td align=center class='style10'>22</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td id=rowtot509 align=center class='style6'>0</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td align=center class='style6'>0</td>
</tr>
The expected result would be:
1. <td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]
Any help? Thanks
What's the problem with using an HTML or XML library?
Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.
Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.
So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.
Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.
I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.
Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)
A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Categories

Resources