Regex matching tags - c#

I have the following piece of text from which I'd like to extract all the <td ????>???</td> tags
<tr id=row509>
<td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
<td align=center class='style4'>23</td>
<td align=center class='style10'>22</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td id=rowtot509 align=center class='style6'>0</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td align=center class='style6'>0</td>
</tr>
The expected result would be:
1. <td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]
Any help? Thanks

What's the problem with using an HTML or XML library?
Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.
Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.
So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.
Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.

I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.

Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)
A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Related

Regex c# html tags with specific attribute

I am new to Regular expression:( After lot of search for my requirement I was able to manage get answer but i do get extra results as explained below:
My String
<td valign="top" width="100%">
<td width="100%" valign="top">
<td valign="top" height="100%" width="100%">
<td valign="top">
My Expression
/<td (?=.*valign="top")(?=.*width="100%").*>/gm
My Result
<td valign="top" width="100%">
<td width="100%" valign="top">
<td valign="top" height="100%" width="100%">
Expected result
<td valign="top" width="100%">
<td width="100%" valign="top">
Conclusion: I want to extract TD tag that has valign and width attribute only with specific value.
Note : I have to parse through lots of data file hence HTMLAgility will slow down overall process.
Kindly guide me to final expression. Cheers
This seems to be doing it for me:
\<td\s+((valign="top"\s+width="100%")|(width="100%"\s+valign="top"))\s*>\gm
Your expression searches to see if the two attributes are somewhere ahead of the <td beginning. This one allows for whitespace, then searches for either valign="top" width="100%" or width="100%" valign="top", followed by more optional whitespace before the end of the td tag. This disallows all attributes except for the width and valign attributes.
With that said, there are always unexpected situations when using regex. You can test your regex expressions in real-time here: http://regexr.com/ Just type in your string and the regex expression to see what it selects.
EDIT:
If you want to account for both single quotes and double quotes around the attributes, try this one:
\<td\s+((valign=([",'])top\3\s+width=([",'])100%\4)|(width=([",'])100%\6\s+valign=([",'])top\7))\s*>\gm
Now I'm allowing for either a " or ' at the beginning of the attribute's value, and search for a match of whichever one was found at the end of the attribute's value.
Again, I encourage you to go to the website I linked above and play around with these yourself as well. I almost never use regex, but when I do I can usually find an expression that works for me with that website.

Given a web response, how do I extract a specific portion for further processing?

I have some code that gets a web response. How do I take that response and search for a table using its CSS class (class="data")? Once I have the table, I need to extract certain field values. For example, in the sample markup below, I need the values of Field #3 and Field #5, so "85" and "1", respectively.
<table width="570" border="0" cellpadding="1" cellspacing="2" class="data">
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
</table>
Use the HTML Agility Pack and parse the HTML. If you want to do it the simplest way then go grab its beta (it supports LINQ).
As Randolf suggests, using HTML Agility Pack is a good option.
But, if you have control of the format of the HTML, it is also possible to do string parsing to extract the values you are after.
It is nearly trivial to download the entire HTML as a string and search for the string "<table" followed by the string "class=\"data\"". Then you can easily extract the values you are after by doing similar string manipulations.
I'm not saying you should do this, for the resulting code will be harder to read and maintain that the code using HTML Agility Pack, but it will save you an external dependency and your code will probably perform much better.
In a WP7 app I made, I started using HTML Agility Pack to parse some HTML and extract some values. This worked well, but it was quite slow. Switching to the string parsing regime made my code many times faster while returning the exact same result.

Extracting data from an XML document without using an XML parser

Here's some lines of the document:
<div class="rowleft">
<h3>Technical Fouls</h3>
<table class="num-left">
<tr class="datahl2b">
<td> </td>
<td>Players</td>
</tr>
<tr>
<td>DAL</td>
<td>
None</td>
</tr>
<tr>
<td>MIA</td>
<td>
Mike Miller</td>
<td>
Mike Miller, Jr.</td>
</tr>
</table>
</div>
I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.
One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.
Relevant
HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack
Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.
If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Scraping html tables in .NET and taking care of colspans

I am trying to scrape HTML tables in my .NET application, however I came across tables that are aggressively using colspan and rowspan attributes on cells causing me headache. I was wondering if there is a library available that can convert a table into an array of strings and taking care of colspan e.g if colspan=5 on a TD element then it will use the value of the TD for the next 5 elements
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td colspan=4>1</td>
<td>2</td>
</tr></table>
the output would be an array of the following:
[1,2,3,4,5]
[1,1,1,1,2]
you may be able to use ParseControl, which would make the whole thing fairly trivial, since you can access the Colspan property.
You could put it in a XmlDocument and then loop through it. Not sure if that's the best solution, but it works.
Maybe LINQ to XML?

Regex Problem(C#)

HTML:
<TD style="DISPLAY: none">999999999</TD>
<TD class=CLS1 >Name</TD>
<TD class=BLACA>271229</TD>
<TD>220</TD>
<TD>343,23</TD>
<TD>23,0</TD>
<TD>222,00</TD>
<TD>33222,8</TD>
<TD class=blacl>0</TD>
<TD class=black>0</TD>
<TD>3433</TD>
<TD>40</TD>
I need td in value. How to do it in C#? I want a string array;
999999999
Name
271229
220
Do not use regular expressions to parse html - see this for why.
Use the HTML Agility Pack to parse the HTML and extract the data in it.

Categories

Resources