Regex Problem(C#) - c#

HTML:
<TD style="DISPLAY: none">999999999</TD>
<TD class=CLS1 >Name</TD>
<TD class=BLACA>271229</TD>
<TD>220</TD>
<TD>343,23</TD>
<TD>23,0</TD>
<TD>222,00</TD>
<TD>33222,8</TD>
<TD class=blacl>0</TD>
<TD class=black>0</TD>
<TD>3433</TD>
<TD>40</TD>
I need td in value. How to do it in C#? I want a string array;
999999999
Name
271229
220

Do not use regular expressions to parse html - see this for why.
Use the HTML Agility Pack to parse the HTML and extract the data in it.

Related

Regex c# html tags with specific attribute

I am new to Regular expression:( After lot of search for my requirement I was able to manage get answer but i do get extra results as explained below:
My String
<td valign="top" width="100%">
<td width="100%" valign="top">
<td valign="top" height="100%" width="100%">
<td valign="top">
My Expression
/<td (?=.*valign="top")(?=.*width="100%").*>/gm
My Result
<td valign="top" width="100%">
<td width="100%" valign="top">
<td valign="top" height="100%" width="100%">
Expected result
<td valign="top" width="100%">
<td width="100%" valign="top">
Conclusion: I want to extract TD tag that has valign and width attribute only with specific value.
Note : I have to parse through lots of data file hence HTMLAgility will slow down overall process.
Kindly guide me to final expression. Cheers
This seems to be doing it for me:
\<td\s+((valign="top"\s+width="100%")|(width="100%"\s+valign="top"))\s*>\gm
Your expression searches to see if the two attributes are somewhere ahead of the <td beginning. This one allows for whitespace, then searches for either valign="top" width="100%" or width="100%" valign="top", followed by more optional whitespace before the end of the td tag. This disallows all attributes except for the width and valign attributes.
With that said, there are always unexpected situations when using regex. You can test your regex expressions in real-time here: http://regexr.com/ Just type in your string and the regex expression to see what it selects.
EDIT:
If you want to account for both single quotes and double quotes around the attributes, try this one:
\<td\s+((valign=([",'])top\3\s+width=([",'])100%\4)|(width=([",'])100%\6\s+valign=([",'])top\7))\s*>\gm
Now I'm allowing for either a " or ' at the beginning of the attribute's value, and search for a match of whichever one was found at the end of the attribute's value.
Again, I encourage you to go to the website I linked above and play around with these yourself as well. I almost never use regex, but when I do I can usually find an expression that works for me with that website.

Get a data with HTMLagilitypack from a table with same class

I want to catch a data from a website using HTMLAgilityPack. The data is store in a Table but the problem is that there is more than a TD tag with a same class and I don't know how to filter them in a separated fields.
here is what I talking about :
<td class="first even">
Phone number:
</td>
<td class="even">
06522366154
</td>
<td class="first even">
Mobile Number:
</td>
<td class="even">
09163524712
</td>
<td class="first even">
Email:
</td>
<td class="even">
h.ghaletaki#gmail.com
</td>
in this HTML code, Mobile phone starts with "09xxxx" and Phone Number starts with "0xxx" and you know about emails. I used below code in C# and I catch all the values mixed.
HtmlNodeCollection nodes1 = doc.DocumentNode.SelectNodes("//td[#class='even']");
Thanks
Use starts-with to do the prefix check (does the phone number start with zero?) and you might check using contains for the # in the mail address.
//td[#class = 'even' and (starts-with(normalize-space(.), '0') or contains(., '#'))]
XPath 1.0 has no support for regular expressions. You might better do the string manipulation and verification outside XPath in C#.

Given a web response, how do I extract a specific portion for further processing?

I have some code that gets a web response. How do I take that response and search for a table using its CSS class (class="data")? Once I have the table, I need to extract certain field values. For example, in the sample markup below, I need the values of Field #3 and Field #5, so "85" and "1", respectively.
<table width="570" border="0" cellpadding="1" cellspacing="2" class="data">
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
</table>
Use the HTML Agility Pack and parse the HTML. If you want to do it the simplest way then go grab its beta (it supports LINQ).
As Randolf suggests, using HTML Agility Pack is a good option.
But, if you have control of the format of the HTML, it is also possible to do string parsing to extract the values you are after.
It is nearly trivial to download the entire HTML as a string and search for the string "<table" followed by the string "class=\"data\"". Then you can easily extract the values you are after by doing similar string manipulations.
I'm not saying you should do this, for the resulting code will be harder to read and maintain that the code using HTML Agility Pack, but it will save you an external dependency and your code will probably perform much better.
In a WP7 app I made, I started using HTML Agility Pack to parse some HTML and extract some values. This worked well, but it was quite slow. Switching to the string parsing regime made my code many times faster while returning the exact same result.

How to assign HTML value to a asp.net string variable

I am using asp.net and C#. I want to send mail to my user in HTML format, I have the content in HTML format let say like this
<table style="width:100%;">
<tr>
<td style="width:20%; background-color:Blue;"></td>
<td style="width:80%; background-color:Green;"></td>
</tr>
</table>
Now I am unable to assign this to a string variable, so that I could send it as a mail.
Please let me know how can I bind this whole HTML content into a varibale.
Also, please note that the above code is only a demo, I have around 100 lines of HTML code.
If you want to explicitly declare the string in code:
string html =
#"<table style=""width:100%;"">
<tr>
<td style=""width:20%; background-color:Blue;""></td>
<td style=""width:80%; background-color:Green;""></td>
</tr>
</table>";
In response to your comment, to insert values, it's simple enough to use StringBuilder to build a string in memory, eg.,
var html = new StringBuilder("<table style=\"width:100%;\">");
html.Append("<tr>");
html.Append("<td style=\"width:20%; background-color:Blue;\">");
html.Append(yourAuthorNameString);
//etc...
or move to a proper html builder or template system like the HTML Agility Pack or NVelocity
I would just keep it in an html file that you open and read in as needed. Good old System.IO.File.ReadAllText(). Putting a large string directly in your source is just begging for frequent re-compilation and deployment.
string myHtml = #"<table style=""width:100%;"">
<tr>
<td style=""width:20%; background-color:Blue;""></td>
<td style=""width:80%; background-color:Green;""></td>
</tr>
</table>";
Or did I misunderstand your question? In that case, what problem do you encounter and at what stage?
Use #" then your html( remember replace " with "" ) then close " and ;

Regex matching tags

I have the following piece of text from which I'd like to extract all the <td ????>???</td> tags
<tr id=row509>
<td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
<td align=center class='style4'>23</td>
<td align=center class='style10'>22</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td id=rowtot509 align=center class='style6'>0</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td align=center class='style6'>0</td>
</tr>
The expected result would be:
1. <td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]
Any help? Thanks
What's the problem with using an HTML or XML library?
Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.
Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.
So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.
Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.
I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.
Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)
A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Categories

Resources