How to get Regex for TDs in a TR - c#

I need help with regualar expressions. (using c#)
In the html-sourcecode i got something like this.
[...]
<TR class=tblDataGreyNH>
<TD style="TEXT-ALIGN: right; FONT-WEIGHT: bold" class=tblHeader>Total Time </TD>
<TD>07:47 </TD>
<TD>04:48 </TD>
<TD>00:00 </TD>
<TD>00:00 </TD>
<TD>07:42 </TD>
<TD>00:00 </TD>
<TD>00:00 </TD></TR>
[..]
<TR class=tblDataGreyNH nowrap>
<TD>Total </TD>
<TD>20:17 </TD></TR>
<TR style="FONT-WEIGHT: bold" class=tblDataWhiteNH nowrap>
<TD>Total Time </TD>
<TD width=75>20:17 </TD></TR></TBODY></TABLE></TD>
<TD colSpan=3>
...
The classnames are always the same.
I need all the TD's parsed into a stringarray.
tblDataGreyNH is the importants class.
here is the whole table, where the td's are inside. (if some you need)
<table class="tblList">
<form action="/interface/timesheet/ViewUserTimeSheet.php" method="get" name="timesheet"></form>
<tbody>
<tr>
<tr class="tblHeader">
<tr class="tblHeader">
<tr class="tblDataWhiteNH">
<tr class="tblDataWhiteNH">
<tr class="tblHeader">
<tr class="tblDataGreyNH">
<td class="tblHeader" style="font-weight: bold; text-align: right"> Total Time </td>
<td> 07:47 </td>
<td> 04:48 </td>
<td> 00:00 </td>
<td> 00:00 </td>
<td> 07:42 </td>
<td> 00:00 </td>
<td> 00:00 </td>
</tr>
<tr class="tblDataWhiteNH">
<tr class="tblHeader">
<tr valign="top">
</tbody>
</table>
I hope there is someone who can help me with this problem.
Regex seems impossible to understand for me.
I can't grasp the basics with that ReGeX stuff!? :/

Don't use Regex for HTML, I would suggest checking out the HtmlAgilityPack
Very simple:
var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");
// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td"))
{
Console.WriteLine(td.InnerText);
}

You should not use regex to parse HTML (one of many refs: link)
There exists a great .NET library named HtmlAgilityPack that I would recommend.

Related

Using HtmlAgilityPack to get a specific row and column data

This is my table
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
&nbsp&nbsp
<span class="BOLD">E-mail:</span>
zoro#xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
I am looping through each node in my Html document using the code below
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(#class,'DataRows')]"))
{
}
When I use the following
node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml
I get the following html
<span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
&nbsp&nbsp
<span class="BOLD">E-mail:</span>
zoro#xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>
How do I extract the address 120 NW 157TH AVE from this ?
When I tried using
node.SelectSingleNode(".//td[#class='BOLD'][4]/preceding-sibling::td").InnerText;
I get an error:
Object reference not set to an instance of an object
Your html is a mess tags are overlapping i suggest you use text nodes as your identifiers rather than indices for example
.//td[./a[contains(text(),'See on Map')]]/td/text()
to get
120 NW 157TH AVE
Here is a full example that gets you everything
var table = doc.DocumentNode.SelectSingleNode("//table[contains(#class,'DataRows')]");
var name = table.SelectSingleNode(".//td[#class='BOLD']/text()").InnerText.Trim();
var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();
Note because of how messy your html is the xpaths has to be as messy, trying to access the tr element by index won't work because all tr elements are children of the previous tr, what is .//tr[4] in a normal table is .//tr/tr/tr/tr in your table.

Webrowser manipulate HTML Table

I'm trying to manipulate a html table open in webbrowser control, this tool will be used ti access a sharepoint page with an autologin option. This far this is what i have:
HtmlElementCollection htmlcol =
wb.Document.GetElementsByTagName("formTextfield277");
for (int i = 0; i < htmlcol.Count; i++)
{
if (htmlcol[i].Name == "portal_id")
{
htmlcol[i].SetAttribute("VALUE",
Properties.Settings.Default.sharepoint_user);
}
else if (htmlcol[i].Name == "password")
{
htmlcol[i].SetAttribute("VALUE",
Properties.Settings.Default.sharepoint_pw);
}
}
This C# code if for manipulate this HTML page:
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0">
<TR>
<TD CLASS="txtRedBold10" WIDTH="4"> </TD>
<TD CLASS="txtRedBold10" COLSPAN="2" HEIGHT="30">Please log in</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" WIDTH="4"> </TD>
<TD CLASS="txtBlackReg10">Username:</TD>
<TD><INPUT CLASS="formTextfield277" TYPE="text" NAME="portal_id" VALUE="" VCARD_NAME="vCard.Email" SIZE="28"></TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">Please enter your username or E-Mail Address</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" WIDTH="4"> </TD>
<TD CLASS="txtBlackReg10">Password:</TD>
<TD><INPUT CLASS="formTextfield277" TYPE="password" NAME="password" SIZE="28" AUTOCOMPLETE="off"></TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">Please enter your network or Intranet password</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">
<TABLE CELLSPACING="0" CELLPADDING="0" BORDER="0">
<TR>
<TD><INPUT TYPE="image" HEIGHT="24" WIDTH="20" SRC="images/cp_arrow.gif" VALUE="Log In"
BORDER="0"></TD>
<TD><A CLASS="linkTxtRedBold10" HREF="javascript:signin()"
onClick="saveForm()">Login</A>
</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
</TABLE>
Any sugestions?
Thanks in advance!
wb.Document.GetElementsByTagName("input") not wb.Document.GetElementsByTagName("formTextfield277");
HtmlElementCollection inputHtmlCollection = Document.GetElementsByTagName("input");
foreach (HtmlElement anInputElement in inputHtmlCollection)
{
if (anInputElement.Name.Equals("portal_id"))
{
anInputElement.SetAttribute("VALUE", Properties.Settings.Default.sharepoint_user);
}
if (anInputElement.Name.Equals("password"))
{
anInputElement.SetAttribute("VALUE", roperties.Settings.Default.sharepoint_pw);
}
}
hope this help!

how to store html code EMail template in a string variable or textbox c#

In my program user will select template message and it will show up in the body textbox. The template is in HTML Format.
if(templatelistbox.SelectedItem.ToString().Equals("Order Processing"))
{
string m;
i want to store the below html to string.
<body>
<table width="580" align="center" cellpadding="0" cellspacing="0" bgcolor="#FFFFFF">
<tbody>
<tr>
<td colspan="2" align="right" valign="middle" width="580" height="60" bgcolor="00496f"><div wotsearchtarget="flipkart.com"></div></td>
</tr>
<tr>
<td colspan="2" align="left" valign="middle" width="580" bgcolor="3bb1d7"><p><strong>Order Confirmed!</strong></p></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="580" bgcolor="ffffff"><p> </p>
<p>Dear <strong>Praveen,</strong></p>
<p>Greetings from Prostyle PC Kart</p>
<p>We thank you for your order. This email contains your order summary. When the item(s) in your order are ready to ship, you will receive an email with the Courier Tracking ID and the link where you can track your order. You can also check the status of your order on <strong>ebay.in</strong></p>
Please find below the summary of your order <strong> OD3</strong></a>
at Prostyle Pc Kart Seller of eBay.in:
</p></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="580" bgcolor="ffffff"><table border="0" cellspacing="0" cellpadding="0" width="580">
<tbody>
<tr>
<td colspan="7"><p><strong>Order ID: OD3
|  Seller ID : eshop.prostylepc.in</strong></p>
<p> <strong>Item (s) Ordered:</strong></p></td>
</tr>
<tr>
<td width="274" valign="top"><p><strong>Product Details</strong></p></td>
<td width="112" valign="top"><p><strong>Shipping Date</strong></p></td>
<td width="62" valign="top"><p><strong>Ordered Quantity</strong></p></td>
<td width="132" valign="top"><p><center><strong>Price</strong></center></p></td>
</tr>
<tr>
<td width="274" valign="top"><p>XOLO Q800</p>
</td>
<td width="112" valign="top"><p> </p></td>
<td width="62" valign="top"><center><p>1</p></center></td>
<td width="132" valign="top" align="center"><p>Rs. 9700</p></td>
<tr>
<td width="274" valign="top"><p>Samsung BHM1100NBEGINU In-the-ear Headset without Charger</p>
</td>
<td width="112" valign="top"><p> </p></td>
<td width="62" valign="top"><center><p>1</p></center></td>
<td width="132" valign="top" align="center"><p>Rs. 699</p></td>
</tr>
<tr>
<td colspan="2"><p>Shipping Charge</p></td>
<td width="62"><center><p>FREE</p></center></td>
</tr>
<tr>
<td valign="top" colspan="3"><p><strong>Total</strong></p></td>
<td width="132" valign="top"><p><strong>Rs. 9700</strong></p></td>
</tr>
<tr>
.......................................................
bodytbox.Text = m;
}
Tell me any way to insert this html code in c# or variable or how to send this template via a email.(Alternative)
I alos need to put some parameter in the template.For Eg OrderNo.xxxxxx replaced with pid(static contain some value).
Thanks In advance
save it to file and load that file to string , or in the application setting ( project -> properties -> setting )
string htmlCode = System.IO.File.ReadAllText("File Path");
with application setting :
string htmlCode = properties.Default.HtmlCode;
after that you can do what ever you want that.
you can use Template parser
Codeplex library and Documentation

Read in HTML file and replace with variables

I have an HTML file that will act as a template for an email that I am going to send out. There are fields in the html that are variable. I was wondering if there is a robust way to replace the placeholders in the HTML file with the variables. I know I could string.Replace all of them, but that isn't ideal since I have a lot of variables. Here is what the html file looks like
<html>
<head>
<title></title>
</head>
<body>
<div>
Please read the Cruise Control Details Below<br>
<br>
<table width='100%'>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Release Details</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>RFC Ticket #</b>
</td>
<td>
%release.RFCTicket%
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Project / Release Name</b>
</td>
<td width='20%'>
%release.ReleaseName%
</td>
</tr>
<tr>
<td width='20%'>
<b>Release Date</b>
</td>
<td width='20%'>
%release.ReleaseDateString%
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Release Time</b>
</td>
<td width='20%'>
%release.ReleaseTimeString%
</td>
</tr>
<tr>
<td width='20%'>
<b>CAB Approval Status</b>
</td>
<td width='20%'>
%release.CABApproval%
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Contact Information:</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>Project / Team Lead</b>
</td>
<td width='20%'>
%release.TeamLead%
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>On Call DSE</b>
</td>
<td width='20%'>
%release.OnCallDSE%
</td>
</tr>
<tr>
<td width='20%'>
<b>Phone</b>
</td>
<td width='20%'>
%release.ContactInfo%
</td>
<td>
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Phone</b>
</td>
<td width='20%'>
%release.OnCallDSEContact%
</td>
</tr>
<tr>
<td>
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Migration Details:</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>Deploy Dashboard</b>
</td>
<td width='20%'>
</td>
<td width='10%'>
</td>
<td width='20%'>
<td>
</td>
<td>
</td>
<b>Deploy Task</b>
</td>
<td width='20%'>
</td>
</tr>
%createTaskTable(ParseSpecialInstuctions().Split('|'))%</table>
</div>
I would like to replace the values in between the "%%" with the variable in code that represents them. I could easily
string.Replace("%release.RFCTicket%",release.RFCTicket);
But that's a bit convoluted in my opinion since I have like 10 or so variables in the file. Are there any built in methods that do what I am asking? Any help would be appreciated, thanks!
Use a regular expression to find your matches. I believe the appropriate regular expression would be along the lines of:
%release.\S+%
From there, you can examine each match, and parse the member name from the match. From there you can get the value of the member from your instance (release in this case) via reflection, and do a string replace.
Something like this. It could use some refactoring to eliminate redundant calls, and I don't know if it fully works, but you get the idea...
var regex = new Regex("%release.\S+%");
var match = r.Match(htmlText);
while (match.Success)
{
var value = match.Value;
var memberName = ParseMemberName(value); //Some code you write to parse out the member name from the match value
var propertyInfo = release.GetType().GetProperty(memberName);
var memberValue = propertyInfo.GetValue(release, null);
htmlText = htmlText.Replace(value, memberValue != null ? memberValue.ToString() : string.Empty);
match = match.NextMatch();
}
This is a talor made Probel for a preprocessed t4 template
You can have your help preformated in the template and allow the template engine to do the replacement. A small example below.
<div>
Please read the Cruise Control Details Below<br>
<br>
<table width='100%'>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Release Details</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>RFC Ticket #</b>
</td>
<td>
<#= RCFTicketVariable #>
</td>
You can use the Apache Velocity Engine port to .Net to do the templating for you
http://velocity.apache.org/engine/
http://velocity.apache.org/engine/devel/user-guide.html
http://nvelocity.sourceforge.net/
I would consider using REGEX (regular expressions) and giving the placeholders some sort of a special tag (ex: ) so you loop for all the tags that begin with .
Then you fill your data with a list or datatable and do 1 single loop for the whole replaces.
check these for help:
http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx
http://www.regular-expressions.info/examples.html (ur exact case is mentioned under Grabbing HTML Tags)

Advanced HTML Agility Pack useage

I am pretty new to the HTML Agility Pack so I need some help with where to go next. I can do some simple things like pull a value from an href (knowing the url string I was looking for) and I can pull like the value in a span based on a specific class that was being used. But I do not understand how to use the HTML Agility Pack in a situation where there are a ton of or tags an thre is not one real solid anchor to tie to?
Here is an actual chunk of code I am scraping through. I placed dummy data in the cells to demonstrate what I am looking for.
What is the best way to extract the following:
1.) Company Name?
2.) Phone Number?
3.) Email Address?
HTML....
<td>
<!-- Company Info -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>COMPANY NAME</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td>
<table cellpadding="1" cellspacing="0" border="0" width="100%">
<tr>
<td colspan="2" align="center">Un-needed Links...</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap>
<b>
<font color="FF0000">
Contact Person
<img src="/images/icon_contact.gif" align="absmiddle"> :
</font>
</b>
</td>
<td align="left" width="100%"> Judy Smith</td>
</tr>
<tr>
<td align="right" nowrap>
<b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> judy.smith#companyname.com</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Location <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> ATLANTA, GA</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Phone <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Fax <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 666-666-6666</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Broker MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 123456</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Carrier MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 654321</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Starting Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Starting Point</th>
<th>Available</th>
</tr>
<tr>
<td class="search" width="270"> <b>ABBEVILLE, GA </b></td>
<td class="search" align="center" width="100"><span style="color: forestgreen"> 1/5/11 </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Destination Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Destination Point</th>
<th>Direction</th>
</tr>
<tr>
<td class="search" width="270"> <b>ATLANTA, GA </b></td>
<td class="search" align="center" width="100"><span style="color: FF0000"> </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Truck Details -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Truck Details</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0">
<tr>
<td>
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td align="right"><b>Date Posted :</b></td>
<td align="left"> 1/5/2011 10:34:48 AM</td>
</tr>
<tr>
<td align="right"><b>Quantity :</b></td>
<td align="left"> 1</td>
</tr>
<tr>
<td align="right"><b>Equipment Type :</b></td>
<td align="left"> FT</td>
</tr>
<tr>
<td align="right"><b>Load Size :</b></td>
<td align="left"> Full</td>
</tr>
<tr>
<td align="right" valign="top"><b>Special Information :</b></td>
<td align="left"> </td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
</td>
....More HTML
Well, you have to understand XPATH to really take advandage of the HTML agility pack scraping capabilities :-) You can Google on XPATH examples to start with.
Focusing on the screen-scraping question, the tricky part is to select what you think is the most discriminant xpath expression for the information you want to get. Most of the time, there is not only one solution, and you must be prepared to update your code to stick with the target site HTML evolution.
So it's a trade off between very simple expressions with a risk that they match unwanted texts, and too discriminant expressions, not tolerant with evolutions in the scraped HTML, with a risk that they match nothing.
As for your specific text, this is a good real world example, and here is a code that does it:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourText);
string companyName = doc.DocumentNode.SelectSingleNode("/td/table/tr/td/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);
// another way
companyName = doc.DocumentNode.SelectSingleNode("//td[#class='black']/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);
// a more advanced XPATH expression, means
// "Select a TD tag anywhere in the doc that has a preceding sibling of TD type with a B chid, with a FONT child with inner text starting with 'Phone Number'"
string phoneNumber = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'Phone Number')]").InnerText;
Console.WriteLine("phone Number=" + phoneNumber);
// same kind of story but go down the next A tag
string email = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'E-mail')]/a").InnerText;
Console.WriteLine("email=" + email);
PS: please note the HTML Agility Pack always expect tags used in XPATH expressions to be lowercase, even if they're not in the original HTML text.
As you see, the company name is retrieved here using two different expressions. They both work on the sample, but the first one will not resist if a new tag is added anywhere in the middle. The second one is more future-proof but is based on a CSS class tag that also may change. It's always a trade-off.
The phone number & email are similar but show the power of XPATH.

Categories

Resources