Read in HTML file and replace with variables - c#

I have an HTML file that will act as a template for an email that I am going to send out. There are fields in the html that are variable. I was wondering if there is a robust way to replace the placeholders in the HTML file with the variables. I know I could string.Replace all of them, but that isn't ideal since I have a lot of variables. Here is what the html file looks like
<html>
<head>
<title></title>
</head>
<body>
<div>
Please read the Cruise Control Details Below<br>
<br>
<table width='100%'>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Release Details</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>RFC Ticket #</b>
</td>
<td>
%release.RFCTicket%
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Project / Release Name</b>
</td>
<td width='20%'>
%release.ReleaseName%
</td>
</tr>
<tr>
<td width='20%'>
<b>Release Date</b>
</td>
<td width='20%'>
%release.ReleaseDateString%
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Release Time</b>
</td>
<td width='20%'>
%release.ReleaseTimeString%
</td>
</tr>
<tr>
<td width='20%'>
<b>CAB Approval Status</b>
</td>
<td width='20%'>
%release.CABApproval%
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Contact Information:</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>Project / Team Lead</b>
</td>
<td width='20%'>
%release.TeamLead%
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>On Call DSE</b>
</td>
<td width='20%'>
%release.OnCallDSE%
</td>
</tr>
<tr>
<td width='20%'>
<b>Phone</b>
</td>
<td width='20%'>
%release.ContactInfo%
</td>
<td>
</td>
<td>
</td>
<td>
</td>
<td width='10%'>
</td>
<td width='20%'>
<b>Phone</b>
</td>
<td width='20%'>
%release.OnCallDSEContact%
</td>
</tr>
<tr>
<td>
</td>
</tr>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Migration Details:</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>Deploy Dashboard</b>
</td>
<td width='20%'>
</td>
<td width='10%'>
</td>
<td width='20%'>
<td>
</td>
<td>
</td>
<b>Deploy Task</b>
</td>
<td width='20%'>
</td>
</tr>
%createTaskTable(ParseSpecialInstuctions().Split('|'))%</table>
</div>
I would like to replace the values in between the "%%" with the variable in code that represents them. I could easily
string.Replace("%release.RFCTicket%",release.RFCTicket);
But that's a bit convoluted in my opinion since I have like 10 or so variables in the file. Are there any built in methods that do what I am asking? Any help would be appreciated, thanks!

Use a regular expression to find your matches. I believe the appropriate regular expression would be along the lines of:
%release.\S+%
From there, you can examine each match, and parse the member name from the match. From there you can get the value of the member from your instance (release in this case) via reflection, and do a string replace.
Something like this. It could use some refactoring to eliminate redundant calls, and I don't know if it fully works, but you get the idea...
var regex = new Regex("%release.\S+%");
var match = r.Match(htmlText);
while (match.Success)
{
var value = match.Value;
var memberName = ParseMemberName(value); //Some code you write to parse out the member name from the match value
var propertyInfo = release.GetType().GetProperty(memberName);
var memberValue = propertyInfo.GetValue(release, null);
htmlText = htmlText.Replace(value, memberValue != null ? memberValue.ToString() : string.Empty);
match = match.NextMatch();
}

This is a talor made Probel for a preprocessed t4 template
You can have your help preformated in the template and allow the template engine to do the replacement. A small example below.
<div>
Please read the Cruise Control Details Below<br>
<br>
<table width='100%'>
<tr>
<td width='100%' colspan='5'>
<font size='4'><b>Release Details</b></font>
</td>
</tr>
<tr>
<td width='20%'>
<b>RFC Ticket #</b>
</td>
<td>
<#= RCFTicketVariable #>
</td>

You can use the Apache Velocity Engine port to .Net to do the templating for you
http://velocity.apache.org/engine/
http://velocity.apache.org/engine/devel/user-guide.html
http://nvelocity.sourceforge.net/

I would consider using REGEX (regular expressions) and giving the placeholders some sort of a special tag (ex: ) so you loop for all the tags that begin with .
Then you fill your data with a list or datatable and do 1 single loop for the whole replaces.
check these for help:
http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx
http://www.regular-expressions.info/examples.html (ur exact case is mentioned under Grabbing HTML Tags)

Related

Removing Columns from a HTML Table

I'm trying to delete the 3rd and 4th <td> and <th> from my table using HtmlAgilityPack.
Example table string:
<table>
<thead>
<tr>
<th>Item</th>
<th>Price</th>
<th>Change</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>
<h2>Top Menu Items</h2>
</td>
</tr>
<tr>
<td> Diced Angus Steak <span>(7oz)</span></td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Kimchi Cheese Beef Pepper Rice</td>
<td>$15.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Classic Beef Pepper Rice</td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
<h2>Steaks</h2>
</td>
</tr>
<tr>
<td> Angus Rib Eye Steak <span>(8oz)</span></td>
<td>$25.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Angus Sirloin Steak <span>(8oz)</span></td>
<td>$22.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Diced Angus Steak <span>(7oz)</span> <span>(Steaks)</span></td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Chicken Breast Steak <span>(8oz)</span></td>
<td>$14.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Premium Hamburger Steak <span>(10oz)</span></td>
<td>$16.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
<h2>Pepper Rice</h2>
</td>
</tr>
<tr>
<td> Sambar Pepper Rice</td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Kimchi Cheese Beef Pepper Rice <span>(Pepper Rice)</span></td>
<td>$15.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Chicken Pepper Rice</td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Salmon Pepper Rice</td>
<td>$15.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Classic Beef Pepper Rice <span>(Pepper Rice)</span></td>
<td>$13.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
<h2>Sides</h2>
</td>
</tr>
<tr>
<td> Rice</td>
<td>$3.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Miso Soup</td>
<td>$3.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Sauteed String Beans</td>
<td>$4.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Sauteed Corn</td>
<td>$4.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Kimchi</td>
<td>$5.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> French Fries</td>
<td>$4.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Onion Rings</td>
<td>$5.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Deep Fried Dumpling</td>
<td>$8.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Sausages</td>
<td>$7.50</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
<h2>Salad</h2>
</td>
</tr>
<tr>
<td> Large Salad</td>
<td>$7.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Small Salad</td>
<td>$3.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Large Seaweed Salad</td>
<td>$9.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td> Small Seaweed Salad</td>
<td>$5.00</td>
<td>
- -
</td>
<td>
<span>
</span>
<span>
</span>
</td>
</tr>
</tbody>
</table>
I send the following string to this method, to remove the 3rd and 4th <td> and <th>.
public static string deleteCols(string table)
{
var doc = new HtmlDocument();
doc.LoadHtml(table);
bool first = true;
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
if (first)
{
try
{
var th3 = row.SelectSingleNode("th[3]");
row.RemoveChild(th3);
}
catch
{
}
try
{
var th4 = row.SelectSingleNode("th[4]");
row.RemoveChild(th4);
}
catch
{
}
first = false;
}
else
{
try
{
var td3 = row.SelectSingleNode("td[3]");
row.RemoveChild(td3);
}
catch
{
}
try
{
var td4 = row.SelectSingleNode("th[4]");
row.RemoveChild(td4);
}
catch
{
}
}
}
foreach (HtmlNode row2 in doc.DocumentNode.SelectNodes("//span"))
{
row2.Remove();
}
return doc.DocumentNode.InnerHtml;
}
Which gives me the following result:
<table>
<thead>
<tr>
<th>Item</th>
<th>Price</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>
<h2>Top Menu Items</h2>
</td>
</tr>
<tr>
<td> Diced Angus Steak </td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td> Kimchi Cheese Beef Pepper Rice</td>
<td>$15.00</td>
<td>
</td>
</tr>
<tr>
<td> Classic Beef Pepper Rice</td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td>
<h2>Steaks</h2>
</td>
</tr>
<tr>
<td> Angus Rib Eye Steak </td>
<td>$25.50</td>
<td>
</td>
</tr>
<tr>
<td> Angus Sirloin Steak </td>
<td>$22.50</td>
<td>
</td>
</tr>
<tr>
<td> Diced Angus Steak </td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td> Chicken Breast Steak </td>
<td>$14.00</td>
<td>
</td>
</tr>
<tr>
<td> Premium Hamburger Steak </td>
<td>$16.00</td>
<td>
</td>
</tr>
<tr>
<td>
<h2>Pepper Rice</h2>
</td>
</tr>
<tr>
<td> Sambar Pepper Rice</td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td> Kimchi Cheese Beef Pepper Rice </td>
<td>$15.00</td>
<td>
</td>
</tr>
<tr>
<td> Chicken Pepper Rice</td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td> Salmon Pepper Rice</td>
<td>$15.00</td>
<td>
</td>
</tr>
<tr>
<td> Classic Beef Pepper Rice </td>
<td>$13.50</td>
<td>
</td>
</tr>
<tr>
<td>
<h2>Sides</h2>
</td>
</tr>
<tr>
<td> Rice</td>
<td>$3.00</td>
<td>
</td>
</tr>
<tr>
<td> Miso Soup</td>
<td>$3.00</td>
<td>
</td>
</tr>
<tr>
<td> Sauteed String Beans</td>
<td>$4.00</td>
<td>
</td>
</tr>
<tr>
<td> Sauteed Corn</td>
<td>$4.00</td>
<td>
</td>
</tr>
<tr>
<td> Kimchi</td>
<td>$5.00</td>
<td>
</td>
</tr>
<tr>
<td> French Fries</td>
<td>$4.00</td>
<td>
</td>
</tr>
<tr>
<td> Onion Rings</td>
<td>$5.00</td>
<td>
</td>
</tr>
<tr>
<td> Deep Fried Dumpling</td>
<td>$8.00</td>
<td>
</td>
</tr>
<tr>
<td> Sausages</td>
<td>$7.50</td>
<td>
</td>
</tr>
<tr>
<td>
<h2>Salad</h2>
</td>
</tr>
<tr>
<td> Large Salad</td>
<td>$7.00</td>
<td>
</td>
</tr>
<tr>
<td> Small Salad</td>
<td>$3.00</td>
<td>
</td>
</tr>
<tr>
<td> Large Seaweed Salad</td>
<td>$9.00</td>
<td>
</td>
</tr>
<tr>
<td> Small Seaweed Salad</td>
<td>$5.00</td>
<td>
</td>
</tr>
</tbody>
</table>
As you can see, some of the elements I wish to delete are still there. Does anybody know what I'm doing wrong here?!
When you remove the 3rd th/tds from the row's children, the 4th item becomes the 3rd, so you're trying to remove a non-existing element.
As a solution, you can either store the elements in variables at first, and then delete them; or you can start removing from the 4th index.

C# IText7 HTML inside a Table Cell

i try to put HTML in a cell, but the border seems double lines:
cell = new Cell();
var elementsList = HtmlConverter.ConvertToElements(sectioncontent);
foreach (IElement e in elementsList)
{
cell.Add((IBlockElement)e);
}
cell.SetBorder(Border.NO_BORDER);
table.SetTextAlignment(TextAlignment.JUSTIFIED).AddCell(cell);
with html like this (caught from CKEditor and i put in a database):
<table border="1" cellpadding="1" cellspacing="1" style="width:500px">
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
<p> </p>
================================ end
String HTML = "<p>Overview line1</p>"
+ "<p>Overview line2</p><p>Overview line3</p>"
+ "<p>Overview line4</p><p>Overview line4</p>"
+ "<p>Overview line5 </p>";
String CSS = "p { font-family: Cardo; }";
cell = new Cell();
//cell.Add(new Paragraph(s));
ElementList elementsList = XMLWorkerHelper.ParseToElementList(HTML, CSS);
foreach (IElement e in elementsList)
{
cell.Add((IBlockElement)e);
}
cell.SetBorder(Border.NO_BORDER);
table.SetTextAlignment(TextAlignment.JUSTIFIED).AddCell(cell);
document.Add(table);
but it raise exception:
Unable to cast object of type 'iTextSharp.text.Paragraph' to type 'iText.Layout.Element.IElement'.
How to workaround?
regards and thanks

Webrowser manipulate HTML Table

I'm trying to manipulate a html table open in webbrowser control, this tool will be used ti access a sharepoint page with an autologin option. This far this is what i have:
HtmlElementCollection htmlcol =
wb.Document.GetElementsByTagName("formTextfield277");
for (int i = 0; i < htmlcol.Count; i++)
{
if (htmlcol[i].Name == "portal_id")
{
htmlcol[i].SetAttribute("VALUE",
Properties.Settings.Default.sharepoint_user);
}
else if (htmlcol[i].Name == "password")
{
htmlcol[i].SetAttribute("VALUE",
Properties.Settings.Default.sharepoint_pw);
}
}
This C# code if for manipulate this HTML page:
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0">
<TR>
<TD CLASS="txtRedBold10" WIDTH="4"> </TD>
<TD CLASS="txtRedBold10" COLSPAN="2" HEIGHT="30">Please log in</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" WIDTH="4"> </TD>
<TD CLASS="txtBlackReg10">Username:</TD>
<TD><INPUT CLASS="formTextfield277" TYPE="text" NAME="portal_id" VALUE="" VCARD_NAME="vCard.Email" SIZE="28"></TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">Please enter your username or E-Mail Address</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" WIDTH="4"> </TD>
<TD CLASS="txtBlackReg10">Password:</TD>
<TD><INPUT CLASS="formTextfield277" TYPE="password" NAME="password" SIZE="28" AUTOCOMPLETE="off"></TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">Please enter your network or Intranet password</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="2"> </TD>
<TD CLASS="txtBlackReg10">
<TABLE CELLSPACING="0" CELLPADDING="0" BORDER="0">
<TR>
<TD><INPUT TYPE="image" HEIGHT="24" WIDTH="20" SRC="images/cp_arrow.gif" VALUE="Log In"
BORDER="0"></TD>
<TD><A CLASS="linkTxtRedBold10" HREF="javascript:signin()"
onClick="saveForm()">Login</A>
</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD CLASS="txtBlackReg10" COLSPAN="3"> </TD>
</TR>
</TABLE>
Any sugestions?
Thanks in advance!
wb.Document.GetElementsByTagName("input") not wb.Document.GetElementsByTagName("formTextfield277");
HtmlElementCollection inputHtmlCollection = Document.GetElementsByTagName("input");
foreach (HtmlElement anInputElement in inputHtmlCollection)
{
if (anInputElement.Name.Equals("portal_id"))
{
anInputElement.SetAttribute("VALUE", Properties.Settings.Default.sharepoint_user);
}
if (anInputElement.Name.Equals("password"))
{
anInputElement.SetAttribute("VALUE", roperties.Settings.Default.sharepoint_pw);
}
}
hope this help!

How to get Regex for TDs in a TR

I need help with regualar expressions. (using c#)
In the html-sourcecode i got something like this.
[...]
<TR class=tblDataGreyNH>
<TD style="TEXT-ALIGN: right; FONT-WEIGHT: bold" class=tblHeader>Total Time </TD>
<TD>07:47 </TD>
<TD>04:48 </TD>
<TD>00:00 </TD>
<TD>00:00 </TD>
<TD>07:42 </TD>
<TD>00:00 </TD>
<TD>00:00 </TD></TR>
[..]
<TR class=tblDataGreyNH nowrap>
<TD>Total </TD>
<TD>20:17 </TD></TR>
<TR style="FONT-WEIGHT: bold" class=tblDataWhiteNH nowrap>
<TD>Total Time </TD>
<TD width=75>20:17 </TD></TR></TBODY></TABLE></TD>
<TD colSpan=3>
...
The classnames are always the same.
I need all the TD's parsed into a stringarray.
tblDataGreyNH is the importants class.
here is the whole table, where the td's are inside. (if some you need)
<table class="tblList">
<form action="/interface/timesheet/ViewUserTimeSheet.php" method="get" name="timesheet"></form>
<tbody>
<tr>
<tr class="tblHeader">
<tr class="tblHeader">
<tr class="tblDataWhiteNH">
<tr class="tblDataWhiteNH">
<tr class="tblHeader">
<tr class="tblDataGreyNH">
<td class="tblHeader" style="font-weight: bold; text-align: right"> Total Time </td>
<td> 07:47 </td>
<td> 04:48 </td>
<td> 00:00 </td>
<td> 00:00 </td>
<td> 07:42 </td>
<td> 00:00 </td>
<td> 00:00 </td>
</tr>
<tr class="tblDataWhiteNH">
<tr class="tblHeader">
<tr valign="top">
</tbody>
</table>
I hope there is someone who can help me with this problem.
Regex seems impossible to understand for me.
I can't grasp the basics with that ReGeX stuff!? :/
Don't use Regex for HTML, I would suggest checking out the HtmlAgilityPack
Very simple:
var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");
// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td"))
{
Console.WriteLine(td.InnerText);
}
You should not use regex to parse HTML (one of many refs: link)
There exists a great .NET library named HtmlAgilityPack that I would recommend.

Advanced HTML Agility Pack useage

I am pretty new to the HTML Agility Pack so I need some help with where to go next. I can do some simple things like pull a value from an href (knowing the url string I was looking for) and I can pull like the value in a span based on a specific class that was being used. But I do not understand how to use the HTML Agility Pack in a situation where there are a ton of or tags an thre is not one real solid anchor to tie to?
Here is an actual chunk of code I am scraping through. I placed dummy data in the cells to demonstrate what I am looking for.
What is the best way to extract the following:
1.) Company Name?
2.) Phone Number?
3.) Email Address?
HTML....
<td>
<!-- Company Info -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>COMPANY NAME</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td>
<table cellpadding="1" cellspacing="0" border="0" width="100%">
<tr>
<td colspan="2" align="center">Un-needed Links...</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap>
<b>
<font color="FF0000">
Contact Person
<img src="/images/icon_contact.gif" align="absmiddle"> :
</font>
</b>
</td>
<td align="left" width="100%"> Judy Smith</td>
</tr>
<tr>
<td align="right" nowrap>
<b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> judy.smith#companyname.com</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Location <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> ATLANTA, GA</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Phone <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Fax <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 666-666-6666</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Broker MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 123456</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Carrier MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 654321</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Starting Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Starting Point</th>
<th>Available</th>
</tr>
<tr>
<td class="search" width="270"> <b>ABBEVILLE, GA </b></td>
<td class="search" align="center" width="100"><span style="color: forestgreen"> 1/5/11 </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Destination Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Destination Point</th>
<th>Direction</th>
</tr>
<tr>
<td class="search" width="270"> <b>ATLANTA, GA </b></td>
<td class="search" align="center" width="100"><span style="color: FF0000"> </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Truck Details -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Truck Details</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0">
<tr>
<td>
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td align="right"><b>Date Posted :</b></td>
<td align="left"> 1/5/2011 10:34:48 AM</td>
</tr>
<tr>
<td align="right"><b>Quantity :</b></td>
<td align="left"> 1</td>
</tr>
<tr>
<td align="right"><b>Equipment Type :</b></td>
<td align="left"> FT</td>
</tr>
<tr>
<td align="right"><b>Load Size :</b></td>
<td align="left"> Full</td>
</tr>
<tr>
<td align="right" valign="top"><b>Special Information :</b></td>
<td align="left"> </td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
</td>
....More HTML
Well, you have to understand XPATH to really take advandage of the HTML agility pack scraping capabilities :-) You can Google on XPATH examples to start with.
Focusing on the screen-scraping question, the tricky part is to select what you think is the most discriminant xpath expression for the information you want to get. Most of the time, there is not only one solution, and you must be prepared to update your code to stick with the target site HTML evolution.
So it's a trade off between very simple expressions with a risk that they match unwanted texts, and too discriminant expressions, not tolerant with evolutions in the scraped HTML, with a risk that they match nothing.
As for your specific text, this is a good real world example, and here is a code that does it:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourText);
string companyName = doc.DocumentNode.SelectSingleNode("/td/table/tr/td/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);
// another way
companyName = doc.DocumentNode.SelectSingleNode("//td[#class='black']/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);
// a more advanced XPATH expression, means
// "Select a TD tag anywhere in the doc that has a preceding sibling of TD type with a B chid, with a FONT child with inner text starting with 'Phone Number'"
string phoneNumber = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'Phone Number')]").InnerText;
Console.WriteLine("phone Number=" + phoneNumber);
// same kind of story but go down the next A tag
string email = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'E-mail')]/a").InnerText;
Console.WriteLine("email=" + email);
PS: please note the HTML Agility Pack always expect tags used in XPATH expressions to be lowercase, even if they're not in the original HTML text.
As you see, the company name is retrieved here using two different expressions. They both work on the sample, but the first one will not resist if a new tag is added anywhere in the middle. The second one is more future-proof but is based on a CSS class tag that also may change. It's always a trade-off.
The phone number & email are similar but show the power of XPATH.

Categories

Resources