Parsing html using agility pack - c#

I have a html to parse(see below)
<div id="mailbox" class="div-w div-m-0">
<h2 class="h-line">InBox</h2>
<div id="mailbox-table">
<table id="maillist">
<tr>
<th>From</th>
<th>Subject</th>
<th>Date</th>
</tr>
<tr onclick="location='readmail.html?mid=welcome'" style="font-weight: bold;">
<td>no-reply#somemail.net</td>
<td>
Hi, Welcome
</td>
<td>
<span title="2016-02-16 13:23:50 UTC">just now</span>
</td>
</tr>
<tr onclick="location='readmail.html?mid=T0wM6P'" style="font-weight: bold;">
<td>someone#outlook.com</td>
<td>
sa
</td>
<td>
<span title="2016-02-16 13:24:04">just now</span>
</td>
</tr>
</table>
</div>
</div>
I need to parse links in <tr onclick= tags and email addresses in <td> tags.
So far i manged to get first occurance of email/link from my html.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);
Could someone show me how is it properly done? Basically what i want to do is take all email addresses and links from html that are in said tags.
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[#onclick]"))
{
HtmlAttribute att = link.Attributes["onclick"];
Console.WriteLine(att.Value);
}
EDIT: I need to store parsed values in a class (list) in pairs. Email (link) and senders Email.
public class ClassMailBox
{
public string From { get; set; }
public string LinkToMail { get; set; }
}

You can write the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[#onclick]"))
{
HtmlAttribute att = link.Attributes["onclick"];
ClassMailBox classMailbox = new ClassMailBox() { LinkToMail = att.Value };
classMailBoxes.Add(classMailbox);
}
int currentPosition = 0;
foreach (HtmlNode tableDef in doc.DocumentNode.SelectNodes("//tr[#onclick]/td[1]"))
{
classMailBoxes[currentPosition].From = tableDef.InnerText;
currentPosition++;
}
To keep this code simple, I'm assuming some things:
The email is always on the first td inside the tr which contains an onlink property
Every tr with an onlink attribute contains an email
If those conditions don't apply this code won't work and it could throw some exceptions (IndexOutOfRangeExceptions) or it could match links with wrong email addresses.

Related

Get Html elements inside div By ID (ID is not Unique)

I am developing add to read web browser data and store it into a dictionary.
During this process, I need to access data By ID but the IDs are not Unique on the page. The page looks like this.
<div id="ID1">
<tbody>
<tr>
<td id="1000" data-field="1">
text
</td>
</tr>
</tbody>
<div id="ID2">
<tbody>
<tr>
<td id="1000" data-field="2">
Some other text
</td>
</tr>
</tbody>
both div elements are on the same page
when I get element By Id It only gives me the first element, not the second one.
Here is My code
HtmlElement myElements = webBrowser1.Document.GetElementById("ID2");
HtmlElement myElements2 = myElements.Document.GetElementById("1000");
if (myElements2.InnerText != null)
{
//Do something
}
How Can I get the inner text of the second element by ID
This is the best and the easiest answer I came up with
I figured out the data-field is a unique value in the page so I looped through the elements and compared it with data-field
HtmlElement Buildingcontacts = webBrowser1.Document.GetElementById("ID2");
HtmlElementCollection ifiels = Buildingcontacts.Document.GetElementsByTagName("td");
foreach (HtmlElement element in ifiels)
{
string datafieldx = element.GetAttribute("data-field");
if (datafieldx == "2")
{
if (element.InnerText != null)
{
//do Somthing
}
}
}

Why is my xpath null? Why is it not finding the nodes

I am just getting into traversing through XML documents to learn how to use xpath.
I have stumbled on to a issue. Everytime I try to execute my xpath it returns null as if it didnt find anything.
I've tried the xpath out in XMLQuire and it worked there.
class Program
{
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var Listing in Featured)
{
}
}
}
I commented out the other xpath I tried, I've tried those two and both are returning null why is that?
Here is a image showing the part of the DOM I want to access.
<table class="top-feature js-hover" data-ad-id="1299717863" data-vip-url="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863">
<tbody><tr>
<td class="watchlist">
<div class="watch js-hover p-vap-lnk-actn-addwtch" data-action="add" data-adid="1299717863" title="Click to add to My Favourites"><div class="icon"></div></div>
<input id="watchlistXsrf" name="ca.kijiji.xsrf.token" value="1527418405414.9b71d1309fdd8a315258ea5a3dac1a09e4a99ec7f32041df88307c46e26a5b1b" type="hidden">
</td>
<td class="image">
<div class="multiple-images"><img src="https://i.ebayimg.com/00/s/NjAwWDgwMA==/z/fXEAAOSwaZdZxTv~/$_2.JPG" alt="C.L. Contracting. Any job big or small."></div>
</td>
<td class="description">
<a href="/v-renovation-contracting-handyman/sudbury/c-l-contracting-any-job-big-or-small/1299717863" class="title ">
C.L. Contracting. Any job big or small.</a>
<p>
Contractor handyman home renovations and repairs. Contractor for Dollarama, Rexall, LaSenza and more. Fully licensed and insured. Able to do drywall, decks, framing, plumbing, flooring windows, ...</p>
<p class="details">
</p>
</td>
<td class="posted">
</td>
</tr>
</tbody></table>
My solution (Need help making my xpath into 1 line instead of traversing through with a bunch of loops.)
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL); // //table/tbody/tr/td[#class = 'description']/p
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tbody/tr/td/a");
foreach (var table in DOM.DocumentNode.SelectNodes("//table[contains(#class, 'top-feature')]"))
{
Console.WriteLine($"Found: {table}");
foreach (var rows in table.SelectNodes("tr"))
{
Console.WriteLine(rows);
foreach (var cell in rows.SelectNodes("td[#class='description']/a"))
{
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Console.ReadKey();
I've managed to fix it, however I ams till curious to why this xpath works
//table[contains(#class, 'top-feature')]/tr/td[#class='description']/a
And this one doesnt.
//table[contains(#class,'top-feature')]/tbody/tr/td/a
As mentioned in the comment, the <tbody> element is generated by a browser developer tool.
If you look at your var DOM object during runtime with the debugger, you can see the InnerHtml property.
<table class="regular-ad js-hover" data-ad-id=".." data-vip-url="..">
<tr>
<td class="watchlist">
...
</td>
<td class="image">
...
</td>
...
</tr>
</table>
No <tbody> element so your XPath has to look like this:
DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]/tr/td/a");

How to add whitespace between tags in Selenium?

I was using Selenium to get data from a table on the web page.
I have HTML with structure:
<table>
<tbody>
<tr>
<td>
<span>1</span>
<span>0</span>
<br>
<span>
<span>Good Luck</span>
<img src="/App_Themes/Resources/img/icon_tick.gif" width="3" height="7">
</span>
</td>
</tr>
<tr>
<td>
<b>Nowaday<br></b>
<p>hook<br>zp</p>
</td>
</tr>
</tbody>
</table>
I using this code to get all values in this table:
ReadOnlyCollection<IWebElement> lstTable = browser.FindElements(By.XPath("table/tbody/tr"));
foreach (IWebElement val in lstTable)
{
ReadOnlyCollection<IWebElement> lstTDElement = val.FindElements(By.XPath("td"));
}
But it shows result of like:
10Good LuckNowadayhookzp
I want to result like this:
1 0 Good Luck Nowaday hookzp
Have whitespace between a tag.
I think should add like this:
<span>1</span>
<span> </span>
<span>0</span>
And:
<b>Nowaday<br></b>
<p> </p>
<p>hook<br>zp</p>
You should try as below :-
ReadOnlyCollection<IWebElement> lstTDElements = browser.FindElements(By.TagName("td"));
var allTextList = lstTDElements.Select(El => EL.Text).ToList();
string FinalString = allTextList.Aggregate(new System.Text.StringBuilder(), (sb, s) => sb.Append(" "+s)).ToString().Replace("\n", "");
Console.WriteLine(FinalString);
Edited :- You can also get separate element togethor with | separator using xpath as below :-
ReadOnlyCollection<IWebElement> lstTable = browser.FindElements(By.XPath("table/tbody/tr"));
foreach (IWebElement val in lstTable)
{
ReadOnlyCollection<IWebElement> lstTDElement = val.FindElements(By.XPath("//td/span | //td/b | //td/p"));
}
Hope it helps...:)

c# htmlagility pack conditional select node

I am not sure the title suits my problem.
I have html like below
<table id="searchResultsTable" class="">
<tbody>
<tr class="searchResultsItem even ">
<td class="searchResultsPriceValue">
<div> 26.500 TL</div></td>
<td class="searchResultsTitleValue ">
<a class="classifiedTitle" href="xxxx"> some text</a>
</tr>
<tr class="searchResultsItem odd ">
.
//same as "searchResultsItem even "
.
</tr>
</tbody>
</table>
I am new to htmlagility pack. I have succeed in getting the price value of both "searchResultsItem even" and "searchResultsItem odd".
I want to get href value if the price is below or above some value. I can get href but all time for "searchResultsItem even". I want to get href if even's price value matches my condition for even and if odd matches my condition i want to get for odd.
below is my code
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class='searchResultsPriceValue']"))
{
string price = node2.InnerText.ToString();
price = price.Trim().Replace(".", String.Empty);
price = price.Replace("TL", String.Empty);
if (Convert.ToInt32(price) < 28000)
{
HtmlNode node3 = node.SelectSingle(".//a[#class='classifiedTitle']");
listBox1.Items.Add(node3.Attributes["href"].Value);
}
}
}
Thanks
Get the tr class name as an attribute value. Loop through rows first, then tds.
foreach (HtmlNode node1 in doc.DocumentNode.SelectNodes("//table[#id='searchResultsTable']"))
{
foreach (HtmlNode tr in table.SelectNodes("//tr"))
{
var #class = tr.GetAttributeValue("class", string.Empty);
switch (#class) {
// rest of your parsing
}
}
}

Remove line from string where certain HTML tag is found

So I have this HTML page that is exported to an Excel file through an MVC action. The action actually goes and renders this partial view, and then exports that rendered view with correct formatting to an Excel file. However, the view is rendered exactly how it is seen before I do the export, and that view contains an "Export to Excel" button, so when I export this, the button image appears as a red X in the top left corner of the Excel file.
I can intercept the string containing this HTML to render in the ExcelExport action, and it looks like this for one example:
<div id="summaryInformation" >
<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />
<table class="resultsGrid" cellpadding="2" cellspacing="0">
<tr>
<td id="NicknameLabel" class="resultsCell">Nick Name</td>
<td id="NicknameValue" colspan="3">
Swap
</td>
</tr>
<tr>
<td id="EffectiveDateLabel" class="resultsCell">
<label for="EffectiveDate">Effective Date</label>
</td>
<td id="EffectiveDateValue" class="alignRight">
02-Mar-2011
</td>
<td id ="NotionalLabel" class="resultsCell">
<label for="Notional">Notional</label>
</td>
<td id="NotionalValue" class="alignRight">
<span>
USD
</span>
10,000,000.00
</td>
</tr>
<tr>
<td id="MaturityDateLabel" class="resultsCell">
<label for="MaturityDate">Maturity Date</label>
</td>
<td id="MaturityDateValue" class="alignRight">
02-Mar-2016
-
Modified Following
</td>
<td id="TimeStampLabel" class="resultsCell">
Rate Time Stamp
</td>
<td id="Timestamp" class="alignRight">
28-Feb-2011 16:00
</td>
</tr>
<tr >
<td id="HolidatCityLabel" class="resultsCell"> Holiday City</td>
<td id="ddlHolidayCity" colspan="3">
New York,
London
</td>
</tr>
</table>
</div>
<script>
$("#ExportToExcel").click(function () {
// ajax call to do the export
var actionUrl = "/Extranet/mvc/Indications.cfc/ExportToExcel";
var viewName = "/Extranet/Views/Indications/ResultsViews/SummaryInformation.aspx";
var fileName = 'SummaryInfo.xls';
GridExport(actionUrl, viewName, fileName);
});
</script>
That <img id="ExportToExcel" tag at the top is the one I want to remove just for the export. All of what you see is contained within a C# string. How would I go and remove that line from the string so it doesn't try and render the image in Excel?
EDIT: Would probably make sense also that we wouldn't need any of the <script> in the export either, but since that won't show up in Excel anyway I don't think that's a huge deal for now.
Remove all img tags:
string html2 = Regex.Replace( html, #"(<img\/?[^>]+>)", #"",
RegexOptions.IgnoreCase );
Include reference: using System.Text.RegularExpressions;
If it's in a C# string then just:
myHTMLString.Replace(#"<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />","");
The safest way to do this will be to use the HTML Agility Pack to read in the HTML and then write code that removes the image node from the HTML.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode image =doc.GetElementById("ExportToExcel"]);
image.Remove();
htmlString = doc.WriteTo();
You can use similar code to remove the script tag and other img tags.
I'm just using this
private string RemoveImages(string html)
{
StringBuilder retval = new StringBuilder();
using (StringReader reader = new StringReader(html))
{
string line = string.Empty;
do
{
line = reader.ReadLine();
if (line != null)
{
if (!line.StartsWith("<img"))
{
retval.Append(line);
}
}
} while (line != null);
}
return retval.ToString();
}

Categories

Resources