Scraping With HtmlAgilityPack

Scraping With HtmlAgilityPack - c#

I have a huge html page that i want to scrap values from it.
I tried to use Firebug to get the XPath of the element i want but it is not a static XPath as it is changes from time to time so how could i get the values i want.
In the following snippet i want to get the Production of Lumber per hour which is located in the 20
<div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1">
<thead>
<tr>
<th colspan="4">
Production per hour: </th>
</tr>
</thead>
<tbody>
<tr>
<td class="ico">
<img class="r1" src="img/x.gif" alt="Lumber" title="Lumber" />
</td>
<td class="res">
Lumber:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r2" src="img/x.gif" alt="Clay" title="Clay" />
</td>
<td class="res">
Clay:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r3" src="img/x.gif" alt="Iron" title="Iron" />
</td>
<td class="res">
Iron:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r4" src="img/x.gif" alt="Crop" title="Crop" />
</td>
<td class="res">
Crop:
</td>
<td class="num">
59 </td>
</tr>
</tbody>
</table>
</div>

Using Html agility pack you will want to do something like the following.
byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());
var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();
var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();
string lumberValue = lumberTd.InnerText.Trim();
Warning, that 'FirstOrDefault()' can return null so you should probably put some checks in there.
Hope that helps.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);
var result = doc.DocumentNode.SelectNodes("//div[#class='boxes-contents cf']//tbody/tr")
.First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
.Elements("td")
.First(td=>td.Attributes["class"].Value=="num")
.InnerText
.Trim();

Related

How to check a Condition in .cshtml file

I want to check the FormattedLastFillDate field ...Some how syntax is throwing an error...Can any one help to write a If condition in .cshtml file...Below is the block of code.
#if ( FormattedLastFillDate!= "My logic")
<tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>

you need to try this one:
#if ( FormattedLastFillDate!= "My logic")
{
<tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>
}

the variable should be accessible mode.
I think you were pretty close, try this:
*#{string FormattedLastFillDate= "test";}
#if (FormattedLastFillDate != "test")
{ <tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>
}*

how to get a text from xpath in c#

i want to show data from my xml file and
this is my xml file
<table>
<tr class="even">
<td class="ltid">1</td>
<td class="ltn">لستر سیتی</td>
<td class="ltg">31</td>
<td class="ltw">19</td>
<td class="ltd">9</td>
<td class="ltl">3</td>
<td class="ltgf">54</td>
<td class="ltga">31</td>
<td class="ltgd" dir="ltr">+23</td>
<td class="ltp">66</td>
</tr>
<tr>
<td class="ltid">2</td>
<td class="ltn">تاتنهام</td>
<td class="ltg">31</td>
<td class="ltw">17</td>
<td class="ltd">10</td>
<td class="ltl">4</td>
<td class="ltgf">56</td>
<td class="ltga">24</td>
<td class="ltgd" dir="ltr">+32</td>
<td class="ltp">61</td>
</tr>
<tr>
<td class="ltid">3</td>
<td class="ltn">آرسنال</td>
<td class="ltg">30</td>
<td class="ltw">16</td>
<td class="ltd">7</td>
<td class="ltl">7</td>
<td class="ltgf">48</td>
<td class="ltga">30</td>
<td class="ltgd" dir="ltr">+18</td>
<td class="ltp">55</td>
</tr>
</table>
and i want to get the third team so
i want to get '<td class="ltid">3</td>'
and this is the code i tried
var doc = XDocument.Parse(richTextBox2.Text);
var navigator = doc.CreateNavigator();
var contentCell = navigator.SelectSingleNode("//td[#class='ltid']");
txtTeam.Text = contentCell.Value;
but i don't know how to get the third td with this class value
i searched for find an answer but i couldn't find answer
and i wrote an another code before this one but in first <tr> we have 3 so it just find that from first <tr> not the third <tr>
please help me to get value from third <tr>

This is one way :
(//td[#class='ltid'])[3]
The XPath will return the 3rd occurrence of td[#class='ltid'] from the entire XML document.

You can try:
var nav = doc.CreateNavigator();
XPathNodeIterator iterator = nav.Select("//td[#class='ltid']");
while (iterator.MoveNext())
{
// do whatever you want with your item
}

There is 3 ways you could do this:
xpath 1: //tr[3]/td[#class='ltid']
xpath 2: //td[#class='ltid'])[3]
xpath 3: //td[contains(text()='3')]

c# htmlagilitypack get table values

I have a webpage that needs to parsed ad values to be stored in sqlserver db. I have tried to use HTMLagility pack.
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(HTML);
var cols = hdoc.DocumentNode.SelectNodes("//table[#id='results']//tr//th//td");
for (int i = 0; i < cols.Count; i = i + 2)
{
DataRow dr = dt.NewRow();
string name = cols[i].InnerText.Trim();
}
This is how my html looks
<table id="results">
<tr>
<th style="white-space: nowrap;">
ID
</th>
<th style="text-align: left;">
Entity Name /<br>
Type
</th>
<th style="white-space: nowrap;">
Registered<br>
Effective Date
</th>
<th>
Status /<br>
Status Date
</th>
</tr>
<tr class="exactMatch" valign="top">
<td class="entityID">
123456
</td>
<td class="nameAndTypeDescription">
<span class="name"><a href="test.aspx?entityID=123456&hash=2055339395&orgTypes=01%2c99">
NAME1 COMPANY </a></span>
<br />
<span class="typeDescription">55 - TRadeUnion Company </span>
</td>
<td class="registeredEffectiveDate">
01/12/1912
</td>
<td class="statusDescriptionAndStatusDate">
<span class="statusDescription">Exists Now </span>
<br>
<span class="statusDate">12/14/1943</span>
</td>
</tr>
<tr class="exactMatch" valign="top">
<td class="entityID">
A23456
</td>
<td class="nameAndTypeDescription">
<span class="name"><a href="test.aspx?entityID=A23456&hash=615278445&orgTypes=01%2c99">
TESTA, INC. </a></span>
<br />
<span class="typeDescription">09 - Domestic Corporation </span>
</td>
<td class="registeredEffectiveDate">
04/29/1926
</td>
<td class="statusDescriptionAndStatusDate">
<span class="statusDescription">Dissolved Company </span>
<br>
<span class="statusDate">06/16/1998</span>
</td>
</tr>
</table>
I need to insert entityID,name, hyperlink, type description,registeredeffectivedate,status description,status date. Right now they all print in one single line and I do know how to parse it. Please help.
Thanks
MR

The TD's are not nested under TH's.
Try this: SelectNodes("//table[#id='results']/tr/td");

Read data in string from begin to end

<td valign="top" class="m92_h_bigimg">
<img border=0 src="http://i2.giatamedia.de/s.php?uid=168846&source=xml&size=320&vea=5vf&cid=2492&file=007399_8790757.jpg" name="bigpic">
</td>
<td valign="top" class="m92_h_bigimg2">
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td valign="top" class="m92_h_para">Hotel:</td>
<td valign="top" class="m92_h_name">
Melia Tropical <br>
<img src="/images/star.gif" height=13 width=13 alt="*"><img src="/images/star.gif" height=13 width=13 alt="*"><img src="/images/star.gif" height=13 width=13 alt="*"><img src="/images/star.gif" height=13 width=13 alt="*"><img src="/images/star.gif" height=13 width=13 alt="*">
</td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Zimmer:</td>
<td valign="top" class="m92_h_wert"><b>Suite</b></td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Verpflegung:</td>
<td valign="top" class="m92_h_wert"><b>All Inclusive</b></td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Ort:</td>
<td valign="top" class="m92_h_wert">Punta Cana</td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Region:</td>
<td valign="top" class="m92_h_wert">Punta Cana</td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Land:</td>
<td valign="top" class="m92_h_wert">Dom. Republik</td>
</tr>
<tr>
<td valign="top" class="m92_h_para">Anbieter:</td>
<td valign="top" class="m92_h_wert"><img border=0 src="http://www.lmweb.net/lmi/va/gifs/5VF.gif" alt="5 vor Flug" title="5 vor Flug"><br>5 vor Flug</td>
</tr>
</table>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td><img src="/images/dropleftw.gif" height="16" width="18"></td>
<td>
<div id="mark" class="m92_notice">
<a target="vakanz" href="siteplus/reminder.php?session_id=rslr1ejntpmj07n0f2smqfhsj5&REC=147203&m_flag=1&m_typ=hotel">Dieses Hotel merken</a>
</div>
</td>
</tr>
<tr>
<td><img src="/images/dropleftw.gif" height="16" width="18"></td>
<td>
<div class="m92_notice">
Hotelbewertung anzeigen
</div>
</td>
</tr>
</table>
</td>
With the HtmlAgility-pack, how can I get the data between <td valign="top" class="m92_h_bigimg"> and his closing <td>. I tried with this code not using the HtmlAgility-pack and this works but it found first </td> and closed. So the code is not correct. I read that the HtmlAgility-pack is the best solution for this kind of problems.
public static string[] GetStringInBetween(string strBegin, string strEnd, string strSource, bool includeBegin, bool includeEnd)
{
string[] result = { "", "" };
int iIndexOfBegin = strSource.IndexOf(strBegin, StringComparison.Ordinal);
if (iIndexOfBegin != -1)
{
int iEnd = strSource.IndexOf(strEnd, iIndexOfBegin, StringComparison.Ordinal);
if (iEnd != -1)
{
result[0] = strSource.Substring(iIndexOfBegin + (includeBegin ? 0 : strBegin.Length), iEnd + (includeEnd ? strEnd.Length : 0) - iIndexOfBegin);
if (iEnd + strEnd.Length < strSource.Length)
result[1] = strSource.Substring(iEnd + strEnd.Length);
}
}
return result;
}
How can I do this?

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var str = htmlDoc.DocumentNode
.Descendants("td")
.Where(x => x.Attributes["class"] != null && x.Attributes["class"].Value == "m92_h_bigimg")
.Select(x => x.InnerHtml)
.First();

The HtmlAgilityPack supports standard XPath queries, so I think you could do something like:
foreach (var node in doc.DocumentElement.SelectNodes("//td[#class='m92_h_bigimg']"))
{
// Do work on your node.
}
... where doc is your instance of HtmlDocument

extract data from an html tbody using c#

I am using c# Web.Client to download an html string.
A small example of the html been returned is
<tbody class='resultBody ' id='Tbody2'>
<tr id='Tr2' class='firstRow'>
<td class='cbrow tier_Gold' rowspan='4'>
<input type='checkbox' name='listingId' value='452' id='Checkbox2' />
</td>
<td class='resNum' rowspan='4'>
<div class='node'>
B</div>
</td>
<td class='datarow busName' id='Td2'>
</td>
<td rowspan='2' class='resLinks'>
</td>
<td class="hoops" rowspan='2'>
</td>
</tr>
<tr>
<td class="datarow">
<dl class="addrBlock">
<dd class="bizAddr">
123 ABC St</dd>
</dl>
</td>
</tr>
</tbody>
<tbody class='resultBody ' id='Tbody3'>
<tr id='Tr3' class='firstRow'>
<td class='cbrow tier_Gold' rowspan='4'>
<input type='checkbox' name='listingId' value='99' id='Checkbox3' />
</td>
<td class='resNum' rowspan='4'>
<div class='node'>
B</div>
</td>
<td class='datarow busName' id='Td3'>
</td>
<td rowspan='2' class='resLinks'>
</td>
<td class="hoops" rowspan='2'>
</td>
</tr>
<tr>
<td class="datarow">
<dl class="addrBlock">
<dd class="bizAddr">
1111 Some St</dd>
</dl>
</td>
</tr>
</tbody>
I am interested in 2 elements of the html but I have no idea the best way to get to them. How would be the best way for me to get the value from and get the inner html from the element
Any suggestions would be great!!!

download the HTML Agility Pack (free)
create a new HtmlDocument
loadhtml
use DOM navigation or an xpath query (SelectSingleNode etc) to find the elements
access InerHtml of the elements you want
The API is similar to XmlDocument, but it works on html that isn't xhtml.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Scraping With HtmlAgilityPack - c#

Related

How to check a Condition in .cshtml file

how to get a text from xpath in c#

c# htmlagilitypack get table values

Read data in string from begin to end

extract data from an html tbody using c#

Categories

Resources