extract data from an html tbody using c#

extract data from an html tbody using c# - c#

I am using c# Web.Client to download an html string.
A small example of the html been returned is
<tbody class='resultBody ' id='Tbody2'>
<tr id='Tr2' class='firstRow'>
<td class='cbrow tier_Gold' rowspan='4'>
<input type='checkbox' name='listingId' value='452' id='Checkbox2' />
</td>
<td class='resNum' rowspan='4'>
<div class='node'>
B</div>
</td>
<td class='datarow busName' id='Td2'>
</td>
<td rowspan='2' class='resLinks'>
</td>
<td class="hoops" rowspan='2'>
</td>
</tr>
<tr>
<td class="datarow">
<dl class="addrBlock">
<dd class="bizAddr">
123 ABC St</dd>
</dl>
</td>
</tr>
</tbody>
<tbody class='resultBody ' id='Tbody3'>
<tr id='Tr3' class='firstRow'>
<td class='cbrow tier_Gold' rowspan='4'>
<input type='checkbox' name='listingId' value='99' id='Checkbox3' />
</td>
<td class='resNum' rowspan='4'>
<div class='node'>
B</div>
</td>
<td class='datarow busName' id='Td3'>
</td>
<td rowspan='2' class='resLinks'>
</td>
<td class="hoops" rowspan='2'>
</td>
</tr>
<tr>
<td class="datarow">
<dl class="addrBlock">
<dd class="bizAddr">
1111 Some St</dd>
</dl>
</td>
</tr>
</tbody>
I am interested in 2 elements of the html but I have no idea the best way to get to them. How would be the best way for me to get the value from and get the inner html from the element
Any suggestions would be great!!!

download the HTML Agility Pack (free)
create a new HtmlDocument
loadhtml
use DOM navigation or an xpath query (SelectSingleNode etc) to find the elements
access InerHtml of the elements you want
The API is similar to XmlDocument, but it works on html that isn't xhtml.

Related

How to check a Condition in .cshtml file

I want to check the FormattedLastFillDate field ...Some how syntax is throwing an error...Can any one help to write a If condition in .cshtml file...Below is the block of code.
#if ( FormattedLastFillDate!= "My logic")
<tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>

you need to try this one:
#if ( FormattedLastFillDate!= "My logic")
{
<tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>
}

the variable should be accessible mode.
I think you were pretty close, try this:
*#{string FormattedLastFillDate= "test";}
#if (FormattedLastFillDate != "test")
{ <tr>
<td class="td--numeric">{{OrderNumber}}</td>
<td>
{{DrugName}}
<div class="order-directions">{{Directions}}</div>
<div class="order-message">{{Message}}</div>
</td>
<td>{{DrugStrength}}</td>
<td>{{DrugForm}}</td>
<td class="td--numeric">{{FormattedRefillsLeft}}</td>
<td class="td--numeric">{{Ndc}}</td>
<td class="td--numeric">{{FormattedLastFillDate}}</td>
</tr>
}*

xpath expression not working properly on HtmlAgilityPack

I'm trying to search for a html node using xpath expressions.
The objective is to match all tr nodes which have 2 children td nodes with the attribute class="shiftHolder" (the seconds tr in the example).
<table>
<tr class="staffRow">
<td class="staff" data-staffid="2" data-primaryrole="1">
<div class="sn">Leyla-claire Collins</div>
</td>
<td class="shiftHolder">
</td>
<td class="shiftHolder unavailable">
Holiday
</td>
</tr>
<tr class="staffRow">
<td class="staff" data-staffid="11" data-primaryrole="4">
<div class="sn">Natale Dersley</div>
</td>
<td class="shiftHolder">
</td>
<td class="shiftHolder">
</td>
</tr>
</table>
The following expressions are working here, here, and here but not on HtmlAgilityPack, both tr are returned.
//tr[#class='staffRow'][count(td[#class='shiftHolder'])=2][td[#class='staff' and #data-staffid and #data-primaryrole][div[#class='sn']]]
//tr[#class='staffRow' and count(td[#class='shiftHolder'])=2 and td[#class='staff' and #data-staffid and #data-primaryrole and div[#class='sn']]]
Is there any difference on HtmlAgilityPack that I'm not aware of?

c# htmlagilitypack get table values

I have a webpage that needs to parsed ad values to be stored in sqlserver db. I have tried to use HTMLagility pack.
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(HTML);
var cols = hdoc.DocumentNode.SelectNodes("//table[#id='results']//tr//th//td");
for (int i = 0; i < cols.Count; i = i + 2)
{
DataRow dr = dt.NewRow();
string name = cols[i].InnerText.Trim();
}
This is how my html looks
<table id="results">
<tr>
<th style="white-space: nowrap;">
ID
</th>
<th style="text-align: left;">
Entity Name /<br>
Type
</th>
<th style="white-space: nowrap;">
Registered<br>
Effective Date
</th>
<th>
Status /<br>
Status Date
</th>
</tr>
<tr class="exactMatch" valign="top">
<td class="entityID">
123456
</td>
<td class="nameAndTypeDescription">
<span class="name"><a href="test.aspx?entityID=123456&hash=2055339395&orgTypes=01%2c99">
NAME1 COMPANY </a></span>
<br />
<span class="typeDescription">55 - TRadeUnion Company </span>
</td>
<td class="registeredEffectiveDate">
01/12/1912
</td>
<td class="statusDescriptionAndStatusDate">
<span class="statusDescription">Exists Now </span>
<br>
<span class="statusDate">12/14/1943</span>
</td>
</tr>
<tr class="exactMatch" valign="top">
<td class="entityID">
A23456
</td>
<td class="nameAndTypeDescription">
<span class="name"><a href="test.aspx?entityID=A23456&hash=615278445&orgTypes=01%2c99">
TESTA, INC. </a></span>
<br />
<span class="typeDescription">09 - Domestic Corporation </span>
</td>
<td class="registeredEffectiveDate">
04/29/1926
</td>
<td class="statusDescriptionAndStatusDate">
<span class="statusDescription">Dissolved Company </span>
<br>
<span class="statusDate">06/16/1998</span>
</td>
</tr>
</table>
I need to insert entityID,name, hyperlink, type description,registeredeffectivedate,status description,status date. Right now they all print in one single line and I do know how to parse it. Please help.
Thanks
MR

The TD's are not nested under TH's.
Try this: SelectNodes("//table[#id='results']/tr/td");

How can I read the HTML inside a XML tag?

I have a XML that return, at some point, this:
<TESTO>
<img src="../path/image.jpg" alt="" />
</TESTO>
well, if I do:
string TESTO = m_oNode.SelectSingleNode("TESTO").InnerText;
TESTO will be "empty". Why? How can I read the whole text? With other tag without HTML tag all works perfectly...
I use XmlDocument
EDIT - code that create an Exception with InnerXml():
<TESTO>
<table style="width: 100%;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td> </td>
<td width="700"><img src="/testata.jpg" alt="mycaf.it" width="700" height="333" border="0" /></td>
<td> </td>
</tr>
<tr>
<td> </td>
<td style="text-align: center; background-color: #f5f5f5;" align="center" bgcolor="#f5f5f5"><br />
<p style="color: #ee2e24; font-style: italic; font-size: 25px; font-family: Arial;">portale<br /> </p>
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
</TESTO>

InnerText gets only the Text (for mixed content or text content). Use InnerXml instead.
Example:
<A>
Some text in mixed content
<B>OnlyText</B>
</A
Gives the result:
InnerText = "Some text in mixed content\r\nOnlyText"
InnerXml = "Some text in mixed content\r\n<B>OnlyText</B>";

To read the content of an html element you have to use yourElement.innerXml instead of yourElement.InnerText
Per leggere il contenuto di un elemento html devi usare yourElement.innerXml al posto di yourElement.InnerText :)

Scraping With HtmlAgilityPack

I have a huge html page that i want to scrap values from it.
I tried to use Firebug to get the XPath of the element i want but it is not a static XPath as it is changes from time to time so how could i get the values i want.
In the following snippet i want to get the Production of Lumber per hour which is located in the 20
<div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1">
<thead>
<tr>
<th colspan="4">
Production per hour: </th>
</tr>
</thead>
<tbody>
<tr>
<td class="ico">
<img class="r1" src="img/x.gif" alt="Lumber" title="Lumber" />
</td>
<td class="res">
Lumber:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r2" src="img/x.gif" alt="Clay" title="Clay" />
</td>
<td class="res">
Clay:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r3" src="img/x.gif" alt="Iron" title="Iron" />
</td>
<td class="res">
Iron:
</td>
<td class="num">
20 </td>
</tr>
<tr>
<td class="ico">
<img class="r4" src="img/x.gif" alt="Crop" title="Crop" />
</td>
<td class="res">
Crop:
</td>
<td class="num">
59 </td>
</tr>
</tbody>
</table>
</div>

Using Html agility pack you will want to do something like the following.
byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());
var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();
var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();
string lumberValue = lumberTd.InnerText.Trim();
Warning, that 'FirstOrDefault()' can return null so you should probably put some checks in there.
Hope that helps.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);
var result = doc.DocumentNode.SelectNodes("//div[#class='boxes-contents cf']//tbody/tr")
.First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
.Elements("td")
.First(td=>td.Attributes["class"].Value=="num")
.InnerText
.Trim();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

extract data from an html tbody using c# - c#

download the HTML Agility Pack (free) create a new HtmlDocument loadhtml use DOM navigation or an xpath query (SelectSingleNode etc) to find the elements access InerHtml of the elements you want The API is similar to XmlDocument, but it works on html that isn't xhtml.

Related

How to check a Condition in .cshtml file

xpath expression not working properly on HtmlAgilityPack

c# htmlagilitypack get table values

How can I read the HTML inside a XML tag?

Scraping With HtmlAgilityPack

Categories

Resources