Read invisible data from table with htmlagilitypack - c#

I have this html with table.
I can get "col1" and "col2" but I don't know how to get also value of "data-index", "data-name":
<table class="footable table" id="footable">
<tbody>
<tr class="trclass red" data-index="123" data-name="Apple">
<td class="col1" >Green</td>
<td class="col2" >1.25</td>
</td></tr>
</tbody>
</table>
What I have tried:
public static void Main()
{
var html =
#"<html>
<tbody>
<table id=\'footable\'>
<tr class=\'trclass red\' data-index=\'123\' data-name=\'Apple\'>
<td class=\'col1\' >Green</td>
<td class=\'col2\' > 1.25</td>
</table>
</tbody></html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tbody = htmlDoc.DocumentNode.SelectNodes("//table[contains(#id, 'foo')]//tr//td");
foreach(var nob in tbody)
{
Console.Write(nob.InnerHtml);
}
}
I know that I can use nob.Attributes["data-index"], but my data is in tr before td where are my "Green" and "1.25".

Related

Using HtmlAgilityPack with C# to find all href links within td elements in html page

I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully

Using HTMLAgility pack to extract value from a Xpath using c# console app

I have the following line of HTML code and I used google chrome for xpath.
<DIV id=TasheelPaymentCtrl1_dvPayment>
<TABLE border=1 cellSpacing=0 borderColor=black cellPadding=7 width=625 align=center>
<TBODY>
<TR>
<TD class=ReceiptHeadArbCenterHead1 width=320>المسمى </TD>
<TD class=ReceiptHeadArbCenterHead1 width=75>دفع إلى</TD>
<TD class=ReceiptHeadArbCenterHead1 width=75>القيمة</TD>
<TD class=ReceiptHeadArbCenterHead1 width=75>الكمية</TD>
<TD class=ReceiptHeadArbCenterHead1 width=75>المجموع</TD></TR>
<TR>
<TD class=ReceiptHeadArbCenterHead>رسوم وزارة العمل</TD>
<TD class=ReceiptValueArbCenter>MOFI</TD>
<TD class=ReceiptValueArbCenter>3</TD>
<TD class=ReceiptValueArbCenter>1</TD>
<TD class=ReceiptValueArbCenter>3</TD>
<TR>
<TD class=ReceiptHeadArbCenterHead>رسوم الدرهم الإلكتروني</TD>
<TD class=ReceiptValueArbCenter>MOFI</TD>
<TD class=ReceiptValueArbCenter>3</TD>
<TD class=ReceiptValueArbCenter>1</TD>
<TD class=ReceiptValueArbCenter>3</TD>
<TR>
<TD class=ReceiptHeadArbCenterHead>رسوم مراكز الخدمة </TD>
<TD class=ReceiptValueArbCenter>MOFI</TD>
<TD class=ReceiptValueArbCenter>47</TD>
<TD class=ReceiptValueArbCenter>1</TD>
<TD class=ReceiptValueArbCenter>47</TD>
<TR>
<TD class=ReceiptHeadArbCenterHead1 colSpan=4>المجموع</TD>
<TD class=ReceiptValueArbCenter>53</TD></TR></TBODY></TABLE></DIV>
I want to extract values 3, 3, 47 and 53
I tried using this xpath
var gf = doc.DocumentNode.SelectNodes("//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[2]/td[5]");
foreach (var node in gf)
{
Console.WriteLine(node.InnerText); //output: "3"
}
var sf = doc.DocumentNode.SelectNodes("//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[3]/td[5]");
foreach (var node in sf)
{
Console.WriteLine(node.InnerText); //output: "3"
}
var tf = doc.DocumentNode.SelectNodes("//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[4]/td[5]");
foreach (var node in tf)
{
Console.WriteLine(node.InnerText); //output: "47"
}
var Allf = doc.DocumentNode.SelectNodes("//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[5]/td[2]");
foreach (var node in Allf )
{
Console.WriteLine(node.InnerText); //output: "53"
}
but i am getting null object exception..
I used Google chrome developer tools to copy the xpath. I am getting null point exception . How can extract value ..
My question is why I am getting null point reference exception, is there any mistake in xpath value?
Please help me.
As you have discovered, some of your XPath expressions don't work because the <tr> tags are not all closed.
Therefore, you will need to cater for this in your XPath expressions:
//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[2]/td[5] - no change
//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[3]/td[5] - should be //div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[2]/tr/td[5]
//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[4]/td[5] - should be //div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[2]/tr/tr/td[5]
//div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[5]/td[2] - should be //div[#id='TasheelPaymentCtrl1_dvPayment']/table/tbody/tr[2]/tr/tr/tr/td[2]

HtmlAgilityPack cannot find specific td

I need extract value of just one specific td from the table by using XPath, but code always return null. How can I fix this?
var location = GetLocation(document.Result.DocumentNode.SelectSingleNode("//*[#id='detailTabTable']/tbody/tr[3]/td[2]"));
and the code
private string GetLocation(HtmlNode h)
{
try
{
string location = null;
if (h == null)
{
location = "N/A";
}
else
{
location = h.InnerText;
location = location.Substring(0, location.IndexOf(",", StringComparison.InvariantCulture));
}
return location;
}
catch (Exception ex)
{
log.ErrorFormat("Error in Link Data Repository {0} in Parse Links {1}", ex.Message, ex.StackTrace);
throw new Exception(ex.Message);
}
}
And small simple table:
<table id="detailTabTable" width="99%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="detailTabContentLt">Current List Price:</td>
<td class="detailTabContentPriceRt">
<span class="aiDetailCurrentPrice">AED 6,600,000</span>
</td>
</tr>
<tr>
<td class="detailTabContentLt" style="white-space: nowrap;">Plot size (Sq. Ft.):</td>
<td class="detailTabContentRt">N/A</td>
</tr>
<tr>
<td class="detailTabContentLt" valign="top">Locality</td>
<td class="detailTabContentRt">Dubai, Dubai</td>
</tr>
<tr>
<td colspan="2"></td>
</tr>
</table>
I have just tested your code. As mentioned in the comments when you do remove tbody from your xpath expression everything works fine. This worked fine for
me.
private static void htmlAgilityPackTest()
{
string html = " <table id=\"detailTabTable\" width=\"99%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\"><tr><td class=\"detailTabContentLt\">Current List Price:</td><td class=\"detailTabContentPriceRt\"><span class=\"aiDetailCurrentPrice\">AED 6,600,000</span></td> </tr><tr> <td class=\"detailTabContentLt\" style=\"white-space: nowrap;\">Plot size (Sq. Ft.):</td><td class=\"detailTabContentRt\">N/A</td></tr> <tr><td class=\"detailTabContentLt\" valign=\"top\">Locality</td> <td class=\"detailTabContentRt\">Dubai, Dubai</td> </tr> <tr><td colspan=\"2\"></td> </tr> </table>";
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var node = document.DocumentNode.SelectSingleNode("//*[#id='detailTabTable']/tr[3]/td[2]");
string location = GetLocation(node);
Console.WriteLine("Location: " + location);
}
In case I misunderstood anything please let me know.
You can use fizzler and select stuff the CSS way :)
http://blog.simontimms.com/2014/02/24/parsing-html-in-c-using-css-selectors/

How to Extract an Html element from a snipet of Html with HtmlAgilityPack?

Here is the Html code:
<table style="border:1px solid #000">
<tr style="background:#ddd;">
<td width="150">TableEle1</td>
<td width="150">TableEle2</td>
<td width="150">TableEle3</td>
<td width="150">TableEle4</td>
<td width="150">TableEle5</td>
<td width="150">TableEle6</td>
<td width="150">TableEle7</td>
<td width="150">TableEle8</td>
</tr>
And here is the code I use to extract the table element 1 (but not successful)
htmlHelper.SetNode(#"//td/text()='TableEle1'");
Is there any advice for me?
You can use a blend of HtmlAgilityPack and Linq to get the desired td node.
HtmlDocument document = new HtmlDocument();
document.LoadHtml("[your HTML string]");
var node = document.DocumentNode.SelectNodes("//td/text()");
var tdNode = node.Where(s => s.InnerText == "TableEle1").Select(s => s);
Hope this helps!

Remove line from string where certain HTML tag is found

So I have this HTML page that is exported to an Excel file through an MVC action. The action actually goes and renders this partial view, and then exports that rendered view with correct formatting to an Excel file. However, the view is rendered exactly how it is seen before I do the export, and that view contains an "Export to Excel" button, so when I export this, the button image appears as a red X in the top left corner of the Excel file.
I can intercept the string containing this HTML to render in the ExcelExport action, and it looks like this for one example:
<div id="summaryInformation" >
<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />
<table class="resultsGrid" cellpadding="2" cellspacing="0">
<tr>
<td id="NicknameLabel" class="resultsCell">Nick Name</td>
<td id="NicknameValue" colspan="3">
Swap
</td>
</tr>
<tr>
<td id="EffectiveDateLabel" class="resultsCell">
<label for="EffectiveDate">Effective Date</label>
</td>
<td id="EffectiveDateValue" class="alignRight">
02-Mar-2011
</td>
<td id ="NotionalLabel" class="resultsCell">
<label for="Notional">Notional</label>
</td>
<td id="NotionalValue" class="alignRight">
<span>
USD
</span>
10,000,000.00
</td>
</tr>
<tr>
<td id="MaturityDateLabel" class="resultsCell">
<label for="MaturityDate">Maturity Date</label>
</td>
<td id="MaturityDateValue" class="alignRight">
02-Mar-2016
-
Modified Following
</td>
<td id="TimeStampLabel" class="resultsCell">
Rate Time Stamp
</td>
<td id="Timestamp" class="alignRight">
28-Feb-2011 16:00
</td>
</tr>
<tr >
<td id="HolidatCityLabel" class="resultsCell"> Holiday City</td>
<td id="ddlHolidayCity" colspan="3">
New York,
London
</td>
</tr>
</table>
</div>
<script>
$("#ExportToExcel").click(function () {
// ajax call to do the export
var actionUrl = "/Extranet/mvc/Indications.cfc/ExportToExcel";
var viewName = "/Extranet/Views/Indications/ResultsViews/SummaryInformation.aspx";
var fileName = 'SummaryInfo.xls';
GridExport(actionUrl, viewName, fileName);
});
</script>
That <img id="ExportToExcel" tag at the top is the one I want to remove just for the export. All of what you see is contained within a C# string. How would I go and remove that line from the string so it doesn't try and render the image in Excel?
EDIT: Would probably make sense also that we wouldn't need any of the <script> in the export either, but since that won't show up in Excel anyway I don't think that's a huge deal for now.
Remove all img tags:
string html2 = Regex.Replace( html, #"(<img\/?[^>]+>)", #"",
RegexOptions.IgnoreCase );
Include reference: using System.Text.RegularExpressions;
If it's in a C# string then just:
myHTMLString.Replace(#"<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />","");
The safest way to do this will be to use the HTML Agility Pack to read in the HTML and then write code that removes the image node from the HTML.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode image =doc.GetElementById("ExportToExcel"]);
image.Remove();
htmlString = doc.WriteTo();
You can use similar code to remove the script tag and other img tags.
I'm just using this
private string RemoveImages(string html)
{
StringBuilder retval = new StringBuilder();
using (StringReader reader = new StringReader(html))
{
string line = string.Empty;
do
{
line = reader.ReadLine();
if (line != null)
{
if (!line.StartsWith("<img"))
{
retval.Append(line);
}
}
} while (line != null);
}
return retval.ToString();
}

Categories

Resources