HTML Agility pack - parsing tables - c#

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.
I looked at the link example, but did not find any table data this way.
Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser).
I am also happy if one can just shed a light on the right object order for the parsing.

How about something like:
Using HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}

The most simple what I've found to get the XPath for a particular Element is to install FireBug extension for Firefox go to the site/webpage press F12 to bring up firebug; right select and right click the element on the page that you want to query and select "Inspect Element" Firebug will select the element in its IDE then right click the Element in Firebug and choose "Copy XPath" this function will give you the exact XPath Query you need to get the element you want using HTML Agility Library.

I know this is a pretty old question but this was my solution that helped with visualizing the table so you can create a class structure. This is also using the HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
var table = doc.DocumentNode.SelectSingleNode("//table");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{
for (int e = 0; e < columns.Count; e++)
{
var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
Console.Write(columns[e].InnerText + ":" + value.InnerText);
}
Console.WriteLine();
}

In my case, there is a single table which happens to be a device list from a router. If you wish to read the table using TR/TH/TD (row, header, data) instead of a matrix as mentioned above, you can do something like the following:
List<TableRow> deviceTable = (from table in document.DocumentNode.SelectNodes(XPathQueries.SELECT_TABLE)
from row in table?.SelectNodes(HtmlBody.TR)
let rows = row.SelectSingleNode(HtmlBody.TR)
where row.FirstChild.OriginalName != null && row.FirstChild.OriginalName.Equals(HtmlBody.T_HEADER)
select new TableRow
{
Header = row.SelectSingleNode(HtmlBody.T_HEADER)?.InnerText,
Data = row.SelectSingleNode(HtmlBody.T_DATA)?.InnerText}).ToList();
}
TableRow is just a simple object with Header and Data as properties.
The approach takes care of null-ness and this case:
<tr>
<td width="28%"> </td>
</tr>
which is row without a header. The HtmlBody object with the constants hanging off of it are probably readily deduced but I apologize for it even still. I came from the world where if you have " in your code, it should either be constant or localizable.

Line from above answer:
HtmlDocument doc = new HtmlDocument();
This doesn't work in VS 2015 C#. You cannot construct an HtmlDocument any more.
Another MS "feature" that makes things more difficult to use. Try HtmlAgilityPack.HtmlWeb and check out this link for some sample code.

Related

crawling price gives null , HtmlAgilityPack (C#)

Im trying to get stock data from a website with webcrawler as a hobby project. I got the link to work, i got the Name of the stock but i cant get the price... I dont know how to handle the html code. Here is my code,
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants("div").Where(n => n.GetAttributeValue("class", "").Equals("Flexbox__StyledFlexbox-sc-1ob4g1e-0 eYavUv Row__StyledRow-sc-1iamenj-0 foFHXj Rows__AlignedRow-sc-1udgki9-0 dnLFDN")).ToList();
var stocks = new List<Stock>();
foreach (var div in divs)
{
var stock = new Stock()
{
Name = div.Descendants("a").Where(a=>a.GetAttributeValue("class","").Equals("Link__StyledLink-sc-apj04t-0 foCaAq NameCell__StyledLink-sc-qgec4s-0 hZYbiE")).FirstOrDefault().InnerText,
changeInPercent = div.Descendants("span").Where((a)=>a.GetAttributeValue("class", "").Equals("Development__StyledDevelopment-sc-hnn1ri-0 kJLDzW")).FirstOrDefault()?.InnerText
};
stocks.Add(stock);
}
foreach (var stock in stocks)
{
Console.WriteLine(stock.Name + " ");
}
I got the Name correct, but i dont really know how the get the ChangeInPercent.... I will past in the html code below,
The top highlight show where i got the name from, and the second one is the "span" i want. I want the -4.70
Im a litle bit confused when it comes to get the data with my code. I tried everything. My changeInPercent property is a string.
it has to be the code somehow...
There's probably an easier to select a single attribute/node than the way you're doing it right now:
If you know the exact XPath expression to select the node you're looking for, then you can do the following:
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var changeInPercent = htmlDocument.DocumentNode
.SelectSingleNode("//foo/bar")
.InnerText;
Getting the right XPath expression (the //foo/bar example above) is the tricky part. But this can be found quite easy using your browser's dev tools. You can navigate to the desired element and just copy it's XPath expression - simple as that! See here for a sample on how to copy the expression.

WebScraper C# + htmlagilitypack

i started develop application. Me need take some informations from website after download it in database, and after I need treatment this informations.
Well, I don't have enough experience and be grateful any of your recomendation.
for example - I'll work with sport site. (https://terrikon.com/football/spain/championship/)
I need receive informations from table and download this data in DB.
I tried some ways download data and understand that best way - use "htmlagilitypack".
I read documentation about work with this library and best that i did:
using System;
using System.Xml;
using HtmlAgilityPack;
namespace Parser
{
class Program
{
static void Main(string[] args)
{
var html = #"https://terrikon.com/football/spain/championship/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
var table = htmlDoc.QuerySelector("#champs-table > table");
var tableRows = table.QuerySelectorAll("tr");
foreach (var row in tableRows)
{
var team = row.QuerySelector(".team");
var win = row.QuerySelector(".win");
var draw = row.QuerySelector(".draw");
var lose = row.QuerySelector(".lose");
Console.WriteLine(team.OuterHtml );
};
}
}
}
I can receive title of website or all informations, if I'll change this string
var node = htmlDoc.DocumentNode.SelectSingleNode("//head");
Could you give me advice how can I get informations only from table?
thanks for tour attention
I would recommend to also install a Css selector extension for HtmlAgilityPack which can be found here:
https://github.com/hcesar/HtmlAgilityPack.CssSelector
With that you can Query your nodes with css selectors.
To get the information from that table you will have to know the CSS selector for it.
In this case its:
#champs-table > table
So to get whole table you could do it like this:
var table = htmlDoc.QuerySelector("#champs-table > table");
// then query rows of that table:
var tableRows = table.QuerySelectorAll("tr");
// Now each element it tableRows is a <tr> from that html table
// you could access every value in a for each loop
foreach(var row in tableRows)
{
var team = row.QuerySelector(".team"); // "team" is a css class applied to <td> containing the team name
var win = row.QuerySelector(".win");
var draw = row.QuerySelector(".draw");
var lose = row.QuerySelector(".lose");
}

Html Agility Pack Xpath not working

so when I'm trying to do is parse a HTML document using Html Agility Pack. I load the html doc and it works. The issue lies when I try to parse it using XPath. I get a "System.NullReferenceException: 'Object reference not set to an instance of an object.'" Error.
To get my xpath I use the Chrome Development window and highlight the whole table that has the rows which contains the data that I want to parse, right click it and copy Xpath.
Here's my code
string url = "https://www.ctbiglist.com/index.asp";
string myPara = "LastName=Smith&FirstName=James&PropertyID=&Submit=Search+Properties";
string htmlResult;
// Get the raw HTML from the website
using (WebClient client = new WebClient())
{
client.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
// Send in the link along with the FirstName, LastName, and Submit POST request
htmlResult = client.UploadString(url, myPara);
//Console.WriteLine(htmlResult);
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlResult);
HtmlNodeCollection table = doc.DocumentNode.SelectNodes("//*[#id=\"Table2\"]/tbody/tr[2]/td/table/tbody/tr/td/div[2]/table/tbody/tr[2]/td/table/tbody/tr[2]/td/form/div/table[1]/tbody/tr");
Console.WriteLine(table.Count);
When I run this code it works but grabs all the tables in the HTML document.
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("//tr").Cast<HtmlNode>()
from cell in row.SelectNodes("//th|td").Cast<HtmlNode>()
select new { Table = table.Id, CellText = cell.InnerText };
foreach (var cell in query)
{
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}
What I want is a specific table that holds all the tables rows that has the data I want to parse into objects.
Thanks for the help!!!
Change the line
from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
to
from table in doc.DocumentNode.SelectNodes("//table[#id=\"Table2\"]").Cast<HtmlNode()
This will only select specific table with given Id. But if you have nested Tables then you have change your xpath accordingly to get the nested table rows.

C# to get data from a website

I would like to get the data from this website and put them into a dictionary.
Basically these are prices and quantities for some financial instruments.
I have this source code for the page (here is just an extract of the whole text):
<tr>
<td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td>
<td><span>0</span></td>
<td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td>
<td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td>
</tr>
Now I would like to extraxt the following information:
The value "4000" from the second line;
The value "-3.87%" from the fourth line;
The value "960.40" from the fifth line.
I have tried to use the following to extract the first information (the value 4000):
string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var firstData = from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText;
but firstData doesn't contains the info I want (the value 4000) but this:
System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String]
How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.
This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.
var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var pricesAndQuotesDataTable =
(from elem in
document.DocumentNode.Descendants()
.Where(
d =>
d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" &&
d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes"))
select
elem.Descendants()
.FirstOrDefault(
d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault();
if (pricesAndQuotesDataTable != null)
{
var dataRows = from elem in pricesAndQuotesDataTable.Descendants()
where elem.Name == "tr" && elem.ParentNode.Name == "tbody"
select elem;
var dataPoints = new List<object>();
foreach (var row in dataRows)
{
var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td")
select col).ToList();
dataPoints.Add(
new
{
StrikePrice = dataColumns[0].InnerText,
DifferenceToPreviousDay = dataColumns[9].InnerText,
LastPrice = dataColumns[10].InnerText
});
}
}
That's because your LINQ hasn't executed. If you check the Results View in the debugger and run the query, you'll get all the items, the first being that value you are looking for.
So, this will get you 4,000.00
var firstData = (from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText).First();
if you want them all, call ToList() instead of First()
if you open to use CSQuery.. then try this one.
static void Main()
{
CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411");
string str = cq["#notation115602071 span"].Text();
}
You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).
Not all HTML pages can be assumed to be valid XML.
With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.
Html Agility Pack
Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,
PDF Export Link
Convert PDF to Text
We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.
Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.
With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.

Can I 'import' this value into a XML structure and parse it? C#

I have a sharepoint server and I am playing around with it programmatically. One of the applications that I am playing with is Microsoft's Call Center. I queried the list of customers:
if (cList.Title.ToLower().Equals("service requests"))
{
textBox1.Text += "> Service Requests" + Environment.NewLine;
foreach (SPListItem item in cList.Items)
{
textBox1.Text += string.Format(">> {0}{1}", item.Title, Environment.NewLine);
}
}
One of the properties in item is XML. Here is value of one:
<z:row
xmlns:z='#RowsetSchema' ows_ID='1' ows_ContentTypeId='0x0106006324F8B638865542BE98AD18210EB6F4'
ows_ContentType='Contact' ows_Title='Mouse' ows_Modified='2009-08-12 14:53:50' ows_Created='2009-08-12 14:53:50'
ows_Author='1073741823;#System Account' ows_Editor='1073741823;#System Account'
ows_owshiddenversion='1' ows_WorkflowVersion='1' ows__UIVersion='512' ows__UIVersionString='1.0'
ows_Attachments='0' ows__ModerationStatus='0' ows_LinkTitleNoMenu='Mouse' ows_LinkTitle='Mouse'
ows_SelectTitle='1' ows_Order='100.000000000000' ows_GUID='{37A91B6B-B645-446A-8E8D-DA8250635DE1}'
ows_FileRef='1;#Lists/customersList/1_.000' ows_FileDirRef='1;#Lists/customersList'
ows_Last_x0020_Modified='1;#2009-08-12 14:53:50' ows_Created_x0020_Date='1;#2009-08-12 14:53:50'
ows_FSObjType='1;#0' ows_PermMask='0x7fffffffffffffff' ows_FileLeafRef='1;#1_.000'
ows_UniqueId='1;#{28A223E0-100D-49A6-99DA-7947CFC38B18}' ows_ProgId='1;#'
ows_ScopeId='1;#{79BF21FE-0B9A-43B1-9077-C071B61F5588}' ows__EditMenuTableStart='1_.000'
ows__EditMenuTableEnd='1' ows_LinkFilenameNoMenu='1_.000' ows_LinkFilename='1_.000'
ows_ServerUrl='/Lists/customersList/1_.000' ows_EncodedAbsUrl='http://spvm:3333/Lists/customersList/1_.000'
ows_BaseName='1_' ows_MetaInfo='1;#' ows__Level='1' ows__IsCurrentVersion='1' ows_FirstName='Mickey'
ows_FullName='Mickey Mouse' ows_Comments='<div></div>' ows_ServerRedirected='0'
/>
Can I create an XMLnode or some other sort of XML object so that I can easily parse it and pull certain values (these certainties are unknowns right now, since I am just testing right now)?
Thanks SO!
If the XML is valid you could use XmlDocument.LoadXMl like so:
XmlDocument doc = new XmlDocument();
doc.LoadXml(validxmlstring);
You can do this and it should work fine (although I would use the XML document approach Colin mentions, or even better LINQ). You may also find the LINQ extensions in SharePoint Extensions Lib useful.
However, I'm wondering why you would approach it this way instead of using the SPListItem.Item property? It's much simpler to use and very clear. For example:
var title = listItem["Title"]; // Returns title of item
var desc = listItem["Description"]; // Returns value of description field
The only trap is the unusual case of a list that contains a field with an internal name equal to another field's display name. This will always return the value of the field with the internal name first.
Just curious if you have a requirement to go the XML route.
I came up with this code that seems to work ok. Since my XML/C# knowledge is limited I would imagine there is a simpler way:
public void DoParse(string value, string elementname)
{
var split = value.Split((char)39);
XmlDocument xDoc = new XmlDocument();
XmlElement xRoot = xDoc.CreateElement(elementname);
xDoc.AppendChild(xRoot);
for (var i = 0; i < split.Length - 1; i += 2)
{
var attribName = split[i].Replace("=", "").Trim();
var xAttrib = xDoc.CreateAttribute(attribName);
xAttrib.Value = split[i + 1];
xRoot.Attributes.Append(xAttrib);
}
xDoc.Save(string.Format("c:\\xmlout_{0}.xml", elementname));
}
Gives me:
<Customer xmlns:z="#RowsetSchema" ows_ID="1" ows_ContentTypeId="0x0106006324F8B638865542BE98AD18210EB6F4" ows_ContentType="Contact" ows_Title="Mouse" ows_Modified="2009-08-12 14:53:50" ows_Created="2009-08-12 14:53:50" ows_Author="1073741823;#System Account" ows_Editor="1073741823;#System Account" ows_owshiddenversion="1" ows_WorkflowVersion="1" ows__UIVersion="512" ows__UIVersionString="1.0" ows_Attachments="0" ows__ModerationStatus="0" ows_LinkTitleNoMenu="Mouse" ows_LinkTitle="Mouse" ows_SelectTitle="1" ows_Order="100.000000000000" ows_GUID="{37A91B6B-B645-446A-8E8D-DA8250635DE1}" ows_FileRef="1;#Lists/customersList/1_.000" ows_FileDirRef="1;#Lists/customersList" ows_Last_x0020_Modified="1;#2009-08-12 14:53:50" ows_Created_x0020_Date="1;#2009-08-12 14:53:50" ows_FSObjType="1;#0" ows_PermMask="0x7fffffffffffffff" ows_FileLeafRef="1;#1_.000" ows_UniqueId="1;#{28A223E0-100D-49A6-99DA-7947CFC38B18}" ows_ProgId="1;#" ows_ScopeId="1;#{79BF21FE-0B9A-43B1-9077-C071B61F5588}" ows__EditMenuTableStart="1_.000" ows__EditMenuTableEnd="1" ows_LinkFilenameNoMenu="1_.000" ows_LinkFilename="1_.000" ows_ServerUrl="/Lists/customersList/1_.000" ows_EncodedAbsUrl="http://spvm:3333/Lists/customersList/1_.000" ows_BaseName="1_" ows_MetaInfo="1;#" ows__Level="1" ows__IsCurrentVersion="1" ows_FirstName="Mickey" ows_FullName="Mickey Mouse" ows_Comments="&lt;div&gt;&lt;/div&gt;" ows_ServerRedirected="0" />
Anyone have some input? Thanks.

Categories

Resources