Parsing html with the HTML Agility Pack and Linq

Parsing html with the HTML Agility Pack and Linq - c#

I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data

As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;

Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}

Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.

instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"

I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

Related

Correctly use regular expressions to extract word

I've got an ASP.NET Core project that requires me to read the response from a website and extract a certain word.
What I've tried was to replace the tags with white space, and remove the tags. Unfortunately, I'm not getting any where with this. What is a better approach?
I want to extract Toyota from these html tags
<tr>
<td class="text-muted">Car Model</td>
<td><strong>Toyota 2015</strong></td>
</tr>
I've tried:
var documentSource = streamReader.ReadToEnd();
//removes html content
Regex remove = new Regex(#"<[^>].+?>");
var strippedSource = remove.Replace(documentSource.Replace("\n", ""), "");
//convert to array
string[] siteContextArray = strippedSource.Split(',');
//matching string
var match = new Regex("Car Model ([^2015]*)");
List<Model> modelList = new List<Model>();
Model model = new Model();
foreach (var item in siteContextArray)
{
var wordMatch = match.Match(item);
if (wordMatch.Success)
{
model.Add(
new Model
{
CarModel = wordMatch.Groups[1].Value
}
);
}
}
return modelList;

Use NuGet to retrieve HTML Agility Pack on your solution.
Usage
var html = #"
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2015 </strong></td>
</tr>
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2016 </strong></td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var models = htmlDoc.DocumentNode
.SelectNodes("//tr/td[text()='Car Model']")
.Select(node => node.SelectSingleNode("following-sibling::*[1][self::td]").InnerText);
By the way, I think it would be nice to add css class on the content element like
<td class="car-model"><strong> Toyota 2016 </strong></td>
Which will make the html more meaningful and easier to extract.

How to check if an XML attribute contains a string?

Here is the XML (I have saved an html page in xml form to parse it generically:
<td width="76" class="DataB">2.276</td>
<td width="76" class="DataB">2.289</td>
<td width="76" class="DataB">2.091</td>
<td width="76" class="DataB">1.952</td>
<td width="76" class="DataB">1.936</td>
<td width="76" class="Current2">1.899</td>
Now I am trying to find all of the elements that contain the string Current because the web page changes the number on the back:
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
This returns an object does not exist error here:
((string) element.Attribute("class"))
How can I check an attribute if it contains something?

If you asked me, it would be easier to write as an xpath query. This way you don't have to deal with cases where elements doesn't contain class attributes and other such cases.
var query = xml.XPathSelectElements("//td[contains(#class,'Current')]");
Otherwise, you would have to check for the existence of the attribute before trying to read it.
// query syntax makes this a little nicer
var query =
from td in xml.Descendants("td")
let classStr = (string)td.Attribute("class")
where classStr != null && classStr.Contains("Current")
select td;
// or alternatively, provide a default value
var query =
from td in xml.Descendants("td")
where ((string)td.Attribute("class") ?? "").Contains("Current")
select td;

There's probably something wrong with the XML input you're using - trying this code works for me in LINQPad:
XDocument xml = XDocument.Parse(#"<tr><td width=""76"" class=""DataB"">2.276</td>
<td width=""76"" class=""DataB"">2.289</td>
<td width=""76"" class=""DataB"">2.091</td>
<td width=""76"" class=""DataB"">1.952</td>
<td width=""76"" class=""DataB"">1.936</td>
<td width=""76"" class=""Current2"">1.899</td></tr>");
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
xElements.Dump();
Are you sure your XML is valid?

How to find nearest match from current context node

I've got a rather large XML file that I'm trying to parse using a C# application and the HtmlAgilityPack. The XML looks something like this:
...
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td>CONTROLLER1</td>
<td>4</td>
<td>3</td>
</tr>
<td>CONTROLLER2</td>
<td>4</td>
<td>3</td>
</tr>
...
Basically a series of table rows and columns that repeats. I'm first doing a search for a controller by using:
string xPath = #"//tr/td[starts-with(.,'CONTROLLER2')]";
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xPath);
foreach (HtmlNode link in nodes) { ... }
Which returns the correct node. Now I want to search backwards (up) for the first (nearest) matching <td> node that starts with text "ABC":
string xPath = #link.XPath + #"/parent::tr/preceding-sibling::tr/td[starts-with(.,'ABC-')]";
This returns all matching nodes, not just the nearest one. When I attempted to add [1] to the end of this XPath string, it didn't seem to work and I've found no examples showing a predicate being used with an axes function like this. Or, more likely, I'm doing it wrong. Any suggestions?

You can use this XPath :
/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]
That will search for nearest preceding <tr> that has child <td> starts with 'ABC-'. Then get that particular <td> element.
There are at least two approaches you can pick when using HtmlAgilityPack :
foreach (HtmlNode link in nodes)
{
//approach 1 : notice dot(.) at the beginning of the XPath
string xPath1 =
#"./parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n1 = node.SelectSingleNode(xPath1);
Console.WriteLine(n1.InnerHtml);
//approach 2 : appending to XPath of current link
string xPath2 =
#"/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n2 = node.SelectSingleNode(link.XPath + xPath2);
Console.WriteLine(n2.InnerHtml);
}

If you're able to use LINQ-to-XML instead of the HAP then this works:
var node = xml.Root.Elements("tr")
.TakeWhile(tr => !tr.Elements("td")
.Any(td => td.Value.StartsWith("CONTROLLER2")))
.SelectMany(tr => tr.Elements("td"))
.Where(td => td.Value.StartsWith("ABC-"))
.Last();
I got this result:
<td>
<b>ABC-123</b>
</td>
(Which I checked was the second matching node in your sample, not the first.)

you can use
//tr/td[starts-with(.,'CONTROLLER2')]/(parent::tr/preceding-sibling::tr/td[starts-with(normalize-space(.),'ABC-')])[1]
since the target node contains unwanted spaces, the use of normalize-space is a must.

I think an XPATH like this (from the current CONTROLLER2 node) should do it:
string xPath = "../preceding-sibling::tr[starts-with(td , 'ABC-')][1]/td[starts-with(. , 'ABC-')]";
It means
get back once ancestor level up (..)
from there, select all preceding sibling TR elements that have TD elements that start with 'ABC-'
get the first (reverse order) of these TR.
from this TR element, get TD elements that starts with 'ABC-'

Converting DataTable to XML using XElement as template

I have an XElement (table template) with next code:
<table border="0" cellpadding="0" cellspacing="0" width="920" align="center">
<tr>
<td valign="top" width="200">
</td>
<td valign="top" width="400">
</td>
</tr>
</table>
Also I have DataTable which contains 2 columns and, for example 5 rows. Need to put my data from this DataTable to new XElement, using table template. I think that my function should looks like this:
public XElement Filling(XElement htmlTable, DataTable dataTable)
But I don't know how to get part of code from htmlTable:
For instance, how can I get only the first td with attribute width="200".
Any suggestion
Thanks

I don't understand the question entirely, but this is an example of how to get only the first td with attribute width="200" assuming that htmlTable contains the html shown in question :
var firstTd = htmlTable.Descendants("td").FirstOrDefault(o => o.Attribute("width").Value == "200");
htmlTable.Descendants("element_name") will get any html element inside table tag (tr and td in this case).

private DataTable ParseHtmlTable( string html )
{
DataTable retVal = new DataTable();
XElement data = XElement.Parse( html );
// Gets rows of values.
List<string[]> records = ( from row in data.Descendants( "TR" ) select ( from cell in row.Descendants( "TD" ) select cell.Value ).ToArray() ).ToList();
// Ensure maximum columns
retVal.Columns.AddRange( new DataColumn[ records.Max( x => x.Length ) ] );
// Add records
records.ForEach( x => retVal.Rows.Add( x ) );
return retVal;
}

htmlagilitypack parse table by th

I am trying to parse the following table using the htmlagilitypack.
<tr>
<th>
Anställda:
</th>
<td>
0 - 4
</td>
</tr>
<tr>
<th>
Oms (tkr):
</th>
<td>
5 409
</td>
</tr>
Im trying to extract the value for Oms (tkr): (in this case 5 409)
The below code gives me the above html table. Problem is I grab the Oms (tkr) value out. It should also be said that Oms (tkr) is not always on the same place, it can be further down or further up in the table. By this i mean that Oms can sometimes be where Anställda is and so forth.
foreach (HtmlAgilityPack.HtmlNode graf in (IEnumerable<HtmlAgilityPack.HtmlNode>)doc.DocumentNode.SelectNodes("//div[#id=\"info\"]//table")) {
var tabellHTdML = graf.InnerHtml;
MessageBox.Show(tabellHTdML);
}
I've tried to do:
if (tabellHTML.Contains("Oms"))
{
item.OMS = cells.InnerText;
}
But cant seem to get the correct value..any ideas what i'm doing wrong?

The following code:
HtmlDocument doc = new HtmlDocument();
doc.Load("test.htm");
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//th[starts-with(normalize-space(text()), 'Oms')]").InnerHtml.Trim());
will dump this:
Oms (tkr)
But you'll have to parse the end manually. The Html Agility Pack only knows about elements and attributes. The XPATH expression means: select any TH element that has a text content that starts with 'Oms', once trimmed (normalize-space).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing html with the HTML Agility Pack and Linq - c#

instead of td.InnerText == "Test1" try td.InnerText == " Test1 " or d.InnerText.Trim() == "Test1"

Related

Correctly use regular expressions to extract word

How to check if an XML attribute contains a string?

How to find nearest match from current context node

Converting DataTable to XML using XElement as template

htmlagilitypack parse table by th

Categories

Resources