How to find nearest match from current context node - c#

I've got a rather large XML file that I'm trying to parse using a C# application and the HtmlAgilityPack. The XML looks something like this:
...
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td>CONTROLLER1</td>
<td>4</td>
<td>3</td>
</tr>
<td>CONTROLLER2</td>
<td>4</td>
<td>3</td>
</tr>
...
Basically a series of table rows and columns that repeats. I'm first doing a search for a controller by using:
string xPath = #"//tr/td[starts-with(.,'CONTROLLER2')]";
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xPath);
foreach (HtmlNode link in nodes) { ... }
Which returns the correct node. Now I want to search backwards (up) for the first (nearest) matching <td> node that starts with text "ABC":
string xPath = #link.XPath + #"/parent::tr/preceding-sibling::tr/td[starts-with(.,'ABC-')]";
This returns all matching nodes, not just the nearest one. When I attempted to add [1] to the end of this XPath string, it didn't seem to work and I've found no examples showing a predicate being used with an axes function like this. Or, more likely, I'm doing it wrong. Any suggestions?

You can use this XPath :
/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]
That will search for nearest preceding <tr> that has child <td> starts with 'ABC-'. Then get that particular <td> element.
There are at least two approaches you can pick when using HtmlAgilityPack :
foreach (HtmlNode link in nodes)
{
//approach 1 : notice dot(.) at the beginning of the XPath
string xPath1 =
#"./parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n1 = node.SelectSingleNode(xPath1);
Console.WriteLine(n1.InnerHtml);
//approach 2 : appending to XPath of current link
string xPath2 =
#"/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n2 = node.SelectSingleNode(link.XPath + xPath2);
Console.WriteLine(n2.InnerHtml);
}

If you're able to use LINQ-to-XML instead of the HAP then this works:
var node = xml.Root.Elements("tr")
.TakeWhile(tr => !tr.Elements("td")
.Any(td => td.Value.StartsWith("CONTROLLER2")))
.SelectMany(tr => tr.Elements("td"))
.Where(td => td.Value.StartsWith("ABC-"))
.Last();
I got this result:
<td>
<b>ABC-123</b>
</td>
(Which I checked was the second matching node in your sample, not the first.)

you can use
//tr/td[starts-with(.,'CONTROLLER2')]/(parent::tr/preceding-sibling::tr/td[starts-with(normalize-space(.),'ABC-')])[1]
since the target node contains unwanted spaces, the use of normalize-space is a must.

I think an XPATH like this (from the current CONTROLLER2 node) should do it:
string xPath = "../preceding-sibling::tr[starts-with(td , 'ABC-')][1]/td[starts-with(. , 'ABC-')]";
It means
get back once ancestor level up (..)
from there, select all preceding sibling TR elements that have TD elements that start with 'ABC-'
get the first (reverse order) of these TR.
from this TR element, get TD elements that starts with 'ABC-'

Related

How to check if an XML attribute contains a string?

Here is the XML (I have saved an html page in xml form to parse it generically:
<td width="76" class="DataB">2.276</td>
<td width="76" class="DataB">2.289</td>
<td width="76" class="DataB">2.091</td>
<td width="76" class="DataB">1.952</td>
<td width="76" class="DataB">1.936</td>
<td width="76" class="Current2">1.899</td>
Now I am trying to find all of the elements that contain the string Current because the web page changes the number on the back:
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
This returns an object does not exist error here:
((string) element.Attribute("class"))
How can I check an attribute if it contains something?
If you asked me, it would be easier to write as an xpath query. This way you don't have to deal with cases where elements doesn't contain class attributes and other such cases.
var query = xml.XPathSelectElements("//td[contains(#class,'Current')]");
Otherwise, you would have to check for the existence of the attribute before trying to read it.
// query syntax makes this a little nicer
var query =
from td in xml.Descendants("td")
let classStr = (string)td.Attribute("class")
where classStr != null && classStr.Contains("Current")
select td;
// or alternatively, provide a default value
var query =
from td in xml.Descendants("td")
where ((string)td.Attribute("class") ?? "").Contains("Current")
select td;
There's probably something wrong with the XML input you're using - trying this code works for me in LINQPad:
XDocument xml = XDocument.Parse(#"<tr><td width=""76"" class=""DataB"">2.276</td>
<td width=""76"" class=""DataB"">2.289</td>
<td width=""76"" class=""DataB"">2.091</td>
<td width=""76"" class=""DataB"">1.952</td>
<td width=""76"" class=""DataB"">1.936</td>
<td width=""76"" class=""Current2"">1.899</td></tr>");
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
xElements.Dump();
Are you sure your XML is valid?

Find all the elements which have the first child of td with inner text of `xxx`?

I'm use html agility pack to parse a html file. There is a big table in the html file.
....
<tr class="..."><td>xxx</td>.....</tr>
<tr class="..."><td>xxx</td>.....</tr>
<tr class="..."><td>yyy</td>.....</tr>
<tr class="..."><td>zzz</td>.....</tr>
....
I want to select all the trs which have the first child td with inner text of xxx. How to write the xpath?
//tr[....]
Update:
How to add an additional condition of "the trs must also have exactly five tds"?
Use this XPath:
//tr[td[1][. = 'xxx']]
Update:
//tr[td[1][. = 'xxx']][count(td) = 5]
or using and operand:
//tr[td[1][. = 'xxx'] and count(td) = 5]

How do I loop this in XDocument using c#

I've table and td value as below code
foreach (var descendant in xmlDoc.Descendants("thead"))
{
var title = descendant.Element("td1 style=background:#cccccc").Value;
}
Assume I've below thead in the table
<thead>
<tr align="center" bgcolor="white">
<td1 style="background:#cccccc">Start</td1>
<td1 style="background:#cccccc">A</td1>
<td1 style="background:#cccccc">B</td1>
<td1 style="background:#cccccc">C</td1>
<td1 style="background:#cccccc">D</td1>
<td1 style="background:#cccccc">E</td1>
<td1 style="background:#cccccc">F</td1>
<td1 style="background:#cccccc">G</td1>
</tr>
</thead>
I need to get all td1 values
Your use of Element is incorrect - you just pass in a name, not the whole content of an element declaration.
If you want all td1 elements, you want something like:
foreach (var descendant in xmlDoc.Descendants("thead"))
{
foreach (var title in descendant.Element("tr")
.Elements("td1")
.Select(td1 => td1.Value))
{
...
}
}
Or if you don't actually need anything from the thead elements:
foreach (var title in descendant.Descendants("thead")
.Elements("tr")
.Elements("td1")
.Select(td1 => td1.Value))
{
...
}
(Do you really mean td1 rather than td by the way?)
If you need td1 elements, then in this case you can select them directly:
var titles = xdoc.Descendants("td1").Select(td => (string)td);
Or you can use XPath
var titles = from td in xdoc.XPathSelectElements("//thread/tr/td1")
select (string)td;
NOTE if you are going to parse html documents, then better consider to use HtmlAgilityPack (available from NuGet).

htmlagilitypack parse table by th

I am trying to parse the following table using the htmlagilitypack.
<tr>
<th>
Anställda:
</th>
<td>
0 - 4
</td>
</tr>
<tr>
<th>
Oms (tkr):
</th>
<td>
5 409
</td>
</tr>
Im trying to extract the value for Oms (tkr): (in this case 5 409)
The below code gives me the above html table. Problem is I grab the Oms (tkr) value out. It should also be said that Oms (tkr) is not always on the same place, it can be further down or further up in the table. By this i mean that Oms can sometimes be where Anställda is and so forth.
foreach (HtmlAgilityPack.HtmlNode graf in (IEnumerable<HtmlAgilityPack.HtmlNode>)doc.DocumentNode.SelectNodes("//div[#id=\"info\"]//table")) {
var tabellHTdML = graf.InnerHtml;
MessageBox.Show(tabellHTdML);
}
I've tried to do:
if (tabellHTML.Contains("Oms"))
{
item.OMS = cells.InnerText;
}
But cant seem to get the correct value..any ideas what i'm doing wrong?
The following code:
HtmlDocument doc = new HtmlDocument();
doc.Load("test.htm");
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//th[starts-with(normalize-space(text()), 'Oms')]").InnerHtml.Trim());
will dump this:
Oms (tkr)
But you'll have to parse the end manually. The Html Agility Pack only knows about elements and attributes. The XPATH expression means: select any TH element that has a text content that starts with 'Oms', once trimmed (normalize-space).

Parsing html with the HTML Agility Pack and Linq

I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data
As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

Categories

Resources