I am trying to parse the following table using the htmlagilitypack.
<tr>
<th>
Anställda:
</th>
<td>
0 - 4
</td>
</tr>
<tr>
<th>
Oms (tkr):
</th>
<td>
5 409
</td>
</tr>
Im trying to extract the value for Oms (tkr): (in this case 5 409)
The below code gives me the above html table. Problem is I grab the Oms (tkr) value out. It should also be said that Oms (tkr) is not always on the same place, it can be further down or further up in the table. By this i mean that Oms can sometimes be where Anställda is and so forth.
foreach (HtmlAgilityPack.HtmlNode graf in (IEnumerable<HtmlAgilityPack.HtmlNode>)doc.DocumentNode.SelectNodes("//div[#id=\"info\"]//table")) {
var tabellHTdML = graf.InnerHtml;
MessageBox.Show(tabellHTdML);
}
I've tried to do:
if (tabellHTML.Contains("Oms"))
{
item.OMS = cells.InnerText;
}
But cant seem to get the correct value..any ideas what i'm doing wrong?
The following code:
HtmlDocument doc = new HtmlDocument();
doc.Load("test.htm");
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//th[starts-with(normalize-space(text()), 'Oms')]").InnerHtml.Trim());
will dump this:
Oms (tkr)
But you'll have to parse the end manually. The Html Agility Pack only knows about elements and attributes. The XPATH expression means: select any TH element that has a text content that starts with 'Oms', once trimmed (normalize-space).
Related
I am having a bit of trouble with a program I am trying to write. It is going to be using XML files that are generated by another program, so the formatting will always be the same, but number of sections and data within a section will be different, and I am trying to make it universal.
Here is a sample XML:
<?xml version="1.0" encoding="utf-8" ?>
<hcdata>
<docTitle>Test Health check</docTitle>
<sections>
<section id="1" name="server-overview">
<h1>Server Overview</h1>
<table name="server1">
<th>Field</th>
<th>Value</th>
<tr>
<td>Name</td>
<td>TestESXI1</td>
</tr>
<tr>
<td>RAM</td>
<td>24GB</td>
</tr>
</table>
<table name="server2">
<th>Field</th>
<th>Value</th>
<tr>
<td>Name</td>
<td>TestESXI2</td>
</tr>
<tr>
<td>RAM</td>
<td>16GB</td>
</tr>
</table>
</section>
<section id="2" name="vms">
<h1>Virtual Machine Information</h1>
<table name="vminfo">
<th>VM Name</th>
<th>RAM Usage</th>
<tr>
<td>2K8R2</td>
<td>2048MB</td>
</tr>
<tr>
<td>2K12R2</td>
<td>4096Mb</td>
</tr>
</table>
</section>
</sections>
</hcdata>
And here is some C# code I have been messing around with to try and pull values:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace XMLParseDev
{
class XMLParseDev
{
static void Main(string[] args)
{
int sectionCount = 0;
Console.WriteLine(sectionCount);
XDocument xDoc = XDocument.Load(#"C:\Users\test.xml");
//XElement xEle = XElement.Load(#"C:\users\test.xml");
//Application winWord = new Application();
IEnumerable<XElement> xElements = xDoc.Elements();
IEnumerable<XElement> xSectionCount = from xSections in xDoc.Descendants("section") select xSections;
IEnumerable<XElement> xthCount = from xth in xDoc.Descendants("th") select xth;
foreach (XElement s in xSectionCount)
{
//This is to count the number of <section> tags, this part works
sectionCount = sectionCount + 1;
//This was trying to write the value of the <h1> tag but does not
IEnumerable<XElement> xH1 = from xH1Field in xDoc.Descendants("h1") select xH1Field;
Console.WriteLine(xH1.Attributes("h1"));
foreach (XElement th in xthCount)
{
//This was supposed to write the <th> value only for <th> within the <section> but writes them all
Console.WriteLine(th.Value);
}
}
Console.WriteLine(sectionCount);
}
}
}
And the output:
0
System.Xml.Linq.Extensions+<GetAttributes>d__1
Field
Value
Field
Value
VM Name
RAM Usage
System.Xml.Linq.Extensions+<GetAttributes>d__1
Field
Value
Field
Value
VM Name
RAM Usage
2
Basically what I want to do, is convert the XML to a Word document (this question isn't about the Word part, just the data getting). I've used tags similar to HTML to assist with ease of design.
I need each <section> tag to be processed as an individual part.
I planned on running through so I can get counts of table rows and columns, so the table can be created and then populated (as the table needs to be made with the right dimensions first).
The section will also have a heading (<h1>).
I planned on this running as a loop that would be a foreach that loops sections and does everything else within this section in the iteration, but I can't figure out how to lock the data selection down to just a specific section.
Hope this makes sense and thanks in advance.
I'm wondering if you might find it easier to let a DataSet parse the data into DataTables then pick which tables you want the data from. Here's a little snippet that will read the xml file and display all the data as tables:
DataSet ds = new DataSet();
ds.ReadXml("xmlfile2.xml");
foreach(DataTable dt in ds.Tables)
{
Console.WriteLine($"Table Name - {dt.TableName}\n");
foreach(DataColumn dc in dt.Columns)
{
Console.Write($"{dc.ColumnName.PadRight(16)}");
}
Console.WriteLine();
foreach(DataRow dr in dt.Rows)
{
foreach(object obj in dr.ItemArray)
{
Console.Write($"{obj.ToString().PadRight(16)}");
}
Console.WriteLine();
}
Console.WriteLine(new string('_', 75));
}
Here is the XML (I have saved an html page in xml form to parse it generically:
<td width="76" class="DataB">2.276</td>
<td width="76" class="DataB">2.289</td>
<td width="76" class="DataB">2.091</td>
<td width="76" class="DataB">1.952</td>
<td width="76" class="DataB">1.936</td>
<td width="76" class="Current2">1.899</td>
Now I am trying to find all of the elements that contain the string Current because the web page changes the number on the back:
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
This returns an object does not exist error here:
((string) element.Attribute("class"))
How can I check an attribute if it contains something?
If you asked me, it would be easier to write as an xpath query. This way you don't have to deal with cases where elements doesn't contain class attributes and other such cases.
var query = xml.XPathSelectElements("//td[contains(#class,'Current')]");
Otherwise, you would have to check for the existence of the attribute before trying to read it.
// query syntax makes this a little nicer
var query =
from td in xml.Descendants("td")
let classStr = (string)td.Attribute("class")
where classStr != null && classStr.Contains("Current")
select td;
// or alternatively, provide a default value
var query =
from td in xml.Descendants("td")
where ((string)td.Attribute("class") ?? "").Contains("Current")
select td;
There's probably something wrong with the XML input you're using - trying this code works for me in LINQPad:
XDocument xml = XDocument.Parse(#"<tr><td width=""76"" class=""DataB"">2.276</td>
<td width=""76"" class=""DataB"">2.289</td>
<td width=""76"" class=""DataB"">2.091</td>
<td width=""76"" class=""DataB"">1.952</td>
<td width=""76"" class=""DataB"">1.936</td>
<td width=""76"" class=""Current2"">1.899</td></tr>");
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
xElements.Dump();
Are you sure your XML is valid?
I've got a rather large XML file that I'm trying to parse using a C# application and the HtmlAgilityPack. The XML looks something like this:
...
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td>CONTROLLER1</td>
<td>4</td>
<td>3</td>
</tr>
<td>CONTROLLER2</td>
<td>4</td>
<td>3</td>
</tr>
...
Basically a series of table rows and columns that repeats. I'm first doing a search for a controller by using:
string xPath = #"//tr/td[starts-with(.,'CONTROLLER2')]";
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xPath);
foreach (HtmlNode link in nodes) { ... }
Which returns the correct node. Now I want to search backwards (up) for the first (nearest) matching <td> node that starts with text "ABC":
string xPath = #link.XPath + #"/parent::tr/preceding-sibling::tr/td[starts-with(.,'ABC-')]";
This returns all matching nodes, not just the nearest one. When I attempted to add [1] to the end of this XPath string, it didn't seem to work and I've found no examples showing a predicate being used with an axes function like this. Or, more likely, I'm doing it wrong. Any suggestions?
You can use this XPath :
/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]
That will search for nearest preceding <tr> that has child <td> starts with 'ABC-'. Then get that particular <td> element.
There are at least two approaches you can pick when using HtmlAgilityPack :
foreach (HtmlNode link in nodes)
{
//approach 1 : notice dot(.) at the beginning of the XPath
string xPath1 =
#"./parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n1 = node.SelectSingleNode(xPath1);
Console.WriteLine(n1.InnerHtml);
//approach 2 : appending to XPath of current link
string xPath2 =
#"/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n2 = node.SelectSingleNode(link.XPath + xPath2);
Console.WriteLine(n2.InnerHtml);
}
If you're able to use LINQ-to-XML instead of the HAP then this works:
var node = xml.Root.Elements("tr")
.TakeWhile(tr => !tr.Elements("td")
.Any(td => td.Value.StartsWith("CONTROLLER2")))
.SelectMany(tr => tr.Elements("td"))
.Where(td => td.Value.StartsWith("ABC-"))
.Last();
I got this result:
<td>
<b>ABC-123</b>
</td>
(Which I checked was the second matching node in your sample, not the first.)
you can use
//tr/td[starts-with(.,'CONTROLLER2')]/(parent::tr/preceding-sibling::tr/td[starts-with(normalize-space(.),'ABC-')])[1]
since the target node contains unwanted spaces, the use of normalize-space is a must.
I think an XPATH like this (from the current CONTROLLER2 node) should do it:
string xPath = "../preceding-sibling::tr[starts-with(td , 'ABC-')][1]/td[starts-with(. , 'ABC-')]";
It means
get back once ancestor level up (..)
from there, select all preceding sibling TR elements that have TD elements that start with 'ABC-'
get the first (reverse order) of these TR.
from this TR element, get TD elements that starts with 'ABC-'
I'm use html agility pack to parse a html file. There is a big table in the html file.
....
<tr class="..."><td>xxx</td>.....</tr>
<tr class="..."><td>xxx</td>.....</tr>
<tr class="..."><td>yyy</td>.....</tr>
<tr class="..."><td>zzz</td>.....</tr>
....
I want to select all the trs which have the first child td with inner text of xxx. How to write the xpath?
//tr[....]
Update:
How to add an additional condition of "the trs must also have exactly five tds"?
Use this XPath:
//tr[td[1][. = 'xxx']]
Update:
//tr[td[1][. = 'xxx']][count(td) = 5]
or using and operand:
//tr[td[1][. = 'xxx'] and count(td) = 5]
I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data
As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.