How do I loop this in XDocument using c# - c#

I've table and td value as below code
foreach (var descendant in xmlDoc.Descendants("thead"))
{
var title = descendant.Element("td1 style=background:#cccccc").Value;
}
Assume I've below thead in the table
<thead>
<tr align="center" bgcolor="white">
<td1 style="background:#cccccc">Start</td1>
<td1 style="background:#cccccc">A</td1>
<td1 style="background:#cccccc">B</td1>
<td1 style="background:#cccccc">C</td1>
<td1 style="background:#cccccc">D</td1>
<td1 style="background:#cccccc">E</td1>
<td1 style="background:#cccccc">F</td1>
<td1 style="background:#cccccc">G</td1>
</tr>
</thead>
I need to get all td1 values

Your use of Element is incorrect - you just pass in a name, not the whole content of an element declaration.
If you want all td1 elements, you want something like:
foreach (var descendant in xmlDoc.Descendants("thead"))
{
foreach (var title in descendant.Element("tr")
.Elements("td1")
.Select(td1 => td1.Value))
{
...
}
}
Or if you don't actually need anything from the thead elements:
foreach (var title in descendant.Descendants("thead")
.Elements("tr")
.Elements("td1")
.Select(td1 => td1.Value))
{
...
}
(Do you really mean td1 rather than td by the way?)

If you need td1 elements, then in this case you can select them directly:
var titles = xdoc.Descendants("td1").Select(td => (string)td);
Or you can use XPath
var titles = from td in xdoc.XPathSelectElements("//thread/tr/td1")
select (string)td;
NOTE if you are going to parse html documents, then better consider to use HtmlAgilityPack (available from NuGet).

Related

C# LINQ and XML Parsing with Separate Sections

I am having a bit of trouble with a program I am trying to write. It is going to be using XML files that are generated by another program, so the formatting will always be the same, but number of sections and data within a section will be different, and I am trying to make it universal.
Here is a sample XML:
<?xml version="1.0" encoding="utf-8" ?>
<hcdata>
<docTitle>Test Health check</docTitle>
<sections>
<section id="1" name="server-overview">
<h1>Server Overview</h1>
<table name="server1">
<th>Field</th>
<th>Value</th>
<tr>
<td>Name</td>
<td>TestESXI1</td>
</tr>
<tr>
<td>RAM</td>
<td>24GB</td>
</tr>
</table>
<table name="server2">
<th>Field</th>
<th>Value</th>
<tr>
<td>Name</td>
<td>TestESXI2</td>
</tr>
<tr>
<td>RAM</td>
<td>16GB</td>
</tr>
</table>
</section>
<section id="2" name="vms">
<h1>Virtual Machine Information</h1>
<table name="vminfo">
<th>VM Name</th>
<th>RAM Usage</th>
<tr>
<td>2K8R2</td>
<td>2048MB</td>
</tr>
<tr>
<td>2K12R2</td>
<td>4096Mb</td>
</tr>
</table>
</section>
</sections>
</hcdata>
And here is some C# code I have been messing around with to try and pull values:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace XMLParseDev
{
class XMLParseDev
{
static void Main(string[] args)
{
int sectionCount = 0;
Console.WriteLine(sectionCount);
XDocument xDoc = XDocument.Load(#"C:\Users\test.xml");
//XElement xEle = XElement.Load(#"C:\users\test.xml");
//Application winWord = new Application();
IEnumerable<XElement> xElements = xDoc.Elements();
IEnumerable<XElement> xSectionCount = from xSections in xDoc.Descendants("section") select xSections;
IEnumerable<XElement> xthCount = from xth in xDoc.Descendants("th") select xth;
foreach (XElement s in xSectionCount)
{
//This is to count the number of <section> tags, this part works
sectionCount = sectionCount + 1;
//This was trying to write the value of the <h1> tag but does not
IEnumerable<XElement> xH1 = from xH1Field in xDoc.Descendants("h1") select xH1Field;
Console.WriteLine(xH1.Attributes("h1"));
foreach (XElement th in xthCount)
{
//This was supposed to write the <th> value only for <th> within the <section> but writes them all
Console.WriteLine(th.Value);
}
}
Console.WriteLine(sectionCount);
}
}
}
And the output:
0
System.Xml.Linq.Extensions+<GetAttributes>d__1
Field
Value
Field
Value
VM Name
RAM Usage
System.Xml.Linq.Extensions+<GetAttributes>d__1
Field
Value
Field
Value
VM Name
RAM Usage
2
Basically what I want to do, is convert the XML to a Word document (this question isn't about the Word part, just the data getting). I've used tags similar to HTML to assist with ease of design.
I need each <section> tag to be processed as an individual part.
I planned on running through so I can get counts of table rows and columns, so the table can be created and then populated (as the table needs to be made with the right dimensions first).
The section will also have a heading (<h1>).
I planned on this running as a loop that would be a foreach that loops sections and does everything else within this section in the iteration, but I can't figure out how to lock the data selection down to just a specific section.
Hope this makes sense and thanks in advance.
I'm wondering if you might find it easier to let a DataSet parse the data into DataTables then pick which tables you want the data from. Here's a little snippet that will read the xml file and display all the data as tables:
DataSet ds = new DataSet();
ds.ReadXml("xmlfile2.xml");
foreach(DataTable dt in ds.Tables)
{
Console.WriteLine($"Table Name - {dt.TableName}\n");
foreach(DataColumn dc in dt.Columns)
{
Console.Write($"{dc.ColumnName.PadRight(16)}");
}
Console.WriteLine();
foreach(DataRow dr in dt.Rows)
{
foreach(object obj in dr.ItemArray)
{
Console.Write($"{obj.ToString().PadRight(16)}");
}
Console.WriteLine();
}
Console.WriteLine(new string('_', 75));
}

How to find nearest match from current context node

I've got a rather large XML file that I'm trying to parse using a C# application and the HtmlAgilityPack. The XML looks something like this:
...
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td><b>ABC-123</b></td>
<td>15</td>
<td>4</td>
</tr>
<tr>
<td>AB-4-320</td>
<td>11</td>
<td>2</td>
</tr>
<tr>
<td>CONTROLLER1</td>
<td>4</td>
<td>3</td>
</tr>
<td>CONTROLLER2</td>
<td>4</td>
<td>3</td>
</tr>
...
Basically a series of table rows and columns that repeats. I'm first doing a search for a controller by using:
string xPath = #"//tr/td[starts-with(.,'CONTROLLER2')]";
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xPath);
foreach (HtmlNode link in nodes) { ... }
Which returns the correct node. Now I want to search backwards (up) for the first (nearest) matching <td> node that starts with text "ABC":
string xPath = #link.XPath + #"/parent::tr/preceding-sibling::tr/td[starts-with(.,'ABC-')]";
This returns all matching nodes, not just the nearest one. When I attempted to add [1] to the end of this XPath string, it didn't seem to work and I've found no examples showing a predicate being used with an axes function like this. Or, more likely, I'm doing it wrong. Any suggestions?
You can use this XPath :
/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]
That will search for nearest preceding <tr> that has child <td> starts with 'ABC-'. Then get that particular <td> element.
There are at least two approaches you can pick when using HtmlAgilityPack :
foreach (HtmlNode link in nodes)
{
//approach 1 : notice dot(.) at the beginning of the XPath
string xPath1 =
#"./parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n1 = node.SelectSingleNode(xPath1);
Console.WriteLine(n1.InnerHtml);
//approach 2 : appending to XPath of current link
string xPath2 =
#"/parent::tr/preceding-sibling::tr[td[starts-with(.,'ABC-')]][1]/td[starts-with(.,'ABC-')]";
var n2 = node.SelectSingleNode(link.XPath + xPath2);
Console.WriteLine(n2.InnerHtml);
}
If you're able to use LINQ-to-XML instead of the HAP then this works:
var node = xml.Root.Elements("tr")
.TakeWhile(tr => !tr.Elements("td")
.Any(td => td.Value.StartsWith("CONTROLLER2")))
.SelectMany(tr => tr.Elements("td"))
.Where(td => td.Value.StartsWith("ABC-"))
.Last();
I got this result:
<td>
<b>ABC-123</b>
</td>
(Which I checked was the second matching node in your sample, not the first.)
you can use
//tr/td[starts-with(.,'CONTROLLER2')]/(parent::tr/preceding-sibling::tr/td[starts-with(normalize-space(.),'ABC-')])[1]
since the target node contains unwanted spaces, the use of normalize-space is a must.
I think an XPATH like this (from the current CONTROLLER2 node) should do it:
string xPath = "../preceding-sibling::tr[starts-with(td , 'ABC-')][1]/td[starts-with(. , 'ABC-')]";
It means
get back once ancestor level up (..)
from there, select all preceding sibling TR elements that have TD elements that start with 'ABC-'
get the first (reverse order) of these TR.
from this TR element, get TD elements that starts with 'ABC-'

Converting DataTable to XML using XElement as template

I have an XElement (table template) with next code:
<table border="0" cellpadding="0" cellspacing="0" width="920" align="center">
<tr>
<td valign="top" width="200">
</td>
<td valign="top" width="400">
</td>
</tr>
</table>
Also I have DataTable which contains 2 columns and, for example 5 rows. Need to put my data from this DataTable to new XElement, using table template. I think that my function should looks like this:
public XElement Filling(XElement htmlTable, DataTable dataTable)
But I don't know how to get part of code from htmlTable:
For instance, how can I get only the first td with attribute width="200".
Any suggestion
Thanks
I don't understand the question entirely, but this is an example of how to get only the first td with attribute width="200" assuming that htmlTable contains the html shown in question :
var firstTd = htmlTable.Descendants("td").FirstOrDefault(o => o.Attribute("width").Value == "200");
htmlTable.Descendants("element_name") will get any html element inside table tag (tr and td in this case).
private DataTable ParseHtmlTable( string html )
{
DataTable retVal = new DataTable();
XElement data = XElement.Parse( html );
// Gets rows of values.
List<string[]> records = ( from row in data.Descendants( "TR" ) select ( from cell in row.Descendants( "TD" ) select cell.Value ).ToArray() ).ToList();
// Ensure maximum columns
retVal.Columns.AddRange( new DataColumn[ records.Max( x => x.Length ) ] );
// Add records
records.ForEach( x => retVal.Rows.Add( x ) );
return retVal;
}

Getting text from elements without id or class name

I am trying to parse HTML code using Html Agility Pack. Is there any tutorial available, or can someone tell me how can I get a text from a <td> that has no Id and no class?
<table id="results-table">
<tr class="row1">
<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>
...
Each row contains 10 different <td>. Thanks!
You can try using this XPATH to query all the tds within your table having id="results-table"
//table[#id='results-table']/tr/td
Firepath for Firefox can help you in formulating XPATH and you can manipulate it from there.
Sample code below
HtmlDocument doc = new HtmlDocument();
var fileName = #"..\..\..\docs\10960189.htm";
doc.Load(fileName);
var nodes = doc.DocumentNode.SelectNodes("//table[#id='results-table']/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
HTH
Here is a link that explain how to use XPath:
http://www.w3schools.com/xpath/
I guess some of your td tags will have class/id. Use the following code. I wrote that in linqpad
void Main()
{
var webGet = new HtmlAgilityPack.HtmlDocument();
//web page/string that need to be parsed
webGet.LoadHtml(#"<table id='results-table'>" +
"<tr class='row1'>" +
"<td class='testclass'>test td with class</td>" +
"<td id='testid'>test td with id</td>" +
"<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>" +
"<td>test td without class or id</td>" +
"<tr/>"
);
var tableOnPage = (from tds in webGet.DocumentNode.Descendants()
where lnks.Name == "td" &&
lnks.Attributes["class"] == null && tds.Attributes["id"] == null &&
tds.ParentNode.InnerText.Trim().Length > 0 && lnks.InnerText.Trim().Length > 0
select new
{
td = tds.DescendantNodes().SingleOrDefault ().InnerHtml.Trim(),
});
//looping through each items
foreach (var item in tableOnPage)
{
Console.WriteLine(item.td);
}
}
Output will be
Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk
test td without class or id

Parsing html with the HTML Agility Pack and Linq

I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data
As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

Categories

Resources