Selecting all nodes containing text with XPath - c#

I have been struggling to resolve this problem I am having over the past couple of days. Say, I want to get all the text() from a HTML document, however I only want to know of and retrieve of the XPath of the node that contains text data. Example:
foreach (var textNode in node.SelectNodes(".//text()"))
//do stuff here
However, when it comes to retrieving the XPath of the textNode using textNode.XPath, I get the full XPath including the #text node:
/html[1]/body[1]/div[1]/a[1]/#text
Yet I only want the containing node of the text, for example:
/html[1]/body[1]/div[1]/a[1]
Could anyone point me toward a better XPath solution to retrieve all nodes that contains text but only retrieve the XPath up until the containing node?

Instead of:
.//text()
use:
.//*[normalize-space(text())]
This selects all "leaf-elements"-descendants of the context (current) node that have at least one non-whitespace-only text node child.

Why don't you
string[] elements = getXPath(textNode).Split(new char[1] { '/' });
return String.Join("/", elements, 0, elements.Length-2);

Related

Wrong xml node accessed when using xpath

I have an xml file generated by Vector CANeds. This file contains information about CANopen Objects I want to read with my tool written in C#.
The (very basic) structure of the xml is as follows:
<ISO15745ProfileContainer xmlns="http://www.canopen.org/xml/1.0">
<ISO15745Profile>
<ProfileHeader></ProfileHeader>
<ProfileBody xsi:type="ProfileBody_Device_CANopen"</ProfileBody>
</ISO15745Profile>
<ISO15745Profile>
<ProfileHeader></ProfileHeader>
<ProfileBody xsi:type="ProfileBody_CommunicationNetwork_CANopen"</ProfileBody>
</ISO15745Profile>
</ISO15745ProfileContainer>
When I create an XmlNodeList with both ISO15745Profile nodes in it and loop through then i get a strange behaviour. By accessing the subnodes with explicit indexes, everything is as expected. When I am using xpath, allways the first node is used.
Code snippet:
const string filepath = "CANeds1.xdd";
const string s_ns = "//ns:";
var mDataXML = new XmlDocument();
mDataXML.Load(filepath);
var root = mDataXML.DocumentElement;
XmlNamespaceManager nsm = new XmlNamespaceManager(mDataXML.NameTable);
nsm.AddNamespace("ns", root.Attributes["xmlns"].Value);
foreach (XmlNode node in root.ChildNodes) {
Console.WriteLine(" " + node.ChildNodes[1].Attributes["xsi:type"].Value);
Console.WriteLine(" " + node.SelectSingleNode(s_ns + "ProfileBody", nsm).Attributes["xsi:type"].Value);
}
Console output:
ProfileBody_Device_CANopen
ProfileBody_Device_CANopen
ProfileBody_CommunicationNetwork_CANopen
ProfileBody_Device_CANopen
Since node references the 2nd node, the last output should be commNetwork to.
Does somebody see my mistake? I have already tried to rename one of the "ISO15745Profile" nodes but this did not change the outcome. I may have messed up something with the namespace...
Some more explanation to the answer given in the comments:
The important point is the // XPath expression. The definition from MSDN says:
Recursive descent; searches for the specified element at any depth. When this path operator appears at the start of the pattern, it indicates recursive descent from the root node.
This means an expression starting with // will always search for occurences the entire document, even if it's called from a specific child note. That's why SelectSingleNode will always return the first match in the entire document.
To search relative to the node that calls the selection method there is the . operator which indicates the current context.
Put together, an expression starting with .// will search for all occurrences of the following pattern, beginning at the current node.
In the specific case, this means changing //ns: to .//ns: to get the expected result.

Get tags around text in HTML document using C#

I would like to search an HTML file for a certain string and then extract the tags. Given:
<div_outer><div_inner>Happy birthday<div><div>
I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div>.
Any ideas? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.
Thanks as always, guys.
The HAP is a good place indeed for this.
You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.
You could use xpath for this. I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday, and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:
var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");
var ancestors = doc.DocumentNode
.SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
.Reverse()
.ToList();
It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.
This will return 2 nodes: div_inner and div_outer.

Replace a single node with multiple nodes using HTML Agility Pack

I have some input tags that are placeholders that I am replacing with some HTML. A lot of the time the HTML I'm replacing them with is only one tag, which is easy enough:
HtmlNode node = HtmlNode.CreateNode(sReplacementString);
inputNode.ParentNode.ReplaceChild(node, inputNode);
However if I want to replace inputNode with two or more nodes HtmlNode.CreateNode(sReplacementString) only reads the first node. Is there a way to do a replace where sReplacementString is multiple tags?
As far as I know, there is no direct way to do it. HtmlNode.CreateNode method creates a single node from the HTML snippet, if there are several nodes there, the first one is created only.
As a workaround you could create a temporary node, create its child nodes from the sReplacementString, and then append these child nodes right after the inputNode node, and, finally, remove the inputNode.
var temp = doc.CreateElement("temp");
temp.InnerHtml = sReplacementString;
var current = inputNode;
foreach (var child in temp.ChildNodes)
{
inputNode.ParentNode.InsertAfter(child, current);
current = child;
}
inputNode.Remove();

How to grab elements by class or id in HTML Source in C#?

I am trying to grab elements from HTML source based on the class or id name, using C# windows forms application. I am putting the source into a string using WebClient and plugging it into the HTMLAgilityPack using HtmlDocument.
However, all the examples I find with the HTMLAgilityPack pack parse through and find items based on tags. I need to find a specific id, of say a link in the html, and retrieve the value inside of the tags. Is this possible and what would be the most efficient way to do this? Everything I am trying to parse out the ids is giving me exceptions. Thanks!
You should be able to do this with XPath:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"file.htm");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id=\"my_control_id\"]");
string value = (node == null) ? "Error, id not found" : node.InnerHtml;
Quick explanation of the xpath here:
// means search everywhere in the path, Use SelectNodes if it will be matching multiples
* means match any type of node
[] define "Predicates" which are basically checking properties relative to this node
[#id=\"my_control_id\"] means find nodes that have an attribute named "id" with the value "my_control_id"
Further reference

Create XML subtree from string in LINQ?

I want to modify all the text nodes using some functions in C#.
I want to insert another xml subtree created from some string.
For example, I want to change this
<root>
this is a test
</root>
to
<root>
this is <subtree>another</subtree> test
</root>
I have this piece of code, but it inserts text node, I want to create xml subtree and insert that instead of plain text node.
List<XText> textNodes = element.DescendantNodes().OfType<XText>().ToList();
foreach (XText textNode in textNodes)
{
String node = System.Text.RegularExpressions.Regex.Replace(textNode.Value, "a", "<subtree>another</subtree>");
textNode.ReplaceWith(new XText(node));
}
You can split the original XText node into several, and add an XElement in between. Then you replace the original node with the three new nodes.
List<XNode> newNodes = Regex.Split(textNode.Value, "a").Select(p => (XNode) new XText(p)).ToList();
newNodes.Insert(1, new XElement("subtree", "another")); // substitute this with something better
textNode.ReplaceWith(newNodes);
I guess CreateDocumentFragment is much easier, though not LINQ, but the idea to use LINQ is ease only.

Categories

Resources