Replace a single node with multiple nodes using HTML Agility Pack - c#

I have some input tags that are placeholders that I am replacing with some HTML. A lot of the time the HTML I'm replacing them with is only one tag, which is easy enough:
HtmlNode node = HtmlNode.CreateNode(sReplacementString);
inputNode.ParentNode.ReplaceChild(node, inputNode);
However if I want to replace inputNode with two or more nodes HtmlNode.CreateNode(sReplacementString) only reads the first node. Is there a way to do a replace where sReplacementString is multiple tags?

As far as I know, there is no direct way to do it. HtmlNode.CreateNode method creates a single node from the HTML snippet, if there are several nodes there, the first one is created only.
As a workaround you could create a temporary node, create its child nodes from the sReplacementString, and then append these child nodes right after the inputNode node, and, finally, remove the inputNode.
var temp = doc.CreateElement("temp");
temp.InnerHtml = sReplacementString;
var current = inputNode;
foreach (var child in temp.ChildNodes)
{
inputNode.ParentNode.InsertAfter(child, current);
current = child;
}
inputNode.Remove();

Related

Wrong xml node accessed when using xpath

I have an xml file generated by Vector CANeds. This file contains information about CANopen Objects I want to read with my tool written in C#.
The (very basic) structure of the xml is as follows:
<ISO15745ProfileContainer xmlns="http://www.canopen.org/xml/1.0">
<ISO15745Profile>
<ProfileHeader></ProfileHeader>
<ProfileBody xsi:type="ProfileBody_Device_CANopen"</ProfileBody>
</ISO15745Profile>
<ISO15745Profile>
<ProfileHeader></ProfileHeader>
<ProfileBody xsi:type="ProfileBody_CommunicationNetwork_CANopen"</ProfileBody>
</ISO15745Profile>
</ISO15745ProfileContainer>
When I create an XmlNodeList with both ISO15745Profile nodes in it and loop through then i get a strange behaviour. By accessing the subnodes with explicit indexes, everything is as expected. When I am using xpath, allways the first node is used.
Code snippet:
const string filepath = "CANeds1.xdd";
const string s_ns = "//ns:";
var mDataXML = new XmlDocument();
mDataXML.Load(filepath);
var root = mDataXML.DocumentElement;
XmlNamespaceManager nsm = new XmlNamespaceManager(mDataXML.NameTable);
nsm.AddNamespace("ns", root.Attributes["xmlns"].Value);
foreach (XmlNode node in root.ChildNodes) {
Console.WriteLine(" " + node.ChildNodes[1].Attributes["xsi:type"].Value);
Console.WriteLine(" " + node.SelectSingleNode(s_ns + "ProfileBody", nsm).Attributes["xsi:type"].Value);
}
Console output:
ProfileBody_Device_CANopen
ProfileBody_Device_CANopen
ProfileBody_CommunicationNetwork_CANopen
ProfileBody_Device_CANopen
Since node references the 2nd node, the last output should be commNetwork to.
Does somebody see my mistake? I have already tried to rename one of the "ISO15745Profile" nodes but this did not change the outcome. I may have messed up something with the namespace...
Some more explanation to the answer given in the comments:
The important point is the // XPath expression. The definition from MSDN says:
Recursive descent; searches for the specified element at any depth. When this path operator appears at the start of the pattern, it indicates recursive descent from the root node.
This means an expression starting with // will always search for occurences the entire document, even if it's called from a specific child note. That's why SelectSingleNode will always return the first match in the entire document.
To search relative to the node that calls the selection method there is the . operator which indicates the current context.
Put together, an expression starting with .// will search for all occurrences of the following pattern, beginning at the current node.
In the specific case, this means changing //ns: to .//ns: to get the expected result.

Selecting all nodes containing text with XPath

I have been struggling to resolve this problem I am having over the past couple of days. Say, I want to get all the text() from a HTML document, however I only want to know of and retrieve of the XPath of the node that contains text data. Example:
foreach (var textNode in node.SelectNodes(".//text()"))
//do stuff here
However, when it comes to retrieving the XPath of the textNode using textNode.XPath, I get the full XPath including the #text node:
/html[1]/body[1]/div[1]/a[1]/#text
Yet I only want the containing node of the text, for example:
/html[1]/body[1]/div[1]/a[1]
Could anyone point me toward a better XPath solution to retrieve all nodes that contains text but only retrieve the XPath up until the containing node?
Instead of:
.//text()
use:
.//*[normalize-space(text())]
This selects all "leaf-elements"-descendants of the context (current) node that have at least one non-whitespace-only text node child.
Why don't you
string[] elements = getXPath(textNode).Split(new char[1] { '/' });
return String.Join("/", elements, 0, elements.Length-2);

Get tags around text in HTML document using C#

I would like to search an HTML file for a certain string and then extract the tags. Given:
<div_outer><div_inner>Happy birthday<div><div>
I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div>.
Any ideas? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.
Thanks as always, guys.
The HAP is a good place indeed for this.
You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.
You could use xpath for this. I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday, and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:
var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");
var ancestors = doc.DocumentNode
.SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
.Reverse()
.ToList();
It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.
This will return 2 nodes: div_inner and div_outer.

Problem with XPath

Here's a link:
http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/results/2010-2011/boxscore819588.html
I'm using HTML Agility Pack and I would like to extract, say, the 188 from the 'Odds' column. My editor gives /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7] when asked for path. I tried that path with various of omissions of body or html, but neither of them return any results when passed to .DocumentNode.SelectNodes(). I also tried with the // at the beginning (which, I assume, is the root of the document tree). What gives?
EDIT:
Code:
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
{
Console.WriteLine("[" + node.InnerText + "]");
}
When scraping sites, you can't rely safely on the exact XPATH given by tools as in general, they are too restrictive, and in fact catch nothing most of the time. The best way is to have a look at the HTML and determine something more resilient to changes.
Here is a piece of code that works with your example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(your html);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[text()='MIA']/ancestor::tr/td[7]"))
{
Console.WriteLine(node.InnerText.Trim());
}
It outputs 188.
The way it works is:
select an A element with inner text set to "MIA"
find the parent TR element of this A element
get to the seventh TD of this TR element
and then we use InnerText property of that TD element
Try this:
/html/body/form/div/div[2]/div/table/*/tr/td[2]/div/table/*/tr[3]/td[7]
The * catch the mandatory <tbody> element that is part of the DOM representation of tables even if it is not denoted in the HTML.
Other than that, it's more robust to select by ID, CSS class name or some other unique property instead of by hierarchy and document structure:
//table[#class='data']//tr[3]/td[7]
By default HtmlAgilityPack treats form tag differently (because form tags can overlap), so you need to remove form tag from xpath, for examle: /html/body//div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
Other way is to force HtmlAgilityPack to treat form tag as others:
HtmlNode.ElementsFlags.Remove("form");

Change the node names in an XML file using C#

I have a huge bunch of XML files with the following structure:
<Stuff1>
<Content>someContent</name>
<type>someType</type>
</Stuff1>
<Stuff2>
<Content>someContent</name>
<type>someType</type>
</Stuff2>
<Stuff3>
<Content>someContent</name>
<type>someType</type>
</Stuff3>
...
...
I need to change the each of the "Content" node names to StuffxContent; basically prepend the parent node name to the content node's name.
I planned to use the XMLDocument class and figure out a way, but thought I would ask if there were any better ways to do this.
(1.) The [XmlElement / XmlNode].Name property is read-only.
(2.) The XML structure used in the question is crude and could be improved.
(3.) Regardless, here is a code solution to the given question:
String sampleXml =
"<doc>"+
"<Stuff1>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff1>"+
"<Stuff2>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff2>"+
"<Stuff3>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff3>"+
"</doc>";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(sampleXml);
XmlNodeList stuffNodeList = xmlDoc.SelectNodes("//*[starts-with(name(), 'Stuff')]");
foreach (XmlNode stuffNode in stuffNodeList)
{
// get existing 'Content' node
XmlNode contentNode = stuffNode.SelectSingleNode("Content");
// create new (renamed) Content node
XmlNode newNode = xmlDoc.CreateElement(contentNode.Name + stuffNode.Name);
// [if needed] copy existing Content children
//newNode.InnerXml = stuffNode.InnerXml;
// replace existing Content node with newly renamed Content node
stuffNode.InsertBefore(newNode, contentNode);
stuffNode.RemoveChild(contentNode);
}
//xmlDoc.Save
PS: I came here looking for a nicer way of renaming a node/element; I'm still looking.
I used this method to rename the node:
/// <summary>
/// Rename Node
/// </summary>
/// <param name="parentnode"></param>
/// <param name="oldname"></param>
/// <param name="newname"></param>
private static void RenameNode(XmlNode parentnode, string oldChildName, string newChildName)
{
var newnode = parentnode.OwnerDocument.CreateNode(XmlNodeType.Element, newChildName, "");
var oldNode = parentnode.SelectSingleNode(oldChildName);
foreach (XmlAttribute att in oldNode.Attributes)
newnode.Attributes.Append(att);
foreach (XmlNode child in oldNode.ChildNodes)
newnode.AppendChild(child);
parentnode.ReplaceChild(newnode, oldNode);
}
The easiest way I found to rename a node is:
xmlNode.InnerXmL = newNode.InnerXml.Replace("OldName>", "NewName>")
Don't include the opening < to ensure that the closing </OldName> tag is renamed as well.
Perhaps a better solution would be to iterate through each node, and write the information out to a new document. Obviously, this will depend on how you will be using the data in future, but I'd recommend the same reformatting as FlySwat suggested...
<stuff id="1">
<content/>
</stuff>
I'd also suggest that using the XDocument that was recently added would be the best way to go about creating the new document.
I'll answer the higher question: why are you trying this using XmlDocument?
I Think the best way to accomplish what you aim is a simple XSLT file
that match the "CONTENTSTUFF" node and output a "CONTENT" node...
don't see a reason to get such heavy guns...
Either way, If you still wish to do it C# Style,
Use XmlReader + XmlWriter and not XmlDocument for memory and speed purposes.
XmlDocument store the entire XML in memory, and makes it very heavy for Traversing once...
XmlDocument is good if you access the element many times (not the situation here).
I am not an expert in XML, and in my case I just needed to make all tag names in a HTML file to upper case, for further manipulation in XmlDocument with GetElementsByTagName. The reason I needed upper case was that for XmlDocument the tag names are case sensitive (since it is XML), and I could not guarantee that my HTML-file had consistent case in the tag names.
So I solved it like this: I used XDocument as an intermediate step, where you can rename elements (i.e. the tag name), and then loaded that into a XmlDocument. Here is my VB.NET-code (the C#-coding will be very similar).
Dim x As XDocument = XDocument.Load("myFile.html")
For Each element In x.Descendants()
element.Name = element.Name.LocalName.ToUpper()
Next
Dim x2 As XmlDocument = New XmlDocument()
x2.LoadXml(x.ToString())
For my purpose it worked fine, though I understand that in certain cases this might not be a solution if you are dealing with a pure XML-file.
Load it in as a string and do a replace on the whole lot..
String sampleXml =
"<doc>"+
"<Stuff1>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff1>"+
"<Stuff2>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff2>"+
"<Stuff3>"+
"<Content>someContent</Content>"+
"<type>someType</type>"+
"</Stuff3>"+
"</doc>";
sampleXml = sampleXml.Replace("Content","StuffxContent")
The XML you have provided shows that someone completely misses the point of XML.
Instead of having
<stuff1>
<content/>
</stuff1>
You should have:/
<stuff id="1">
<content/>
</stuff>
Now you would be able to traverse the document using Xpath (ie, //stuff[id='1']/content/) The names of nodes should not be used to establish identity, you use attributes for that.
To do what you asked, load the XML into an xml document, and simply iterate through the first level of child nodes renaming them.
PseudoCode:
foreach (XmlNode n in YourDoc.ChildNodes)
{
n.ChildNode[0].Name = n.Name + n.ChildNode[0].Name;
}
YourDoc.Save();
However, I'd strongly recommend you actually fix the XML so that it is useful, instead of wreck it further.

Categories

Resources