How to delete a node if it has no parent node - c#

I'm using the HTML agility pack to clean up input to a WYSIWYG. This might not be the best way to do this but I'm working with developers who explode on contact with regex so it will have to suffice.
My WYSIWYG content looks something like this (for example):
<p></p>
<p></p>
<p><span><input id="textbox" type="text" /></span></p>
I need to strip the empty paragraph tags. Here's how I'm doing it at the moment:
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
foreach (HtmlNode node in nodes)
{
node.InnerHtml = node.InnerHtml.Trim();
if (node.InnerHtml == string.Empty)
node.ParentNode.RemoveChild(node);
}
However, because the HTML is not a complete document the paragraph tags do not have a parent node and RemoveChild will therefore fail since ParentNode is null.
I can't find another way to remove tag though, can anyone point me at an alternate method?

Technically, first-level elements are children of the document root, so the following code should work:
if (node.InnerHtml == String.Empty) {
HtmlNode parent = node.ParentNode;
if (parent == null) {
parent = doc.DocumentNode;
}
parent.RemoveChild(node);
}

You want to remove from the collection, right?
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
for (int i = 0; i < nodes.Count - 1; i++)
{
nodes[i].InnerHtml = nodes[i].InnerHtml.Trim();
if (nodes[i].InnerHtml == string.Empty)
nodes.Remove(i);
}

Related

XML Node Attribute returning as NULL when populated?

I have a XML Doc that I'm pulling out a specific Node and all of it's attributes. In debug mode I can see that I'm getting the specific Nodes and all of their attributes. However, when I try to get the attribute value it can't find it and returns a NULL value. I've done some searching and looked at some examples and from what I can tell I should be getting the value but I'm not and I don't see what I'm doing wrong.
I'm trying to get the StartTime value.
Here is the XML that is returned.
Here you can see in debug and with the Text Visualizer the value should be there.
The code I'm trying.
XmlNodeList nodes = xmlDoc.GetElementsByTagName("PlannedAbsences");
if (nodes != null && nodes.Count > 0)
{
foreach (XmlNode node in nodes)
{
if (node.Attributes != null)
{
var nameAttribute = node.Attributes["StartTime"];
if (nameAttribute != null)
{
//var startDate = nameAttribute.Value;
}
}
}
}
Using the XDocument class contained within the System.Xml.Linq namespace, grab the sub elements from the PlannedAbsences parent, then iterate over sub elements retrieving the value of the desired attribute.
var xmlDoc = XDocument.Load(#"path to xml file")
var absences = xmlDoc.Element("PlannedAbsences")?.Elements("Absence");
foreach (var item in absences)
{
var xElement = item.Attribute("StartTime").Value;
Console.WriteLine(xElement);
}

C# HtmlDocument Extract Classes

I am writing some code to loop through every element in a HTML page and extract all ID and Classes.
My current code is able to extract the ID's but I can't see a way to get the classes, does anybody know where I can access these?
private void ParseElements()
{
// GET: Document from Browser
HtmlDocument ThisDocument = Browser.Document;
// DECLARE: List of IDs
List<string> ListIdentifiers = new List<string>();
// LOOP: Through Each Element
for (int LoopA = 0; LoopA < ThisDocument.All.Count; LoopA += 1)
{
// DETERMINE: Whether ID Exists in Element
if (ThisDocument.All[LoopA].Id != null)
{
// ADD: Identifier to List
ListIdentifiers.Add(ThisDocument.All[LoopA].Id);
}
}
}
You could get the inner HTML of each node and use a regular expression to get the class. Or you could try HTML Agility pack.
Something like...
HtmlAgilityPack.HtmlDocument AgilePack = new HtmlAgilityPack.HtmlDocument();
AgilePack.LoadHtml(ThisDocument.Body.OuterHtml);
HtmlNodeCollection Nodes = AgilePack.DocumentNode.SelectNodes(#"//*");
foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
{
if (Node.Attributes["class"] != null)
MessageBox.Show(Node.Attributes["class"].Value);
}

How can I get this with XPath

I'm writing a Crawler for one of the sites and and came across with this problem.
From this HTML...
<div class="Price">
<span style="font-size: 14px; text-decoration: line-through; color: #444;">195.90 USD</span>
<br />
131.90 USD
</div>
I need to get only 131.90 USD using XPath.
Tried this...
"//div[#class='Price']"
But it returns different result.
How can i achieve this?
EDIT
I'm using this C# code (simplified for demonstration)
protected override DealDictionary GrabData(HtmlAgilityPack.HtmlDocument html) {
var price = Helper.GetInnerHtml(html.DocumentNode, "//div[#class='Price']/text()");
}
Helper Class
public static class Helper {
public static String GetInnerText(HtmlDocument doc, String xpath) {
var nodes = doc.DocumentNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerText.TrimHtml();
}
return String.Empty;
}
public static String GetInnerText(HtmlNode inputNode, String xpath) {
var nodes = inputNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
var comments = node.ChildNodes.OfType<HtmlCommentNode>().ToList();
foreach (var comment in comments)
comment.ParentNode.RemoveChild(comment);
return node.InnerText.TrimHtml();
}
return String.Empty;
}
public static String GetInnerHtml(HtmlDocument doc, String xpath) {
var nodes = doc.DocumentNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerHtml.TrimHtml();
}
return String.Empty;
}
public static string GetInnerHtml(HtmlNode inputNode, string xpath) {
var nodes = inputNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerHtml.TrimHtml();
}
return string.Empty;
}
}
The XPath you tried is a good start:
//div[#class='Price']
This selects any <div> element in the Xml document. You restrict that selection to <div> elements that have a class attribute whose value is Price.
So far, so good - but as you select a <div> element, what you will get back will be a <div> element including all of its contents.
In the Xml fragment you show above, you have the following hierarchical structure:
<div> element
<span> element
text node
<br> element
text node
So, what you are actually interested in is the latter text node. You can use text() in XPath to select any text nodes. As in this case, you are interested in the first text node that is an immediate child of the <div> element you found, your XPath should look like this:
//div[#class='Price']/text()

How do I loop backwards from SiteMap.CurrentNode to SiteMap.RootNode

I have a simple Sitemap like this from asp:SiteMapDataSource:
Page 1 > Page 2 > Page 3
I would like to create foreach loop in C# that generates it instead for using asp:SiteMapPath because I need to add some exceptions to it. Now I cannot figure out how do I loop backwards from SiteMap.CurrentNode to SiteMap.RootNode?
The property you are looking for is SiteMapNode.ParentNode
SiteMapNode currentNode = SiteMap.CurrentNode;
SiteMapNode rootNode = SiteMap.RootNode;
Stack<SiteMapNode> nodeStack = new Stack<SiteMapNode>();
while (currentNode != rootNode)
{
nodeStack.Push(currentNode);
currentNode = currentNode.ParentNode;
}
// If you want to include RootNode in your list
nodeStack.Push(rootNode);
SiteMapNode[] breadCrumbs = nodeStack.ToArray();

HtmlAgilityPack selecting childNodes not as expected

I am attempting to use the HtmlAgilityPack library to parse some links in a page, but I am not seeing the results I would expect from the methods. In the following I have a HtmlNodeCollection of links. For each link I want to check if there is an image node and then parse its attributes but the SelectNodes and SelectSingleNode methods of linkNode seems to be searching the parent document not the childNodes of linkNode. What gives?
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("/img[#alt]");
}
}
Is there any other way I could get the alt attribute of the image childnode of linkNode if it exists?
You should remove the forwardslash prefix from "/img[#alt]" as it signifies that you want to start at the root of the document.
HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");
With an xpath query you can also use "." to indicate the search should start at the current node.
HtmlNode imageNode = linkNode.SelectSingleNode(".//img[#alt]");
Also, watch out for null checks; SelectNodes returns null instead of blank collection.
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
**if(linkNodes!=null)**
{
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
**HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");**
}
}
}

Categories

Resources