How can I get this with XPath - c#

I'm writing a Crawler for one of the sites and and came across with this problem.
From this HTML...
<div class="Price">
<span style="font-size: 14px; text-decoration: line-through; color: #444;">195.90 USD</span>
<br />
131.90 USD
</div>
I need to get only 131.90 USD using XPath.
Tried this...
"//div[#class='Price']"
But it returns different result.
How can i achieve this?
EDIT
I'm using this C# code (simplified for demonstration)
protected override DealDictionary GrabData(HtmlAgilityPack.HtmlDocument html) {
var price = Helper.GetInnerHtml(html.DocumentNode, "//div[#class='Price']/text()");
}
Helper Class
public static class Helper {
public static String GetInnerText(HtmlDocument doc, String xpath) {
var nodes = doc.DocumentNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerText.TrimHtml();
}
return String.Empty;
}
public static String GetInnerText(HtmlNode inputNode, String xpath) {
var nodes = inputNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
var comments = node.ChildNodes.OfType<HtmlCommentNode>().ToList();
foreach (var comment in comments)
comment.ParentNode.RemoveChild(comment);
return node.InnerText.TrimHtml();
}
return String.Empty;
}
public static String GetInnerHtml(HtmlDocument doc, String xpath) {
var nodes = doc.DocumentNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerHtml.TrimHtml();
}
return String.Empty;
}
public static string GetInnerHtml(HtmlNode inputNode, string xpath) {
var nodes = inputNode.SelectNodes(xpath);
if (nodes != null && nodes.Count > 0) {
var node = nodes[0];
return node.InnerHtml.TrimHtml();
}
return string.Empty;
}
}

The XPath you tried is a good start:
//div[#class='Price']
This selects any <div> element in the Xml document. You restrict that selection to <div> elements that have a class attribute whose value is Price.
So far, so good - but as you select a <div> element, what you will get back will be a <div> element including all of its contents.
In the Xml fragment you show above, you have the following hierarchical structure:
<div> element
<span> element
text node
<br> element
text node
So, what you are actually interested in is the latter text node. You can use text() in XPath to select any text nodes. As in this case, you are interested in the first text node that is an immediate child of the <div> element you found, your XPath should look like this:
//div[#class='Price']/text()

Related

XML Node Attribute returning as NULL when populated?

I have a XML Doc that I'm pulling out a specific Node and all of it's attributes. In debug mode I can see that I'm getting the specific Nodes and all of their attributes. However, when I try to get the attribute value it can't find it and returns a NULL value. I've done some searching and looked at some examples and from what I can tell I should be getting the value but I'm not and I don't see what I'm doing wrong.
I'm trying to get the StartTime value.
Here is the XML that is returned.
Here you can see in debug and with the Text Visualizer the value should be there.
The code I'm trying.
XmlNodeList nodes = xmlDoc.GetElementsByTagName("PlannedAbsences");
if (nodes != null && nodes.Count > 0)
{
foreach (XmlNode node in nodes)
{
if (node.Attributes != null)
{
var nameAttribute = node.Attributes["StartTime"];
if (nameAttribute != null)
{
//var startDate = nameAttribute.Value;
}
}
}
}
Using the XDocument class contained within the System.Xml.Linq namespace, grab the sub elements from the PlannedAbsences parent, then iterate over sub elements retrieving the value of the desired attribute.
var xmlDoc = XDocument.Load(#"path to xml file")
var absences = xmlDoc.Element("PlannedAbsences")?.Elements("Absence");
foreach (var item in absences)
{
var xElement = item.Attribute("StartTime").Value;
Console.WriteLine(xElement);
}

c# htmlagility select specific xpath

i have this html code :
<div>
<time class="departure"><span></span>value1<time class="return">
<span></span>value2</time>
</div>
i'm using the c# code below :
var nodes = doc.DocumentNode.SelectNodes("//time[#class='departure']");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
if (node.InnerText.Trim() == DepartTime)
{
ReturnTime = node.SelectSingleNode("time").InnerText; //null reference here
}
}
so as you can see i'm checking if the depart time (DepartTime) exist then it will returns the next innertext value of the first time element after . but this doesnt seems to be working i get exception null reference .
solved it by
foreach (var node in nodes)
{
if (node.InnerText.Trim() == DepartTime)
{
ReturnTime = node.ParentNode.SelectNodes("time")[1].InnerText.Trim();
}
}

how do i get the text from a nested <p> tag in an external html using html agility pack?

I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements

How to delete a node if it has no parent node

I'm using the HTML agility pack to clean up input to a WYSIWYG. This might not be the best way to do this but I'm working with developers who explode on contact with regex so it will have to suffice.
My WYSIWYG content looks something like this (for example):
<p></p>
<p></p>
<p><span><input id="textbox" type="text" /></span></p>
I need to strip the empty paragraph tags. Here's how I'm doing it at the moment:
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
foreach (HtmlNode node in nodes)
{
node.InnerHtml = node.InnerHtml.Trim();
if (node.InnerHtml == string.Empty)
node.ParentNode.RemoveChild(node);
}
However, because the HTML is not a complete document the paragraph tags do not have a parent node and RemoveChild will therefore fail since ParentNode is null.
I can't find another way to remove tag though, can anyone point me at an alternate method?
Technically, first-level elements are children of the document root, so the following code should work:
if (node.InnerHtml == String.Empty) {
HtmlNode parent = node.ParentNode;
if (parent == null) {
parent = doc.DocumentNode;
}
parent.RemoveChild(node);
}
You want to remove from the collection, right?
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
for (int i = 0; i < nodes.Count - 1; i++)
{
nodes[i].InnerHtml = nodes[i].InnerHtml.Trim();
if (nodes[i].InnerHtml == string.Empty)
nodes.Remove(i);
}

Update or inserting a node in an XML doc

I am a beginner to XML and XPath in C#. Here is an example of my XML doc:
<root>
<folder1>
...
<folderN>
...
<nodeMustExist>...
<nodeToBeUpdated>some value</nodeToBeUpdated>
....
</root>
What I need is to update the value of nodeToBeUdpated if the node exists or add this node after the nodeMustExist if nodeToBeUpdated is not there. The prototype of the function is something like this:
void UpdateNode(
xmlDocument xml,
string nodeMustExist,
string nodeToBeUpdte,
string newVal
)
{
/*
search for XMLNode with name = nodeToBeUpdate in xml
to XmlNodeToBeUpdated (XmlNode type?)
if (xmlNodeToBeUpdated != null)
{
xmlNodeToBeUpdated.value(?) = newVal;
}
else
{
search for nodeMustExist in xml to xmlNodeMustExist obj
if ( xmlNodeMustExist != null )
{
add xmlNodeToBeUpdated as next node
xmlNodeToBeUpdte.value = newVal;
}
}
*/
}
Maybe there are other better and simplified way to do this. Any advice?
By the way, if nodeToBeUpdated appears more than once in other places, I just want to update the first one.
This is to update all nodes in folder:
public void UpdateNodes(XmlDocument doc, string newVal)
{
XmlNodeList folderNodes = doc.SelectNodes("folder");
if (folderNodes.Count > 0)
foreach (XmlNode folderNode in folderNodes)
{
XmlNode updateNode = folderNode.SelectSingleNode("nodeToBeUpdated");
XmlNode mustExistNode = folderNode.SelectSingleNode("nodeMustExist"); ;
if (updateNode != null)
{
updateNode.InnerText = newVal;
}
else if (mustExistNode != null)
{
XmlNode node = folderNode.OwnerDocument.CreateNode(XmlNodeType.Element, "nodeToBeUpdated", null);
node.InnerText = newVal;
folderNode.AppendChild(node);
}
}
}
If you want to update a particular node, you cannot pass string nodeToBeUpdte, but you will have to pass the XmlNode of the XmlDocument.
I have omitted the passing of node names in the function since nodes names are unlikely to change and can be hardcoded. However, you can pass these to the functions and use the strings instead of hardcoded node names.
The XPath expression that selects all instances of <nodeToBeUpdated> would be this:
/root/folder[nodeMustExist]/nodeToBeUpdated
or, in a more generic form:
/root/folder[*[name() = 'nodeMustExist']]/*[name() = 'nodeToBeUpdated']
suitable for:
void UpdateNode(xmlDocument xml,
string nodeMustExist,
string nodeToBeUpdte,
string newVal)
{
string xPath = "/root/folder[*[name() = '{0}']]/*[name() = '{1}']";
xPath = String.Format(xPath, nodeMustExist, nodeToBeUpdte);
foreach (XmlNode n in xml.SelectNodes(xPath))
{
n.Value = newVal;
}
}
Have a look at the SelectSingleNode method MSDN Doc
your xpath wants to be something like "//YourNodeNameHere" ;
once you have found that node you can then traverse back up the tree to get to the 'nodeMustExist' node:
XmlNode nodeMustExistNode = yourNode.Parent["nodeMustExist];

Categories

Resources