Selecting attribute value using XPath and HtmlAgilityPack

Selecting attribute value using XPath and HtmlAgilityPack - c#

I am trying to get second attribute value of a meta tag using an xpath expression in html agility pack:
The meta tag:
<meta name="pubdate" content="2012-08-30" />
The xml path expression i am using:
//meta[#name='pubdate']/#content
But it does not return anything. I tried to search around and implement this solution:
//meta[#name='pubdate']/string(#content)
Another way:
string(//meta[#name='pubdate']/#content)
But it gives xml exception in html agility pack.
Another solution did not work as well.
//meta[#name='pubdate']/data(#content)
For reasons i wanted to use just xml path (and not html agility pack functions to get the attribute value). The function i use is below:
date = TextfromOneNode(document.DocumentNode.SelectSingleNode(".//body"), "meta[#name='pubdate']/#content");
public static string TextfromOneNode(HtmlNode node, string xmlPath)
{
string toReturn = "";
if(node.SelectSingleNode(xmlPath) != null)
{
toReturn = node.SelectSingleNode(xmlPath).InnerText;
}
return toReturn;
}
So far it looks like there is no way to use xml path expression to get an attribute value directly.
Any ideas?

There is a way using HtmlNodeNavigator :
public static string TextfromOneNode(HtmlNode node, string xmlPath)
{
string toReturn = "";
var navigator = (HtmlAgilityPack.HtmlNodeNavigator)node.CreateNavigator();
var result = navigator.SelectSingleNode(xmlPath);
if(result != null)
{
toReturn = result.Value;
}
return toReturn;
}
The following console app example demonstrates how HtmlNodeNavigator.SelectSingleNode() works with both XPath that return element and XPath that return attribute :
var raw = #"<div>
<meta name='pubdate' content='2012-08-30' />
<span>foo</span>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(raw);
var navigator = (HtmlAgilityPack.HtmlNodeNavigator)doc.CreateNavigator();
var xpath1 = "//meta[#name='pubdate']/#content";
var xpath2 = "//span";
var result = navigator.SelectSingleNode(xpath1);
Console.WriteLine(result.Value);
result = navigator.SelectSingleNode(xpath2);
Console.WriteLine(result.Value);
dotnetfiddle demo
output :
2012-08-30
foo

Using xml linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "<meta name=\"pubdate\" content=\"2012-08-30\" />";
XElement meta = XElement.Parse(input);
DateTime output = (DateTime)meta.Attribute("content");
}
}
}

Related

Extract certain parts from XML string

I have a windows form in C# that does a httpclient get request. And this is the response in XML format
> <Result><Success>true</Success><Token>MYTOKENHERE</Token><TokenExpirationDate null="1"
> /><UserName>********</UserName><PersonCode>442078</PersonCode><LoginStatusMessage>LoginOk</LoginStatusMessage></Result>
I want to set the text of a text box to what is inbetween the <Token></Token> Tags
What is the best approach to do this
Thanks
This is my current Form1.cs code
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Net.Http;
using System.Xml.Serialization;
namespace EBS_Token_Form
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
HttpClient Client = new HttpClient();
Client.BaseAddress = new Uri("PRIVATE_URL");
string Username = username.Text;
string Password = password.Text;
string CredentialsString = $"{Username}:{Password}";
byte[] CredentialsStringByes = System.Text.Encoding.UTF8.GetBytes(CredentialsString);
Client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("basic", Convert.ToBase64String(CredentialsStringByes));
try
{
var Response = Client.GetAsync("Rest/Authentication").Result;
if (!Response.IsSuccessStatusCode)
{
// Something went wrong, is error.
// Put a breakpoint on the line below and we can figure out why.
string x = "";
}
string ServerResponse = Response.Content.ReadAsStringAsync().Result;
}
catch (Exception ex)
{
return;
}
}
If i set the value of the textbox to ServerResponse it has the full xml document. I need to extract just the .

Usually I'd use XmlSerializer for something like this, but if you really just need this one value, you may want to try XElement:
var root = XElement.Parse(yourResponseString);
var value = root.Element("Token")?.Value;
XElement is great for traversing, reading and manipulating XML.

XPath works for this, although some might consider overkill. The XPath expression /Result/Token/Text() does exactly what it looks like it will.
using System.Xml;
using System.Xml.XPath;
string Incoming_XML = #"<Result><Success>true</Success><Token>MYTOKENHERE</Token><TokenExpirationDate null=""1"" /><UserName>********</UserName><PersonCode>442078</PersonCode><LoginStatusMessage>LoginOk</LoginStatusMessage></Result>";
XPathDocument xPathDoc = null;
using (StringReader sr = new StringReader(Incoming_XML))
{
xPathDoc = new XPathDocument(sr);
XPathNavigator xPathNav = xPathDoc.CreateNavigator();
string Result = xPathNav.SelectSingleNode("/Result/Token/text()").Value;
Console.WriteLine($"Token is: {Result}");
}

below sample code may not the best but it is simple to understand and will do the work.
// Define a regular expression pattern
string Regex_syntax = #"<Token>[\s\S]*?</Token>";
Regex rx = new Regex(Regex_syntax,RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Define a RAW data string to process , in your case will be http response result
string inputtext = #"<Result><Success>true</Success><Token>MYTOKENHERE</Token><TokenExpirationDate";
// Find matches.
MatchCollection matches = rx.Matches(inputtext);
// clease the uneeded string
string cleanseString = matches[0].ToString().Replace(#"<Token>", "");
cleanseString = cleanseString.Replace(#"</Token>", "");
// This will print value in between the token tag
Console.WriteLine("require output = " + cleanseString);

Can I use WebUtility.HtmlDecode to decode XML?

I have an XML-encoded attribute value. This is actually from a processing instruction. So the original data looks something like this:
<root><?pi key="value" data="<foo attr="bar">Hello world</foo>" ?></root>
I can parse it like this:
using System;
using System.Linq;
using System.Xml.Linq;
public class Program
{
private const string RawData = #"<root><?pi key=""value"" data=""<foo attr="bar">Hello world</foo>"" ?></root>";
public static void Main()
{
XDocument doc = GetXDocumentFromProcessingInstruction();
IEnumerable<XElement> fooElements = doc.Descendants("foo");
// ...
}
private static XProcessingInstruction LoadProcessingInstruction()
{
XDocument doc = XDocument.Parse(rawData);
return doc
.DescendantNodes()
.OfType<XProcessingInstruction>()
.First();
}
private static XDocument GetXDocumentFromProcessingInstruction()
{
XProcessingInstruction processingInstruction = LoadProcessingInstruction();
// QUESTION:
// Can there ever be a situation where HtmlDecode wouldn't decode the XML correctly?
string decodedXml = WebUtility.HtmlDecode(processingInstruction.Data);
// This works well, but it contains the attributes of the processing
// instruction as text.
string dummyXml = $"<dummy>{xml}</dummy>";
return XDocument.Parse(dummyXml);
}
This works absolutely fine, as far as I can tell.
But I am wondering if there might be some edge cases where it may fail, because of differences in how data would be encoded in XML vs. HTML.
Anybody have some more insight?
Edit:
Sorry, I made some incorrect assumptions about XProcessingInstruction.Data, but the code above was still working fine, so my question stands.
I have nonetheless rewritten my code and wrapped everything in an XElement, which (of course) removed the issue altogether:
private static XDocument GetXDocumentFromProcessingInstruction2()
{
XProcessingInstruction processingInstruction = LoadProcessingInstruction();
string encodedXml = string.Format("<dummy {0} />", processingInstruction.Data);
XElement element = XElement.Parse(encodedXml);
string parsedXml = element.Attribute("data").Value;
return XDocument.Parse(parsedXml);
}
So this does exactly what I need. But since WebUtility.HtmlDecode worked sufficiently well, I would still like to know if there could have been a situation where the first approach could have failed.

Removing the question marks and adding a forward slash at end of your input I got this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "<pi data=\"<foo attr="bar">Hello world</foo>\" />";
XElement pi = XElement.Parse(input);
string data = (string)pi.Attribute("data");
XElement foo = XElement.Parse(data);
string attr = (string)foo.Attribute("attr");
string innertext = (string)foo;
}
}
}

How to delimit text inside HTML using html-agility-pack or Regex in C# project?

I'm using the Html Agility Pack in Windows Form with C # and I get good results in html page searching.
However, the query returns the whole html of the page and I only need the contents of the post, since the rest are unnecessary links and texts.
The content that matters after reading the html is between:
<span class = "update-date"> 23/06/2019 16h17 '<' / span '>' <'/ span'> '' <'/ p'> '
and '<' p class = "col-lg-24" '>'.
I tried to use regex, but I did not succeed.
I am using the wrong .SelectNodes for this case?
Here's an example: (Example based on https://dotnetfiddle.net/ltDevV)
// #nuget: HtmlAgilityPack
using System;
using System.Xml;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
var html =
#"https://economia.uol.com.br/noticias/redacao/2019/06/23/aposentadoria-pensao-camara-deputados.htm";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//p");
if(htmlNodes!=null)
{
foreach (var node in htmlNodes)
{
Console.WriteLine(node.OuterHtml);
}
}
else
{
Console.WriteLine("Oh OK.");
}
}
}
I hope to be able to get as a final result only the content that is between the tags
<span class = "update-date"> 23/06/2019 16:17 '<' / span '>' <'/ span'> '' < '/ p'> 'e' <'p class = "col-lg-24"'> '.

Xpath in C# : getting element out of a xml "list" (siblings)

My problem is simple. I'm using XPath to retrieve information from a xml file. My goal is to compare 2 XML files, they should be the same. I have no problem with "single" nodes. My problem comes when there are siblings.
It looks like that :
XPathDocument expectedDocument = new XPathDocument("C:\\expected.xml");
XPathDocument testedDocument = new XPathDocument("C:\\tested.xml");
XPathNavigator expectedNav = expectedDocument.CreateNavigator();
XPathNavigator testedNav = testedDocument.CreateNavigator();
XPathNodeIterator expectedIterator;
XPathNodeIterator testedIterator;
string expectedStr;
string testedStr;
string parameter;
parameter = "/DonneesDepot/Identification/#CoclicoFacturation";
expectedStr = expectedNav.SelectSingleNode(parameter).Value;
testedStr = testedNav.SelectSingleNode(parameter).Value;
CompareValues(expectedStr, testedStr, parameter);
That works perfectly. Now where it gets complicated is for this kind of XML :
<Surtaxe>
<Zone CodeZoneSurtaxe="1" NbPlisZone="0" PoidsZone="0" />
<Zone CodeZoneSurtaxe="2" NbPlisZone="2" PoidsZone="2" />
</Surtaxe>
I want to be able to make sure the content of "Surtaxe" is the same in both files (keep in mind the order is not important), so I tried this :
parameter = "/DonneesDepot/Facturation/Surtaxe/Zone/#PoidsZone";
expectedIterator = expectedNav.Select(parameter);
testedIterator = testedNav.Select(parameter);
while (expectedIterator.MoveNext() && testedIterator.MoveNext())
{
CompareValues(expectedIterator.Current.Value, testedIterator.Current.Value, parameter);
}
But even if the XMLs both contain the two rows, they are sometimes not in the same order, so my while loop doesn't work.
What would be the easiest to compare the two followings (the expected result of this comparison would be "Equality")
<Surtaxe>
<Zone CodeZoneSurtaxe="1" NbPlisZone="1" PoidsZone="1" />
<Zone CodeZoneSurtaxe="2" NbPlisZone="0" PoidsZone="0" />
</Surtaxe>
and
<Surtaxe>
<Zone CodeZoneSurtaxe="2" NbPlisZone="0" PoidsZone="0" />
<Zone CodeZoneSurtaxe="1" NbPlisZone="1" PoidsZone="1" />
</Surtaxe>
Thank you

This will do what you want, you may need to change it up for your real objects, but this should get you started. I proveded 2 solutions. A naive n^2 solution and a sorting method that should be faster.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace XmlCompare
{
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
class Program
{
private static string xml1 = "<Surtaxe>" + "<Zone CodeZoneSurtaxe=\"2\" NbPlisZone=\"0\" PoidsZone=\"0\" />"
+ "<Zone CodeZoneSurtaxe=\"1\" NbPlisZone=\"1\" PoidsZone=\"1\" />" + "</Surtaxe>";
private static string xml2 = "<Surtaxe>" + "<Zone CodeZoneSurtaxe=\"1\" NbPlisZone=\"1\" PoidsZone=\"1\" />"
+ "<Zone CodeZoneSurtaxe=\"2\" NbPlisZone=\"0\" PoidsZone=\"0\" />" + "</Surtaxe>";
private static void Main(string[] args)
{
var expectedDoc = XDocument.Load(new StringReader(xml1));
var testedDoc = XDocument.Load(new StringReader(xml2));
var success = true;
//naive
foreach (var node in expectedDoc.Descendants("Surtaxe").First().Descendants())
{
if (testedDoc.Descendants(node.Name).FirstOrDefault(x => x.ToString()== node.ToString()) == null)
{
success = false;
break;
}
}
//sort
var sortedExpected = xml1.ToList();
sortedExpected.Sort();
var testSorted = xml2.ToList();
testSorted.Sort();
success = new string(sortedExpected.ToArray()).Equals(new string(testSorted.ToArray()));
Console.WriteLine("Match? " + success);
Console.ReadKey();
}
}
}

accessing descendant elements from an xml returned null using linq to xml in c#

Please people help me out I need to consume a web service that returns an xml from my application, The code that downloads xml works fine, but I need to extract values from the xml file, but I keep getting a null return value from the code, precisely the GetLocationFromXml() method is the method returning null, the GetLocationAsXMLFromHost() method works fine.
this is the complete class
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using AMIS.Core.DTOs;
using System.Net;
using System.Xml.Linq;
using System.Xml;
using System.Linq;
public class GeoLocationService
{
private string _hostWebSite = "http://api.hostip.info";
private readonly XNamespace _hostNameSpace = "http://www.hostip.info/api";
private readonly XNamespace _hostGmlNameSpace = "http://www.opengis.net/gml";
public LocationInfo GetLocationInfoFromIPAddress(string userHostIpAddress)
{
IPAddress ipAddress = null;
IPAddress.TryParse(userHostIpAddress, out ipAddress);
string xmlData = GetLocationAsXMLFromHost(ipAddress.ToString());
LocationInfo locationInfo = GetLocationFromXml(xmlData);
return locationInfo;
}
private string GetLocationAsXMLFromHost(string userHostIpAddress)
{
WebClient webClient= new WebClient();
string formattedUrl = string.Format(_hostWebSite + "/?ip={0}", userHostIpAddress);
var xmlData = webClient.DownloadString(formattedUrl);
return xmlData;
}
private LocationInfo GetLocationFromXml(string xmlData)
{
LocationInfo locationInfo = new LocationInfo();
var xmlResponse = XDocument.Parse(xmlData);
var nameSpace = (XNamespace)_hostNameSpace;
var gmlNameSpace = (XNamespace)_hostGmlNameSpace;
try
{
locationInfo = (from x in xmlResponse.Descendants(nameSpace + "Hostip")
select new LocationInfo
{
CountryName = x.Element(nameSpace + "countryName").Value,
CountryAbbreviation = x.Element(nameSpace + "countryAbbrev").Value,
LocationInCountry = x.Element(gmlNameSpace + "name").Value
}).SingleOrDefault();
}
catch (Exception)
{
throw;
}
return locationInfo;
}
}
and the xml file is below
<?xml version="1.0" encoding="iso-8859-1"?>
<HostipLookupResultSet version="1.0.1" xmlns:gml="http://www.opengis.net/gml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.hostip.info/api/hostip-1.0.1.xsd">
<gml:description>This is the Hostip Lookup Service</gml:description>
<gml:name>hostip</gml:name>
<gml:boundedBy>
<gml:Null>inapplicable</gml:Null>
</gml:boundedBy>
<gml:featureMember>
<Hostip>
<ip>41.78.8.3</ip>
<gml:name>(Unknown city)</gml:name>
<countryName>NIGERIA</countryName>
<countryAbbrev>NG</countryAbbrev>
<!-- Co-ordinates are unavailable -->
</Hostip>
</gml:featureMember>
</HostipLookupResultSet>

Given the comments, I suspect the problem may be as simple as:
private string _hostNameSpace = "hostip.info/api";
should be:
private string _hostNameSpace = "http://hostip.info/api";
(Ditto for the others.) Personally I'd make then XNamespace values to start with:
private static readonly XNamespace HostNameSpace = "http://hostip.info/api";
EDIT: Okay, after messing around with your example (which could have been a lot shorter and a lot more complete) I've worked out what's wrong: you're looking for elements using the "host namespace" - but the elements in the XML aren't in any namespace. Just get rid of those namespace bits, and it works fine.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Selecting attribute value using XPath and HtmlAgilityPack - c#

Related

Extract certain parts from XML string

Can I use WebUtility.HtmlDecode to decode XML?

How to delimit text inside HTML using html-agility-pack or Regex in C# project?

Xpath in C# : getting element out of a xml "list" (siblings)

accessing descendant elements from an xml returned null using linq to xml in c#

Categories

Resources