Parsing html -> xml and querying with Xpath - c#

I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(

I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}

To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.

Related

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

Yahoo News API In C#

So I'm working on a speech recognition program in C# and while trying to implement the YAHOO News API into the program I am getting no response.
I won't copy/paste my whole code as it would be very long so here are the main bits.
private void GetNews()
{
string query = String.Format("http://news.yahoo.com/rss/");
XmlDocument wData = new XmlDocument();
wData.Load(query);
XmlNamespaceManager manager = new XmlNamespaceManager(wData.NameTable);
manager.AddNamespace("media", "http://search.yahoo.com/mrss/");
XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");
XmlNodeList nodes = wData.SelectNodes("rss/channel/item/description", manager);
FirstStory = channel.SelectSingleNode("item").SelectSingleNode("title", manager).Attributes["alt"].Value;
}
I believe I have done something wrong here:
XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");
XmlNodeList nodes = wData.SelectNodes("rss/channel/item/description", manager);
FirstStory = channel.SelectSingleNode("item").SelectSingleNode("title", manager).Attributes["alt"].Value;
Here is the full XML Document: http://news.yahoo.com/rss/
If any more info is required let me know.
Hmm I have implemented my own code to get news from Yahoo, I read all the news Title ( which is located at rss/channel/item/title ) and Short story ( which is located rss/channel/item/description ).
The short story is the problem for news, and that is the point when we need to get all the inner text of description node in a string and then parse it like XML. The text code is in this format and the Short story is right behind </p>
<p><a><img /></a></p>"Short Story"<br clear="all"/>
We need to modify it since we have many xml roots (p and br) and we add an extra root <me>
string ShStory=null;
string Title = null;
//Creating a XML Document
XmlDocument doc = new XmlDocument();
//Loading rss on it
doc.Load("http://news.yahoo.com/rss/");
//Looping every item in the XML
foreach (XmlNode node in doc.SelectNodes("rss/channel/item"))
{
//Reading Title which is simple
Title = node.SelectSingleNode("title").InnerText;
//Putting all description text in string ndd
string ndd = node.SelectSingleNode("description").InnerText;
XmlDocument xm = new XmlDocument();
//Loading modified string as XML in xm with the root <me>
xm.LoadXml("<me>"+ndd+"</me>");
//Selecting node <p> which has the text
XmlNode nodds = xm.SelectSingleNode("/me/p");
//Putting inner text in the string ShStory
ShStory= nodds.InnerText;
//Showing the message box with the loaded data
MessageBox.Show(Title+ " "+ShStory);
}
Choose the me as the right answer or vote me up if the code works for you. If there are any issues you can ask me. Cheers
It's likely that you are passing that namespace manager to those attributes, but I'm not 100% certain. Those are definitely not in that .../mrss/ namespace, so I would guess that is your problem.
I would try it without passing the namespace (if possible) or using the GetElementsByTagName method to avoid namespace issues.
Tag contains the text rather than Xml.
Here is an example to display text news:
foreach (XmlElement node in nodes)
{
Console.WriteLine(Regex.Match(node.InnerXml,
"(?<=(/a>)).+(?=(</p))"));
Console.WriteLine();
}

Read value from HTML node

I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.
I have this HTML file which looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">
<span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
</p>
Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
if (reader.LocalName == "span")
{
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
}
I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">I'm great! </p>
In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0
It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.
Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.
Edit:
From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:
Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...
To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.
Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:
string html;
WebClient webClient = new WebClient();
using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
doc.write(html);
foreach (IHTMLElement el in doc.all)
Console.WriteLine(el.tagName);
I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.

NodeList.SelectSingleNode() syntax

Having problems getting NodeList.SelectSingleNode() to work properly.
My XML looks like this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<inm:Results xmlns:inm="http://www.namespace.com/1.0">
<inm:Recordset setCount="18254">
<inm:Record setEntry="0">
<!-- snip -->
<inm:Image>fileName.jpg</inm:Image>
</inm:Record>
</inm:Recordset>
</inm:Results>
The data is a long series of <inm:Record> entries.
I open the doc and get create a NodeList object based on "inm:Record". This works great.
XmlDocument xdoc = new XmlDocument();
xdoc.Load(openFileDialog1.FileName);
XmlNodeList xRecord = xdoc.GetElementsByTagName("inm:Record");
I start looping through the NodeList using a for loop. Before I process a given entry, I want to check and see if the <inm:Image> is set. I thought it would be super easy just to do
string strImage = xRecord[i].SelectSingleNode("inm:Image").InnerText;
My thinking being, "For the XRecord that I'm on, go find the <inm:Image> value ...But this doesn't work as I get the exception saying that I need a XmlNameSpaceManager. So, I tried to set that up but could never get the syntax right.
Can someone show me how to use the correct XmlNameSpaceManager syntax in this case.
I've worked around the issue for now by looping through all of the childNodes for a given xRecord, and checking the tag once I loop around to it. I would like to check that value first to see if I need to loop over that <inm:Record> entry at all.
No need to loop through all the Record elements, just use XPath to specify the subset that you want:
XmlDocument xdoc = new XmlDocument();
xdoc.Load(openFileDialog1.FileName);
XmlNamespaceManager manager = new XmlNamespaceManager(xdoc.NameTable);
manager.AddNamespace("inm", "http://www.inmagic.com/webpublisher/query");
XmlNodeList nodes = xdoc.SelectNodes("/inm:Results/inm:Recordset/inm:Record[inm:Image != '']", manager);
Using the LINQ to XML libraries, here's an example for retrieving that said node's value:
XDocument doc = XDocument.Load(openFileDialog1.FileName);
List<XElement> docElements = doc.Elements().ToList();
XElement results = docElements.Elements().Where(
ele => ele.Name.LocalName == "Results").First();
XElement firstRecord = results.Elements().Where(
ele => ele.Name.LocalName == "Record").First();
XElement recordImage = firstRecord .Elements().Where(
ele => ele.Name.LocalName == "Image").First();
string imageName = recordImage.Value;
Also, by the way, using Hungarian notation for a type-checked language is overkill. You don't need to prepend string variables with str when it will always be a string.
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xdoc.NameTable);
string strImage = xRecord[i].SelectSingleNode("inm:Image",nsMgr).InnerText;
Should do it.
Using this Xml library, you can get all the records that have an Image child element with this:
XElement root = XElement.Load(openFileDialog1.FileName);
XElement[] records = root.XPath("//Record[Image]").ToArray();
If you want to be sure that the Image child contains a value, it can be expressed like this:
XElement[] records = root.XPath("//Record[Image != '']").ToArray();

Can i create a XmlNamespaceManager object from the xml file i am about to read?

I have some c# code running on sharepoint that i use to check inside the xml of an infopath document to see if i should checking the document or discard the document.
The code is working fine for a couple of different form templates i have created but is failing on my latested one.
I have discovered that the XmlNamespaceManager i am creating contains the wrong diffinition for the "MY" namepsace.
I'll try to explain
I have this code to decalre my XmlNamespaceManager
NameTable nt = new NameTable();
NamespaceManager = new XmlNamespaceManager(nt);
// Add prefix/namespace pairs to the XmlNamespaceManager.
NamespaceManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");
NamespaceManager.AddNamespace("xhtml", "http://www.w3.org/1999/xhtml");
NamespaceManager.AddNamespace("dfs", "http://schemas.microsoft.com/office/infopath/2003/dataFormSolution");
NamespaceManager.AddNamespace("my", "http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-07-14T13:45:59");
NamespaceManager.AddNamespace("xd", "http://schemas.microsoft.com/office/infopath/2003");`
I can then use the following line of code to search for the xml i am after
XPathNavigator nav = xml.CreateNavigator().SelectSingleNode("//my:HasSaved", NamespaceManager);
nav.Value then gets be the data i want.
This all works fine on a couple of my form templates. I ahve a new form template and have discovered that i need to use thi line instead
NamespaceManager.AddNamespace("my", "http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-11-30T17:39:37");
The date is different.
My problem is that i cannot add this twice as only 1 set of forms will work.
So my question is. Is there a way i can generate the NamespaceManager object from the XML file as this is all contained in the header?
I have not been able to find a simple way round this.
I found a way of doing this. Instead of adding the "my" namespace it can be pulled from the XmlDocument object. This might just be a bit of luck that it works this way but i'm happy with it.
NamespaceManager.AddNamespace("my", formXml.DocumentElement.NamespaceURI
formXML is an XmlDocument created from the infopath XML
One option would be to try to load the xml node using the first namespace, if that doesn't give any results, call PushScope(), override the first namespace definition, select, etc...
var doc = new XmlDocument();
doc.LoadXml(#"<?xml version=""1.0""?>
<root xmlns:my=""http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-11-30T17:39:37"" value=""1"">
<my:item>test</my:item>
</root>");
var nameTable = new NameTable();
var namespaceManager = new XmlNamespaceManager(nameTable);
namespaceManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");
namespaceManager.AddNamespace("xhtml", "http://www.w3.org/1999/xhtml");
namespaceManager.AddNamespace("dfs", "http://schemas.microsoft.com/office/infopath/2003/dataFormSolution");
namespaceManager.AddNamespace("my", "http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-07-14T13:45:59");
namespaceManager.AddNamespace("xd", "http://schemas.microsoft.com/office/infopath/2003");
// n will be null since the namespace url doesn't match
var n = doc.SelectSingleNode("descendant::my:item", namespaceManager);
namespaceManager.PushScope();
namespaceManager.AddNamespace("my", "http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-11-30T17:39:37");
// will work
n = doc.SelectSingleNode("descendant::my:item", namespaceManager);
namespaceManager.PopScope();
Another option is to parse the attributes in the header and look for any contained namespaces
foreach(XmlAttribute attribute in doc.DocumentElement.Attributes)
{
var url = namespaceManager.LookupNamespace(attribute.LocalName);
if(url != null && url != attribute.Value)
{
namespaceManager.RemoveNamespace(attribute.LocalName, url);
namespaceManager.AddNamespace(attribute.LocalName, attribute.Value);
}
}
You don't need to do anything like this. Just use "my2" or something for the second namespace. You then need the new namespace prefix for any nodes that use the new namespace, and use the old "my" namespace prefix for all the nodes that still use the old namespace.

Categories

Resources