I want to parse the HTML of a website in my C# program.
First, I use the SGMLReader DLL to convert the HTML to XML. I use the following method for this:
XmlDocument FromHtml(TextReader reader)
{
// setup SGMLReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.None;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Next, I read a website and try to look for the header node:
var client = new WebClient();
var xmlDoc = FromHtml(new StringReader(client.DownloadString(#"http://www.switchonthecode.com")));
var result = xmlDoc.DocumentElement.SelectNodes("head");
However, this query gives an empty result (count == 0). But when I inspect the results view of xmlDoc.DocumentElement, I see the following:
Any idea's why there are no results? Note that when I try another site, like http://www.google.com, it works.
You need to select using the namespace explicitly, see this question.
XmlNamespaceManager manager = new XmlNamespaceManager(doc.NameTable);
manager.AddNamespace("ns", "http://www.w3.org/1999/xhtml");
doc.DocumentElement.SelectNodes("ns:head", manager);
You can use HTML Agility Pack instead. It's an open source HTML parser
Related
I am building a Windows 8 app, and I need to extract the whole XML node and its children as string from a large xml document, and the method that does that so far looks like this:
public string GetNodeContent(string path)
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
settings.ConformanceLevel = ConformanceLevel.Auto;
settings.IgnoreComments = true;
using (XmlReader reader = XmlReader.Create("something.xml", settings))
{
reader.MoveToContent();
reader.Read();
XmlDocument doc = new XmlDocument();
doc.LoadXml(reader.ReadOuterXml());
IXmlNode node = doc.SelectSingleNode(path);
return node.InnerText;
}
}
When I pass any form of xpath, node gets the value of null. I'm using the reader to get the first child of root node, and then use XMLDocument to create one from that xml. Since it's Windows 8, apparently, I can't use XPathSelectElements method and this is the only way I can't think of. Is there a way to do it using this, or any other logic?
Thank you in advance for your answers.
[UPDATE]
Let's say XML has this general form:
<nodeone attributes...>
<nodetwo attributes...>
<nodethree attributes... />
<nodethree attributes... />
<nodethree attributes... />
</nodetwo>
</nodeone >
I expect to get as a result nodetwo and all of its children in the form of xml string when i pass "/nodeone/nodetwo" or "//nodetwo"
I've come up with this solution, the whole approach was wrong to start with. The problematic part was the fact that this code
reader.MoveToContent();
reader.Read();
ignores the namespace by itself, because it skips the root tag. This is the new, working code:
public static async Task<string> ReadFileTest(string xpath)
{
StorageFolder folder = await Package.Current.InstalledLocation.GetFolderAsync("NameOfFolderWithXML");
StorageFile xmlFile = await folder.GetFileAsync("filename.xml");
XmlDocument xmldoc = await XmlDocument.LoadFromFileAsync(xmlFile);
var nodes = doc.SelectNodes(xpath);
XmlElement element = (XmlElement)nodes[0];
return element.GetXml();
}
Disclaimer: This issue is happening within a Unity application, but AFAIK, this is more of a C# issue than a Unity issue...
I am trying to use System.Xml.XmlDocument to parse an Amazon S3 bucket listing. Here is my bucket xml. I am using an example that I found in a C# Xml tutorial.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("http://rss.cnn.com/rss/edition_world.rss");
XmlNode titleNode = xmlDoc.SelectSingleNode("//rss/channel/title");
if(titleNode != null)
Debug.Log(titleNode.InnerText);
This works fine for that particular XML file, but when I put my stuff in there:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("https://s3.amazonaws.com/themall/");
Debug.Log ( xmlDoc.InnerXml );
XmlNode nameNode = xmlDoc.SelectSingleNode("//Name");
if(nameNode != null)
Debug.Log(nameNode.InnerText);
I get the raw XML in the console, so I know it is being downloaded successfully, but even the simplest XPath produces no results!
My only theory is that perhaps it has something to do with the default namespace in my XML? Do I need to tell XmlDocument about that somehow? Here is my root tag:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
I have tried creating an XmlNamespaceManager and using it with all of my calls to "SelectSingleNode", but that doesn't seem to work either.
XPathNavigator nav = xmlDoc.CreateNavigator();
XmlNamespaceManager ns = new XmlNamespaceManager(nav.NameTable);
ns.AddNamespace(System.String.Empty, "http://s3.amazonaws.com/doc/2006-03-01/");
What am I doing wrong?
Thanks!
When you add the namespace to the namespace manager you need to give it a non-empty prefix. According to MSDN:
If the XmlNamespaceManager will be used for resolving namespaces in an XML Path Language (XPath) expression, a prefix must be specified.
Blockquote
The prefix must then be used in your XPath select statement. Here is the code I used and the output was "themall" as expected:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("http://s3.amazonaws.com/themall/");
XmlNamespaceManager namespaceManager =
new XmlNamespaceManager(xmlDoc.NameTable);
namespaceManager.AddNamespace("ns",
"http://s3.amazonaws.com/doc/2006-03-01/");
XmlNode titleNode =
xmlDoc.SelectSingleNode("//ns:Name", namespaceManager);
if (titleNode != null)
Console.WriteLine(titleNode.InnerText);
In .net what is the best way to scrape HTML web pages.
Is there something open source that runs on .net framework 2 and and put all the html into objects. I have read about "HTML Agility Pack" but is there any think else?
I think HtmlAgilityPack is but you can also use
Fizzler : css selector engine for C#
SgmlReader : Convert html to valid xml
SharpQuery : Alternative of fizzler
You might use Tidy.net, which is a c# wrapper for the Tidy Library to convert HTML in XHTML available here: http://sourceforge.net/projects/tidynet/ so you could get valid XML and process it as such.
I'd make it this way:
// don't forget to import TidyNet and System.Xml.Linq
var t = new Tidy();
TidyMessageCollection messages = new TidyMessageCollection();
t.Options.Xhtml = true;
//extra options if you plan to edit the result by hand
t.Options.IndentContent = true;
t.Options.SmartIndent = true;
t.Options.DropEmptyParas = true;
t.Options.DropFontTags = true;
t.Options.BreakBeforeBR = true;
string sInput = "your html code goes here";
var bytes = System.Text.Encoding.UTF8.GetBytes(sInput);
StringBuilder sbOutput = new StringBuilder();
var msIn = new MemoryStream(bytes);
var msOut = new MemoryStream();
t.Parse(msIn, msOut, messages);
var bytesOut = msOut.ToArray();
string sOut = System.Text.Encoding.UTF8.GetString(bytesOut);
XDocument doc = XDocument.Parse(sOut);
//process XML as you like
Otherwise, HTML Agility pack is ok.
I am trying to read OSIS formatted documents. I have cut the document down to a simple fragment:
<?xml version="1.0" encoding="utf-8"?>
<osis xmlns="http://www.bibletechnologies.net/2003/OSIS/namespace">
<osisText osisRefWork="Bible" osisIDWork="kjv" xml:lang="en">
</osisText>
</osis>
I try to read it with this sample code from the MSDN documentation:
XPathDocument document = new XPathDocument("osis.xml");
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator nodes = navigator.Select("/osis/osisText");
while (nodes.MoveNext())
{
Console.WriteLine(nodes.Current.Name);
}
The problem is that the selection contains no nodes and throws no exception. Since the code discards the root tag, I can't read the document. If I remove the xmlns="http://www.bibletechnologies.net/2003/OSIS/namespace" from the root osis tag, it works just fine. The offensive URL returns a 404 code, but otherwise I see nothing wrong with this XML. Can someone explain why this code won't read the document? What options do I have besides hand editing every document before trying to load it?
Your XPath expression is missing a namespace prefix.
The element that you're trying to select has a namespace URI of http://www.bibletechnologies.net/2003/OSIS/namespace, and XPath will not match these nodes using paths with an empty namespace URI.
I tested this revision in .NET 2.0 and it found the node as expected.
XPathDocument document = new XPathDocument("osis.xml");
XPathNavigator navigator = document.CreateNavigator();
XmlNamespaceManager xmlns = new XmlNamespaceManager(navigator.NameTable);
xmlns.AddNamespace("osis", "http://www.bibletechnologies.net/2003/OSIS/namespace");
XPathNodeIterator nodes = navigator.Select("/osis:osis/osis:osisText", xmlns);
You can read the file to a string, replace the namespace in memory, and then load it using a string stream:
string s;
using(var reader = File.OpenText("osis.xml"))
{
s = reader.ReadToEnd();
}
s = s.Replace("xmlns=\"http://www.bibletechnologies.net/2003/OSIS/namespace\"", "");
Stream stream = new MemoryStream(Encoding.ASCII.GetBytes(s));
XPathDocument document = new XPathDocument("stream");
// Rest of the code
I have a string input that i do not know whether or not is valid xml.
I think the simplest aprroach is to wrap
new XmlDocument().LoadXml(strINPUT);
In a try/catch.
The problem im facing is, sometimes strINPUT is an html file, if the header of this file contains
<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">
<html xml:lang=""en-GB"" xmlns=""http://www.w3.org/1999/xhtml"" lang=""en-GB"">
...like many do, it actually tries to make a connection to the w3.org url, which i really dont want it doing.
Anyone know if its possible to just parse the string without trying to be clever and checking external urls? Failing that is there an alternative to xmldocument?
Try the following:
XmlDocument doc = new XmlDocument();
using (var reader = XmlReader.Create(new StringReader(xml), new XmlReaderSettings() {
ProhibitDtd = true,
ValidationType = ValidationType.None
})) {
doc.Load(reader);
}
The code creates a reader that turns off DTD processing and validation. Checking for wellformedness will still apply.
Alternatively you can use XDocument.Parse if you can switch to using XDocument instead of XmlDocument.
I am not sure about the reason behind the problem but Have you tried XDocument and XElement classes in System.Xml.Linq
XDocument document = XDocument.Load(strINPUT , LoadOptions.None);
XElement element = XElement.Load(strINPUT );
EDIT: for xml as string try following
XDocument document = XDocument.Parse(strINPUT , LoadOptions.None );
Use XmlDocument's load method to load the xml document, use XmlNodeList to get at the elements, then retrieve the data ...
try the following:
XmlDocument xmlDoc = new XmlDocument();
//use the load method to load the XML document from the specified stream.
xmlDoc.Load("myXMLDoc.xml");
//Use the method GetElementsByTagName() to get elements that match the specified name.
XmlNodeList item = xDoc.GetElementsByTagName("item");
XmlNodeList url = xDoc.GetElementsByTagName("url");
Console.WriteLine("The item is: " + item[0].InnerText));
add a try/catch block around the above code and see what you catch, modify your code to address that situation.