I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.
I have this HTML file which looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">
<span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
</p>
Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
if (reader.LocalName == "span")
{
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
}
I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">I'm great! </p>
In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0
It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.
Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.
Edit:
From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:
Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...
To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.
Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:
string html;
WebClient webClient = new WebClient();
using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
doc.write(html);
foreach (IHTMLElement el in doc.all)
Console.WriteLine(el.tagName);
I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.
Related
What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}
I'm currently using the following method to read in rss feeds:
if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))//RSS Feeds
{
using (StringReader sr = new StringReader(rawxml))
{
XmlReader xmlReader = XmlReader.Create(sr);
SyndicationFeed rssfeed = SyndicationFeed.Load(xmlReader);
xmlReader.Close();
//do stuff with the SyndicationFeed rssfeed
}
}
This code is going to be handling several different news sources and with all of the different types of errors that can happen with the varying rss feeds during the SyndicationFeed.Load process, I want to simplify the rss feed before I load it into a SyndicationFeed (which is a string format, named rawxml in the code) so that the items in the rss feed ONLY contain these child elements:
<item>
<title>*</title>
<link>*</link>
<description>*</description>
<pubDate>*</pubDate>
</item>
I am currently looking at using a regex pattern to strip out all the children elements under the <item> elements that aren't titles, links, descriptions or pubDates. I would do this using the following additional code:
string pattern = #"some pattern here";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(rawxml, "");
The problem is I am not sure how to write a pattern that would remove those unnecessary elements without destroying the children elements I want to keep. Is there a way to select those nested elements? A second strategy I have been looking at is using XPath to select those nodes, but I'm not sure how to remove children nodes from an XMLReader.
UPDATE:
I have decided to pull away from REGEX for the time and I'm looking at using XDocument and XPath to select all the nodes I don't want and to remove them from the feed. The following is what I have so far:
if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))//RSS Feeds
{
//Create XML and remove unneeded xml nodes
var xdoc = XDocument.Parse(rawxml);
foreach (var item in xdoc.XPathSelectElements("//item/??some/xpath/here/to/get/unwanted/children"))
{
item.RemoveNodes();
item.RemoveAll();
}
//Feed in the cleaned up xml into SyndicationFeed
using (XmlReader r = xdoc.CreateReader())
{
SyndicationFeed rssfeed = SyndicationFeed.Load(r);
//Do stuff
}
}
}
RegEx is not a suitable tool for modifying XML documents. What you're trying to do is a transformation, and there is a standardised technology for transforming XML documents: XSLT. All required types are in the System.Xml.Xsl namespace, and there's also a guide describing how to do an XSL transformation in .NET.
LINQ and XDocument was more straight forward to use and solved the solution. Here is what the solution I used looks like for anyone coming here that is trying to limit the amount of errors they get while reading RSS feeds. I ended up just not using SyndicationFeed overall, but for those interested in still using that they can use the .RemoveAll() operation on the XNodes.
if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))
{
//Create XML
XDocument xdoc = XDocument.Parse(rawxml);
foreach (var item in xdoc.Descendants("item")) {
//set temporary variables
foreach(var child in item.Descendants().Where(x =>
x.Name.ToString().ToLower() == "description" ||
x.Name.ToString().ToLower() == "link" ||
x.Name.ToString().ToLower() == "title" ||
x.Name.ToString().ToLower() == "pubdate"
)){
//grab elements with a switch statement
//do your operations
}
}
I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.
So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?
The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);
I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(
I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}
To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.