Traversing the DOM with HTML Agility Pack - c#

I'm having trouble figuring out how to traverse the DOM with HTML Agility Pack.
For example let's say that I wanted to find an element with id="gbqfsa".
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(Url);
var foo = from bar in doc.DocumentNode.DescendantNodes()
where bar.Attributes["id"].Value == "gbqfsa"
select bar.InnerText;
Right now I'm doing this (above), but foo is coming out as null. What am I doing wrong?
EDIT: This is the if statement I was using. I was just testing to see if the elements InnerText equaled "Google Search."
if (foo.Equals("Google Search"))
{
HasSucceeded = 1;
MessageBox.Show(yay);
}
else
{
MessageBox.Show("kms");
}
return HasSucceeded;

What you should do is:
var foo = (from bar in doc.DocumentNode.DescendantNodes()
where bar.GetAttributeValue("id", null) == "gbqfsa"
select bar.InnerText).FirstOrDefault();
You forgot FirstOrDefault() to select the first element that satisfy the condition in where.
And I replace Attributes["id"].Value by GetAttributeValue("id", null) not to throw an exception if an element does have an id attribute.

I don't think foo is coming out as null. More likely, bar.Attributes["id"] is null for some of the elements in the tree since not all descendant nodes have an "id" property. I would recommend using the GetAttributeValue method, which will return a default value if the attribute is not found.
var foo = from bar in doc.DocumentNode.DescendantNodes()
where bar.GetAttributeValue("id", null) == "gbqfsa"
select bar.InnerText;

Related

Selenium C# - Web element attribute is present but cannot be found?

Running into an issue with a function I wrote for a selenuim testcase... When I run a jquery on a web element ID (#AmountToggle), it displays all the attributes. I want to verify this one ("lastChild"):
but when I run this code:
driver.FindElement(By.CssSelector("#AmountToggle")).GetAttribute("lastChild")
its returning null?!
Why is this and how can I get the correct value of this attribute?
So it looks like "lastChild" is a property, not an attribute. Although I cannot assert that for sure with what you have above.
The difference being, an attribute would appear as something in the direct html, such as:
Link
Where href and id are attributes. If lastChild doesn't appear in the html like the above examples, it won't be considered an attribute.
First try comparing these two in the javascript console:
$("#AmountToggle").attr("lastChild")
$("#AmountToggle").prop("lastChild")
The following is a workaround when you have issues with Selenium finding things. This logic will allow you to find things inside of iframes with ease, and also will allow you to use pseudo-selectors to find elements:
public static string GetFullyQualifiedXPathToElement(string cssSelector, bool isFullJQuery = false, bool noWarn = false)
{
if (cssSelector.Contains("$(") && !isFullJQuery) {
isFullJQuery = true;
}
string finder_method = #"
function getPathTo(element) {
if(typeof element == 'undefined') return '';
if (element.tagName == 'HTML')
return '/HTML[1]';
if (element===document.body)
return '/HTML[1]/BODY[1]';
var ix= 0;
var siblings = element.parentNode.childNodes;
for (var i= 0; i< siblings.length; i++) {
var sibling= siblings[i];
if (sibling===element)
return getPathTo(element.parentNode)+'/'+element.tagName+'['+(ix+1)+']';
if (sibling.nodeType===1 && sibling.tagName===element.tagName)
ix++;
}
}
";
if(isFullJQuery) {
cssSelector = cssSelector.TrimEnd(';');
}
string executable = isFullJQuery ? string.Format("{0} return getPathTo({1}[0]);", finder_method, cssSelector) : string.Format("{0} return getPathTo($('{1}')[0]);", finder_method, cssSelector.Replace("'", "\""));
string xpath = string.Empty;
try {
xpath = BaseTest.Driver.ExecuteJavaScript<string>(executable);
} catch (Exception e) {
if (!noWarn) {
//Warn about failure with custom message.
}
}
if (!noWarn && string.IsNullOrEmpty(xpath)) {
//Warn about failure with custom message.
//string.Format("Supplied cssSelector did not point to an element. Selector is \"{0}\".", cssSelector);
}
return xpath;
}
This method uses Jquery, which has more extensive search options using CssSelectors (such as pseudo selectors), and finds things 100% of the time given a good search query. This method uses JQuery to find the element, and then generates an explicit XPath to that element in the DOM, returning that XPath. With the explicit XPath, you can then tell Selenium to find the element using XPath.
It looks like the value of last-child is an element itself. If that is true, here is how you might use this in your example:
driver.FindElement(By.XPath(GetFullyQualifiedXPathToElement("$(#AmountToggle).prop('lastChild')[0]", true)));
Note three things here. The first is that I used "prop" in JQuery. Change that to "attr" if that was the correct call. Also, note the [0] index. This will return the JQuery element value as a regular javascript DOM element, which is what the method above uses. The final thing to note is the cssSelector value passed in. You can pass in just a selector to this method, such as "#SomeElementId > div", or you can pass in full JQuery, such as "$('#SomeElementId > div')".

Get HtmlAgilityPack Node using exact HTML search or Converting HTMLElement to HTMLNode

I have created a HTMLElement picker (DOM) by using the default .net WebBrowser.
The user can pick (select) a HTMLElement by clicking on it.
I want to get the HtmlAgilityPack.HTMLNode corresponding to the HTMLElement.
The easiest way (in my mind) is to use doc.DocumentNode.SelectSingleNode(EXACTHTMLTEXT) but it does not really work (because the function only accepts xpath code).
How can I do this?
A sample HTMLElement select by a user looks like this (The OuterHtml Code):
<a onmousedown="return wow" class="l" href="http://site.com"><em>Great!!!</em> <b>come and see more</b></a>
Of course, any element can be selected, that's why I need a way to get the HTMLNode.
Same concept, but a bit simpler because you don't have to know the element type:
HtmlNode n = doc.DocumentNode.Descendants().Where(n => n.OuterHtml.Equals(text, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
I came up with a solution. Don't know if it's the best (I would appreciate if somebody knows a better way to achieve this to let me know).
Here is the class that will get the HTMLNode:
public HtmlNode GetNode(string text)
{
if (text.StartsWith("<")) //get the type of the element (a, p, div etc..)
{
string type = "";
for (int i = 1; i < text.Length; i++)
{
if (text[i] == ' ')
{
type = text.Substring(1, i - 1);
break;
}
}
try //check to see if there are any nodes of your HTMLElement type that have an OuterHtml equal to the HtmlElement Outer HTML. If a node exist, than that's the node we want to use
{
HtmlNode n = doc.DocumentNode.SelectNodes("//" + type).Where(x => x.OuterHtml == text).First();
return n;
}
catch (Exception)
{
throw new Exception("Cannot find the HTML element in the HTML Page");
}
}
else
{
throw new Exception("Invalid HTML Element supplied. The selected HTML element must start with <");
}
}
The idea is that you pass the OuterHtml of the HtmlElement. Example:
HtmlElement el=....
HtmlNode N = GetNode(el.OuterHtml);

Remove element from XML based on attribute value?

I was trying to remove a descendant element from an XElement (using .Remove()) and I seem to get a null object reference, and I'm not sure why.
Having looked at the previous question with this title (see here), I found a way to remove it, but I still don't see why the way I tried 1st didn't work.
Can someone enlighten me ?
String xml = "<things>"
+ "<type t='a'>"
+ "<thing id='100'/>"
+ "<thing id='200'/>"
+ "<thing id='300'/>"
+ "</type>"
+ "</things>";
XElement bob = XElement.Parse(xml);
// this doesn't work...
var qry = from element in bob.Descendants()
where element.Attribute("id").Value == "200"
select element;
if (qry.Count() > 0)
qry.First().Remove();
// ...but this does
bob.XPathSelectElement("//thing[#id = '200']").Remove();
Thanks,
Ross
The problem is that the collection you are iterating contains some element that don't have the id attribute. For them, element.Attribute("id") is null, and so trying to access the Value property throws a NullReferenceException.
One way to solve this is to use a cast instead of Value:
var qry = from element in bob.Descendants()
where (string)element.Attribute("id") == "200"
select element;
If an element doesn't have the id attribute, the cast will returns null, which works fine here.
And if you're doing a cast, you can just as well cast to an int?, if you want.
Try the following:
var qry = bob.Descendants()
.Where(el => el .Attribute("id") != null)
.Where(el => el .Attribute("id").Value = "200")
if (qry.Count() > 0)
qry.First().Remove();
You need to test for the presence of the id attribute before getting its value.

LINQ to XML C# get root element attribute

Lts say i have XElement object doc:
<parameters mode="solve">
<inputs>
<a>value_a</a>
...
...
how do i get the value of the attribute of the first element (parameters), in other words how do i check which mode is it on.
if i write
if ((string)doc.Element("parameters").Attribute("mode").Value == "solve") { mode = 1; }
it gives me null object reference error
If doc is an XElement, as you say in your question, then you probably don't need to match it again:
if (doc.Attribute("mode").Value.ToString() == "solve") {
mode = 1;
}
If it is an XDocument, then you can use its Root property to refer to the document element:
if (doc.Root.Attribute("mode").Value.ToString() == "solve") {
mode = 1;
}
When you are calling doc.Element("parameters"), you are trying to look at the elements below the root element (in this case, the elements at the same level as <inputs>). You want to do this instead:
if (input.Attribute("mode").Value == "solve") { mode = 1; }
Just use the Root
if (doc.Root.Attribute("mode").Value.Equals("solve"))
{
mode = 1;
}

Looking for C# HTML parser [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.
http://code.google.com/p/fizzler/
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser()
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...
So now that expectations are realistically low, here's the code :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(#"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(#"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(#"^<%.*%>$"); // matches "<%# ... %>"
static Regex _shortHtmlComment = new Regex(#"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.
I think #Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:
SgmlReader
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
usage:
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[#class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.
You could use a HTML DTD, and the generic XML parsing libraries.
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.
You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.
There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.

Categories

Resources