HtmlAgilityPack throws stackoverflow exception when calling OuterHtml

HtmlAgilityPack throws stackoverflow exception when calling OuterHtml - c#

var doc = new HtmlDocument();
var table = HtmlNode.CreateNode("table");
var tbody = HtmlNode.CreateNode("tbody");
table.AppendChild(tbody);
doc.DocumentNode.AppendChild(table);
var s = doc.DocumentNode.OuterHtml; //exception is thrown
I am unsure why this throws an exception. I have created 2 nodes and appended the child (tbody) to the parent (table).
In my mind, this should return
<table><tbody></tbody></table>
I'm not sure what I've done wrong

That HtmlNode.CreateNode expects an HTML structure instead of an element name.
From the method definition:
// Creates an HTML node from a string representing literal HTML.
//
// Parameters:
// html:
// The HTML text.
//
// Returns:
// The newly created node instance.
public static HtmlNode CreateNode(string html)
You are looking for the CreateElement method on the HtmlDocument instance.
var doc = new HtmlDocument();
var table = doc.CreateElement("table");
var tbody = doc.CreateElement("tbody");
table.AppendChild(tbody);
doc.DocumentNode.AppendChild(table);
In case you do want to go for HtmlNode.CreateNode then pass the corresponding html tags for the table and tbody.
var table = HtmlNode.CreateNode("<table></table");
var tbody = HtmlNode.CreateNode("<tbody></tbody>");

Related

Store links into variable instead of text file

I am on very early learning curve of C#. I have a code for storing web links into text file. How I can store them into variable so I can loop through them later in the code and access each one separately?
string pdfLinksUrl = "https://www.nordicwater.com/products/waste-water/";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().StartsWith("https://www.nordicwater.com/product/")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(#"c:\temp\pdflinks.txt", pdfUrls.ToArray());

pdfUrls holds all of your URLs, you are using it when you are writing all of them into the file
You can use a foreach loop in order to loop through the URLs easily:
foreach (string url in odfUrls.ToArray()) {
Console.WriteLine($"PDF URL: {url}");
}

HTMLAgilityPack - Get element in class by class

I wish to get the value from the H2 (highlighted) element within 'listicle-page' class shown below. Currently the code gets all values in the DIV element while I need to just get the value of H2 that is contained within the class below.
Consider the following HTML:
Please see code below -
private void getFact()
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://www.rd.com/culture/interesting-facts/");
var headerNames = doc.DocumentNode.SelectNodes("//div[#class='listicle-page']").ToList();
foreach(var item in headerNames)
{
MessageBox.Show(item.InnerText);
}
}

Your XPath //div[#class='listicle-page'] matches div node with all of its descendants. If you need to select child h2 node only, then explicitly specify it by adding /h2:
//div[#class='listicle-page']/h2

MVC StackOverflowException with larger html data

I have the following method (i'm using the htmlagilitypack):
public DataTable tableIntoTable(HtmlDocument doc)
{
var nodes = doc.DocumentNode.SelectNodes("//table");
var table = new DataTable("MyTable");
table.Columns.Add("raw", typeof(string));
foreach (var node in nodes)
{
if (
(!node.InnerHtml.Contains("pldefault"))
&& (!node.InnerHtml.Contains("ntdefault"))
&& (!node.InnerHtml.Contains("bgtabon"))
)
{
table.Rows.Add(node.InnerHtml);
}
}
return table;
}
It accepts html grabbed using this:
public HtmlDocument getDataWithGet(string url)
{
using (var wb = new WebClient())
{
string response = wb.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(response);
return doc;
}
}
All works fine with an html document that is 3294 lines long.
When I feed it some html that is 33960 lines long I get:
StackOverflowException was unhandled at the IF statement in the tableIntoTable method as seen in this image:
http://imgur.com/Q2FnIgb
I thought it might be related to the MaxHttpCollectionKeys limit of 1000 so I tried putting this in my Web.config and it still doesn't work:
add key="aspnet:MaxHttpCollectionKeys" value="9999"
I'm not really sure where to go from here, it only breaks with larger html documents.

Assuming the values in your if statement are contained in some attribute value of some decendant of a table.
var xpath = #"//table[not(.//*[contains(#*,'pldefault') or
contains(#*,'ntdefault') or
contains(#*,'bgtabon')])]";
var tables = doc.DocumentNode.SelectNodes(xpath);
Upadte: More accurately based on your comments:
#"//table[not(.//td[contains(#class,'pldefault') or
contains(#class,'ntdefault') or
contains(#class,'bgtabon')])]";

How to get div by class in HtmlAgilityPack?

I'm following this tutorial, but I have a problem, I don't know how to get htmlNode by class name .
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(e.Result);
HtmlNode divContainer = htmlDoc.GetElementbyId("directoryItems");//My problem here,I want to get by class name html
if (divContainer != null)
{
HtmlNodeCollection nodes = divContainer.SelectNodes("//table/tr");
....
}

Try this:
HtmlNodeCollection divContainer = htmlDoc.DocumentNode.SelectNodes("//div[#class='myClass']");
this will return a collection of div nodes with class="myClass"

Assuming that you want to select a <div> element having class attribute value equals "directoryItems", and you know there will be only one element meets the criteria (or you want to simply select the first occurrence if there are more then one), you can use .SelectSingleNode() method with following XPath query :
HtmlNode divContainer = htmlDoc.DocumentNode
.SelectSingleNode("//div[#class='directoryItems']");

Htmlagilitypack: create html text node

In HtmlAgilityPack, I want to create HtmlTextNode, which is a HtmlNode (inherts from HtmlNode) that has a custom InnerText.
HtmlTextNode CreateHtmlTextNode(string name, string text)
{
HtmlDocument doc = new HtmlDocument();
HtmlTextNode textNode = doc.CreateTextNode(text);
textNode.Name = name;
return textNode;
}
The problem is that the textNode.OuterHtml and textNode.InnerHtml will be equal to "text" after the method above.
e.g. CreateHtmlTextNode("title", "blabla") will generate:
textNode.OuterHtml = "blabla" instead of <Title>blabla</Title>
Is there any better way to create HtmlTextNode?

The following lines creates a outer html with content
var doc = new HtmlDocument();
// create html document
var html = HtmlNode.CreateNode("<html><head></head><body></body></html>");
doc.DocumentNode.AppendChild(html);
// select the <head>
var head = doc.DocumentNode.SelectSingleNode("/html/head");
// create a <title> element
var title = HtmlNode.CreateNode("<title>Hello world</title>");
// append <title> to <head>
head.AppendChild(title);
// returns Hello world!
var inner = title.InnerHtml;
// returns <title>Hello world!</title>
var outer = title.OuterHtml;
Hope it helps.

A HTMLTextNode contains just Text, no tags.
It's like the following:
<div> - HTML Node
<span>text</span> - HTML Node
This is the Text Node - Text Node
<span>text</span> - HTML Node
</div>
You're looking for a standard HtmlNode.
HtmlDocument doc = new HtmlDocument();
HtmlNode textNode = doc.CreateElement("title");
textNode.InnerHtml = HtmlDocument.HtmlEncode(text);
Be sure to call HtmlDocument.HtmlEncode() on the text you're adding. That ensures that special characters are properly encoded.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack throws stackoverflow exception when calling OuterHtml - c#

Related

Store links into variable instead of text file

HTMLAgilityPack - Get element in class by class

MVC StackOverflowException with larger html data

How to get div by class in HtmlAgilityPack?

Htmlagilitypack: create html text node

Categories

Resources