Htmlagilitypack: create html text node - c#

In HtmlAgilityPack, I want to create HtmlTextNode, which is a HtmlNode (inherts from HtmlNode) that has a custom InnerText.
HtmlTextNode CreateHtmlTextNode(string name, string text)
{
HtmlDocument doc = new HtmlDocument();
HtmlTextNode textNode = doc.CreateTextNode(text);
textNode.Name = name;
return textNode;
}
The problem is that the textNode.OuterHtml and textNode.InnerHtml will be equal to "text" after the method above.
e.g. CreateHtmlTextNode("title", "blabla") will generate:
textNode.OuterHtml = "blabla" instead of <Title>blabla</Title>
Is there any better way to create HtmlTextNode?

The following lines creates a outer html with content
var doc = new HtmlDocument();
// create html document
var html = HtmlNode.CreateNode("<html><head></head><body></body></html>");
doc.DocumentNode.AppendChild(html);
// select the <head>
var head = doc.DocumentNode.SelectSingleNode("/html/head");
// create a <title> element
var title = HtmlNode.CreateNode("<title>Hello world</title>");
// append <title> to <head>
head.AppendChild(title);
// returns Hello world!
var inner = title.InnerHtml;
// returns <title>Hello world!</title>
var outer = title.OuterHtml;
Hope it helps.

A HTMLTextNode contains just Text, no tags.
It's like the following:
<div> - HTML Node
<span>text</span> - HTML Node
This is the Text Node - Text Node
<span>text</span> - HTML Node
</div>
You're looking for a standard HtmlNode.
HtmlDocument doc = new HtmlDocument();
HtmlNode textNode = doc.CreateElement("title");
textNode.InnerHtml = HtmlDocument.HtmlEncode(text);
Be sure to call HtmlDocument.HtmlEncode() on the text you're adding. That ensures that special characters are properly encoded.

Related

HtmlAgilityPack throws stackoverflow exception when calling OuterHtml

var doc = new HtmlDocument();
var table = HtmlNode.CreateNode("table");
var tbody = HtmlNode.CreateNode("tbody");
table.AppendChild(tbody);
doc.DocumentNode.AppendChild(table);
var s = doc.DocumentNode.OuterHtml; //exception is thrown
I am unsure why this throws an exception. I have created 2 nodes and appended the child (tbody) to the parent (table).
In my mind, this should return
<table><tbody></tbody></table>
I'm not sure what I've done wrong
That HtmlNode.CreateNode expects an HTML structure instead of an element name.
From the method definition:
// Creates an HTML node from a string representing literal HTML.
//
// Parameters:
// html:
// The HTML text.
//
// Returns:
// The newly created node instance.
public static HtmlNode CreateNode(string html)
You are looking for the CreateElement method on the HtmlDocument instance.
var doc = new HtmlDocument();
var table = doc.CreateElement("table");
var tbody = doc.CreateElement("tbody");
table.AppendChild(tbody);
doc.DocumentNode.AppendChild(table);
In case you do want to go for HtmlNode.CreateNode then pass the corresponding html tags for the table and tbody.
var table = HtmlNode.CreateNode("<table></table");
var tbody = HtmlNode.CreateNode("<tbody></tbody>");

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

Load p tag from form using HtmlAgilityPack

I'm trying to grap p tag from form tag but it is null:
string html = "<form id='foo123'> <p> loll </p> </form>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectNodes("//form[contains(#id, 'foo')]"); //.Count = 1
var p = node[0].SelectSingleNode("./p"); // p is null
How do I fix this?
This is a known issue where the Agility Pack is incorrectly fixing the nesting of tags. You can work around it by calling:
HtmlNode.ElementsFlags.Remove("form");
See: http://htmlagilitypack.codeplex.com/workitem/23074

How to get text between two div tags with some class attribute with HTMLagility

I want to get some text from two html div from HTML file.
After some searches i decided to use HTMLAgility Pack for doing this.
I wrote this code :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*div[#class='item']");
string value = node.InnerText;
'result' is my content of the File.
But i get this exception : 'Expression must evaluate to a node-set'
And this is some of mt file's content :
<div class="Clear" style="height:15px;"></div>
<div class='Container Select' id="Container_1">
<div class='Item'><div class='Part Lable'>موضوع : </div><div class='Part ...
try either
"//*/div[#class='item']"
or simply
"//div[#class='item']"
have you tried using XPath
for example if I wanated to find a if a node is selected in my example I would do the following
string xpath = null;
XmlNode configNode = configDom.DocumentElement;
// collect selected nodes in node list
XmlNodeList nodeList =
configNode.SelectNodes(#"//*[#status='checked']");
in your case you would do the following
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*/div[#class='item']");
string value = node.InnerText;

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources