Use HtmlAgilityPack to parse HTML variable, not HTML document? - c#

I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.

You can use the LoadHtml method of HtmlDocument class

Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.

Related

Extracting data from HTML file using c# script

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}

How read content of a span tag using HtmlAgilityPack?

I'm using HtmlAgilityPack to scrap data from a link(site). There are many p tags, header and span tags in a site. I need to scrap data from a particular span tag.
var webGet = new HtmlWeb();
var document = webGet.Load(URL);
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
{
string strData = node.InnerText.Trim();
}
I had tried by using keyword on parent tag which was not working for all kind of URLs.
Please help me to fix it.
What is the error?
You can start by fixing this:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
it should be:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("//span"))
But I want exact data. For example, there are too many span tags in source as <span>abc</span>, <span>def</span>, <span>pqr</span>, <span>xyz</span>. I want the result as "pqr". Is there any option to get it by count of particular tag or by index?
If you want to get, for example, the third span tag from the root:
doc.DocumentNode.SelectSingleNode("//span[3]")
If you want to get the node containing the text "pqr":
doc.DocumentNode.SelectSingleNode("//span[contains(text(),'pqr')]");
You can use SelectNodes for the latter to get all span tags containing "pqr" in the text.

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

get html node inner text segmented?

I am trying to parse html page and I am facing a problem which is that I want to get the inner text of a node segmented i.e iterate on html node children assuming each text segment as a in child:
<node1>
This text I WANT on iterate#1
<innernode>This text I WANT on iterate#2</innernode>
This text I WANT on iterate#3
<innernode>This text I WANT on iterate#4</innernode>
This text I WANT on iterate#5
</node1>
I am using htmlagilitypack as a parser but I think that I will face this problem with any other html parser
Depending on your .NET version, you could use an extension method that works on the node you want.
I havent used the html agility pack, so this is a mix of C# and psuedo-code.
eg
public static List<string> GetTextSegments(this HtmlNode node)
{
string nodesText = ... // get the nodes text
yield nodesText;
List<HtmlNode> innerNodes = ... // get the list of inner nodes with a
// query like node.SelectNodes("//innerNodes")
foreach(HtmlNode iNode in innerNodes)
{
string iNodeText = ... // get iNodes text
yield iNodeText;
}
}
You could then call this like so:
HtmlNode nodeOfTypeNode1 = ... //
foreach(string text : nodeOfTypeNode1.getTextSegments())
{
Console.WriteLine(text);
}
To get your goal, use SelectNodes with XPath.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);//content is the variable containing your html.
var items = doc.DocumentNode.SelectNodes("/node1//text()");
foreach (var item in items)
{
Console.WriteLine(item.OuterHtml.Replace("\r\n",""));
}

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources