How to extract data from webpage using C# - c#

I'm trying to extract text from this HTML tag
<span id="example1">sometext</span>
And I have this code:
using System;
using System.Net;
using HtmlAgilityPack;
namespace GC_data_console
{
class Program
{
public static void Main(string[] args)
{
using (var client = new WebClient())
{
// Download the HTML
string html =
client.DownloadString("https://www.requestedwebsite.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in
doc.DocumentNode.SelectNodes("//span"))
{
HtmlAttribute href = link.Attributes["id='example1'"];
if (href != null)
{
Console.WriteLine(href.Value.ToString());
Console.ReadLine();
}
}
}
}
}
}
But I am still not getting the text sometext.
But if I insert:
HtmlAttribute href = link.Attributes["id"];
I'll get all the IDs names. What am I doing wrong?

You need to first understand difference between HTML Node and HTMLAttribute. You code is nowhere near to solve the problem.
HTMLNode represents the tags used in HTML such as span,div,p,a and lot other. HTMLAttribute represents attribute which are used for the HTMLNodes such as href attribute is used for a, and style,class, id, name etc. attributes are used for almost all the HTML tags.
In below HTML
<span id="firstName" style="color:#232323">Some Firstname</span>
span is HTMLNode while id and style are the HTMLAttributes. and you can get value Some FirstName by using HtmlNode.InnerText property.
Also selecting HTMLNodes from HtmlDocument is not that straight forward. You need to provide proper XPath to select node you want.
Now in your code if you want to get the text written in <span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>, which is part of HTML of someurl.com, you need to write following code.
using (var client = new WebClient())
{
string html = client.DownloadString("https://www.someurl.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
var nodes = doc.DocumentNode.SelectNodes("//span")
.Where(d => d.Attributes.Contains("id"))
.Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
The above code will select all the span tags which are directly under the document node of the HTML. Tags which are located deep inside the hierarchy you need to use different XPath.
This should help you resolve your issue.

Related

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

How to get text between two div tags with some class attribute with HTMLagility

I want to get some text from two html div from HTML file.
After some searches i decided to use HTMLAgility Pack for doing this.
I wrote this code :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*div[#class='item']");
string value = node.InnerText;
'result' is my content of the File.
But i get this exception : 'Expression must evaluate to a node-set'
And this is some of mt file's content :
<div class="Clear" style="height:15px;"></div>
<div class='Container Select' id="Container_1">
<div class='Item'><div class='Part Lable'>موضوع : </div><div class='Part ...
try either
"//*/div[#class='item']"
or simply
"//div[#class='item']"
have you tried using XPath
for example if I wanated to find a if a node is selected in my example I would do the following
string xpath = null;
XmlNode configNode = configDom.DocumentElement;
// collect selected nodes in node list
XmlNodeList nodeList =
configNode.SelectNodes(#"//*[#status='checked']");
in your case you would do the following
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*/div[#class='item']");
string value = node.InnerText;

how do i get the text from a nested <p> tag in an external html using html agility pack?

I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements

get iframe source using HtmlAgilityPack

I am trying to get all iFrame source urls on an html doc. I tried using HtmlAgilityPack with xpath - but I don't seem to be getting a list of sources.
HtmlAgilityPack.HtmlDocument myHtml= new HtmlDocument();
myHtml.LoadHtml(htmlString);
foreach (HtmlNode framesrc) in myHtml.DocumentNode.SelectNodes("//iframe/src"))
{
srcCollection.add(framesrc);
}
Is my xpath wrong?
ifarme has attribute #src. So your XPath should be //iframe/#src. It will select #src of all iframe.
Actually this opensource html parser uses query look like following query:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//iframe[#src]");
foreach(var node in nodes){
HtmlAttribute attr = node.Attributes["src"];
Console.WriteLine(attr.Value);
}

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

Categories

Resources