get iframe source using HtmlAgilityPack - c#

I am trying to get all iFrame source urls on an html doc. I tried using HtmlAgilityPack with xpath - but I don't seem to be getting a list of sources.
HtmlAgilityPack.HtmlDocument myHtml= new HtmlDocument();
myHtml.LoadHtml(htmlString);
foreach (HtmlNode framesrc) in myHtml.DocumentNode.SelectNodes("//iframe/src"))
{
srcCollection.add(framesrc);
}
Is my xpath wrong?

ifarme has attribute #src. So your XPath should be //iframe/#src. It will select #src of all iframe.

Actually this opensource html parser uses query look like following query:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//iframe[#src]");
foreach(var node in nodes){
HtmlAttribute attr = node.Attributes["src"];
Console.WriteLine(attr.Value);
}

Related

How to extract data from webpage using C#

I'm trying to extract text from this HTML tag
<span id="example1">sometext</span>
And I have this code:
using System;
using System.Net;
using HtmlAgilityPack;
namespace GC_data_console
{
class Program
{
public static void Main(string[] args)
{
using (var client = new WebClient())
{
// Download the HTML
string html =
client.DownloadString("https://www.requestedwebsite.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in
doc.DocumentNode.SelectNodes("//span"))
{
HtmlAttribute href = link.Attributes["id='example1'"];
if (href != null)
{
Console.WriteLine(href.Value.ToString());
Console.ReadLine();
}
}
}
}
}
}
But I am still not getting the text sometext.
But if I insert:
HtmlAttribute href = link.Attributes["id"];
I'll get all the IDs names. What am I doing wrong?
You need to first understand difference between HTML Node and HTMLAttribute. You code is nowhere near to solve the problem.
HTMLNode represents the tags used in HTML such as span,div,p,a and lot other. HTMLAttribute represents attribute which are used for the HTMLNodes such as href attribute is used for a, and style,class, id, name etc. attributes are used for almost all the HTML tags.
In below HTML
<span id="firstName" style="color:#232323">Some Firstname</span>
span is HTMLNode while id and style are the HTMLAttributes. and you can get value Some FirstName by using HtmlNode.InnerText property.
Also selecting HTMLNodes from HtmlDocument is not that straight forward. You need to provide proper XPath to select node you want.
Now in your code if you want to get the text written in <span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>, which is part of HTML of someurl.com, you need to write following code.
using (var client = new WebClient())
{
string html = client.DownloadString("https://www.someurl.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
var nodes = doc.DocumentNode.SelectNodes("//span")
.Where(d => d.Attributes.Contains("id"))
.Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
The above code will select all the span tags which are directly under the document node of the HTML. Tags which are located deep inside the hierarchy you need to use different XPath.
This should help you resolve your issue.

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

c# HtmlAgilityPack read non html tags

am trying to read MYOWNTAG using HtmlAgilityPack but I get Object reference not set to an instance of an objec
how can I print the name ahmed this is my c# code
string html = "<p>HELLO <MYOWNTAG> ahmed </MYOWNTAG> again</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//MYOWNTAG"))
MessageBox.Show(node.InnerText);
string html = "<p>HELLO <MYOWNTAG> ahmed </MYOWNTAG> again</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//myowntag"))
MessageBox.Show(node.InnerText);
It works. HtmlAgilityPack did ToLower() for all tags.

HtmlAgilityPack HtmlNodeCollection returning NULL , shouldn't

I made a simple program for fetching youtube users in comments.
This is the code
string html;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.youtube.com/watch?v=ER5EnjskCvE");
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
List<string> data = new List<string>();
HtmlNodeCollection nodeCollection = doc.DocumentNode.SelectNodes("//*[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img");
foreach (HtmlNode node in nodeCollection)
{
data.Add(node.GetAttributeValue("alt",null));
}
But i have a problem that my nodeCollection is returning null.
For the XPath i used copy XPath option in chrome under F12
try this replace "*" , "div"
"/html/body//div[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img"

Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(
I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}
To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.

Categories

Resources