I heard good things about the HTMLAgilityPack library, so I thought I'd give it a try but I have had absolutely zero success with it. I've been trying to figure this out for months. No matter what I do, I cannot get this code to give me anything other than null. I tried following this example (http://www.c-sharpcorner.com/uploadfile/9b86d4/getting-started-with-html-agility-pack/), but I do not get the same results and I cannot explain why.
I try loading the file and then run SelectNodes to select all hyperlinks, but it always returns an empty list. I've tried selecting all kinds of nodes (divs, p, a, everything and anything) and it always returns an empty list. I've tried using doc.Descendants, I've tried using different source files, locally and on the the web and nothing I do will ever return an actual result.
I must have overlooked something important, but I cannot figure out what it is. What could I be missing?
Code:
public string GetSource()
{
try
{
string result = "";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
if (!System.IO.File.Exists("htmldoc.html"))
throw new Exception("Unable to load doc");
doc.LoadHtml("htmldoc.html"); // copied locally to bin folder, confirmed it found the file and loaded it
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a"); // Always returns null, regardless of what I put in here
if (nodes != null)
{
foreach (HtmlNode item in nodes)
{
result += item.InnerText;
}
}
else
{
// Every. Single. Time.
throw new Exception("No matching nodes found in document");
}
return result;
}
catch (Exception ex)
{
return ex.ToString();
}
}
The source HTML file 'htmldoc.html' I'm using looks like this:
<html>
<head>
<title>Testing HTML Agility Pack</title>
</head>
<body>
<div id="div1">
Link 1 inside div1
Link 2 inside div1
</div>
Link 3 outside all divs
<div id="div2">
Link 1 inside div2
Link 2 inside div2
</div>
</body>
</html>
To load a file you should use Load method.. LoadHtml is used for strings containing html
doc.Load("htmldoc.html");
Related
I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.
I am trying to parse the number shown in this page:
https://www.edf.org/embed/methane-counters
I have tried WebBrowser, WebClient ... etc. with no good result. Every time I try something new, in the HTML returned I get this (HTML area where the number is shown):
<strong id=\"methane\"></strong>
... as you see there is no number between the 'strong' tags. Just in case, this is the latest code I have tried, that still do not work:
using (WebBrowser myWebBrowser = new WebBrowser()) {
myWebBrowser.ScriptErrorsSuppressed = true;
myWebBrowser.Navigate(myURL);
while ((myWebBrowser.ReadyState != WebBrowserReadyState.Complete))
Application.DoEvents();
myContent = myWebBrowser.Document.Body.InnerHtml;
myContent = myWebBrowser.DocumentText;
}
... neither of the last two calls returns the HTML with the number on it.
Any ideas on how to get the proper content of this page?
I've been trying, without luck, to use IJavaScriptExecutor to find a specific header string in a page. Here's the html code form the page:
<div class="wrap">
<h2>Edit Page <a href="http://www.webtest.bugrit.net/wordpress/wp-admin/post-
new.php?post_type=page" class="add-new-h2">Add New</a></h2>
<div id...
The text I need to check for is the "Edit Page" string.
This is the closest I've come, which isn't very close:
var element = FFDriver.Instance.FindElements(By.ClassName("add-new-h2"));
IJavaScriptExecutor js = FFDriver.Instance as IJavaScriptExecutor;
if (js != null) {
string innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", element);
//System.Windows.Forms.MessageBox.Show(innerHtml);
if (innerHtml.Equals("Edit Page")) {
return true;
} else {
return false;
}
}
Now, I realize that the text I should expect to get from that code isn't the exact string "Edit Page". But shouldn't it return something? When I enable the MessageBox line, the innerHtml string is empty.
Or, of couse - if someone knows another, possible easier, way to check for the existance of a specific string inside a specific html tag, I'm all ears.
Your element returns you <a> element, not <h2>. Your <a> doesn't contain Edit Page string.
Try find your element like this to the parent element <h2> (only if class name add-new-h2 is unique, otherwise you will get the first matching one):
var element = FFDriver.Instance.FindElement(By.XPath(".//a[#class='add-new-h2']/.."));
var containsText = element.Text.Contains("Edit Page");
I am using the GeckoFX 22 c# web browser control but cannot manage to access tags within an iframe. When I check the gecko innerhtml it seems that although the iframe tag shows in the html, the contents of it do not.
This is the code I used to get the inner html of the browser control which just shows the iframe tag as empty (when it should have another doc inside of it):
GeckoHtmlElement element = null;
var geckoDomElement = webBrowser.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
var innerHtml = element.InnerHtml;
}
Previously I used code similar to the code below to access individual elements which works fine:
GeckoDocument checkDoc = (GeckoDocument)webBrowser.Window.Document;
var x = (checkDoc.GetElementsByTagName("a").Where(b => b.Id == "ipt-form-format-aside").First());
I am able to get individual elements and change their values/trigger events etc without problems with the main html document but anything in an iframe is impossible to get the elements of. I think perhaps the Iframe has not been loaded yet or something like that. Is there a way to force the control to wait for the I frame to load before attempting to access its elements?
string content = null;
var iframe = webBrowser.Document.GetElementsByTagName("iframe").FirstOrDefault() as Gecko.DOM.GeckoIFrameElement;
if(iframe != null)
{
var html = iframe.ContentDocument.DocumentElement as GeckoHtmlElement;
if (html != null)
content = html.OuterHtml;
}
I'm just posting this for anyone else that might get this problem
foreach (GeckoIFrameElement _E in geckoWebBrowser1.Document.GetElementsByTagName("iframe"))
{
if (_E.GetAttribute("class") == "testClass")
{
var innerHTML = _E.ContentDocument;
foreach (GeckoHtmlElement _A in innerHTML.GetElementsByTagName("input"))
{
_A.SetAttribute("value", "Test");
}
}
}
I got a similar problem so i did this
checkDoc.Window.Frames(1)
instead of
checkDoc.GetElementsByTagName("iframe")
value within the parenthesis (i.e. 1 here) depends of your index
I need to change the css of lots of pages and so I took the chance to play with AgilityHTML, I can read the css entries that I have to change just fine but I have no idea how to change the href of it.
here is an example of what I wanted to change:
<link rel="stylesheet" type="text/css" href="http://cdn.mysite.com/master/public.css?rev=012010">
More specific the href:
http://cdn.mysite.com/master/public.css?rev=012010
I've looked around but havent found the answer yet.
var nodes = doc.DocumentNode.SelectNodes("//css[#type=\"text/css\"]");
if (nodes != null)
{
foreach (HtmlNode data in nodes)
{
if (data.Attributes["href"] == null)
continue;
//data.Attributes["href"].Value;
}
}
To resume:
How could I change the href and save it back ?
data.Attributes["href"].Value = "Whatever you want";
...
...
doc.Save(stream);
// or:
string content = doc.DocumentNode.OuterHtml;
Try following,
var nodes = doc.DocumentNode.SelectNodes("//css[#type='text/css']");
It will select the nodes correctly.
I guess there is method on HtmlNode class called
SetAttributeValue
you can use it to save the new value.
Once you set the value you can access changed html content using
node.DocumentNode.OuterHtml