Iterate through web pages and download PDFs

Iterate through web pages and download PDFs - c#

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}

Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}

The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

Related

How to prevent "stale element" inside a foreach loop?

I'm using Selenium for retrieve data from this site, and I encountered a little problem when I try to click an element within a foreach.
What I'm trying to do
I'm trying to get the table associated to a specific category of odds, in the link above we have different categories:
As you can see from the image, I clicked on Asian handicap -1.75 and the site has generated a table through javascript, so inside my code I'm trying to get that table finding the corresponding element and clicking it.
Code
Actually I have two methods, the first called GetAsianHandicap which iterate over all categories of odds:
public List<T> GetAsianHandicap(Uri fixtureLink)
{
//Contains all the categories displayed on the page
string[] categories = new string[] { "-1.75", "-1.5", "-1.25", "-1", "-0.75", "-0.5", "-0.25", "0", "+0.25", "+0.5", "+0.75", "+1", "+1.25", "+1.5", "+1.75" };
foreach(string cat in categories)
{
//Get the html of the table for the current category
string html = GetSelector("Asian handicap " + asian);
if(html == string.Empty)
continue;
//other code
}
}
and then the method GetSelector which click on the searched element, this is the design:
public string GetSelector(string selector)
{
//Get the available table container (the category).
var containers = driver.FindElements(By.XPath("//div[#class='table-container']"));
//Store the html to return.
string html = string.Empty;
foreach (IWebElement container in containers)
{
//Container not available for click.
if (container.GetAttribute("style") == "display: none;")
continue;
//Get container header (contains the description).
IWebElement header = container.FindElement(By.XPath(".//div[starts-with(#class, 'table-header')]"));
//Store the table description.
string description = header.FindElement(By.TagName("a")).Text;
//The container contains the searched category
if (description.Trim() == selector)
{
//Get the available links.
var listItems = driver.FindElement(By.Id("odds-data-table")).FindElements(By.TagName("a"));
//Get the element to click.
IWebElement element = listItems.Where(li => li.Text == selector).FirstOrDefault();
//The element exist
if (element != null)
{
//Click on the container for load the table.
element.Click();
//Wait few seconds on ChromeDriver for table loading.
driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(20);
//Get the new html of the page
html = driver.PageSource;
}
return html;
}
return string.Empty;
}
Problem and exception details
When the foreach reach this line:
var listItems = driver.FindElement(By.Id("odds-data-table")).FindElements(By.TagName("a"));
I get this exception:
'OpenQA.Selenium.StaleElementReferenceException' in WebDriver.dll
stale element reference: element is not attached to the page document
Searching for the error means that the html page source was changed, but in this case I store the element to click in a variable and the html itself in another variable, so I can't get rid to patch this issue.
Someone could help me?
Thanks in advance.

I looked at your code and I think you're making it more complicated than it needs to be. I'm assuming you want to scrape the table that is exposed when you click one of the handicap links. Here's some simple code to do this. It dumps the text of the elements which ends up unformatted but you can use this as a starting point and add functionality if you want. I didn't run into any StaleElementExceptions when running this code and I never saw the page refresh so I'm not sure what other people were seeing.
string url = "http://www.oddsportal.com/soccer/europe/champions-league/paok-spartak-moscow-pIXFEt8o/#ah;2";
driver.Url = url;
// get all the (visible) handicap links and click them to open the page and display the table with odds
IReadOnlyCollection<IWebElement> links = driver.FindElements(By.XPath("//a[contains(.,'Asian handicap')]")).Where(e => e.Displayed).ToList();
foreach (var link in links)
{
link.Click();
}
// print all the odds tables
foreach (var item in driver.FindElements(By.XPath("//div[#class='table-container']")))
{
Console.WriteLine(item.Text);
Console.WriteLine("====================================");
}
I would suggest that you spend some more time learning locators. Locators are very powerful and can save you having to stack nested loops looking for one thing... and then children of that thing... and then children of that thing... and so on. The right locator can find all that in one scrape of the page which saves a lot of code and time.

As you mentioned in related Post, this issue is because site executes an auto refresh.
Solution 1:
I would suggest if there is an explicit way to do refresh, perform that refresh on a periodic basis, or (if you are sure, when you need to do refresh).
Solution 2:
Create a Extension method for FindElement and FindElements, so that it try to get element for a given timeout.
public static void FindElement(this IWebDriver driver, By by, int timeout)
{
if(timeout >0)
{
return new WebDriverWait(driver, TimeSpan.FromSeconds(timeout)).Until(ExpectedConditions.ElementToBeClickable(by));
}
return driver.FindElement(by);
}
public static IReadOnlyCollection<IWebElement> FindElements(this IWebDriver driver, By by, int timeout)
{
if(timeout >0)
{
return new WebDriverWait(driver, TimeSpan.FromSeconds(timeout)).Until(ExpectedConditions.PresenceOfAllElementsLocatedBy(by));
}
return driver.FindElements(by);
}
so your code will use these like this:
var listItems = driver.FindElement(By.Id("odds-data-table"), 30).FindElements(By.TagName("a"),30);
Solution 3:
Handle StaleElementException using an Extension Method:
public static void FindElement(this IWebDriver driver, By by, int maxAttempt)
{
for(int attempt =0; attempt <maxAttempt; attempt++)
{
try
{
driver.FindElement(by);
break;
}
catch(StaleElementException)
{
}
}
}
public static IReadOnlyCollection<IWebElement> FindElements(this IWebDriver driver, By by, int maxAttempt)
{
for(int attempt =0; attempt <maxAttempt; attempt++)
{
try
{
driver.FindElements(by);
break;
}
catch(StaleElementException)
{
}
}
}
Your code will use these like this:
var listItems = driver.FindElement(By.Id("odds-data-table"), 2).FindElements(By.TagName("a"),2);

Use this:
string description = header.FindElement(By.XPath("strong/a")).Text;
instead of your:
string description = header.FindElement(By.TagName("a")).Text;

HtmlAgilityPack search url link

I create a WindownsFormApplication for a group of friends. I'm using HtmlAgilityPack for my application.
I need to find all version of taco addon's , like this:
<li><a href='https://www.dropbox.com/s/nks140nf794tx77/GW2TacO_034r.zip?dl=0'>Download Build 034.1866r</a></li>
Additionally, I need to check the latest version for downloading the file with the url as in the code below:
public static bool Tacoisuptodate(string Version)
{
// Load HtmlDocuments
var doc = new HtmlWeb().Load("http://www.gw2taco.com/");
var body = doc.DocumentNode.SelectNodes("//body").Single();
// Sort out the document to take that he to interest us
//SelectNodes("//div"))
foreach (var node in doc.DocumentNode.SelectNodes("//div"))
{
// Check for null value
var classeValue = node.Attributes["class"]?.Value;
var idValue = node.Attributes["id"]?.Value;
var hrefValue = node.Attributes["href"]?.Value;
// We search <div class="widget LinkList" id="LinkList1" into home page >
if (classeValue == "widget LinkList" && idValue == "LinkList1")
{
foreach(HtmlNode content in node.SelectNodes("//li"))
{
Debug.Write(content.GetAttributeValue("href", false));
}
}
}
return false;
}
If somebody could help me, I would really appreciate it.

A single xpath is enough.
var xpath = "//h2[text()='Downloads']/following-sibling::div[#class='widget-content']/ul/li/a";
var doc = new HtmlAgilityPack.HtmlWeb().Load("http://www.gw2taco.com/");
var downloads = doc.DocumentNode.SelectNodes(xpath)
.Select(li => new
{
href = li.Attributes["href"].Value,
name = li.InnerText
})
.ToList();

Gets link with searching phrase

I am using HtmlAgilityPack and on univerity I had got task to get all links, located next to the word "source" and related. I tried with such a code:
foreach (HtmlNode link in document.DocumentNode.SelectNodes(".//a[#href]"))
{
if (document.DocumentNode.InnerHtml.ToString().Contains(sourcesDictionary[i]))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
Console.WriteLine(hrefValue);
}
}
But it just prints all links at HTML Document. What can I change to get it working properly?

Adding the highlighted line may help
foreach (HtmlNode link in document.DocumentNode.SelectNodes(".//a[#href]"))
{
**if(link.ParentNode.InnerHtml.Contains("source")**
{
if (document.DocumentNode.InnerHtml.ToString().Contains(sourcesDictionary[i]))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
Console.WriteLine(hrefValue);
}
}
}

HtmlAgilityPack ArgumentOutOfRangeException

I'm trying to parse a website's content on Windows Phone using the HtmlAgilityPack. My current code is:
HtmlWeb.LoadAsync(url, DownloadCompleted);
...
void DownloadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if (e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
string test = doc.DocumentNode.Element("html").Element("body").Element("form").Elements("div").ElementAt(2).Element("table").Element("tbody").Elements("tr").ElementAt(4).Element("td").Element("center").Element("div").InnerText.ToString();
System.Diagnostics.Debug.WriteLine(test);
}
}
}
Currently, when I run the above, I get an ArgumentOutOfRangeException at string test = doc.DocumentNode.Element("html").Element("body").Element("form").Elements("div").ElementAt(2).Element("table").Element("tbody").Elements("tr").ElementAt(4).Element("td").Element("center").Element("div").InnerText.ToString();.
doc.DocumentNode.Element("html").InnerText.ToString() seems to give me the source code for the entire page.
The URL of the website I'm trying to parse is: http://polyclinic.singhealth.com.sg/Webcams/QimgPage.aspx?Loc_Code=BDP

It looks like you're after a specific DIV, if I'm not mistaking the one you're after has a unique identifier <td class="queueNo"><center><div id="divRegPtwVal">0</div></center></td>.
Why not simply use doc.DocumentNode.SelectSingleNode("//div[#id='divRegPtwVal']") or doc.DocumentNode.Descendants("div").Where(div => div.Id == "divRegPtwVal").FirstOrDefault()
Select the image source for a specific image with id:
var attrib = doc.DocumentNode.SelectSingleNode("//img[#id='imgCam2']/#src");
//I suspect, might be a slightly different property, I can't check right now
string src = attrib.InnerText;
Or:
var img = doc.DocumentNode.Descendants("img").Where(img => img.Id=="imgCam2");
string src = img.Attributes["Source"].Value;

Reading links in header using WebKit.NET

I am trying to figure out how to read header links using C#.NET. I want to get the edit link from Browser1 and put it in browser 2. My problem is that I can't figure out how to get at attributes, or even the link tags for that matter. Below is what I have now.
using System.XML.Linq;
...
string source = webKitBrowser1.DocumentText.ToString();
XDocument doc = new XDocument(XDocument.Parse(source));
webKitBrowser2.Navigate(doc.Element("link").Attribute("href").Value.ToString());
This would work except that xml is different than html, and right off the bat, it says that it was expecting "doctype" to be uppercase.

I finally figured it out, so I will post it for anyone who has the same question.
string site = webKitBrowser1.Url.Scheme + "://" + webKitBrowser1.Url.Authority;
WebKit.DOM.Document doc = webKitBrowser1.Document;
WebKit.DOM.NodeList links = doc.GetElementsByTagName("link");
WebKit.DOM.Element link;
string editlink = "none";
foreach (var item in links)
{
link = (WebKit.DOM.Element)item;
if (link.Attributes["rel"].NodeValue == "edit") { editlink = link.Attributes["href"].NodeValue; }
}
if (editlink != "none") { webKitBrowser2.Navigate(site + editlink); }

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Iterate through web pages and download PDFs - c#

The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

Related

How to prevent "stale element" inside a foreach loop?

HtmlAgilityPack search url link

Gets link with searching phrase

HtmlAgilityPack ArgumentOutOfRangeException

Reading links in header using WebKit.NET

Categories

Resources