Gets link with searching phrase - c#

I am using HtmlAgilityPack and on univerity I had got task to get all links, located next to the word "source" and related. I tried with such a code:
foreach (HtmlNode link in document.DocumentNode.SelectNodes(".//a[#href]"))
{
if (document.DocumentNode.InnerHtml.ToString().Contains(sourcesDictionary[i]))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
Console.WriteLine(hrefValue);
}
}
But it just prints all links at HTML Document. What can I change to get it working properly?

Adding the highlighted line may help
foreach (HtmlNode link in document.DocumentNode.SelectNodes(".//a[#href]"))
{
**if(link.ParentNode.InnerHtml.Contains("source")**
{
if (document.DocumentNode.InnerHtml.ToString().Contains(sourcesDictionary[i]))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
Console.WriteLine(hrefValue);
}
}
}

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

How to prevent "stale element" inside a foreach loop?

I'm using Selenium for retrieve data from this site, and I encountered a little problem when I try to click an element within a foreach.
What I'm trying to do
I'm trying to get the table associated to a specific category of odds, in the link above we have different categories:
As you can see from the image, I clicked on Asian handicap -1.75 and the site has generated a table through javascript, so inside my code I'm trying to get that table finding the corresponding element and clicking it.
Code
Actually I have two methods, the first called GetAsianHandicap which iterate over all categories of odds:
public List<T> GetAsianHandicap(Uri fixtureLink)
{
//Contains all the categories displayed on the page
string[] categories = new string[] { "-1.75", "-1.5", "-1.25", "-1", "-0.75", "-0.5", "-0.25", "0", "+0.25", "+0.5", "+0.75", "+1", "+1.25", "+1.5", "+1.75" };
foreach(string cat in categories)
{
//Get the html of the table for the current category
string html = GetSelector("Asian handicap " + asian);
if(html == string.Empty)
continue;
//other code
}
}
and then the method GetSelector which click on the searched element, this is the design:
public string GetSelector(string selector)
{
//Get the available table container (the category).
var containers = driver.FindElements(By.XPath("//div[#class='table-container']"));
//Store the html to return.
string html = string.Empty;
foreach (IWebElement container in containers)
{
//Container not available for click.
if (container.GetAttribute("style") == "display: none;")
continue;
//Get container header (contains the description).
IWebElement header = container.FindElement(By.XPath(".//div[starts-with(#class, 'table-header')]"));
//Store the table description.
string description = header.FindElement(By.TagName("a")).Text;
//The container contains the searched category
if (description.Trim() == selector)
{
//Get the available links.
var listItems = driver.FindElement(By.Id("odds-data-table")).FindElements(By.TagName("a"));
//Get the element to click.
IWebElement element = listItems.Where(li => li.Text == selector).FirstOrDefault();
//The element exist
if (element != null)
{
//Click on the container for load the table.
element.Click();
//Wait few seconds on ChromeDriver for table loading.
driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(20);
//Get the new html of the page
html = driver.PageSource;
}
return html;
}
return string.Empty;
}
Problem and exception details
When the foreach reach this line:
var listItems = driver.FindElement(By.Id("odds-data-table")).FindElements(By.TagName("a"));
I get this exception:
'OpenQA.Selenium.StaleElementReferenceException' in WebDriver.dll
stale element reference: element is not attached to the page document
Searching for the error means that the html page source was changed, but in this case I store the element to click in a variable and the html itself in another variable, so I can't get rid to patch this issue.
Someone could help me?
Thanks in advance.
I looked at your code and I think you're making it more complicated than it needs to be. I'm assuming you want to scrape the table that is exposed when you click one of the handicap links. Here's some simple code to do this. It dumps the text of the elements which ends up unformatted but you can use this as a starting point and add functionality if you want. I didn't run into any StaleElementExceptions when running this code and I never saw the page refresh so I'm not sure what other people were seeing.
string url = "http://www.oddsportal.com/soccer/europe/champions-league/paok-spartak-moscow-pIXFEt8o/#ah;2";
driver.Url = url;
// get all the (visible) handicap links and click them to open the page and display the table with odds
IReadOnlyCollection<IWebElement> links = driver.FindElements(By.XPath("//a[contains(.,'Asian handicap')]")).Where(e => e.Displayed).ToList();
foreach (var link in links)
{
link.Click();
}
// print all the odds tables
foreach (var item in driver.FindElements(By.XPath("//div[#class='table-container']")))
{
Console.WriteLine(item.Text);
Console.WriteLine("====================================");
}
I would suggest that you spend some more time learning locators. Locators are very powerful and can save you having to stack nested loops looking for one thing... and then children of that thing... and then children of that thing... and so on. The right locator can find all that in one scrape of the page which saves a lot of code and time.
As you mentioned in related Post, this issue is because site executes an auto refresh.
Solution 1:
I would suggest if there is an explicit way to do refresh, perform that refresh on a periodic basis, or (if you are sure, when you need to do refresh).
Solution 2:
Create a Extension method for FindElement and FindElements, so that it try to get element for a given timeout.
public static void FindElement(this IWebDriver driver, By by, int timeout)
{
if(timeout >0)
{
return new WebDriverWait(driver, TimeSpan.FromSeconds(timeout)).Until(ExpectedConditions.ElementToBeClickable(by));
}
return driver.FindElement(by);
}
public static IReadOnlyCollection<IWebElement> FindElements(this IWebDriver driver, By by, int timeout)
{
if(timeout >0)
{
return new WebDriverWait(driver, TimeSpan.FromSeconds(timeout)).Until(ExpectedConditions.PresenceOfAllElementsLocatedBy(by));
}
return driver.FindElements(by);
}
so your code will use these like this:
var listItems = driver.FindElement(By.Id("odds-data-table"), 30).FindElements(By.TagName("a"),30);
Solution 3:
Handle StaleElementException using an Extension Method:
public static void FindElement(this IWebDriver driver, By by, int maxAttempt)
{
for(int attempt =0; attempt <maxAttempt; attempt++)
{
try
{
driver.FindElement(by);
break;
}
catch(StaleElementException)
{
}
}
}
public static IReadOnlyCollection<IWebElement> FindElements(this IWebDriver driver, By by, int maxAttempt)
{
for(int attempt =0; attempt <maxAttempt; attempt++)
{
try
{
driver.FindElements(by);
break;
}
catch(StaleElementException)
{
}
}
}
Your code will use these like this:
var listItems = driver.FindElement(By.Id("odds-data-table"), 2).FindElements(By.TagName("a"),2);
Use this:
string description = header.FindElement(By.XPath("strong/a")).Text;
instead of your:
string description = header.FindElement(By.TagName("a")).Text;

How to get URL from the XPATH?

I've tried to check other answers on this site, but none of them worked for me. I have following HTML code:
<h3 class="x-large lheight20 margintop5">
<strong>some textstring</strong>
</h3>
I am trying to get # from this document with following code:
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a/#href").InnerText;
I've also tried to do that without #href. Also tried with a[contains(#href, 'searchString')]. But all of these lines gave me just the name of the link - some textstring
Attributes doesn't have InnerText.You have to use the Attributes collection instead.
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a")
.Attributes["href"].Value;
Why not just use the XDocument class?
private string GetUrl(string filename)
{
var doc = XDocument.Load(filename)
foreach (var h3Element in doc.Elements("h3").Where(e => e.Attribute("class"))
{
var classAtt = h3Element.Attribute("class");
if (classAtt == "x-large lheight20 margintop5")
{
h3Element.Element("a").Attribute("href").value;
}
}
}
The code is not tested so use with caution.

Find a word in PDF using PDFSharp

I am using PDFSharp. I need help. I need to check wether the document contains the word "abc". Example:
11abcee = true
444abcggw = true
778ab = false
I wrote this code, but it does not work as expected:
PdfDocument document = PdfReader.Open("c:\\abc.pdf");
PdfDictionary dictionary = new PdfDictionary(document);
string a = dictionary.Elements.GetString("MTZ");
if (a.Equals("MTZ"))
{
MessageBox.Show("OK", "");
}
else
{
MessageBox.Show("NO", "");
}
Am I missing something?
maybe this SO entry will help you: PDFSharp alter Text repositioning.
It links to here - text extraction example with PDFSharp.
Old question, but here is an example.
Note: c# 7.0+ is required to use IS new local variable assignment.
Note: This example uses PDFSharp installed from Package Manager.
"Install-Package PdfSharp -Version 1.50.5147"
Note: For my requirements, I only needed to search the first page of my PDFs, update if
needed.
using (PdfDocument inputDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.Import))
{
if (searchPDFPage(ContentReader.ReadContent(inputDocument.Pages[0]), searchText))
{
// match found.
}
}
This code looks for a cString that starts with a pound sign, the OP would need to use a Contains string function.
private bool searchPDFPage(CObject cObject, string searchText)
{
if (cObject is COperator cOperator)
{
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
if (searchPDFPage(cOperand, searchText))
{
return true;
}
}
}
}
else if (cObject is CSequence cSequence)
{
foreach (var element in cSequence)
{
if (searchPDFPage(element, searchText))
{
return true;
}
}
}
else if (cObject is CString cString)
{
if (cString.Value.StartsWith("#"))
{
if (cString.Value.Substring(2) == searchText)
{
return true;
}
}
}
return false;
}
Credit: This example was modified based on this answer:
C# Extract text from PDF using PdfSharp

Reading links in header using WebKit.NET

I am trying to figure out how to read header links using C#.NET. I want to get the edit link from Browser1 and put it in browser 2. My problem is that I can't figure out how to get at attributes, or even the link tags for that matter. Below is what I have now.
using System.XML.Linq;
...
string source = webKitBrowser1.DocumentText.ToString();
XDocument doc = new XDocument(XDocument.Parse(source));
webKitBrowser2.Navigate(doc.Element("link").Attribute("href").Value.ToString());
This would work except that xml is different than html, and right off the bat, it says that it was expecting "doctype" to be uppercase.
I finally figured it out, so I will post it for anyone who has the same question.
string site = webKitBrowser1.Url.Scheme + "://" + webKitBrowser1.Url.Authority;
WebKit.DOM.Document doc = webKitBrowser1.Document;
WebKit.DOM.NodeList links = doc.GetElementsByTagName("link");
WebKit.DOM.Element link;
string editlink = "none";
foreach (var item in links)
{
link = (WebKit.DOM.Element)item;
if (link.Attributes["rel"].NodeValue == "edit") { editlink = link.Attributes["href"].NodeValue; }
}
if (editlink != "none") { webKitBrowser2.Navigate(site + editlink); }

Categories

Resources