Click on HTML elements with Scrapy (WebScraping)

Click on HTML elements with Scrapy (WebScraping) - c#

I'm doing a program in c # using scrapySharp or HtmlAgilityPack. But I have the disadvantage of that part of the information that I need, to appear when I click on an HTML element (Button, link ).
In some forums it was commented that when using Selenium you could manipulate the html elements, so I tried the following
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
// Defines the interface with the Chrome browser
IWebDriver driver = new ChromeDriver ();
// Auxiliary to store the label element in href
Element IWebElement;
// Go to the website
driver.Url = url;
// Click on the download button
driver.FindElement (By.Id ("Download button")). Click ();
but being a web automation test, it opens a browser and the website to perform the selection process (clicks), so it is not of my use, since I have to perform the inspection on several websites internally.
Although I can continue using Selenium, I am looking for ways to avoid using the browser and instead click without it.
Does anyone know how to achieve the click of the link or button, without the need to open a browser for web scraping?

Hope this would be helpful to anyone who has the same requirements.
If you want to avoid opening the browser, you could use below settings in the ChromeDriver.
// settings for avoid opening browser
var options = new ChromeOptions();
options.AddArgument("headless");
var service = ChromeDriverService.CreateDefaultService();
service.HideCommandPromptWindow = true;
// url to access and scrape
var url = "https://example.com";
using (var driver = new ChromeDriver(service, options))
{
// access the url
driver.Navigate().GoToUrl(url);
// Click on the download button - copied from your code above
driver.FindElement (By.Id ("Download button")). Click ();
}
In addition to above below links also, you may find useful,
can-selenium-webdriver-open-browser-windows-silently-in-background
running-webdriver-without-opening-actual-browser-window

Related

How do I download a page with Selenium

I did not find a solution how to download a whole Webpage
All I want is to navigate to https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?
and download it. Is it possible to download the page with Selenium?
I used the following Code to navigate to the page:
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
}

You can retrieve the page source content with driver.PageSource command. And save it to the file.
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
await File.WriteAllTextAsync("PageSource.html", driver.PageSource);
}
For downloading json it will work well.
But for html pages, note:
If the page has been modified after loading (for example, by JavaScript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server.
References
https://www.selenium.dev/selenium/docs/api/dotnet/html/P_OpenQA_Selenium_IWebDriver_PageSource.htm
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/file-system/how-to-write-to-a-text-file

How do I right-click on an image and Copy image address using selenium C#?

How do I right-click on an image and Copy image address using selenium C#?
I used this code:
var productimgs = driver.FindElement(By.XPath("//*[#id='coconut-baby-organic']/div[1]/div[1]/div/a/div/img"));
Actions action = new Actions(driver);
action.ContextClick(productimgs).Build().Perform();
action.SendKeys(Keys.ArrowDown).Build().Perform();
action.SendKeys(Keys.ArrowDown).Build().Perform();
action.SendKeys(Keys.ArrowDown).Build().Perform();
action.SendKeys(Keys.ArrowDown).Build().Perform();
action.SendKeys(Keys.Enter).Build().Perform();
I expect it to right-click on the image and keep going down till it finds "Copy image address" then click it but it's not.

This is a known issue in the Chrome Selenium Web driver.
Alternatives:
Use Firebox web driver.
You can achieve a similar functionality, using the inputsimulator. Note: The Chrome window must be in focus.
// find the element and click on it.
IWebElement element = driver.FindElement(By.XPath("some_xpath"));
Actions action = new Actions(driver);
action.ContextClick(element).Build().Perform();
// navigate in menu
var input = new InputSimulator();
input.Keyboard.KeyPress(VirtualKeyCode.DOWN);
input.Keyboard.KeyPress(VirtualKeyCode.DOWN);
input.Keyboard.KeyPress(VirtualKeyCode.DOWN);
input.Keyboard.KeyPress(VirtualKeyCode.DOWN);
input.Keyboard.KeyPress(VirtualKeyCode.RETURN);

Why on earth you want to do this using context click? The approach will require browser constantly being in focus, it means that you will not be able to do anything else with your computer while the test is running, neither you will be able to run your Selenium tests in parallel mode.
Instead I would recommend fetching src attribute of the <img> tag - that would be the URL you're looking for. It can be done via IWebElement.GetAttribute() function
Example code:
var productimgs = driver.FindElement(By.XPath("//*[#id='coconut-baby-organic']/div[1]/div[1]/div/a/div/img"));
var src = productimgs.GetAttribute("src");
Console.WriteLine("Image URL is: " + src);

Chromedriver: How to disable PDF plugin

To mimic functionality of how my Firefox profile is set up, I need to ensure that the PDF viewer for Chrome is disabled. after searching across the internet, the closest answer I find is here
https://code.google.com/p/chromium/issues/detail?id=528436
However attempting any of the suggestions on this page have given me no success
Here is a snippet of code I expect to work
Dictionary<String, Object> plugin = new Dictionary<String, Object>();
plugin.Add("enabled", false );
plugin.Add("name", "Chrome PDF Viewer");
var options = new ChromeOptions();
options.AddUserProfilePreference("plugins.plugins_list", plugin);
driver = new ChromeDriver(options);
Can anyone see what exactly I am doing wrong? this is starting to become a really frustrating issue!

For Chrome 57, I had to use this line instead:
options.AddUserProfilePreference("plugins.always_open_pdf_externally", true);
Additionally, if you ever need to set a plugin yourself you can find it like this:
Navigate to chrome://settings/content (Menu > Settings, Show advanced settings..., Content settings...).
Locate the specific preference (checkboxes only).
Right-click and Inspect the checkbox.
The preference name is the entire 'pref' attribute's value.

I found that this works for Selenium.WebDriver 2.53m, ChromeDriver 2.25.426923, and Chrome v55.0.2883.87 m.
var options = new ChromeOptions();
options.AddUserProfilePreference("plugins.plugins_disabled", new []{"Chrome PDF Viewer"});
driver = new ChromeDriver(options);

This work for me (if the first link is Chrome PDF Viewer )
driver.Navigate().GoToUrl("chrome://plugins/");
Thread.Sleep(4000);
driver.FindElement(By.LinkText("Disable")).Click();

Scraping product page with HttpAgilityPack - Not getting all products

Context:
I'm developing a desktop application in C# to scrape / analyse product information from individual web pages in a small number of domains. I use HtmlAgilityPack to capture and parse pages to fetch the data needed. I code different parsing rules for different domains.
Issue:
Pages from one particular domain, when displayed through a browser, can show perhaps 60-80 products. However when I parse through HtmlAgilityPack I only get 20 products maximum. Looking at the raw html in Firefox "View Page Source" there also appears to be only 20 of the relevant product divs present. I conclude that the remaining products must be loaded in via a script, perhaps to ease the load on the server. Indeed I can sometimes see this happening in the browser as there is a short pause while 20 more products load, then another 20 etc.
Question:
How can I access, through HtmlAgilityPack or otherwise, the full set of product divs present once all the scripting is complete?

You could use the WebBrowser in System.Windows.Forms to load the data, and agility pack to parse it. It would look something like this :
var browser = new WebBrowser();
browser.Navigate("http://whatever.com");
while (true)
{
if(browser.ReadyState == WebBrowserReadyState.Complete && browser.IsBusy != true)
{
break;
}
//not for production
Thread.Sleep(1000)
}
var doc = new HtmlAgilityPack.HtmlDocument();
var dom = (IHTMLDocument3)browser.Document.DomDocument;
StringReader reader = new StringReader(dom.documentElement.outerHTML);
doc.Load(reader);
see here for more details

Ok, I've got something working using the Selenium package (available via NuGet). The code looks like this:
private HtmlDocument FetchPageWithSelenium(string url)
{
IWebDriver driver = new FirefoxDriver();
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
driver.Navigate().GoToUrl(url);
// Scroll to the bottom of the page and pause for more products to load.
// Do it four times as there may be 4x20 products to retrieve.
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
HtmlDocument webPage = new HtmlDocument();
webPage.LoadHtml(driver.PageSource.ToString());
driver.Quit();
return webPage;
}
This returns an HtmlAgilityPack HtmlDocument ready for further analysis having first forced the page to fully load by repeatedly scrolling to the bottom. Two issues outstanding:
The code launches Firefox and then stops it again when complete. That's a bit clumsy and I'd rather all that happened invisibly. It's suggested that you can avoid this by using a PhantomJS driver instead of the Firefox driver. This didn't help though as it just pops up a Windows console window instead.
It's a bit slow due to the time taken to load the browser and pause while the scripting loads the supplementary content. I can probably live with it though.
I'll try to rework the #swestner code as well to get it running in a WPF app and see which is the tidier solution.

Selenium WebDriver and browsers select file dialog

I'm using selenium webdriver, C#.
Is it possible to make work webdriver with Firefox select file dialog?
Or must I use something like AutoIt?

If you are trying to select a file for upload Selenium 2 supports HTML file inputs. For example:
HTML
<input type="file" id="uploadhere" />
Selenium Code
IWebElement element = driver.FindElement(By.Id("uploadhere"));
element.SendKeys("C:\\Some_Folder\\MyFile.txt");
Basically you "type" (with SendKeys) the full file path to the file input element. Selenium handles the file selection dialog for you.
However if you want to manipulate an arbitrary file selection dialog, then like Anders said, you have to go outside of Selenium.

No, WebDriver cannot interact with dialogs - this is because dialogs are the domain of the operating system and not the webpage.
I know people that have had luck with autoit as well as the Automation API provided by .Net.
Another option would be to skip the file dialog entirely and issue a POST or a GET, but this requires more advanced knowledge of the website as well as understanding how construct a POST/GET.
You could try Webinator, it is similar to Selenium in the sense that it is powered by WebDriver. It provides file dialog capabilities and I've had great success with it.

Here is another solution using remotewebdriver, it works like magic and I loved it.
Here is the class I have:
driver.findElementByLinkText("Upload Files").click();
driver.setLogLevel(Level.ALL);
System.out.println(driver.getCurrentUrl());
WebElement element = driver.findElement(By.xpath("//input[#name='file_1']"));
LocalFileDetector detector = new LocalFileDetector();
//Now, give the file path and see the magic :)
String path = "D://test66T.txt";
File f = detector.getLocalFile(path);
((RemoteWebElement)element).setFileDetector(detector);
element.sendKeys(f.getAbsolutePath());
//now click the button to finish
driver.findElementByXPath("//html/body/div[9]/div[1]/a/span").click();

You asked for using AutoIt for the file dialog. This is easy and you can do it with C#.
Install nuget package AutoItX.Net
Use the demo code below
Change the dialog title string as you need
public static void InsertIntoFileDialog(string file, int timeout = 10)
{
int aiDialogHandle = AutoItX.WinWaitActive("Save As", "", timeout); // adjust string as you need
if (aiDialogHandle <= 0)
{
Assert.Fail("Can't find file dialog.");
}
AutoItX.Send(file);
Thread.Sleep(500);
AutoItX.Send("{ENTER}");
Thread.Sleep(500);
}
This helped me after I had trouble with Appium/Selenium related to file dialogs.

According to Nadim Saker
.Net has a library to handle file upload dialog. It has a SendKeys class that has a method SendWait(string keys). It sends the given key on the active application and waits for the message to be processed. It does not return any value.

This can be done as follows, tested and working with Internet Explorer and Chrome driver
var allowsDetection = this.Driver as IAllowsFileDetection;
if (allowsDetection != null)
{
allowsDetection.FileDetector = new LocalFileDetector();
}
Driver.FindElement(By.Id("your-upload-input")).SendKeys(#"C:\PathToYourFile");
Reference https://groups.google.com/forum/#!msg/webdriver/KxmRZ8MkM4M/45CT4ID_WjQJ

If you want to upload a file, and not use the WebDriver, the only solution I've come across is AutoIt. It allows you to write a script and convert it to an executable which you can then call from within your code. I've used it successfully while working with an ActiveX control.

Another approach is to use System.Windows.Forms.SendKeys.SendWait("pathToFile"). I use it with success everywhere where i cant just send keys to element like described by #prestomanifesto.

I used this to solve the problem... try it if all above does not works
Actions action = new Actions(driver);
action.SendKeys(pObjElement, Keys.Space).Build().Perform();
Thread.Sleep(TimeSpan.FromSeconds(2));
var dialogHWnd = FindWindow(null, "Elegir archivos para cargar"); // Here goes the title of the dialog window
var setFocus = SetForegroundWindow(dialogHWnd);
if (setFocus)
{
Thread.Sleep(TimeSpan.FromSeconds(2));
System.Windows.Forms.SendKeys.SendWait(pFile);
System.Windows.Forms.SendKeys.SendWait("{DOWN}");
System.Windows.Forms.SendKeys.SendWait("{TAB}");
System.Windows.Forms.SendKeys.SendWait("{TAB}");
System.Windows.Forms.SendKeys.SendWait("{ENTER}");
}
Thread.Sleep(TimeSpan.FromSeconds(2));
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Click on HTML elements with Scrapy (WebScraping) - c#

Related

How do I download a page with Selenium

How do I right-click on an image and Copy image address using selenium C#?

Chromedriver: How to disable PDF plugin

Scraping product page with HttpAgilityPack - Not getting all products

Selenium WebDriver and browsers select file dialog

Categories

Resources