Web Scraping C# with (possibly) a masked IP

Web Scraping C# with (possibly) a masked IP - c#

I've gone away and tried my hand as some Web Scraping in C#. I found a video by Blake B on Youtube which shows the process of getting Ebay listings into a list, pretty cool. I am also trying to do something similar but I'm struggling on the HTML part on what to substitute where, this is what I have so far...
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net.Http;
namespace CarRegRetrieve
{
class Program
{
static void Main(string[] args)
{
GetHTMLAsync();
Console.ReadLine();
}
private static async void GetHTMLAsync()
{
var url = "https://www.rapidcarcheck.co.uk/results?RegPlate=LN52dmv";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var ProductsHTML = htmlDoc.DocumentNode.Descendants("body")
.Where(node => node.GetAttributeValue("wpb_text_column wpb_content_element", "")
.Equals("wpb_wrapper")).ToList();
var productLists = ProductsHTML[0].Descendants();
Console.WriteLine(productLists);
}
}
This is the beginning of what should be a scrape to get car information by an entered registration, as you can see, the website I am using is RapidCarCheck and I have entered my registration and would like to get the engine size, top speed etc. I have had a look at the inspect element on the page, but have no idea what I am looking for.
A little side note, note sure if it's an easy fix, but the site as a Cloudfare anti-request blocker type thing, and without a proxy to hide or change my IP makes it hard to code the project as I keep needing to change my IP manually.
Thanks!

Related

How to Find the reCAPTCHA element and click on it in c# Selenium

Hello everyone I need some help. There is URL: http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml. As you can see in screenshot I need to click checkbox Captcha.
https://i.stack.imgur.com/xjXaA.png
Here is my code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
namespace AkaryakitSelenium
{
class Program
{
private static string AkaryakitLink = "http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml";
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
driver.Navigate().GoToUrl(AkaryakitLink);
var kategoriCol = driver.FindElements(By.CssSelector(".ui-selectonemenu-trigger.ui-state-default.ui-corner-right"));
var x = kategoriCol[3];
x.Click();
var deneme = driver.FindElement(By.Id("petrolBayilikOzetSorguKriterleriForm:j_idt52_1"));
deneme.Click();
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
}
}
}
And lastly this error that I am facing:
"OpenQA.Selenium.NoSuchElementException: 'no such element: Unable to
locate element: {"method":"css
selector","selector":"#recaptcha-anchor"}"
Thank you for your help.

The element you are looking for is inside an iframe :
//iframe[#title='reCAPTCHA']
first you need to switch to iframe like this :
new WebDriverWait(driver, TimeSpan.FromSeconds(3)).Until(ExpectedConditions.FrameToBeAvailableAndSwitchToIt(By.XPath("//iframe[#title='reCAPTCHA']")));
then you can perform a click on it :
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
PS : captchas are not meant to be automated. since Captcha stands for CAPTCHA stands for the Completely Automated Public Turing test to tell Computers and Humans Apart.

You can not bypass captcha with Selenium.
It is designed to avoid automated access to web pages as described here and in many other places.

C# ScrapySharp 'System.Net.CookieException: 'The 'Name'='HttpOnly, NID' part of the cookie is invalid.'

So i'm facing an unexpected issue with my code. For some reason, I am unable to download & print the links out of my Google search... Help is much appreciated as I'm really not sure what is going on here... I am also using the DotNET SDK
using System;
using System.Threading.Tasks;
using ScrapySharp;
using ScrapySharp.Extensions;
using ScrapySharp.Network;
using static System.Console;
namespace Test
{
class Program
{
static async Task Main(string[] args)
{
var query = "scrapysharp";
Console.WriteLine($"Searching '{query}' on google");
var browser = new ScrapingBrowser();
browser.UseDefaultCookiesParser = false;
var resultsPage = await browser.NavigateToPageAsync(new Uri($"https://www.google.fr/search?q={query}"));
Console.WriteLine($"Results");
foreach (var link in resultsPage.Html.CssSelect("h3.r a"))
{
Console.WriteLine($"- {link.InnerText}");
}
}
}
Error:
System.Net.CookieException: 'The 'Name'='HttpOnly, NID' part of the cookie is invalid.'

I was facing the same issue, the quick workaround for me was bellow one line code.
browser.IgnoreCookies = true;
Leave everything else as is, add this line after the line where you are creating browser object and try it out.

Webscraper c# . perhaps a more precise webscraper than this

I am trying to scrape http://gameinfo.na.leagueoflegends.com/en/game-info/champions/ but i can't find where the images are of those champions in my webscraping. The problem is that it doesnt scrape every single thing... My script is ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Net;
namespace WebScraping
{
class Program
{
static void Main(string[] args) {
WebScraping wb = new WebScraping();
wb.Scraping();
}
class WebScraping
{
public void Scraping()
{
Console.WriteLine("Type in the webpage you want to scrape : \n");
string WebPage = Console.ReadLine();
WebClient webc = new WebClient();
string url = webc.DownloadString(WebPage);`
Console.WriteLine(url += "\n \t Done");
Console.ReadLine();
}
}
}
The thing I'm trying to find is the <a href="amumu"/></a>

You're right: the data is not in the original HTML. Instead, the Champions Grid is populated via javascript. This actually works in your favor; it means you'll probably be able to a get your hero information in json format, which is much easier to parse. The only trick is finding where that javascript is loaded.
In order to do that, load the page in your browser and use the developer tools. I'll use Google Chrome as an example. Hit F12 to open the developer tools, and then go to the Network tab. Now hit Shift+F5 to reload the page the record the requests. With this done, you can look through every individual item that was downloaded to render this page. I saw a full 238 requests (that's a lot!), but if you scan through the list for json items you'll eventually see a champions.json file. Right click on that, and you can get this url:
http://ddragon.leagueoflegends.com/cdn/6.24.1/data/en_US/champion.json
Look at the data in that file, and you'll find this:
"Amumu":
{
"version":"6.24.1",
"id":"Amumu",
"key":"32",
"name":"Amumu",
"title":"the Sad Mummy",
"blurb":"''Solitude can be lonelier than death.''<br><br>A lonely and melancholy soul from ancient Shurima, Amumu roams the world in search of a friend. Cursed by an ancient spell, he is doomed to remain alone forever, as his touch is death and his affection ...",
"info":
{
"attack":2,
"defense":6,
"magic":8,
"difficulty":3
},
"image":
{
"full":"Amumu.png",
"sprite":"champion0.png",
"group":"champion",
"x":192,
"y":0,
"w":48,
"h":48
},
"tags":["Tank","Mage"],
"partype":"MP",
"stats":
{
"hp":613.12,
"hpperlevel":84.0,
"mp":287.2,
"mpperlevel":40.0,
"movespeed":335.0,
"armor":23.544,
"armorperlevel":3.8,
"spellblock":32.1,
"spellblockperlevel":1.25,
"attackrange":125.0,
"hpregen":8.875,
"hpregenperlevel":0.85,
"mpregen":7.38,
"mpregenperlevel":0.525,
"crit":0.0,
"critperlevel":0.0,
"attackdamage":53.384,
"attackdamageperlevel":3.8,
"attackspeedoffset":-0.02,
"attackspeedperlevel":2.18
}
}
Use NuGet to pull in a JSON parser and you can quickly get structured data from this.

Regex helped me match the information that i needed
MatchCollection m1 = Regex.Matches(html, "\"id\":\"(.+?)\",\"", RegexOptions.Singleline);

OpenQA.Selenium.NoSuchElementException was unhandled + C# + Another Website

I am new to selenium, currently exploring on how it works. Started using it for ASP.NET application, i am using C# Selenium driver, IE Driver server ( 32 bit as it's faster than 64 bit)
I navigated to a application there I am clicking a link which should take me to ANOTHER WEBSITE where I have to find a textbox and clear it and enter some text (SendKeys) and then click a button.
When it goes to a another website from main website, It's unable to find the element ( I tried using by.ID and by.Name). I made sure element is available on the webpage. As recommened I used ImplicitlyWait but no luck, tried thread.sleep() no lucK.. Does the test needs to be on the same website which launched initially ?.. Below is my code snippet.. Please help me..
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.IE;
using OpenQA.Selenium.Support.UI;
using System.Threading;
namespace mySelenium
{
class Program
{
private static void Main(string[] args)
{
IWebDriver driver = new InternetExplorerDriver(#"C:\Users\msbyuva\Downloads\IEDriverServer_Win32_2.45.0\");
driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromSeconds(10));
driver.Navigate().GoToUrl("http://MyorgName.org/Apps/Sites/2015/login.aspx");
IWebElement userNameTxtBox = driver.FindElement(By.Id("ContentPlaceHolder1_Login1_UserName"));
userNameTxtBox.SendKeys("MSBYUVA");
IWebElement passwordTxtBox = driver.FindElement(By.Id("ContentPlaceHolder1_Login1_Password"));
passwordTxtBox.SendKeys("1234");
var myButton = driver.FindElement(By.Id("ContentPlaceHolder1_Login1_LoginButton"));
myButton.Click();
var EMailLink = driver.FindElement(By.LinkText("Email Testing Link"));
EMailLink .Click();
//Thread.Sleep(10000);
// -- HERE IT IS THROWING ERROR (ANOTHER WEBSITE AFTER CLICKING HYPERLINK)
var toEmailAddress = driver.FindElement(By.Name("ctl00$ContentPlaceHolder1$txtTo"));
toEmailAddress.Clear();
toEmailAddress.SendKeys("msbyuva#gmail.com");
var chkEmailAttachment = driver.FindElement(By.Name("ctl00$ContentPlaceHolder1$ChkAttachMent"));
chkEmailAttachment.Click();
var sendEmailButton = driver.FindElement(By.Id("ctl00_ContentPlaceHolder1_BtnSend"));
sendEmailButton.Click();
}
}
}

You need to switchTo newly opened window and set focus to it in order to send any commands to it
string currentHandle = driver.CurrentWindowHandle;
driver.SwitchTo().Window(driver.WindowHandles.ToList().Last());
After you done with newly opened window do(as need)
driver.Close();
driver.SwitchTo().Window(currentHandle );
More perfectly use PopupWindowFinder class
string currentHandle = driver.CurrentWindowHandle;
PopupWindowFinder popUpWindow = new PopupWindowFinder(driver);
string popupWindowHandle = popUpWindow.Click(EMailLink );
driver.SwitchTo().Window(popupWindowHandle);
//then do the email stuff
var toEmailAddress = driver.FindElement(By.Name("ctl00$ContentPlaceHolder1$txtTo"));
toEmailAddress.Clear();
toEmailAddress.SendKeys("msbyuva#gmail.com");
var chkEmailAttachment = driver.FindElement(By.Name("ctl00$ContentPlaceHolder1$ChkAttachMent"));
chkEmailAttachment.Click();
var sendEmailButton = driver.FindElement(By.Id("ctl00_ContentPlaceHolder1_BtnSend"));
sendEmailButton.Click();
}
}
}
//closing pop up window
driver.Close();
driver.SwitchToWindow(currentHandle);

Screen Capture in C# using HtmlAgilityPack

Due to the lack of proper documentation, I'm not sure if HtmlAgilityPack supports screen capture in C# after it loads the html contents.
So is there a way I can more or less grab a screenshot using (or along with) HtmlAgilityPack so I can have a visual clue as to what happens every time I do page manipulations?
Here is my working code so far:
using HtmlAgilityPack;
using System;
namespace ConsoleApplication4
{
class Program
{
static void Main(string[] args)
{
string urlDemo = "https://htmlagilitypack.codeplex.com/";
HtmlWeb getHtmlWeb = new HtmlWeb();
var doc = getHtmlWeb.Load(urlDemo);
var sentence = doc.DocumentNode.SelectNodes("//p");
int counter = 1;
try
{
foreach (var p in sentence)
{
Console.WriteLine(counter + ". " + p.InnerText);
counter++;
}
}
catch (Exception e)
{
Console.WriteLine(e);
}
Console.ReadLine();
}
}
}
Currently, it scrapes and output all the p of the page in the console but at the same time I want to get a screen grab of the scraped contents but I don't know how and where to begin.
Any help is greatly appreciated. TIA

You can't do this with HTML Agility Pack. Use a different tool such as Selenium WebDriver. Here is how to do it: Take a screenshot with Selenium WebDriver

Could you use Selenium WebDriver instead?
You'll need to add the following NuGet packages to your project first:
Selenium.WebDriver
Selenium.Support
Loading a page and taking a screenshot is then as simple as...
using System;
using System.Drawing.Imaging;
using System.IO;
using OpenQA.Selenium;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium.Support.UI;
namespace SeleniumTest
{
class Program
{
static void Main(string[] args)
{
// Create a web driver that used Firefox
var driver = new FirefoxDriver(
new FirefoxBinary(), new FirefoxProfile(), TimeSpan.FromSeconds(120));
// Load your page
driver.Navigate().GoToUrl("http://google.com");
// Wait until the page has actually loaded
var wait = new WebDriverWait(driver, new TimeSpan(0, 0, 10));
wait.Until(d => d.Title.Contains("Google"));
// Take a screenshot, and saves it to a file (you must have full access rights to the save location).
var myDesktop = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
((ITakesScreenshot)driver).GetScreenshot().SaveAsFile(Path.Combine(myDesktop, "google-screenshot.png"), ImageFormat.Png);
driver.Close();
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.