Webscraper c# . perhaps a more precise webscraper than this

Webscraper c# . perhaps a more precise webscraper than this - c#

I am trying to scrape http://gameinfo.na.leagueoflegends.com/en/game-info/champions/ but i can't find where the images are of those champions in my webscraping. The problem is that it doesnt scrape every single thing... My script is ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Net;
namespace WebScraping
{
class Program
{
static void Main(string[] args) {
WebScraping wb = new WebScraping();
wb.Scraping();
}
class WebScraping
{
public void Scraping()
{
Console.WriteLine("Type in the webpage you want to scrape : \n");
string WebPage = Console.ReadLine();
WebClient webc = new WebClient();
string url = webc.DownloadString(WebPage);`
Console.WriteLine(url += "\n \t Done");
Console.ReadLine();
}
}
}
The thing I'm trying to find is the <a href="amumu"/></a>

You're right: the data is not in the original HTML. Instead, the Champions Grid is populated via javascript. This actually works in your favor; it means you'll probably be able to a get your hero information in json format, which is much easier to parse. The only trick is finding where that javascript is loaded.
In order to do that, load the page in your browser and use the developer tools. I'll use Google Chrome as an example. Hit F12 to open the developer tools, and then go to the Network tab. Now hit Shift+F5 to reload the page the record the requests. With this done, you can look through every individual item that was downloaded to render this page. I saw a full 238 requests (that's a lot!), but if you scan through the list for json items you'll eventually see a champions.json file. Right click on that, and you can get this url:
http://ddragon.leagueoflegends.com/cdn/6.24.1/data/en_US/champion.json
Look at the data in that file, and you'll find this:
"Amumu":
{
"version":"6.24.1",
"id":"Amumu",
"key":"32",
"name":"Amumu",
"title":"the Sad Mummy",
"blurb":"''Solitude can be lonelier than death.''<br><br>A lonely and melancholy soul from ancient Shurima, Amumu roams the world in search of a friend. Cursed by an ancient spell, he is doomed to remain alone forever, as his touch is death and his affection ...",
"info":
{
"attack":2,
"defense":6,
"magic":8,
"difficulty":3
},
"image":
{
"full":"Amumu.png",
"sprite":"champion0.png",
"group":"champion",
"x":192,
"y":0,
"w":48,
"h":48
},
"tags":["Tank","Mage"],
"partype":"MP",
"stats":
{
"hp":613.12,
"hpperlevel":84.0,
"mp":287.2,
"mpperlevel":40.0,
"movespeed":335.0,
"armor":23.544,
"armorperlevel":3.8,
"spellblock":32.1,
"spellblockperlevel":1.25,
"attackrange":125.0,
"hpregen":8.875,
"hpregenperlevel":0.85,
"mpregen":7.38,
"mpregenperlevel":0.525,
"crit":0.0,
"critperlevel":0.0,
"attackdamage":53.384,
"attackdamageperlevel":3.8,
"attackspeedoffset":-0.02,
"attackspeedperlevel":2.18
}
}
Use NuGet to pull in a JSON parser and you can quickly get structured data from this.

Regex helped me match the information that i needed
MatchCollection m1 = Regex.Matches(html, "\"id\":\"(.+?)\",\"", RegexOptions.Singleline);

Related

How to Find the reCAPTCHA element and click on it in c# Selenium

Hello everyone I need some help. There is URL: http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml. As you can see in screenshot I need to click checkbox Captcha.
https://i.stack.imgur.com/xjXaA.png
Here is my code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
namespace AkaryakitSelenium
{
class Program
{
private static string AkaryakitLink = "http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml";
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
driver.Navigate().GoToUrl(AkaryakitLink);
var kategoriCol = driver.FindElements(By.CssSelector(".ui-selectonemenu-trigger.ui-state-default.ui-corner-right"));
var x = kategoriCol[3];
x.Click();
var deneme = driver.FindElement(By.Id("petrolBayilikOzetSorguKriterleriForm:j_idt52_1"));
deneme.Click();
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
}
}
}
And lastly this error that I am facing:
"OpenQA.Selenium.NoSuchElementException: 'no such element: Unable to
locate element: {"method":"css
selector","selector":"#recaptcha-anchor"}"
Thank you for your help.

The element you are looking for is inside an iframe :
//iframe[#title='reCAPTCHA']
first you need to switch to iframe like this :
new WebDriverWait(driver, TimeSpan.FromSeconds(3)).Until(ExpectedConditions.FrameToBeAvailableAndSwitchToIt(By.XPath("//iframe[#title='reCAPTCHA']")));
then you can perform a click on it :
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
PS : captchas are not meant to be automated. since Captcha stands for CAPTCHA stands for the Completely Automated Public Turing test to tell Computers and Humans Apart.

You can not bypass captcha with Selenium.
It is designed to avoid automated access to web pages as described here and in many other places.

Web Scraping C# with (possibly) a masked IP

I've gone away and tried my hand as some Web Scraping in C#. I found a video by Blake B on Youtube which shows the process of getting Ebay listings into a list, pretty cool. I am also trying to do something similar but I'm struggling on the HTML part on what to substitute where, this is what I have so far...
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net.Http;
namespace CarRegRetrieve
{
class Program
{
static void Main(string[] args)
{
GetHTMLAsync();
Console.ReadLine();
}
private static async void GetHTMLAsync()
{
var url = "https://www.rapidcarcheck.co.uk/results?RegPlate=LN52dmv";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var ProductsHTML = htmlDoc.DocumentNode.Descendants("body")
.Where(node => node.GetAttributeValue("wpb_text_column wpb_content_element", "")
.Equals("wpb_wrapper")).ToList();
var productLists = ProductsHTML[0].Descendants();
Console.WriteLine(productLists);
}
}
This is the beginning of what should be a scrape to get car information by an entered registration, as you can see, the website I am using is RapidCarCheck and I have entered my registration and would like to get the engine size, top speed etc. I have had a look at the inspect element on the page, but have no idea what I am looking for.
A little side note, note sure if it's an easy fix, but the site as a Cloudfare anti-request blocker type thing, and without a proxy to hide or change my IP makes it hard to code the project as I keep needing to change my IP manually.
Thanks!

C# : Selenium : How to find an element with respect to the innertext of another element

I just started to learn Selenium WebDriver and I collided with a few issues.
I googled a lot, but it was unsuccessful.
So, I am going to write a parser of a website.
There is a kind of HTML.
browser view and html
<div class="view-wrapper"> is included
<ul class="sport--list"> and its included a list of <li class="sport--block">...</li>
I am trying to check each class of sport-block's in a loop and found the section which includes key word like "Футбол"
Футбол
Then, when I found the proper section I am going to get value of non-static timer and then write it to file. It's my next step. I have to solve first of all my first trouble;
timer
The main issue is that there are a lot of divs of .
How can I found the proper one? I wrote this code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
namespace Parser
{
class Program
{
static void Main(string[] args)
{
using (var driver = new ChromeDriver())
{
// Go to the home page
driver.Navigate().GoToUrl("https://www.favorit.com.ua/ru/live/");
// Get the page elements
IList<IWebElement> ClassNamesElements = driver.FindElements(By.ClassName("sport--block"));
for (int i = 0; i < ClassNamesElements.Count; i++)
{
Console.WriteLine(ClassNamesElements[i]);
Console.ReadLine();
}
}
}
}
}
But I don't know how to set up the next condition for selection. Like "Where includes something like Футбол".
And after that, I want to work only within the piece of HTML which corresponds to my proper sport--block
I am not able to use XPath of elements, cuz website is not static. And proper sport block can appear with random positon.
I don't need you write code instead of me. I just need some direction to continue my googling.
DId I choose the proper way to solve this task (C# + Selenium)?
Please, give me a few clues or hints. Thank you in advance.

To retrieve the value of the non-static timer with respect to several keys within the <li class="sport--block">...</li> tags, as there are multiple such <li> tags, you can write a function which will accept the key value as a string argument and print the relevant time.
Function :
public void print_key_timer(string myKey)
{
string myTime = driver.FindElement(By.XPath("//ul[#class='sport--list']//li[#class='sport--block']/div[contains(#class,'sport--head')]//span[.='" + myKey + "']//following-sibling::ul[1]//ul[#class='events--list']//div[#class='event--head']//div[#class='time--block']/div[#class='event--timer']")).GetAttribute("innerHTML");
Console.WriteLine(myTime);
}
Now you can call the function as many times you wish from anywhere within your program as :
print_key_timer("Футбол")

Try this code :
for (int i = 0; i < ClassNamesElements.Count; i++){
if(ClassNamesElements[i].GetText().Contains("Футбол")){
Console.WriteLine(ClassNamesElements[i].GetText());
Console.ReadLine();
}
}

Need to pass key-value pairs from C# to Javascript

I need to pass a variable from C# to javascript in the form { 'key':'value', ..., }. I tried passing it as a string and hoping javascript would parse it (because the C# on cshtml pages is evaluated server side and js is client side) but unfortunately the quotes were formatted as &whateverthecodeis; so it didn't work. I think JSON might be what I'm looking for, but I have no idea how to use it.

Here is what I might do...
Run this console app and see what it does:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// note: you will have to include a reference to "System.Web.Extensions" in your project to be able to use this...
using System.Web.Script.Serialization;
namespace KeyValuePairTestApp
{
class Program
{
static void Main(string[] args)
{
List<KeyValuePair<string, string>> pairs = new List<KeyValuePair<string, string>>()
{
new KeyValuePair<string, string>("MyFirstKey", "MyFirstValue"),
new KeyValuePair<string, string>("MySecondKey", "MySecondValue")
};
string json = new JavaScriptSerializer().Serialize(pairs);
Console.WriteLine(json);
}
}
}
For the "pass to javascript" part, please see here Making a Simple Ajax call to controller in asp.net mvc for practical examples for MVC and jQuery.

Yes you can use JSON.
Perhaps you should try using escape characters to escape the quotes being misinterpreted.
Or as in the above answer #user1477388, serialize the keyvalupairs to Json and return as following:
public ActionResult ReturnJsonObject()
{
//Your code goes here
return Json(json);
}

Info from javascript text?

im trying to get some information from a website (http://wowhead.com) now most if it is easy enough with html agility pack, but im stuck on how to get information after a java script tag.
The source what im trying to get info from is
<script type="text/javascript">//<![CDATA[
Markup.printHtml("[ul][li]Level: 49[/li][li]Requires level 47[/li][li]Loremaster: [url=/achievement=4931]Felwood[/url][/li][li]Side: [span class=icon-horde]Horde[/span][/li] [li][icon name=quest_start]Start: [url=/npc=48127]Darla Drilldozer[/url][/icon][/li][li] [icon name=quest_end]End: [url=/npc=48127]Darla Drilldozer[/url][/icon][/li] [li]Sharable[/li][li]Difficulty: [color=r2]47[/color][small] [/small][color=r3]52[/color][small] [/small][color=r4]59[/color][/li][li]Added in patch 4.0.3[/li][/ul]", "sdhafcuvh0", { allow: Markup.CLASS_STAFF, dbpage: true });
//]]></script>
Now from all that, the only thing im interested in is the info from
[url=/npc=48127]Darla Drilldozer[/url]
From which i only want do display 48127 and Darla Drilldozer.
Is there anyway to do this?
Here is an example of my current code in a console to show what kind of thing im after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
//Enter the Quest ID and set it as a WoWHead link
Console.WriteLine("Enter quest ID");
string ID = Console.ReadLine();
Console.WriteLine("Gathering Quest information from: http://www.wowhead.com/quest=" + ID);
//Load WoWHead and search for the quest name in <h1>
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://wowhead.com/quest=" + ID);
HtmlNodeCollection Qname = doc.DocumentNode.SelectNodes("//h1");
//Set QuestName as the second <h1> tag
string QuestName = (Qname[1].InnerText);
//Display information recivied
Console.WriteLine("Quest ID: " + ID);
Console.WriteLine("Quest Name: " + QuestName);
Console.WriteLine("Quest Giver: " );
Console.WriteLine("Quest Giver ID: ");
Console.ReadLine();
}
}
}
So the information needed for Quest giver and Quest giver ID are from the above Javascript.
Is there any way to get this information?

There are many ways to skin a cat, and one of them in this case, is to find the position of the word you are looking for and use a simple string.substring. would that work for you?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.