im trying to get some information from a website (http://wowhead.com) now most if it is easy enough with html agility pack, but im stuck on how to get information after a java script tag.
The source what im trying to get info from is
<script type="text/javascript">//<![CDATA[
Markup.printHtml("[ul][li]Level: 49[/li][li]Requires level 47[/li][li]Loremaster: [url=/achievement=4931]Felwood[/url][/li][li]Side: [span class=icon-horde]Horde[/span][/li] [li][icon name=quest_start]Start: [url=/npc=48127]Darla Drilldozer[/url][/icon][/li][li] [icon name=quest_end]End: [url=/npc=48127]Darla Drilldozer[/url][/icon][/li] [li]Sharable[/li][li]Difficulty: [color=r2]47[/color][small] [/small][color=r3]52[/color][small] [/small][color=r4]59[/color][/li][li]Added in patch 4.0.3[/li][/ul]", "sdhafcuvh0", { allow: Markup.CLASS_STAFF, dbpage: true });
//]]></script>
Now from all that, the only thing im interested in is the info from
[url=/npc=48127]Darla Drilldozer[/url]
From which i only want do display 48127 and Darla Drilldozer.
Is there anyway to do this?
Here is an example of my current code in a console to show what kind of thing im after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
//Enter the Quest ID and set it as a WoWHead link
Console.WriteLine("Enter quest ID");
string ID = Console.ReadLine();
Console.WriteLine("Gathering Quest information from: http://www.wowhead.com/quest=" + ID);
//Load WoWHead and search for the quest name in <h1>
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://wowhead.com/quest=" + ID);
HtmlNodeCollection Qname = doc.DocumentNode.SelectNodes("//h1");
//Set QuestName as the second <h1> tag
string QuestName = (Qname[1].InnerText);
//Display information recivied
Console.WriteLine("Quest ID: " + ID);
Console.WriteLine("Quest Name: " + QuestName);
Console.WriteLine("Quest Giver: " );
Console.WriteLine("Quest Giver ID: ");
Console.ReadLine();
}
}
}
So the information needed for Quest giver and Quest giver ID are from the above Javascript.
Is there any way to get this information?
There are many ways to skin a cat, and one of them in this case, is to find the position of the word you are looking for and use a simple string.substring. would that work for you?
Related
Hello everyone I need some help. There is URL: http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml. As you can see in screenshot I need to click checkbox Captcha.
https://i.stack.imgur.com/xjXaA.png
Here is my code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
namespace AkaryakitSelenium
{
class Program
{
private static string AkaryakitLink = "http://lisans.epdk.org.tr/epvys-web/faces/pages/lisans/petrolBayilik/petrolBayilikOzetSorgula.xhtml";
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
driver.Navigate().GoToUrl(AkaryakitLink);
var kategoriCol = driver.FindElements(By.CssSelector(".ui-selectonemenu-trigger.ui-state-default.ui-corner-right"));
var x = kategoriCol[3];
x.Click();
var deneme = driver.FindElement(By.Id("petrolBayilikOzetSorguKriterleriForm:j_idt52_1"));
deneme.Click();
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
}
}
}
And lastly this error that I am facing:
"OpenQA.Selenium.NoSuchElementException: 'no such element: Unable to
locate element: {"method":"css
selector","selector":"#recaptcha-anchor"}"
Thank you for your help.
The element you are looking for is inside an iframe :
//iframe[#title='reCAPTCHA']
first you need to switch to iframe like this :
new WebDriverWait(driver, TimeSpan.FromSeconds(3)).Until(ExpectedConditions.FrameToBeAvailableAndSwitchToIt(By.XPath("//iframe[#title='reCAPTCHA']")));
then you can perform a click on it :
var check = driver.FindElement(By.Id("recaptcha-anchor"));
check.Click();
PS : captchas are not meant to be automated. since Captcha stands for CAPTCHA stands for the Completely Automated Public Turing test to tell Computers and Humans Apart.
You can not bypass captcha with Selenium.
It is designed to avoid automated access to web pages as described here and in many other places.
I've gone away and tried my hand as some Web Scraping in C#. I found a video by Blake B on Youtube which shows the process of getting Ebay listings into a list, pretty cool. I am also trying to do something similar but I'm struggling on the HTML part on what to substitute where, this is what I have so far...
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net.Http;
namespace CarRegRetrieve
{
class Program
{
static void Main(string[] args)
{
GetHTMLAsync();
Console.ReadLine();
}
private static async void GetHTMLAsync()
{
var url = "https://www.rapidcarcheck.co.uk/results?RegPlate=LN52dmv";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var ProductsHTML = htmlDoc.DocumentNode.Descendants("body")
.Where(node => node.GetAttributeValue("wpb_text_column wpb_content_element", "")
.Equals("wpb_wrapper")).ToList();
var productLists = ProductsHTML[0].Descendants();
Console.WriteLine(productLists);
}
}
This is the beginning of what should be a scrape to get car information by an entered registration, as you can see, the website I am using is RapidCarCheck and I have entered my registration and would like to get the engine size, top speed etc. I have had a look at the inspect element on the page, but have no idea what I am looking for.
A little side note, note sure if it's an easy fix, but the site as a Cloudfare anti-request blocker type thing, and without a proxy to hide or change my IP makes it hard to code the project as I keep needing to change my IP manually.
Thanks!
I am trying to scrape http://gameinfo.na.leagueoflegends.com/en/game-info/champions/ but i can't find where the images are of those champions in my webscraping. The problem is that it doesnt scrape every single thing... My script is ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Net;
namespace WebScraping
{
class Program
{
static void Main(string[] args) {
WebScraping wb = new WebScraping();
wb.Scraping();
}
class WebScraping
{
public void Scraping()
{
Console.WriteLine("Type in the webpage you want to scrape : \n");
string WebPage = Console.ReadLine();
WebClient webc = new WebClient();
string url = webc.DownloadString(WebPage);`
Console.WriteLine(url += "\n \t Done");
Console.ReadLine();
}
}
}
The thing I'm trying to find is the <a href="amumu"/></a>
You're right: the data is not in the original HTML. Instead, the Champions Grid is populated via javascript. This actually works in your favor; it means you'll probably be able to a get your hero information in json format, which is much easier to parse. The only trick is finding where that javascript is loaded.
In order to do that, load the page in your browser and use the developer tools. I'll use Google Chrome as an example. Hit F12 to open the developer tools, and then go to the Network tab. Now hit Shift+F5 to reload the page the record the requests. With this done, you can look through every individual item that was downloaded to render this page. I saw a full 238 requests (that's a lot!), but if you scan through the list for json items you'll eventually see a champions.json file. Right click on that, and you can get this url:
http://ddragon.leagueoflegends.com/cdn/6.24.1/data/en_US/champion.json
Look at the data in that file, and you'll find this:
"Amumu":
{
"version":"6.24.1",
"id":"Amumu",
"key":"32",
"name":"Amumu",
"title":"the Sad Mummy",
"blurb":"''Solitude can be lonelier than death.''<br><br>A lonely and melancholy soul from ancient Shurima, Amumu roams the world in search of a friend. Cursed by an ancient spell, he is doomed to remain alone forever, as his touch is death and his affection ...",
"info":
{
"attack":2,
"defense":6,
"magic":8,
"difficulty":3
},
"image":
{
"full":"Amumu.png",
"sprite":"champion0.png",
"group":"champion",
"x":192,
"y":0,
"w":48,
"h":48
},
"tags":["Tank","Mage"],
"partype":"MP",
"stats":
{
"hp":613.12,
"hpperlevel":84.0,
"mp":287.2,
"mpperlevel":40.0,
"movespeed":335.0,
"armor":23.544,
"armorperlevel":3.8,
"spellblock":32.1,
"spellblockperlevel":1.25,
"attackrange":125.0,
"hpregen":8.875,
"hpregenperlevel":0.85,
"mpregen":7.38,
"mpregenperlevel":0.525,
"crit":0.0,
"critperlevel":0.0,
"attackdamage":53.384,
"attackdamageperlevel":3.8,
"attackspeedoffset":-0.02,
"attackspeedperlevel":2.18
}
}
Use NuGet to pull in a JSON parser and you can quickly get structured data from this.
Regex helped me match the information that i needed
MatchCollection m1 = Regex.Matches(html, "\"id\":\"(.+?)\",\"", RegexOptions.Singleline);
Okay so this is very basic but I've literally started learning how to read an XML document today and i usually find answers more comprehensive on here than on online guides. Essentially i'm coding a Pokemon game which uses an XML file to load all the stats (its one i swiped from someone else).The user will input a Pokemon and i then want to read the Base Stats of the Pokemon from the XML file, to give a template, this would be one of the Pokemon:
<Pokemon>
<Name>Bulbasaur</Name>
<BaseStats>
<Health>5</Health>
<Attack>5</Attack>
<Defense>5</Defense>
<SpecialAttack>7</SpecialAttack>
<SpecialDefense>7</SpecialDefense>
<Speed>5</Speed>
</BaseStats>
</Pokemon>
The code ive tried to use is:
XDocument pokemonDoc = XDocument.Load(#"File Path Here");
while(pokemonDoc.Descendants("Pokemon").Elements("Name").ToString() == cbSpecies.SelectedText)
{
var Stats = pokemonDoc.Descendants("Pokemon").Elements("BaseStats");
}
but this just returns pokemonDoc as null, any idea where im going wrong?
NOTE:
cbSpeciesSelect is where the user selects which species of pokemon they want.
The File Path definitely works as i've used it already in my program
The while loop never actually starts
Try xml linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(FILENAME);
var pokemon = doc.Descendants("Pokemon").Select(x => new {
name = (string)x.Element("Name"),
health = (int)x.Element("BaseStats").Element("Health"),
attack = (int)x.Element("BaseStats").Element("Attack"),
defense = (int)x.Element("BaseStats").Element("Defense"),
specialAttack = (int)x.Element("BaseStats").Element("SpecialAttack"),
specialDefense = (int)x.Element("BaseStats").Element("SpecialDefense"),
speed = (int)x.Element("BaseStats").Element("Speed"),
}).FirstOrDefault();
}
}
}
Can you try below code:
foreach(var e in pokemonDoc.Descendants("Pokemon").Elements("Name"))
{
if(e.Value==cbSpecies.SelectedText)
{
var Stats = pokemonDoc.Descendants("Pokemon").Elements("BaseStats");
}
}
Due to the lack of proper documentation, I'm not sure if HtmlAgilityPack supports screen capture in C# after it loads the html contents.
So is there a way I can more or less grab a screenshot using (or along with) HtmlAgilityPack so I can have a visual clue as to what happens every time I do page manipulations?
Here is my working code so far:
using HtmlAgilityPack;
using System;
namespace ConsoleApplication4
{
class Program
{
static void Main(string[] args)
{
string urlDemo = "https://htmlagilitypack.codeplex.com/";
HtmlWeb getHtmlWeb = new HtmlWeb();
var doc = getHtmlWeb.Load(urlDemo);
var sentence = doc.DocumentNode.SelectNodes("//p");
int counter = 1;
try
{
foreach (var p in sentence)
{
Console.WriteLine(counter + ". " + p.InnerText);
counter++;
}
}
catch (Exception e)
{
Console.WriteLine(e);
}
Console.ReadLine();
}
}
}
Currently, it scrapes and output all the p of the page in the console but at the same time I want to get a screen grab of the scraped contents but I don't know how and where to begin.
Any help is greatly appreciated. TIA
You can't do this with HTML Agility Pack. Use a different tool such as Selenium WebDriver. Here is how to do it: Take a screenshot with Selenium WebDriver
Could you use Selenium WebDriver instead?
You'll need to add the following NuGet packages to your project first:
Selenium.WebDriver
Selenium.Support
Loading a page and taking a screenshot is then as simple as...
using System;
using System.Drawing.Imaging;
using System.IO;
using OpenQA.Selenium;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium.Support.UI;
namespace SeleniumTest
{
class Program
{
static void Main(string[] args)
{
// Create a web driver that used Firefox
var driver = new FirefoxDriver(
new FirefoxBinary(), new FirefoxProfile(), TimeSpan.FromSeconds(120));
// Load your page
driver.Navigate().GoToUrl("http://google.com");
// Wait until the page has actually loaded
var wait = new WebDriverWait(driver, new TimeSpan(0, 0, 10));
wait.Until(d => d.Title.Contains("Google"));
// Take a screenshot, and saves it to a file (you must have full access rights to the save location).
var myDesktop = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
((ITakesScreenshot)driver).GetScreenshot().SaveAsFile(Path.Combine(myDesktop, "google-screenshot.png"), ImageFormat.Png);
driver.Close();
}
}
}