I have a question about how we can press control + s on any page of Chrome Webdriver using C# basically i have been trying to find the solution of this from past 2 days and have found nothing and still searching. if someone help me out with it. I wud appreciate that person.
here i the code which i have written:
string CaptchaSrc = driver.FindElement(By.XPath("//img[#class='captchaImage']")).GetAttribute("src");
Thread.Sleep(2000);
driver.Navigate().GoToUrl(CaptchaSrc);
driver.FindElement(By.TagName("body")).SendKeys(OpenQA.Selenium.Keys.Control + '\u0053');
Actions action = new Actions(driver); char S = '\u0053'; action.SendKeys(System.Windows.Forms.Keys.Control + Convert.ToString(S)).Build().Perform();
Thread.Sleep(4000);
System.Windows.Forms.SendKeys.SendWait(#"C:\Users\Blue\Downloads\" + captchaNumbering.ToString());
System.Windows.Forms.SendKeys.SendWait(#"{Enter}");
I have tried almost all the ways present on the StackOverFlow and even tried them all but nothing works for me. I just want to press Control + S after going to this src-URL of Image which i have scraped from the internet.
enter link description here
Maybe you can use Html Agility Pack for saving page or try this
.SendKeys(Keys.Control + "a")
.SendKeys(Keys.Control + String.valueOf('\u0053'))
If picture is appaer,you can use this code for example
string pagehtml = element.GetAttribute("innerHTML");
HtmlWeb page = new HtmlWeb();
HtmlDocument document = page.Load(pagehtml);
document.save("name.html");
I hope this example works.I didn't check but maybe shows your way for search.
Related
Following is my code
IWebDriver driver = new ChromeDriver();
driver.Url = "https://www.google.com/";
driver.Url = "https://login.yahoo.com/";
I want that both the links should be opened in different tabs of same browser window
How to achieve this?
TIA
Try this:
((IJavaScriptExecutor)driver).ExecuteScript("window.open();");
instead of
IWebElement body = driver.FindElement(By.TagName("body"));
body.SendKeys(Keys.Control + "t");
If you need to continue working on the first window you need to follow steps described here just remember to change Ctrl + t with the JavaScript above.
You should be able to do that as follows:
driver.findElement(By.cssSelector("body")).sendKeys(Keys.CONTROL+"t");
driver.get("https://login.yahoo.com/");
Good luck
Andreas
i want to work on a scraper program which will search keyword in google. i have problem in starting my scraper program.
my problem is:
let suppose window application(c#) have 2 textboxes and a button control. first textbox have "www.google.com" and the 2nd textbox contain keywork for example:
textbox1: www.google.com
textbox2: "cricket"
i want code to add to the button click event that will search cricket in google. if anyone have a programing idea in c# then plz help me.
best regards
i have googled my problem and found solution to the above problem...
we can use google API for this purpose...when we add reference to google api then we will add the following namespace in our program...........
using Google.API.Search;
write the following code in button click event
var client = new GwebSearchClient("http://www.google.com");
var results = client.Search("google api for .NET", 100);
foreach (var webResult in results)
{
//Console.WriteLine("{0}, {1}, {2}", webResult.Title, webResult.Url, webResult.Content);
listBox1.Items.Add(webResult.ToString ());
}
test my solution and give comments .........thanx everybody
I agree with Paqogomez that you don't appear to have put much work into this but I also understand that it can be hard to get started. Here is some sample code that should get you on the right path.
private void button1_Click(object sender, EventArgs e)
{
string uriString = "http://www.google.com/search";
string keywordString = "Test Keyword";
WebClient webClient = new WebClient();
NameValueCollection nameValueCollection = new NameValueCollection();
nameValueCollection.Add("q", keywordString);
webClient.QueryString.Add(nameValueCollection);
textBox1.Text = webClient.DownloadString(uriString);
}
This code will search for "Test Keyword" on Google and return the results as a string.
The problems with what you are asking is Google is going to return your result as HTML that you will need to parse. I really think you need to do some research on the Google API and what is needed to programmatically request data from Google. Start your search here Google Developers.
Hope this helps get you started on the right path.
You can use the WebClient class and DownloadString method
for searches. Use the regex for matching urls from result string.
For example:
WebClient Web = new WebClient();
string Source=Web.DownloadString("https://www.google.com/search?client=" + textbox2.text);
Regex regex =new Regex(#“ ^http(s)?://([\w-]+.)+[\w-]+(/[\w%&=])?$”);
MatchCollection Collection=regex.Matches(source);
List<string> Urls=new List<string>();
foreach (Match match in Collection)
{
Urls.Add(match.ToString());
}
I am using Selenium webdriver for UI automation purpose. Below is my sample code
IWebDriver driver = new OpenQA.Selenium.IE.InternetExplorerDriver();
string url ="http://stackoverflow.com";
driver.Navigate().GoToUrl(url);
string pagesource = driver.PageSource;
pagesource variable does not have the doctype. I need to know the DOCTYPE for W3C validation. Is there any way to get DOCTYPE of html source through selenium?
This thread shows there is no way to get the Doctype of html source through selenium, instead you can do a HTTP request from .net and get the DOCTYPE. I don't want to do a seperate HTTP request for getting DOCTYPE.
Using FirefoxDriver instead of InternetExplorerDriver will get you the DOCTYPE. Unfortunately this won't solve your problem - the source you're getting with driver.PageSource is already preprocessed by the browser, so trying to validate that code won't give reliable results.
Unfortunately there are no easy solutions.
If your page is not password protected you can use "validate by uri" method.
Otherwise you need to obtain page source. I know two ways of doing it (I implemented both in my project). One is to use proxy. If you are using C# take a look at FiddlerCore. Other way would be to make another request using javascript and XMLHttpRequest. You can find example here (search the page for XMLHttpRequest).
For W3C validation basically we have 3 issues if we automate through selenium webdriver.
Getting proper page source since driver.Pagesource is not reliable.
Getting doctype of HTML source.
Dealing with controls rendered through ajax calls. Since we cannot access these controls in page source how do we get the exact 'Generated source' of the page?
All the above things can be done by executing javascript through selenium web driver.
in a text file called 'htmlsource.txt' store this below code snippet.
function outerHTML(node){
// if IE, Chrome take the internal method otherwise build one as lower versions of firefox
//does not support element.outerHTML property
return node.outerHTML || (
function(n){
var div = document.createElement('div'), h;
div.appendChild( n.cloneNode(true) );
h = div.innerHTML;
div = null;
return h;
})(node);
}
var outerhtml = outerHTML(document.getElementsByTagName('html')[0]);
var node = document.doctype;
var doctypestring="";
if(node)
{
// IE8 and below does not have document.doctype and you will get null if you access it.
doctypestring = "<!DOCTYPE "
+ node.name
+ (node.publicId ? ' PUBLIC "' + node.publicId + '"' : '')
+ (!node.publicId && node.systemId ? ' SYSTEM' : '')
+ (node.systemId ? ' "' + node.systemId + '"' : '')
+ '>';
}
else
{
// for IE8 and below you can access doctype like this
doctypestring = document.all[0].text;
}
return doctypestring +outerhtml ;
And now the C# code to access the complete AJAX rendered HTML source with doctype
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
string jsToexecute =File.ReadAlltext("htmlsource.txt");
string completeHTMLGeneratedSourceWithDoctype = (string)js.ExecuteScript(jsToexecute);
I'm using Response.Redirect() to pass data (containing HTML) from one page to another. This works fine in Google Chrome but in Internet Explorer it said: "Couldn't find page!"
Does someone know what this is?
Thank you in advance
This is the URL:
string url = "Detailscherm.aspx?"
+ "melder=" + Server.UrlEncode(gv.SelectedRow.Cells[1].Text)
+ "&onderwerp=" + Server.UrlEncode(gv.SelectedRow.Cells[2].Text)
+ "&omschrijving=" + Server.UrlEncode(lblOmschrijving.Text)
+ "&fasedatum=" + Server.UrlEncode(gv.SelectedRow.Cells[4].Text)
+ "&outlookid=" + Server.UrlEncode(lblOutlookID.Text)
+ "&status=" + Server.UrlEncode(status)
+ "&niv1=" + Server.UrlEncode("")
+ "&niv2=" + Server.UrlEncode("");
Response.Redirect(url);
lblOmschrijving is a label which contains HTML-code
this is the value of URL right before Redirect:
"Detailscherm.aspx?melder=EBE&onderwerp=Test+feedback&omschrijving=%3chtml+xmlns%3ao%3d%22urn%3aschemas-microsoft-com%3aoffice%3aoffice%22+xmlns%3aw%3d%22urn%3aschemas-microsoft-com%3aoffice%3aword%22+xmlns%3d%22http%3a%2f%2fwww.w3.org%2fTR%2fREC-html40%22%3e%0d%0a%3chead%3e%0d%0a%3cmeta+http-equiv%3d%22Content-Type%22+content%3d%22text%2fhtml%3b+charset%3dutf-8%22%3e%0d%0a%3cmeta+name%3d%22Generator%22+content%3d%22Microsoft+Word+11+(filtered+medium)%22%3e%0d%0a%3cstyle%3e%0d%0a%3c!--%0d%0a+%2f*+Style+Definitions+*%2f%0d%0a+p.MsoNormal%2c+li.MsoNormal%2c+div.MsoNormal%0d%0a%09%7bmargin%3a0cm%3b%0d%0a%09margin-bottom%3a.0001pt%3b%0d%0a%09font-size%3a12.0pt%3b%0d%0a%09font-family%3a%22Times+New+Roman%22%3b%7d%0d%0aa%3alink%2c+span.MsoHyperlink%0d%0a%09%7bcolor%3ablue%3b%0d%0a%09text-decoration%3aunderline%3b%7d%0d%0aa%3avisited%2c+span.MsoHyperlinkFollowed%0d%0a%09%7bcolor%3apurple%3b%0d%0a%09text-decoration%3aunderline%3b%7d%0d%0aspan.E-mailStijl17%0d%0a%09%7bmso-style-type%3apersonal-compose%3b%0d%0a%09font-family%3aArial%3b%0d%0a%09color%3awindowtext%3b%7d%0d%0a%40page+Section1%0d%0a%09%7bsize%3a595.3pt+841.9pt%3b%0d%0a%09margin%3a70.85pt+70.85pt+70.85pt+70.85pt%3b%7d%0d%0adiv.Section1%0d%0a%09%7bpage%3aSection1%3b%7d%0d%0a--%3e%0d%0a%3c%2fstyle%3e%0d%0a%3c%2fhead%3e%0d%0a%3cbody+lang%3d%22NL%22+link%3d%22blue%22+vlink%3d%22purple%22%3e%0d%0a%3cdiv+class%3d%22Section1%22%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3eMohamed%2c%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3eIk+heb+zonet+enkele+zaken+getest.+De+testfeedback+is+opgenomen+in+de+bijlage.%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3ca+href%3d%22file%3a%2f%2f%2f%5c%5cJUPITER%5cInformatica%5cProjecten%5cIntegratie%2520SLA%2520rapportering%2520op%2520IT%2520Helpdesk%2520mailbox%5c6.%2520Test%2520en%2520Training%5cTesten%2520Integratie%2520helpdesk%2520sla-%2520Opmerkingen.xls%22%3eO%3a%5cProjecten%5cIntegratie%0d%0a+SLA+rapportering+op+IT+Helpdesk+mailbox%5c6.+Test+en+Training%5cTesten+Integratie+helpdesk+sla-+Opmerkingen.xls%3c%2fa%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3eWe+zullen+hier+vanmiddag+samen+naar+kijken.%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3eGroeten%2c%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3eEric%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%222%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3b%0d%0afont-family%3aArial%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3e__________________________________________%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Times+New+Roman%22%3e%3cspan+style%3d%22font-size%3a%0d%0a9.0pt%3blayout-grid-mode%3aline%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3eEric+Op+de+Beeck%3c%2fspan%3e%3c%2ffont%3e%3cspan+style%3d%22layout-grid-mode%3aline%22%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3eAfdelingshoofd+Informatica%3c%2fspan%3e%3c%2ffont%3e%3cfont+size%3d%222%22%3e%3cspan+style%3d%22font-size%3a10.0pt%3blayout-grid-mode%3aline%22%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3e%3ca+href%3d%22mailto%3aEric.Op.de.Beeck%40etaplighting.com%22+title%3d%22mailto%3aEric.Op.de.Beeck%40etaplighting.com%22%3eEric.Op.de.Beeck%40etaplighting.com%3c%2fa%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Times+New+Roman%22%3e%3cspan+style%3d%22font-size%3a%0d%0a9.0pt%3blayout-grid-mode%3aline%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3eAntwerpsesteenweg+130+-+B-2390+Malle%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3eTel.+03+310+02+11+-+Fax+03+311+61+42%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+lang%3d%22EN-GB%22+style%3d%22font-size%3a%0d%0a9.0pt%3bfont-family%3aArial%3bletter-spacing%3a.5pt%3blayout-grid-mode%3aline%22%3eBTW+BE+0424+980+655+RPR+Antwerpen%3c%2fspan%3e%3c%2ffont%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+lang%3d%22EN-GB%22+style%3d%22font-size%3a9.0pt%3bfont-family%3aArial%3blayout-grid-mode%3aline%22%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cu%3e%3cfont+size%3d%221%22+color%3d%22blue%22+face%3d%22Arial%22%3e%3cspan+lang%3d%22EN-GB%22+style%3d%22font-size%3a9.0pt%3bfont-family%3aArial%3bcolor%3ablue%3blayout-grid-mode%3aline%22%3ewww.etaplighting.com%3c%2fspan%3e%3c%2ffont%3e%3c%2fu%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+lang%3d%22EN-GB%22+style%3d%22font-size%3a9.0pt%3bfont-family%3aArial%3b%0d%0alayout-grid-mode%3aline%22%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3b%0d%0afont-family%3aArial%3blayout-grid-mode%3aline%22%3e__________________________________________%3c%2fspan%3e%3c%2ffont%3e%3cfont+size%3d%221%22+face%3d%22Arial%22%3e%3cspan+style%3d%22font-size%3a9.0pt%3bfont-family%3aArial%3blayout-grid-mode%3a%0d%0aline%22%3e%3co%3ap%3e%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3cp+class%3d%22MsoNormal%22%3e%3cfont+size%3d%223%22+face%3d%22Times+New+Roman%22%3e%3cspan+style%3d%22font-size%3a%0d%0a12.0pt%22%3e%3co%3ap%3e%26nbsp%3b%3c%2fo%3ap%3e%3c%2fspan%3e%3c%2ffont%3e%3c%2fp%3e%0d%0a%3c%2fdiv%3e%0d%0a%3c%2fbody%3e%0d%0a%3c%2fhtml%3e%0d%0a&fasedatum=21%2f03%2f2011+12%3a08%3a13&outlookid=AAMkAGI2MGM0NjY2LTI5MGYtNGVmMC1iMTg2LThlZDNmODFhZDIwNQBGAAAAAAC5W4YdHHPkSL1VgU1WnUztBwD2It7i8bOLTI4%2fH%2bc6MwEsAC0BCIilAAD2It7i8bOLTI4%2fH%2bc6MwEsAC0M%2b0T9AAA%3d&status=0&niv1=&niv2="
The length of the querystring is too long. I.E. Only accepts up to 2083 characters. Chrome and others do not. I have had a similar problem.
Try using Server.Transfer(), or put the variables in session or post a form.
Session["melder"] = Server.UrlEncode(gv.SelectedRow.Cells[1].Text);
Session["onderwerp"] = Server.UrlEncode(gv.SelectedRow.Cells[2].Text);
...
Response.Redirect("Detailscherm.aspx");
You can then fetch these values back on that page
string melder = Session["melder"];
Session["melder"] = "";
In any case, it does not seem like a very good idea to put all that data in a querystring. If anyone changes the values in the address bar, it could make your pages show incorrect data.
Try using sessions, or Post to carry large amounts of data across pages.
try this
string value = "../containing html";
Response.Redirect("http://www.mysite.com/?Value=" + Server.UrlEncode(value));
HTTP Get length, that Google Chrome and Internet Explorer supports is different.
IE only support 2083 characters.
Google Chrome support 8182 characters.
Safari Browser support 80,000.
Opera Browser support 190,000.
I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.