How can I select a text box that is available on a webpage so that my program can add data to the selected text box?
I am trying to setup a C# program that will auto login to a series of websites.
Example website:
http://what.cd/login.php
Current Code:
private void login()
{
System.Net.HttpWebRequest whatCDReq = (System.Net.HttpWebRequest)System.Net.WebRequest.Create("http://what.cd/login.php");
HTMLDocument htmlDoc = new HTMLDocumentClass();
htmlDoc = (HTMLDocument)webBrowser1.Document;
HTMLInputElement username = (HTMLInputElement)htmlDoc.all.item("p", 0);
username.value = "Test";
}
Look, what you want to do is send form requests to the server. Parse the webpage for text box form controls and submit the data in a format that the server can use (usually, the data handling is done within PHP on the server end).
Look in the webpage file for a reference to the Javascript function that performs the action itself (it should format the data and send it to the server). I'd recommend implementing that by translating it to your language of choice OR you could run the Javascript function directly through some 3rd party library (despite what you may think, I find that the first option is ultimately easier for small tasks like this).
Related
I am trying to parse Google play store HTML page in C# .NET core. Unfortunately, Google does not provide APIs to get the mobile application info (such as version, last update ...), while Apple does. This is why I am trying to parse the HTML page and then get the info needed.
However, it seems they published a new version recently, where a user has to press on an arrow button to be able to see the info of the app displayed in a popup.
In order to understand more, consider the example of WhatsApp application: https://play.google.com/store/apps/details?id=com.whatsapp&hl=en
In order to get the info of this app (like release date, version ...), the user has to press now on the arrow near "About this app".
Previously, the below code was working perfectly:
var id = "com.whatsapp";
var language = "en";
var url = string.Format("https://play.google.com/store/apps/details?id={0}&hl={1}", id, language);
string result;
WebClient client = new WebClient();
client.Encoding = System.Text.UTF8Encoding.UTF8;
result = client.DownloadString(url);
MatchCollection matches = Regex.Matches(result, "<div class=\"hAyfc\">.*?
<span class=\"htlgb\"><div class=\"IQ1z0d\"><span class=\"htlgb\">(?<content>.*?)
</span></div></span></div>");
objAndroidDetails.updated = matches[0].Groups["content"].Value;
objAndroidDetails.version = matches[3].Groups["content"].Value;
...
But now, it's not the case anymore for two reasons:
The regular expression is not valid anymore
The client.DownloadString(url) downloads only the code before triggering the button to display the info, thus I will not be able to extract it bcz it's not available :)) .
So, anybody can help me to solve the issue #2 ? I need somehow to trigger the button in order to be able to match the HTML code needed and extract it.
Thanks
Context:
I'm developing a desktop application in C# to scrape / analyse product information from individual web pages in a small number of domains. I use HtmlAgilityPack to capture and parse pages to fetch the data needed. I code different parsing rules for different domains.
Issue:
Pages from one particular domain, when displayed through a browser, can show perhaps 60-80 products. However when I parse through HtmlAgilityPack I only get 20 products maximum. Looking at the raw html in Firefox "View Page Source" there also appears to be only 20 of the relevant product divs present. I conclude that the remaining products must be loaded in via a script, perhaps to ease the load on the server. Indeed I can sometimes see this happening in the browser as there is a short pause while 20 more products load, then another 20 etc.
Question:
How can I access, through HtmlAgilityPack or otherwise, the full set of product divs present once all the scripting is complete?
You could use the WebBrowser in System.Windows.Forms to load the data, and agility pack to parse it. It would look something like this :
var browser = new WebBrowser();
browser.Navigate("http://whatever.com");
while (true)
{
if(browser.ReadyState == WebBrowserReadyState.Complete && browser.IsBusy != true)
{
break;
}
//not for production
Thread.Sleep(1000)
}
var doc = new HtmlAgilityPack.HtmlDocument();
var dom = (IHTMLDocument3)browser.Document.DomDocument;
StringReader reader = new StringReader(dom.documentElement.outerHTML);
doc.Load(reader);
see here for more details
Ok, I've got something working using the Selenium package (available via NuGet). The code looks like this:
private HtmlDocument FetchPageWithSelenium(string url)
{
IWebDriver driver = new FirefoxDriver();
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
driver.Navigate().GoToUrl(url);
// Scroll to the bottom of the page and pause for more products to load.
// Do it four times as there may be 4x20 products to retrieve.
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
HtmlDocument webPage = new HtmlDocument();
webPage.LoadHtml(driver.PageSource.ToString());
driver.Quit();
return webPage;
}
This returns an HtmlAgilityPack HtmlDocument ready for further analysis having first forced the page to fully load by repeatedly scrolling to the bottom. Two issues outstanding:
The code launches Firefox and then stops it again when complete. That's a bit clumsy and I'd rather all that happened invisibly. It's suggested that you can avoid this by using a PhantomJS driver instead of the Firefox driver. This didn't help though as it just pops up a Windows console window instead.
It's a bit slow due to the time taken to load the browser and pause while the scripting loads the supplementary content. I can probably live with it though.
I'll try to rework the #swestner code as well to get it running in a WPF app and see which is the tidier solution.
I have problem with browse button and switching to file dialog. I cannot use my file path control and just send there my string with file path and file itself, as it's readonly and in fact some behind control is my input filepath.
Here's my code
driver.FindElement(By.Id("browseButton")).Click();
driver.SwitchTo().ActiveElement().SendKeys(filepath);
Above code fills my control for file path, as i can see that on UI. But my open file dialog is still opened and i do not know how to close it and submit my upload.
Uploading files in Selenium can be a pain, to say the least. The real problem comes from the fact that it does not support dialog boxes such as file upload and download.
I go over this in an answer to another question, so I will just copy/paste my answer from there here. The code examples should actually be relevant in your case, since you are using C#:
Copied from previous answer on question here:
Selenium Webdriver doesn't really support this. Interacting with non-browser windows (such as native file upload dialogs and basic auth dialogs) has been a topic of much discussion on the WebDriver discussion board, but there has been little to no progress on the subject.
I have, in the past, been able to work around this by capturing the underlying request with a tool such as Fiddler2, and then just sending the request with the specified file attached as a byte blob.
If you need cookies from an authenticated session, WebDriver.magage().getCookies() should help you in that aspect.
edit: I have code for this somewhere that worked, I'll see if I can get ahold of something that you can use.
public RosterPage UploadRosterFile(String filePath){
Face().Log("Importing Roster...");
LoginRequest login = new LoginRequest();
login.username = Prefs.EmailLogin;
login.password = Prefs.PasswordLogin;
login.rememberMe = false;
login.forward = "";
login.schoolId = "";
//Set up request data
String url = "http://www.foo.bar.com" + "/ManageRoster/UploadRoster";
String javaScript = "return $('#seasons li.selected') .attr('data-season-id');";
String seasonId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
javaScript = "return Foo.Bar.data.selectedTeamId;";
String teamId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
//Send Request and parse the response into the new Driver URL
MultipartForm form = new MultipartForm(url);
form.SetField("teamId", teamId);
form.SetField("seasonId", seasonId);
form.SendFile(filePath,LoginRequest.sendLoginRequest(login));
String response = form.ResponseText.ToString();
String newURL = StaticBaseTestObjs.RemoveStringSubString("http://www.foo.bar.com" + response.Split('"')[1].Split('"')[0],"amp;");
Face().Log("Navigating to URL: "+ newURL);
Driver().GoTo(new Uri(newURL));
return this;
}
Where MultiPartForm is:
MultiPartForm
And LoginRequest/Response:
LoginRequest
LoginResponse
The code above is in C#, but there are equivalent base classes in Java that will do what you need them to do to mimic this functionality.
The most important part of all of that code is the MultiPartForm.SendFile method, which is where the magic happens.
One of the many ways to do that is to remove the disable attribute and then use typical selenium SendKeys() to accomplish that
public void test(string path)
{
By byId = By.Id("removeAttribute");
const string removeAttribute = #"document.getElementById('browseButton').removeAttribute('disabled');";
((IJavaScriptExecutor)Driver).ExecuteScript(removeAttribute);
driver.FindElement(byId).Clear();
driver.FindElement(byId).SendKeys(path);
}
You can use this Auto IT Script to Handle File Upload Option.
Auto IT Script for File Upload:
AutoItSetOption("WinTitleMatchMode","2") ; set the select mode to
Do
Sleep ("1000")
until WinExists("File Upload")
WinWait("File Upload")
WinActivate("File Upload")
ControlFocus("File Upload","","Edit1")
Sleep(2000)
ControlSetText("File Upload" , "", "Edit1", $CmdLineRaw)
Sleep(2000)
ControlClick("File Upload" , "","Button1");
Build and Compile the above code and place the EXE in a path and call it when u need it.
Call this Once you click in the Browse Button.
Process p = System.Diagnostics.Process.Start(txt_Browse.Text + "\\File Upload", DocFileName);
p.WaitForExit();
Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.
All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.
At present Im calling the required page via this code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[#class=\"nav\"]");
An example bit of code that Ive seen saying to use the WebBrowser control:
if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);
Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.
It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.
Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.
I struggled all day to get this right so here is a FedEx tracking example of what the accepted answer is referring to (I think):
Dim body As String
body = "data={""TrackPackagesRequest"":{""appType"":""WTRK"",""appDeviceType"":""DESKTOP"",""supportHTML"":true,""supportCurrentLocation"":true,""uniqueKey"":"""",""processingParameters"":{},""trackingInfoList"":[{""trackNumberInfo"":{""trackingNumber"":" & Chr(34) & "YOUR TRACKING NUMBER HERE" & Chr(34) & ",""trackingQualifier"":"""",""trackingCarrier"":""""}}]}}"
body = body & "&action=trackpackages&locale=en_US&version=1&format=json"
With CreateObject("MSXML2.XMLHTTP")
.Open("POST", "https://www.fedex.com/trackingCal/track", False)
.setRequestHeader("Referer", "https://www.fedex.com/apps/fedextrack/?tracknumbers=YOUR TRACKING NUMBER HERE")
.setRequestHeader("User-Agent", "Mozilla/5.0")
.setRequestHeader("X-Requested-With", "XMLHttpRequest")
.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8")
.send(body)
Dim Reply = .responseText
End With
Alternatively have you considered building a browser into your application using Cefsharp.net and then using Dev Tools through the .net interface?
You may have noticed that even dynamically AJAX/JS generated HTML can be found using e.g. Inspect Element option in Firefox. So that code is sitting on your computer even if you can't scrape it using traditional HTML scraping methods.
Another option to consider.
https://cefsharp.github.io/
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!