HTMLAgilityPack set input into text field, then scrape results? C# - c#

What I am trying to do is making it so that I can start my program, and in a textbox, I enter something I would like to search. This then loads a a web page i have defined, finds the search field on the web page, and takes the string from that textbox and enters it into the search field. From there, it hits the "search" button on the webpage, or simulates the enter button. Then, it loads a new web page, which lets say contains a specific xPath that i have set the program to search for. It then pulls the string from that xPath, and loads it into a datagridview.
Now, I have already figured out how I scrape results from 1 web page. (ie, if i just load the specific web page and scrape the xPath.)
The issue, however, is that i want to automatically search for different results, and pull the strings from each web page that has my results.
(IE: Searching for 2+2, it returns 4. Searching for 1+2, it returns 3. but "4" and "3" are on separate webpages. so instead of copying each link to a webpage, having the tool take care of that for me.)
This is what I have so far. This doesn't give me any errors, but I am at a loss trying to figure out how one would hit search or press enter.
I prefer the pressing enter method, if possible.
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(urlField.Text);
HtmlNode node = doc.DocumentNode.SelectNodes(pathSfield.Text).Firs‌​t();
if (node != null)
{
node.SetAttributeValue(null, searchFpath.Text);
}

Related

Submitting Form with Selenium by Clicking on the Send Button puts all Settings back in the DOM Tree

I am using Selenium with .NET.
The WebDriver I am using is ChromeDriver.
I wrote a code, which fills a simple form (survey; from surveymonkey.com). The DOM Tree gets changed (manipulated) perfectly as I want it to.
But as soon as I click the button to submit the data (the survey/form), I get an error from the survey page, i.e. “please answer the question(s) x,y,...,n.” I verified the DOM Tree after getting this error, and amazingly, the DOM Tree has changed back to default, as there was never a change, except to the Input and TextArea Fields – these two nodes have been accepted from the page (the manipulation from selenium to the TextArea and Input Fields were accepted). So all manipulation on the Radio Buttons and Checkboxes has changed back to unchecked when clicking on the send button.
What I did:
I set a breakpoint before clicking on the button, copied the DOM Tree and compared that one with the DOM Tree of a real client: There was no difference except on the meta node (some query parameters of a script).
So I was really confused, why the survey won't work with selenium but with a real client, even though the DOM Tree are the same.
Therefore, I tried it with another Driver (with IE and Firefox) - I had the exact same problem.
Since the DOM Tree are the same (selenium and real client), I think there is a problem with the selenium Click event? Btw, I also tried, instead of Click(), to submit the form by Keys.Enter - it did not work as well.
Here is the link of the survey (to get the DOM Tree): https://de.surveymonkey.com/r/KZDWJD2?pharmacy=test
Example: For the Radio buttons, what I did (and the only thing which is necessary) was changing/manipulating the attribute aria-checked from false to true. As soon as I submitted the survey, this attribute changes back to false, which seems really strange to me and therefore cannot be send to the backend (message in the frontend: 'please answer the question 1,2,...,n).
Here is some code from my C#:
Changing the above attribute: `
public static void SetAttribute(this IWebElement webElement, string name, string value)
{
var driver = ((IWrapsDriver)webElement).WrappedDriver;
var jsExecutor = (IJavaScriptExecutor)driver;
jsExecutor.ExecuteScript("arguments[0].setAttribute(arguments[1], arguments[2]);", webElement, name, value);
}
As already mentioned, the attribute gets changed perfectly by above method. But, as soon as selenium submits the survey (by clicking on the button), the attribute gets changed back to false and the survey can therefore not be sent.
Any help or advice would be appreciated.
Thanks in advance.
After some more digging I found out, that survey monkey adds another attribute to every input field: checked="checked" - adding this attribute to the specific node elements was the key to the solution - the survey can now be sent.

C# Find text inside html class on website and click href link

I'm making an application that can find an item on a website if the item name and colour matches the one set inside the application.
If the item name is set as "backpack" and colour set as "green" the application should find a match on the page and click the link. The website is this: Click
I would prefer doing this in C# with http requests or something similar. I would also do PhantomJS if anyone has a better solution using it.
You can use selenium, it basically allows you to act like a user with actual web browser. http://www.seleniumhq.org/
You can do something like mentioned below with the help of XPath:
driver.Navigate().GoToUrl("http://www.supremenewyork.com/shop/all/bags");
var backpack= driver.FindElement(By.XPath("//*[contains(#class,'inner-article')]//h1//a[contains(., 'Backpack') or contains(., 'Backpack')]"));
var colorGreen = driver.FindElement(By.XPath("//*[contains(#class,'inner-article')]//p//a[contains(., 'Acid Green') or contains(., 'Acid Green')]"));
if (backpack.Text == "Backpack" && colorGreen.Text == "Acid Green")
colorGreen.Click();
It is a tested code, it successfully finds the required values within tags, clicks and moves to that page.
Hope it helps.

Action on SharePoint New Item form in selenium 2 using C# is not working

I am trying to automate sharepoint site new item form but what ever method i try it is showing not found.
I tried switchTo() to a new iframe, window...
Tried this code which finds the outer content
IWebElement table1 = WebElement.FindElement(By.XPath("//table[#class=\"s4-wpTopTable\"]"));
int table1count = WebElement.FindElements(By.XPath("//table[#class=\"s4-wpTopTable\"]")).Count;
MessageBox.Show(table1count.ToString());
above code displays the table count as 2. Going beyond this element does not show any element.
And I am using IE as the browser.
I used Xpath and could identify till the red mark and it does not identify beyond that.. i am trying to identify the elements marked in green.
var iframecount = driver.FindElement(By.XPath("//html/body/form/div[8]/div/div[4]/div[2]/div[2]/div/div/table/tbody/tr/td
Here is the xpath is used taken from FireBug
var iframecount = driver.FindElement(By.XPath("//html/body/form/div[8]/div/div[4]/div[2]/div[2]/div/div/table/tbody/tr/td/div/span/table/tbody/tr/td[2]/span/span/input"));
i have found answer for this...
Sharepoint New item form (i.e modal pop up) has 3 iframes without id or name so switching to iframe using the below code works
driver.SwitchTo().Frame(2);
i.e frames start from 0 index.

URL and Query management Asp.Net C#

Ok so while back I asked question Beginner ASP.net question handling url link
I wanted to handle case like this www.blah.com/blah.aspx?day=12&flow=true
I got my answer string r_flag = Request.QueryString["day"];
Then what I did is placed a code in Page_Load()
that basically takes these parameters and if they are not NULL, meaning that they were part of URL.
I filter results based on these parameters.
It works GREAT, happy times.... Except it does not work anymore once you try to go to the link using some other filter.
I have drop down box that allows you to select filters.
I have a button that once clicked should update these selections.
The problem is that Page_Load is called prior to Button_Clicked function and therefore I stay on the same page.
Any ideas how to handle this case.
Once again in case above was confusing.
So I can control behavior of my website by using URL, which I parse in Page_Load()
and using controls that are on the page.
If there is no query in URL it works great (controls) if there is it overrides controls.
Essentially I am trying to find a way how to ignore parsing of url when requests comes from clicking Generate button on the page.
Maybe you can put your querystring parsing code into IsPostBack control if Generate button is the control that only postbacks at your page.
if (!IsPostBack)
{
string r_flag = Request.QueryString["day"];
}
As an alternative way, at client side you can set a hidden field whenever user clicks the Generate button, then you can get it's value to determine if the user clicked the Generate button and then put your logic there.

Logic for Implementing a Dynamic Web Scraper in C#

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:
Get the URL from the user.
Load the Web page in the IE UI control(embedded browser) in WINForms.
Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.
Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.
I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .
Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?
It would be helpful , even if you can share some online resources.
thanks,
vijay
One approach is to build a stack of tags/styles/id down to the element which you want to select.
From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.
Example:
<html>
<body>
<!-- lots of html -->
<div id="main">
<div>
<span>
<div class="pricearea">
<table> <!-- with price data -->
For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.
Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.
If the layout seldom changes, this would let you navigate to the same location each time.
I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.
Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!
After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
IHTMLElement parentElement = range.parentElement();// Identifies the parent element
targetSourceIndex = parentElement.sourceIndex;
//dataLocation = range.parentElement().id;
MessageBox.Show(range.text);//range.parentElement().sourceIndex
}
I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.
The IHTMLElement instance exposes a property named 'SourceIndex' which allocates a unique id to each of the html elements.
One can store this SourceIndex to the DB and Query for the content at that location. using the following code.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
IHTMLElement targetElement = null;
foreach (IHTMLElement domElement in HtmlDoc.all)
{
if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
{
targetElement = domElement;
break;
}
}
MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
}

Categories

Resources