C# selenium load html - c#

I want to load html from WebClient() to selenium driver.
I have:
WebClient glavniklijent = new WebClient();
string HTML = glavniklijent.DownloadString("http://www.bodum.com/gb/en-us/shop/detail/10948-01/");
If I save it like local html file and then navigate on it
driver.Navigate().GoToUrl(localfile);
It wont help because then it will request online resources. Which take too long.
Also I tried with Javascript Executor
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
string title = (string)js.ExecuteScript("document.write('" + HTML +"')");
But that don't work.
Reason why I do this is For me easiest way for parsing html is with Selenum driver, I tried with HtmlAgilityPack but I never used it before and it seems much complicated compared with Selenium Select By Id, Select by classname etc.
Can I load this with selenium locally ?
Is there html parser similar to selenium ?

Try CsQuery
https://github.com/jamietre/CsQuery
https://www.nuget.org/packages/CsQuery/
It makes parsing HTML pretty easy, in a way very similar to jQuery:
var document = CsQuery.CQ.CreateDocument(html);
foreach (var element in document.Select("ul.somelist > li.thread"))
{
// do something with element
}

Related

Extracting string from Html page using C#

I have a source html page and I want to do the following:
extracting a specific string from the whole html page and save the new choosing string in a new html page.
creating a database on MySQL with 4 columns.
importing the data from the html page to the table on MySql.
I would be pretty thankful and grateful if someone could help me in that cause I have no that perfect knowledge of using C#.
You could use this code :
HttpClient http = new HttpClient();
//I have put Ebay.com. you could use any.
var response = await http.GetByteArrayAsync("ebay.com");
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument Nodes = new HtmlDocument();
Nodes.LoadHtml(source);
In the Nodes object, you will have all the DOM elements in the HTML page.
You could use linq to filter out whatever you need.
Example :
List<HtmlNode> RequiredNodes = Nodes.DocumentNode.Descendants()
.Where(x => x.Attributes["Class"].Contains("List-Item")).ToList();
You will probably need to install Html Agility Pack NuGet or download it from the link.
hope this helps.

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

text returning as NULL using htmlagility pack + xpath

I'm currently playing around with htmlagility pack, however, I don't seem to be getting any data back from the following url:
http://cloud.tfl.gov.uk/TrackerNet/LineStatus
This is the code i'm using:
var url = #"http://cloud.tfl.gov.uk/TrackerNet/LineStatus";
var webGet = new HtmlWeb();
var doc = webGet.Load(url);
However, when I check the contents of 'doc', the text value is set to null. I've tried other url's and i'm receiving the HTML used on the site. Is it just this particular url, or am I doing something wrong. Any help would be appreciated.
HtmlAgilityPack is an HTML parser, thus you won't be successful in trying to parse a non-HTML webpage such as the XML your want to parse.

Getting DOCTYPE using selenium webdriver C#

I am using Selenium webdriver for UI automation purpose. Below is my sample code
IWebDriver driver = new OpenQA.Selenium.IE.InternetExplorerDriver();
string url ="http://stackoverflow.com";
driver.Navigate().GoToUrl(url);
string pagesource = driver.PageSource;
pagesource variable does not have the doctype. I need to know the DOCTYPE for W3C validation. Is there any way to get DOCTYPE of html source through selenium?
This thread shows there is no way to get the Doctype of html source through selenium, instead you can do a HTTP request from .net and get the DOCTYPE. I don't want to do a seperate HTTP request for getting DOCTYPE.
Using FirefoxDriver instead of InternetExplorerDriver will get you the DOCTYPE. Unfortunately this won't solve your problem - the source you're getting with driver.PageSource is already preprocessed by the browser, so trying to validate that code won't give reliable results.
Unfortunately there are no easy solutions.
If your page is not password protected you can use "validate by uri" method.
Otherwise you need to obtain page source. I know two ways of doing it (I implemented both in my project). One is to use proxy. If you are using C# take a look at FiddlerCore. Other way would be to make another request using javascript and XMLHttpRequest. You can find example here (search the page for XMLHttpRequest).
For W3C validation basically we have 3 issues if we automate through selenium webdriver.
Getting proper page source since driver.Pagesource is not reliable.
Getting doctype of HTML source.
Dealing with controls rendered through ajax calls. Since we cannot access these controls in page source how do we get the exact 'Generated source' of the page?
All the above things can be done by executing javascript through selenium web driver.
in a text file called 'htmlsource.txt' store this below code snippet.
function outerHTML(node){
// if IE, Chrome take the internal method otherwise build one as lower versions of firefox
//does not support element.outerHTML property
return node.outerHTML || (
function(n){
var div = document.createElement('div'), h;
div.appendChild( n.cloneNode(true) );
h = div.innerHTML;
div = null;
return h;
})(node);
}
var outerhtml = outerHTML(document.getElementsByTagName('html')[0]);
var node = document.doctype;
var doctypestring="";
if(node)
{
// IE8 and below does not have document.doctype and you will get null if you access it.
doctypestring = "<!DOCTYPE "
+ node.name
+ (node.publicId ? ' PUBLIC "' + node.publicId + '"' : '')
+ (!node.publicId && node.systemId ? ' SYSTEM' : '')
+ (node.systemId ? ' "' + node.systemId + '"' : '')
+ '>';
}
else
{
// for IE8 and below you can access doctype like this
doctypestring = document.all[0].text;
}
return doctypestring +outerhtml ;
And now the C# code to access the complete AJAX rendered HTML source with doctype
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
string jsToexecute =File.ReadAlltext("htmlsource.txt");
string completeHTMLGeneratedSourceWithDoctype = (string)js.ExecuteScript(jsToexecute);

How to scrape a page generated with a script in C#?

Simple example: Google search page.
http://www.google.com/search?q=foobar
When I get the source of the page, I get the underlying JavaScript. I want the resulting page. What do I do?
Even though it looks as if it is only javascript it really is the full HTML, you can easily confirm with HtmlAgilityPack:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com/search?q=foobar");
string html = doc.DocumentNode.OuterHtml;
var nodes = doc.DocumentNode.SelectNodes("//div"); //returns 85 nodes

Categories

Resources