XPath, htmlAgilityPack and the WebBrowser control - c#

I can load a url into a WebBrowser control and perform a login (forms based), I see what I need to see. Great, now I want to use XPath to get the data I need.
Can't do that with a WebBrowser (unless you disagree?) so I use The Agility Pack to kick of a new session as per below:
var wc = new WebClient();
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(wc.OpenRead(url), Encoding.UTF8);
var value = doc.DocumentNode.SelectSingleNode("|//li[#data-section='currentPositionsDetails']//*[#class='description']");
My value is not retrievable because the website doesn't expose it to the public (it wants a logged in session). How can I "pass on" my WebBrowser control session to my WebClient()? Looking into some of the methods of how to POST my login information, it all seems awfully complicated.
Any ideas? - Thanks

You can retrieve the body html string with webBrowser1.Document.Body.OuterHtml and load it with HtmlAgilityPack:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(this.webBrowser1.Document.Body.OuterHtml));

OK, posting this as an answer as it seems to be answered/discussed elsewhere. It's not going to be easy for an amateur like me!
How to pass cookies to HtmlAgilityPack or WebClient?
HtmlAgilityPack.HtmlDocument Cookies

Related

Parsing web site using HtmlAgilityPack does not return values as seen on browser

When parsing the site https://holfuy.com/en/weather/1284 HtmlAgilityPack returns "-" for relevant data.
string url = "https://holfuy.com/en/weather/1284";
var web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
HtmlDocument doc = web.Load(url);
string data = doc.DocumentNode.SelectNodes("//*[#id=\"j_pressure\"]")[0].InnerText;
Console.WriteLine(data);
What is the reason behind this?
It seems that data is dynamically loaded into page and if you need to parse it you need to hook real browser, through for example Selenium and use one of available drivers there or if you don't want to include entire Selenium just hook some headless browser like phantom.js directly. Once you do it, just set some small delay for data to render, load page and parse.
You can see more information here:
Running Scripts in HtmlAgilityPack

How to scrape data from another website which is built in AngularJS?

I have to get some specific data from another web page which is built in AngularJS.
What I have done until now:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
It's not returning proper HTML and I suppose (after searching) that the site is returning 4 items but the page source shows only one item with this {{item.name}} type of syntax.
How to solve this issue?
If you use HttpWebRequest, it will just return you the HTML template, it will not contain any data. Due to the nature of Angular, data binding happens later on using JavaScript.
I suggest you to use WebBrowser Control instead of HttpWebRequest for data scraping. Using WebBrowser you should be able to get the complete HTML after the $scope is initialized and data is added to the DOM.
To know more about how to use WebBrowser in ASP.NET you can check this link

ASP Classic VBScript to ASP.NET C# Conversion

I am familiar with ASP.NET, but not with Visual Basic.
Here is the Visual Basic code:
myxml="http://api.ipinfodb.com/v3/ip-city/?key="&api_key&"&ip=" &UserIPAddress&"&format=xml"
set xml = server.CreateObject("MSXML2.DOMDocument.6.0")
xml.async = "false"
xml.resolveExternals = "false"
xml.setProperty "ServerHTTPRequest", true
xml.load(myxml)
response.write "<p><strong>First result</strong><br />"
for i=0 to 10
response.write xml.documentElement.childNodes(i).nodename & " : "
response.write xml.documentElement.childNodes(i).text & "<br/>"
NEXT
response.write "</p>"
What is going on in this code?
How can I convert this to ASP.NET (C#)?
Based on a quick glance at the site you linked to in a comment, it looks like the intended functionality is to make a request to a URL and receive the response. The first example given on that site is:
http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100
You can probably use something like the System.Net.WebClient object to make an HTTP request and receive the response. The example on MSDN can be modified for your URL. Maybe something like this:
var client = new WebClient();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
var data = client.OpenRead(#"http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100");
var reader = new StreamReader(data);
var result = reader.ReadToEnd();
data.Close();
reader.Close();
(There's also the WebRequest class, which appears to share roughly the same functionality.)
At that point the result variable contains the response from the API. Which you can handle however you need to.
From the looks of the Visual Basic code, I think you should create two methods to "convert" this to an ASP.NET C# web page:
LoadXmlData method - use an XmlDocument to load from the URL via the XmlDocument's Load function. Read ASP.net load XML file from URL for an example.
BuildDisplay method - use an ASP.NET PlaceHolder or Panel to create a container to inject the paragraph tag and individual results into.

load a page with javascript disabled using HtmlAgilityPack/HttpWebRequest

I was wondering if there was a way to load a page with javascript disabled (i.e. emulate a browser accessing a page with javascript disallowed).
I'm exploring a promising method using WebRequest and UserAgent:
HttpWebRequest Req = (HttpWebRequest)WebRequest.Create(url);
Req.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
WebPage result;
HttpWebResponse resp = (HttpWebResponse)Req.GetResponse();
HtmlDocument doc = new HtmlDocument();
var resultStream = resp.GetResponseStream();
doc.Load(resultStream);
And I want to say there is a way to intialize the useragent (in this case firefox) with javascript disabled, but I'm not quite sure how to.
If anyone knows how to do this just using HtmlAgilityPack, that would be extremely helpful as well.
Also, on a sidenote: to fill in a textbox using the HtmlAgilityPack, is it just:
HtmlNode textbox = HtmlDocument doc.DocumentNode.SelectSingleNode("//text[#id='box']");
textbox.SetAttributeValue("value to put in textbox");
?
Thank you very much!

How to get the content of web page? [duplicate]

This question already exists:
Closed 11 years ago.
Possible Duplicate:
Reading web page by sending username & password?
My problem is this. There is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting.
for getting that data i have to provide the userid & password.
I have used HttpWebRequest to get data but the problem is that response text returns "Your browser doesn't support frame" instead of the data i want.
how can i get it?
Most likely you are having this problem because you are not setting the user-agent in your request, i.e. with a WebClient:
using(WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string htmlResult = wc.DownloadString(someUrl);
}
You can make use WebBrowser control to solve your problem. This approach works like this, First, you have to load the specific webpage on to the WebBrowser Control, then once the document has been loaded or not . If loaded then you can retrieve the web page stream using DocumentStream property.
Hope this helps.

Categories

Resources