I am trying to extract some information from a website. But when I navigate to it, it uses javascript to connect me to a server before dynamically loading a php-page. I can follow the sequence in Chrome with the developer tools. I figured it would be easiest to reproduce it in C# with the Webbrowser control and simply navigate to the website. Then the webbrowser control must contain all the javascript files, the text from the dynamically loaded php page and so on. But is this true and where in the control are they stored? I can't seem to find them.
Recreate the whole sequence diagram implemented in Chrome would be a lot of work. However, "extract some information from a website" is something that can be done quite easily.
Disclaimer: I assumed this question was for the WPF's WebBrower control (it would be almost the same for WinForms)
You can get the HTMLDocument once the page is loaded, using:
using mshtml; // <- don't forget to add the reference
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
browser.Navigate("http://google.com/");
browser.LoadCompleted += browser_LoadCompleted;
}
void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
HTMLDocument doc = (HTMLDocument)browser.Document;
string html = doc.documentElement.innerHTML.ToString();
// from here, you should be able to parse the HTML
// or sniff the HTMLDocument (using HTML Agility Pack for instance)
}
}
From this HTMLDocument, you have access to a lot of properties, including HTML elements, CSS styles and scripts. I invite you to put a break-point and check out what best fits your needs.
Nevertheless, since the page you want to load uses JavaScript to fill its content, the HTMLDocument will probably not be complete a the time the LoadCompleted is raise.
In that case, I suggest to use a timer to poll until the content is stable.
You could also use HTMLDocument to inject your own JavaScript code, and call C# methods througth WebBrowser.ObjectForScripting, but this is gonna be much more complicated and harder to maintain.
Related
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
I'm using C#, and I've been struggling for a few days for grabbing the final rendered HTML from an URL.
I've tried using several browser engines, Awesomium, WebBrowser and so on, but none of them returns the actual rendered HTML of the page, as if I right clicked in chrome and chose "inspect element".
What I do is roughly the following (using the WebBrowser WinForms control):
public static string GetDomSource(WebBrowser wb)
{
var dd = wb.Document.DomDocument as IHTMLDocument2;
return dd.body.parentElement.outerHTML;
}
(Though I don't know whether you already tried this or whether you are using WinForms at all).
To introduce the IHTMLDocument2 interface, I've add a reference to the "Microsoft.mshtml" assembly.
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
I'm trying to retrieve data from a webpage but I cannot do it by making a web request and parsing the resulting html file because the actual text that I'm trying to retrieve is not in the html file! I imagine that this text is pulled using some script and for that reason it's not in the html file. For all I know I'm looking at the wrong data, but assuming that my theory is correct, is there a straightforward way to retrieve whatever text is displayed by the browser (Firefox or IE) rather than attempt to fetch the text from the html file?
Assuming you are referring to text that has been generated using Javascript in the browser.
You can use PhantomJS to achieve this: http://phantomjs.org/
It is essentially a headless browser that will process Javascript.
You may need to run this as ane xternal program but Im sure you can do that through C#
Your other option would be to open the web page in a WebBrowser object which should execute the scripts, and then you can get the HtmlDocument object and go from there.
Take a look at this example...
private void test()
{
WebBrowser wBrowser1 = new WebBrowser();
wBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wBrowser1_DocumentCompleted);
wBrowser1.Url = new Uri("Web Page URL");
}
void wBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument document = (sender as WebBrowser).Document;
// get elements and values accordingly.
}
I have a situation where a rather clever website updates the latest information on the site via Shockwave Flash through a TCP connection. The data received is then updated onto the page via JavaScript so in order to get the latest data a browser is required. If attempts are made to hit the website with continual requests then a) you get banned and b) you're not actually getting the latest data, only the last updated base framework.
So I need to run a browser with scripts enabled.
My first question is, using the standard WPF WebBrowser in .NET I get the following warnings which I don't get in standard IE, Chrome or Firefox. What is causing this and how do I supress/allow it but still allowing scripts for the site to be run?
My second question relates to is there a better way do to this or are there any better alternatives to the WebBrowser control that will
Allow scripts to run
can access the DOM or html and scripts returned in at least text format
is compatible with WPF
can hide the browser as I don't actually want it displayed.
So far I've looked into WebKit.NET which doesn't seem to allow access to the DOM and didn't like WPF windows when I tested and also Awesomium but again didn't appear to allow direct access to the DOM without javascript.
Are there any other options (apart from hacking their scripts)?
Thank you
set WebBrowser.ScriptErrorsSuppressed = true;
Ultimately I ended up keeping the WPF control and used this code to inject a JavaScript script to disable JavaScript errors. The Microsoft HTML Object Library needs to be added.
private const string DisableScriptError = #"function noError() { return true;} window.onerror = noError;";
private void webBrowser1_Navigated(object sender, System.Windows.Navigation.NavigationEventArgs e)
{
InjectDisableScript();
}
private void InjectDisableScript()
{
HTMLDocumentClass doc = webBrowser1.Document as HTMLDocumentClass;
HTMLDocument doc2 = webBrowser1.Document as HTMLDocument;
IHTMLScriptElement scriptErrorSuppressed = (IHTMLScriptElement)doc2.createElement("SCRIPT");
scriptErrorSuppressed.type = "text/javascript";
scriptErrorSuppressed.text = DisableScriptError;
IHTMLElementCollection nodes = doc.getElementsByTagName("head");
foreach (IHTMLElement elem in nodes)
{
HTMLHeadElementClass head = (HTMLHeadElementClass)elem;
head.appendChild((IHTMLDOMNode)scriptErrorSuppressed);
}
}
WPF WebBrowser does not have this property as the WinForms control.
You'd be better using a WindowsFormsHost in your WPF application and use the WinForms WebBrowser (so that you can use SuppressScriptErrors.) Make sure you run in full trust.