Scraping data dynamically generated by JavaScript in html document using C# - c#

How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}

You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

Related

Replicate Webbrowser Save Page As function in C#

I'm looking for a method that replicates a Web Browsers Save Page As function (Save as Type = Text Files) in C#.
Dilemma: I've attempted to use WebClient and HttpWebRequest to download all Text from a Web Page. Both methods only return the HTML of the web page which does not include dynamic content.
Sample code:
string url = #"https://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=" + package.Item2 + "&LOCALE=en";
try
{
System.Net.ServicePointManager.SecurityProtocol = System.Net.SecurityProtocolType.Tls11 | System.Net.SecurityProtocolType.Tls12;
using (WebClient client = new WebClient())
{
string content = client.DownloadString(url);
}
}
The above example returns the HTML without the tracking events from the page.
When I display the page in Firefox, right click on the page and select Save Page As and save as Text File all of the raw text is saved in the file. I would like to mimic this feature.
If you are scraping a web page that shows dynamic content then you basically have 2 options:
Use something to render the page first. The simplest in C# would be to have a WebBrowser control, and listen for the DocumentCompleted event. Note that there is some nuance to this when it fires for multiple documents on one page
Figure out what service the page is calling to get the extra data, and see if you can access that directly. It may well be the case that the Canadapost website is accessing an API that you can also call directly.

Html Agility Pack, Web scraping [duplicate]

How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

Parsing HTML in C# that is updating constantly

I have a webpage that is displaying some data using AJAX queries. I would need to parse some of this data in a C# program.
Problem is that when I look at the source code of my webpage, this is not showing up the data, as this is being generated automatically by an AJAX script and modifying the DOM.
If I select everything on the webpage and do "Inspect Element" with Chrome, I have the full HTML code with the data I want to extract that are in various tables.
What I've tried is doing a webBrowser1.Navigate("www.site.com"), and then in my webBrowser1_DocumentCompleted() event, I'm doing this:
var name = webBrowser1.Document.GetElementById("table_1_r_7_c_2");
Problem is that webBrowser1 is not returning the full HTML code, as some code is generated by the AJAX queries.
Does anyone know how I could achieve this behavior in C#?
The DocumentCompleted event is a bit misleading because it will also fire for each AJAX request on the page. You can do something like this to check if it's the actual page that's loaded, or some other variant to look for specific requests.
private void OnDocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (e.Url.AbsolutePath == webBrowser1.Url.AbsolutePath)
{
// page loaded
}
}

is there a straightforward way to retrieve text that is rendered by the browser but is not hard-coded in the actual html file?

I'm trying to retrieve data from a webpage but I cannot do it by making a web request and parsing the resulting html file because the actual text that I'm trying to retrieve is not in the html file! I imagine that this text is pulled using some script and for that reason it's not in the html file. For all I know I'm looking at the wrong data, but assuming that my theory is correct, is there a straightforward way to retrieve whatever text is displayed by the browser (Firefox or IE) rather than attempt to fetch the text from the html file?
Assuming you are referring to text that has been generated using Javascript in the browser.
You can use PhantomJS to achieve this: http://phantomjs.org/
It is essentially a headless browser that will process Javascript.
You may need to run this as ane xternal program but Im sure you can do that through C#
Your other option would be to open the web page in a WebBrowser object which should execute the scripts, and then you can get the HtmlDocument object and go from there.
Take a look at this example...
private void test()
{
WebBrowser wBrowser1 = new WebBrowser();
wBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wBrowser1_DocumentCompleted);
wBrowser1.Url = new Uri("Web Page URL");
}
void wBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument document = (sender as WebBrowser).Document;
// get elements and values accordingly.
}

get the web page source with the rendered html from javascript

If I use this
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://test.net");
I am able to use the agility pack to scan the html and get most of the tags that I need but its missing the html that is rendered by the javascript.
My question is, how do I get the final rendered page source using c#. Is there something more to the WebClient to get the final rendered source after javascript is run?
The HTML Agility Pack alone is not enough to do what you want, You need a javascript engine as well. To do that, you may want to check out something like Geckofx, which will allow you to embed a fully functional web browser into your application, and than allow you to programatically access the contents of the dom after the page has rendered.
http://code.google.com/p/geckofx/
You need to wrap a browser in your application.
You are in luck! There is a .NET wrapper for WebKit. http://webkitdotnet.sourceforge.net/
You can use the WebBrowser Class from System.Windows.Forms.
using (WebBrowser wb = new WebBrowser())
{
//Code here
}
https://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(v=vs.110).aspx

Categories

Resources