I want to download a webpage after or with his JavaScript. For example a webpage with data that load in javaScript. I'm using WebClient class.
My code:
WebClient wb = new WebClient();
htmlDoc = wb.DownloadString(link);
My problem is that I get the html code before the data is downloaded by the javaScript.
What can I do?
Here is my link: http://www.select-test.com/HTML/frmDisplayGrid.aspx?Type=cat&Data=OS
thanks,
Chani
Related
I'm looking for a method that replicates a Web Browsers Save Page As function (Save as Type = Text Files) in C#.
Dilemma: I've attempted to use WebClient and HttpWebRequest to download all Text from a Web Page. Both methods only return the HTML of the web page which does not include dynamic content.
Sample code:
string url = #"https://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=" + package.Item2 + "&LOCALE=en";
try
{
System.Net.ServicePointManager.SecurityProtocol = System.Net.SecurityProtocolType.Tls11 | System.Net.SecurityProtocolType.Tls12;
using (WebClient client = new WebClient())
{
string content = client.DownloadString(url);
}
}
The above example returns the HTML without the tracking events from the page.
When I display the page in Firefox, right click on the page and select Save Page As and save as Text File all of the raw text is saved in the file. I would like to mimic this feature.
If you are scraping a web page that shows dynamic content then you basically have 2 options:
Use something to render the page first. The simplest in C# would be to have a WebBrowser control, and listen for the DocumentCompleted event. Note that there is some nuance to this when it fires for multiple documents on one page
Figure out what service the page is calling to get the extra data, and see if you can access that directly. It may well be the case that the Canadapost website is accessing an API that you can also call directly.
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
If I use this
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://test.net");
I am able to use the agility pack to scan the html and get most of the tags that I need but its missing the html that is rendered by the javascript.
My question is, how do I get the final rendered page source using c#. Is there something more to the WebClient to get the final rendered source after javascript is run?
The HTML Agility Pack alone is not enough to do what you want, You need a javascript engine as well. To do that, you may want to check out something like Geckofx, which will allow you to embed a fully functional web browser into your application, and than allow you to programatically access the contents of the dom after the page has rendered.
http://code.google.com/p/geckofx/
You need to wrap a browser in your application.
You are in luck! There is a .NET wrapper for WebKit. http://webkitdotnet.sourceforge.net/
You can use the WebBrowser Class from System.Windows.Forms.
using (WebBrowser wb = new WebBrowser())
{
//Code here
}
https://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(v=vs.110).aspx
In a Winforms app I have a webbrowser control that is logged in to a site.
Now I want to download an image (that can only be downloaded when logged in to that site) programmatically.
So how do I tell my webbrowser control to download an image, ie. http://www.example.com/image.jpg, and save it somewhere ?
If you don't want to save the file directly to your hard drive, you can download it into a stream. e.g.
WebClient wc = new WebClient();
byte[] bytes = wc.DownloadData("http://www.example.com/image.jpg");
Bitmap b = new Bitmap(new MemoryStream(bytes));
If you then wish to save it to your hard drive, you can call the Bitmap.Save() method. e.g.
b.Save("bitmap.jpg");
I guess you can't do that silently using a WebBrowser. Don't forget that it's an IE instance at the end. What you can do is: Navigate to the ImageURL, then Invoke ShowSaveAsDialog() that will show a Save as dialog to the user to save the Image:
WebBrowser wb = new WebBrowseR();
wb.Navigate("ImageURL");
wb.ShowSaveAsDialog();
A better solution is to get the Image using a WebClient
System.Net.WebClient wc = new System.Net.WebClient();
wc.Credentials = new System.Net.NetworkCredential("username", "password"); //Authenticates to the website - Call it only if the image url needs authentication first
wc.DownloadFile("imageURL", "downloadedImage.jpg"); //Downloads the imageURL to the local file downloadedImage.jpg