I'm trying to retrieve the html of a page with some Ajax on.
Problem is that Webclient.Downloadstring() returns to fast, so the Ajax page haven't finished loading => I'm not getting the right html :(
Is it possible to call another function or similar, so I for example request the page, wait a few seconds and then read the response? (so I allow the Ajax to finish loading before I retrieve the html)
Thanks,
Louisa
The WebClient by default only fetches the (HTML) contents of a single URL. It does not parse HTML and thus does not know about any CSS, images or javascript used on the page. You are trying to emulate the functionality of a full-blown browser, for which the WebClient alone is insufficient.
To achieve your desired behaviour, you will have to not only retrieve the HTML, but then also parse it, retrieve and execute javascript on the page and then get the resulting DOM. This is most easily achieved through a library that provides the functionality of a webbrowser to your application. Examples include System.Windows.Forms.WebBrowser (WinForms), System.Windows.Controls.WebBrowser (WPF) or Awesomium.
Related
Im looking for a simple way to get a string from a URL that contains all text actually displayed to the user.
I.e. anything loaded with a delay (using JavaScript) should be contained. Also, the result should ideally be free from HTML tags etc.
A straightforward approach with WebClient.DownlodString() and subsequent HTML-regex is pretty much pointless, because most content in modern web apps is not contained in the initial HTML document.
Most probably you can use Selenium WebDriver to fully load the page and then dump the full DOM.
There are lots of sites that use this (imo) annoying "infinite scrolling" style.
Examples of this are sites like tumblr, twitter, 9gag, etc..
I recently tried to scrape some pics off of these sites programatically with HtmlAgilityPack.
like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var primary = doc.DocumentNode.SelectNodes("//img[#class='badge-item-img']");
var picstring = primary.Select(r => r.GetAttributeValue("src", null)).FirstOrDefault();
This works fine, but when I tried to load in the HTML from certain sites, I noticed that I only got back a small amount of content (lets say the first 10 "posts" or "pictures", or whatever..)
This made me wonder if it would be possible to simulate the "scrolling down to the bottom" of the page in c#.
This isn't just the case when I load the html programatically, when I simply go to sites like tumblr, and I check firebug or just "view source", I expected that all the content would be in there somewhere, but alot of it seems to be hidden/inserted with javascript. Only the content that is actually visible on my screen is present in the HTML source.
So my questions is: is it possible to simulate infinitely scrolling down to a page, and loading in that HTML with c# (preferably)?
(I know that I can use API's for tumblr and twitter, but i'm just trying to have some fun hacking stuff together with HtmlAgilityPack)
There is no way to reliably do this for all such websites in one shot, short of embedding a web browser (which typically won't work in headless environments).
What you should consider doing instead is looking at the site's JavaScript in order to see what AJAX queries are used to fetch content as the user scrolls down.
Alternatively, use a web debugger in your browser (such as the one included in Chrome). These debuggers usually have a "network" pane you can use to inspect AJAX requests performed by the page. Looking at these requests as you scroll down should give you enough information to write C# code that simulates those requests.
You will then have to parse the response from those requests as whatever type of content that particular API delivers, which will probably be JSON or XML, but almost certainly not HTML. (This may be better for you anyway, since it will save you having to parse out display-oriented HTML, whereas the AJAX API will give you data objects that should be much easier to use.)
Those sites are making asynchronous http requests to load the subsequent page contents. Since HTML agility pack doesn't have a javascript interpreter (thank heavens for that), you will need to make those requests yourself. It is most likely that most sites will not return html fragments, but rather JSON. For that, you'll need to use a JSON parser, not the HTML agility pack.
I am trying to create a form which allow async file uploading with asp.net. I realize you cannot upload a file with ajax per se so I am examining alternatives
What is the best way to do this? Create an Iframe on the page with the entire form including the file input? Can I on the parent to the frame have the submit button which forces the frame to submit and then displays some sort of spinner to indicate file is uploading? Ideally upon completion I'd like to redirect the user to another page. Is there a somewhat easy way to do this???
Have you tried using one of the jquery plugins vice doing it by hand?
http://aquantum-demo.appspot.com/file-upload
Why not use the ASP.NET AJAX Control Toolkit's AsyncFileUpload control? It's free and works pretty well.
You could use http://jquery.malsup.com/form/#file-upload
Have it post to a page that will handle a file upload on the server side in your usual way.
I like to have the page return JSON with a success/failure flag and message, then parse the response to determine if the upload succeeded.
I would like to know how the HTML source of ajax based sites can be read using HttpWebRequest / HttpWebResponse (That is reading the contents of a website at server side). The problem that I'm facing is that I'm unable to read parts of the webpage which uses Ajax or stuffs like UpdatePanel.
My application is in ASP.NET / C#, so can't think of using stuffs like Browser control or mshtml.dll since I would not be able to serve multiple requests.
Thanks in advance.
this is going to be difficult.
I know you said you don't want to use Browser control, but I'm going to say it anyway. You will most probably be better off using a Browser control. The reasons are as follow:
AJAX sites make multiple calls from the browser to the server to obtain the required view.
The multiple calls are being performed via JavaScript
The data returned from the server may be reformatted by JavaScript before being updated onto the view.
If you are going to do this using HttpWebXyz functions, you will have to do the following:
Make the relevant calls to get the initial page source.
Parse the page for JavaScript.
Evaluate/execute the JavaScript. This may include providing the relevant implementation for functions such as alert and making subsequent calls to the server.
Depending on the complexity of the AJAX site, you may want to reconsider using the browser control. Complex sites are easier process by the control. If the site is simple enough, you may survive parsing and executing the required JavaScript.
This example uses a deprecated class to parse JavaScript.
You may want to explore ICodeCompiler and its relevant classes for the new approach.
Good luck.
I've been struggling to find an exmample of some C# code (I'm using C# Visual Studio 2008 Express) that can programmatically save an entire web page (given a URL) including the images and formatting (e.g. CSS). The intention is that in a subsequent phase I'd ship this off (not sure how yet) so it could be viewed later via a browser.
Is there an example of the most simple approach (leveraging the .NET Framework methods) to save an entire web page? Saving as one page with a subdirectory for images, or otherwise. Basically the same as what you get with browsers when you say "save entire web page".
The simplest way is probably to add a WebBrowser Control to your application and point it at the page you want to save using the Navigate() method.
Then, when the document has loaded, call the ShowSaveAsDialog method. The user can then save the page as a single file, or a file with images in a subdirectory.
[Update]
Having now noticed "programatically" in your question, the above approach is not ideal as it requires either user involvement or delving into the Windows API to send input using SendKeys or similar.
There is nothing built-in to the .NET Framework that does all of what you ask.
So my approach revised would be:
Use System.NET.HttpWebRequest to get the main HTML document as a string or stream (easy).
Load this into a HTMLAgilityPack document where you can now easily query the document to get lists of all image elements, stylesheet links, etc.
Then make a separate web request for each of these files and save them to a subdirectory.
Finally update all relevent links in the main page to point to the items in the subdirectory.
In effect you would be implementing a very simple web browser. You may run into issues with pages that use JavaScript to dynamically alter or request page content, but for most pages this should give acceptable results.
From code Project: ZetaWebSpider
It's definitely not elegant, but you could navigate a System.Windows.Forms.WebBrowser to the URL and then call its ShowSaveAsDiagog() method to save the page.