How to read html content of Ajax based site - c#

I would like to know how the HTML source of ajax based sites can be read using HttpWebRequest / HttpWebResponse (That is reading the contents of a website at server side). The problem that I'm facing is that I'm unable to read parts of the webpage which uses Ajax or stuffs like UpdatePanel.
My application is in ASP.NET / C#, so can't think of using stuffs like Browser control or mshtml.dll since I would not be able to serve multiple requests.
Thanks in advance.

this is going to be difficult.
I know you said you don't want to use Browser control, but I'm going to say it anyway. You will most probably be better off using a Browser control. The reasons are as follow:
AJAX sites make multiple calls from the browser to the server to obtain the required view.
The multiple calls are being performed via JavaScript
The data returned from the server may be reformatted by JavaScript before being updated onto the view.
If you are going to do this using HttpWebXyz functions, you will have to do the following:
Make the relevant calls to get the initial page source.
Parse the page for JavaScript.
Evaluate/execute the JavaScript. This may include providing the relevant implementation for functions such as alert and making subsequent calls to the server.
Depending on the complexity of the AJAX site, you may want to reconsider using the browser control. Complex sites are easier process by the control. If the site is simple enough, you may survive parsing and executing the required JavaScript.
This example uses a deprecated class to parse JavaScript.
You may want to explore ICodeCompiler and its relevant classes for the new approach.
Good luck.

Related

Post data to server and then parse the HTML-code C#

I'm trying to parse a website. The only problem is that the site dosen't use a specific URL to the site I wan't to parse. The content is being displayed to the site using JavaScript on the same webpage so the content is different depending on the searchquery.
Is it possible to choose a value from a dropdown-menu and then post that to the server and then parse the HTML-code in C#?
Clarification:The code is returned in HTML.
I know the name of the option from the dropdown i want to post, but how do I do that from code-behind?
Most sites do not really generate HTML in Javascript. Much more often you see Asp.Net sites where Javascript is used for a postback (and name of the dropdown is posted back in __EVENTTARGET field)
Then you can do the same in your application - you have to imitate filling the form - pass all the fields to the server including VIEWSTATE and EVENTTARGET.
Having said that, it might be against the site's terms of use.
You definitely need to checkout Selenium, it does exactly what you need. It is commonly used as a testing framework. However you can use it to manipulate HTML tags even when the website uses javascript.
Note: Selenium allows you to open and manipulate a website using a browser such as FireFox, Chrome, IE, etc. However, I think what you need here is to use the WebDriver, which manipulates the website without opening a browser. Most of my experience using Selenium is with Java, but I found multiple tutorials online for .net too.

Using JavaScript template with ASP.NET

I ran into this problem multiple times in my career, and never was able to find a elegant solution for it. Imagine you have a simple page, that has a repeater. You populate that repeater on the server-side through the databinding. That's great, works fast and does what it's supposed to. But now you want to add paginator to that repeater, or otherwise change the output. Doing it through Ajax is a preferred way to enable rich client interaction.
So you create a web-service that serves you the data as JSON, but now you are stuck... Either you have to write complicated client-side code to find each field that you need to modify in each repeater-item, or you have to blow away the whole server-side output of the repeater and construct new HTML from the scratch, or, the method that I've been using lately, take the first repeated item, blow away everything else and clone the first item as many time as you need to and modify it's fields.
All of the described methods are not optimal, because no matter what, they require quite a bit of repeated logic on the server-side (i.e. template in repeater) and on the client-side (javascript to display JSON data). There's got to be a better, easier way to do this. First thing that comes to mind, is instead of returning JSON from the web-server, return HTML of the pre-populated repeater. But for something like that, I might as well use ASP.NET AJAX Update panel. The output isn't going to be any smaller with a stand-alone web-service.
Next thing that I thought of, is JavaScript templates. What if there would be some way to take unprocessed repeater template on the server-side, and convert it to JavaScript template that could be either embedded on the page at load, or served as part of the web-service response. However, I couldn't find any existing solutions for something like this. And I can't think of a simple way to do that myself. Any ideas?
P.S. Rendering JavaScript template to the client-side on page load, and using JavaScript to populate it without the initial view being rendered on the server (no repeater and databinding) is out of the question. I care too much about performance.
Firstly, I don't believe that using client template with JSON data even on first load will adversely affect the performance unless we are talking about devices with different form factors such as phones etc.
However, if you must use server side templating/rendering then why not make server return the html for the repeater. This can be done by putting repeater logic into a different user control/page and processing only that page on ajax request. And this is not at all equivalent to using UpdatePanel (as stated by you) - UpdatePanel posts entire page data (including view-state) having more request size. The response size is also larges because it must contain the view-state. On server side also, use of UpdatePanel results in loading complete control tree with state data and post-back event processing. Sending only the requisite html is much better approach and will fit your needs perfectly - only issue is the html would be larger in size as compared to JSON.
Lastly, there are some interesting projects such as Script# - Script# converts C# code into java-script. You may build something similar (using script# itself) to convert the server side templating code into eqivalent JS code. More viable approach on similar lines could be use T4 templating to convert a technology-agnostic template into both server side code (markup + code or pure code) and equivalent JS-code.
After thinking about all pros and cons of different approaches, I stopped on the following method. I created a custom ASP.NET databound control, that can render HTML, however, when the page is requested with query string parameters, instead of just doing standard rendering, it will use Response.Clear() and Response.End() and in between of those two commands output JSON version of data based on the query string parameters. Also on the first rendering of the page, it will also output JavaScript template using reflections to read names of the variables from the control's template area.
This method works great, all I have to do, is drop my control on the page, data bind it, and it works as a true AJAX grid that supports pagination, sorting and filtering. However it does have limitation. In the control's template you can only specify variables, not expressions. Otherwise reflections can't convert it to a JavaScript variable. But I can live with that.
Other possibilities that I considered is a separate web-service that takes a type of the page as parameter and uses reflection to get data bound object as well as create template for the grid. I also though about writting my own version of update panel, that would not use view state and only send in part of the page.

run serverside method from jquery

I know the method that, from jQuery, using ajax I can invoke WebMethod from a aspx or asmx file. That's ok, but I only can place my project logic in ascx.cs files. It is a specific CMS and I can't do anything about it.
So my question is based on example described below:
Lets say users is logged in and is viewing an article. One user wants to mark it as a favourite, so clicks a button. On the server side without, refreshing the page, an appropriate method should run which adds this article to his favourites and then in client side there is an alert - 'Success'.
I dont want anyone to write the code for me for that it is just an example for desribing what functionality I would like to be able to achieve and which technology to use for that.
Thanks for the help.
P.S. I'm using ASP.Net 2.0
ASCX files are not directly accessible from the client (and as such, cannot be targeted via AJAX calls).
If your logic really has to be encapsulated in the ASCX file, you can add in an entrypoint WebMethod in your ASPX that calls the respective ASCX method instead. You'll probably encounter some difficulties related to the WebMethods being static though, so you may eventually need to refactor a bit, depending on how your code is structured now.
You can make ajax call to remote page (with ascx control with your server side method) and then parse output (for example look for world "SUCCESS") to verify, that your method is executed. Not very elegant, but it will work.

Web page crawling in C#

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.

Implementing UpdatePanel manually

I was reading an article that shows how bad CodePlex uses UpdatePanels and how nice is StackOverflow on this matter when, for example, a user upvotes an answer/question.
I wonder if someone can point a tutorial on how to do such action.
I know some points:
Create a Web Service that gets the action value and ouputs a JSON string
Build the javascript inside <ajax:ScripManager> control to replace the correct value on the page with the new value
But, even in the first I have difficulties, I can send a JSON string, but it will always be surrounded with XML information!
Can anyone (or maybe Jeff) point to a nice "how-to" since scratch? Thank you.
Well, I doubt StackOverflow uses UpdatePanel - more likely it uses jQuery / load to simply update a div, using ASP.NET MVC as the source (rather than ASP.NET vanilla, which has a more complex page cycle).
With this approach, it is trivial... the jQuery examples tab largely says it all.
Re returning the Json - that is simply return Json(obj); from the controller in ASP.NET MVC - but personally I'd return the html (simpler).
Before you dismiss the UpdatePanel I suggest you have a read of this post I did - http://www.aaron-powell.com/blog/august-2008/optimising-updatepanels.aspx. It looks at how to optimise UpdatePanels and it can lead to some performance increases if done well.
I also did a post - http://www.aaron-powell.com/blog/august-2008/paging-data-client-side.aspx which looks at doing client-side templating with jQuery and MS AJAX. I look at how to read a web service with JavaScript and if you download the sample you'll see how to send data client-side to a web service.
But this video cast on the ASP.NET website may also be of use - http://www.asp.net/learn/ajax-videos/video-82.aspx. It's on how to extend web services for script service capabilities.

Categories

Resources