I'd like to create a web page where you can enter your domain name and have it fetch it and show you all the resources, their download times, etc -- similar to FireFox's NET tab.
Here's the page which I'd like emulate: http://tools.pingdom.com/
Now, I know this is a complex feature, but I'd like to hear general ideas. I know I could easily fetch the HTML via a WebClient, but that's the easy part. I need to fetch and time all the resources too, and not all at the same time. I want to mimic a browser. So, I thought about using something like System.Windows.Forms.WebBrowser, but that will only really give me the page load time.
Anyone have any thoughts / tips?
Using the Html Agility Pack you can easily find which external resources are referenced from an HTML page.
This won't tell you exactly when they would be loaded by the browser, and also won't help you with dynamically loaded resources, but is a good start.
I'm afraid the only way to be sure is to instantiate an entire browser. You could use a plug in for the Fiddler HTTP debugging proxy to intercept requests from the WebBrowser control to determine which resources are actually loaded in this case.
Related
So I am just beginning to learn C#, and one of my main goals is to be able to 'navigate' a website. I have done minimal research and have found that the two primary was to do this would be HTTPClient and Requests, and I would like to learn this through HTTPClient.
Now what I mean by navigate is to essentially bot a website for practice. This is like clicking buttons, putting text into fields, etc.
If anyone can give me an idea on where to start with this it would be much appreciated! Not looking for code specifically, just looking for what I should learn in HTTPClient to make this happen. Thanks!
I think that you are a little confused about the concepts. HTTPClient send requests to some site, but you cannot click buttons or "navigate" inside the site.
If youre looking for a way to test some site, i recommend you learn about cypress.io. You can add text to your textboxes, click buttons or navigate in any site. All of this with a few lines of code with Javascript. Its free.
Otherwise, if you need to save values on a database depending of your "navigation", you have to research about scraping tools. I recommend you Selenium or any other scraping tool.
Usually HTTPClient is used when you have to consume a REST API.
Basically you have to think about how a program could ‘see’ a website. You cannot expect to say to the HTTPClient: ‘Open page www.google.com and search for something.’ If you want to do this programmatically you have to exactly specify what your program should do.
For your purpose I recommend the HTML Agility Pack. This one can be used to get the navigation elements of a HTML document. This way you can parse a HTML delivered from a website into your program and do further stuff with it.
Kind regards :)
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
There is one website named "www.localbanya.com", i wanted to grab the HTML information from that site, they list products, the structure of their display is:
First they display some around 8-10 products on page-load, and
later when user scrolls down it generates more products.
Now as this is happening based on javascript, i am not able to get the whole page source using WebClient.
I wanted to know is there any way i can update the page-source while using WebClient class in .net to retrieve whole page information or any other alternative i can use to get the whole page HTML information, at once.
You can refer this for reference localbanya product page
Any help will be a appreciated.
WebClient obviously doesn't run the javascript.
so you gonna need some sort of a headless browser to do it.
There are many options for it, though I don't know any C# or .NET implementation..
You may look into Phantom JS and other headless browsers which replicate what a normal browser does and you can write scripts for it.
Also refer to this question
Headless browser for C# (.NET)?
You can also run something like Fiddler to see what requests were made from the page when scrolling down, to reverse engineer how the data is retrieved, and replicate that with a WebClient if possible.
Hope this Helps.
I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.
I have an web based application. The content for the Home page has been currently mentioned in the HTML code for the Home page using , and tags. To change the content anytime in future, it needs to be changed in the HTML code. :(
Is there a way that we can pick up the content from some external place and get it reflected through the website. This ways, any change if required can be made at the external location without referring to the application's code.
Please advise if there is any solution for it.
Thanks.
You can
Use a database
Include external files using Server Side Includes
Read external files and write their contents and an alternative method
Sounds like you're looking for a Content Management System (CMS), which will allow your content editors access to modify only specific blocks of a page that you specify.
There are a ton out there to do what you want, so you don't have to start from scratch. Just Google 'CMS'.
Although I haven't used it myself, DotNetNuke is a popular one these days and has a free version.