I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
Related
Is there an example of adding plugins to Google IE browser that supports language c# or .net I want to run files js - json - hrml popup
Microsoft just released Blazor, which allows browsers to run Razor / C# code on the client side. You could research and experiment with that.
However, you don't need that to run js - json - html popup. Use something like ASP.Net MVC to run any C# code on the server. The server generates your HTML and Javascript and stuff. Browsers know how to work with all of that stuff without any plugins.
This site will work better when you have specific questions and you can show your code and what you have tried. For a question like this, Google will work better for you! Good luck.
I have been using Selenium along side C# in Visual Studio 2013. I will make a call to:
driver.Navigate().GoToUrl("http://<insert webpage>");
...Which will open whichever WebDriver I choose to use.
From here, I will make calls to links/text boxes/menus as I need to.
However, I was wondering if there is a way to get the information from webpages without having to actually open a browser, and if so, could someone perhaps explain or link me to the right direction? It would save time and speed up a lot of my programs. I know applications can get information remotely without actually opening a browser, I just do not know how the process works or if Selenium alone will give that ability.
I appologize if this is wrong place to ask this question.
It is not clear whether or not you need to work with web page (like click on the links, or edit test), but here are two options:
You can use PhantomJS.It is headless browser and since there will be no UI execution may be faster. There is a selenium driver for it.
You can use Html Agility Pack to parse the page and WebClient to download the page. No selenium is required in that case. Html Agility Pack will allow you to make XPath queries, find objects by class name or ID. But: you won't be able to manipulate with DOM structure as you can do with real browser. It is just to parse and navigate over static html page.
I'm trying to parse a website. The only problem is that the site dosen't use a specific URL to the site I wan't to parse. The content is being displayed to the site using JavaScript on the same webpage so the content is different depending on the searchquery.
Is it possible to choose a value from a dropdown-menu and then post that to the server and then parse the HTML-code in C#?
Clarification:The code is returned in HTML.
I know the name of the option from the dropdown i want to post, but how do I do that from code-behind?
Most sites do not really generate HTML in Javascript. Much more often you see Asp.Net sites where Javascript is used for a postback (and name of the dropdown is posted back in __EVENTTARGET field)
Then you can do the same in your application - you have to imitate filling the form - pass all the fields to the server including VIEWSTATE and EVENTTARGET.
Having said that, it might be against the site's terms of use.
You definitely need to checkout Selenium, it does exactly what you need. It is commonly used as a testing framework. However you can use it to manipulate HTML tags even when the website uses javascript.
Note: Selenium allows you to open and manipulate a website using a browser such as FireFox, Chrome, IE, etc. However, I think what you need here is to use the WebDriver, which manipulates the website without opening a browser. Most of my experience using Selenium is with Java, but I found multiple tutorials online for .net too.
I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.
I need to display HTML in my silverlight application and cannot find a way of doing it. I cannot use the web browser control as it needs to be able to run in or out of a browser.
Does anyone know of a good way to do this, because all I can think of doing at the moment is running replace methods on the text to just replace the tags with C# equivalents eg(<br /> to \n).
The way I do it is to check if the application is running inside the browser and change the means of display accordingly. If running inside the browser, I overlay the application with an IFrame, as I describe in this article: http://www.silverlightshow.net/items/Building-a-Silverlight-Line-Of-Business-Application-Part-6.aspx. Otherwise, I use the WebBrowser control. I have a control which does this all for you in the source code that accompanies my book, which is downloadable from the Apress website here: http://www.apress.com/book/downloadfile/4638.
Hope this helps...
Chris
I believe what you are looking for is HTML Bridge.
Edit I'm am actually now unsure if you'll still have access to javascript if you're running this OOB. I'm going to look into this some more and will further update. I'll still leave the answer up though for reference.
Second Edit Here is what I've found. HTML Bridge is disabled when you run silverlight out of browser. This disables access to the HTML DOM as well as Javascript. However, according to a comment on this site:
HTML Bridge is not available when you first install a OOB app. But you CAN force it if you modify the index.html in the folder where the app is installed just adding the enablehtmlaccess parameter.
It works!
You can even create dynamic HTML elements using the well-known methods of the HtmlPage class. You can even open a new browser window with the Navigate() method and its "_blank" parameter.
Keep in mind this information was posted about SL 3. Its possible that this may have changed, but I doubt it. So it seems that what you may want to do is build a script into the startup of your SL app that detects whether or not your app is running out of browser. If it is then you may want to have some script to call that can modify this file for you.
There recently was a similar question.
I posted a link there to an implementation that parses and displays HTML inline in Silverlight. Of course, it will work only with simple HTML, but maybe you can expand it to your needs.