Any way to tell a WebBrowser in C# to show the pages in HTML only? I'm trying to make a web scraper and I don't need pictures that make the process way slower than necessary.
Why are you using a WebBrowser control for page scraping? If you just want the core html of a page, then just do a WebRequest and get the response.
you're going to have to roll your own basically.
One way would be to build your application in WPF and use a HTML->XAML conversion process and just leave off the tag from being converted.
Related
Im looking for a simple way to get a string from a URL that contains all text actually displayed to the user.
I.e. anything loaded with a delay (using JavaScript) should be contained. Also, the result should ideally be free from HTML tags etc.
A straightforward approach with WebClient.DownlodString() and subsequent HTML-regex is pretty much pointless, because most content in modern web apps is not contained in the initial HTML document.
Most probably you can use Selenium WebDriver to fully load the page and then dump the full DOM.
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
Currently I am working on my project which involves using webbrowser control in c#. After many struggles I successfully accomplished joining WebKit to WinForms application and run website with CKEditor in it but it gave me 2 issues.
1 Image uploader works fine but it doesn't send callback or WebKit cannot process it. Is there any possibility to make it work?
2 When I try to scrape html document to get the iframe by doing this: webKitBrowser1.Document.GetElementById("cke_1_contents").LastChild I get iframe element but I have no idea how to get content of it because it says that i doesn't have any childs.
Anyone can suggest me what to do next or give any alternative for this?
I use VS2008 and .NET 3.5.
I can't answer this question in the context of the WebKit-based control, but I suggest that you try the native WinForms WebBrowser control. It works great as the host for CKEditor, once the WebBrowser Feature Control has been implemented.
Then, if I was to do web-scraping on a page with CKEditor, I'd try something like this to get the current editor content (from C#):
dynamic pageDocument = webBrowser.Document.DomDocument;
var ckeDocument = pageDocument.getElementsByClassName("cke_wysiwyg_frame").item(0).contentDocument;
MessageBox.Show((string)ckeDocument.documentElement.outerHTML);
I want to add a html control in C# which will display all the text from an html page selectively with the title given in my html page
Don't forget that when you are programming in ASP.NET, you are really programming in HTML. ASP.NET controls have their effect by generating HTML, which is then sent to the browser.
This changes your question. Your question is really, "how can I use HTML to display the contents of another web site, and how can I make ASP.NET generate the HTML that I need".
You can display the contents of another site by using an iframe:
<iframe id="myOtherSite" src="other site url"/>
You can simply place that on your ASP.NET page. However, it doesn't solve your problem with the title. I expect you can do that with some JavaScript, as your main window can access the DOM of the iframe to pick up the title and put it where you want it.
You could always use a string reader to effectively 'scrape' the page content from the 3rd party site. Then use a simple regular expression check to grab the page title. You could then do with it as you want.
I've been struggling to find an exmample of some C# code (I'm using C# Visual Studio 2008 Express) that can programmatically save an entire web page (given a URL) including the images and formatting (e.g. CSS). The intention is that in a subsequent phase I'd ship this off (not sure how yet) so it could be viewed later via a browser.
Is there an example of the most simple approach (leveraging the .NET Framework methods) to save an entire web page? Saving as one page with a subdirectory for images, or otherwise. Basically the same as what you get with browsers when you say "save entire web page".
The simplest way is probably to add a WebBrowser Control to your application and point it at the page you want to save using the Navigate() method.
Then, when the document has loaded, call the ShowSaveAsDialog method. The user can then save the page as a single file, or a file with images in a subdirectory.
[Update]
Having now noticed "programatically" in your question, the above approach is not ideal as it requires either user involvement or delving into the Windows API to send input using SendKeys or similar.
There is nothing built-in to the .NET Framework that does all of what you ask.
So my approach revised would be:
Use System.NET.HttpWebRequest to get the main HTML document as a string or stream (easy).
Load this into a HTMLAgilityPack document where you can now easily query the document to get lists of all image elements, stylesheet links, etc.
Then make a separate web request for each of these files and save them to a subdirectory.
Finally update all relevent links in the main page to point to the items in the subdirectory.
In effect you would be implementing a very simple web browser. You may run into issues with pages that use JavaScript to dynamically alter or request page content, but for most pages this should give acceptable results.
From code Project: ZetaWebSpider
It's definitely not elegant, but you could navigate a System.Windows.Forms.WebBrowser to the URL and then call its ShowSaveAsDiagog() method to save the page.