I'm using C#, and I've been struggling for a few days for grabbing the final rendered HTML from an URL.
I've tried using several browser engines, Awesomium, WebBrowser and so on, but none of them returns the actual rendered HTML of the page, as if I right clicked in chrome and chose "inspect element".
What I do is roughly the following (using the WebBrowser WinForms control):
public static string GetDomSource(WebBrowser wb)
{
var dd = wb.Document.DomDocument as IHTMLDocument2;
return dd.body.parentElement.outerHTML;
}
(Though I don't know whether you already tried this or whether you are using WinForms at all).
To introduce the IHTMLDocument2 interface, I've add a reference to the "Microsoft.mshtml" assembly.
Related
I am trying to get HTML from a page after a portion of Java executes and updates the HTML. (I know that Java continues to run while the page is open so there is no way to get the code "after" its finished). I'm trying to get the HTML from this page XBowling.com, you can see that there is a splash message before lanes load. I need to get the HTML after the lanes load so i can then look through the data to get to the Lane and then look through the lane's page data to get scores and what not.
I have been messing around with headerless browsers, I'm currently playing around with Awesomium with little success i can't get it to give me the updated version of the HTML just the original when the page first loads.
(I don't have any code because I don't have anything to show other then failed attempts to get the damn thing to work)
Install Selenium.Webdriver.Domify, Selenium.WebDriverBackedSelenium and Selenium.WebDriver.ChromeDriver using nuget and code something like
using (var driver = new ChromeDriver())
{
driver.Navigate().GoToUrl(url);
var columns = driver.Divs(By.ClassName("col-md-6"));
// here you access the elements using driver object
}
I am trying to extract some information from a website. But when I navigate to it, it uses javascript to connect me to a server before dynamically loading a php-page. I can follow the sequence in Chrome with the developer tools. I figured it would be easiest to reproduce it in C# with the Webbrowser control and simply navigate to the website. Then the webbrowser control must contain all the javascript files, the text from the dynamically loaded php page and so on. But is this true and where in the control are they stored? I can't seem to find them.
Recreate the whole sequence diagram implemented in Chrome would be a lot of work. However, "extract some information from a website" is something that can be done quite easily.
Disclaimer: I assumed this question was for the WPF's WebBrower control (it would be almost the same for WinForms)
You can get the HTMLDocument once the page is loaded, using:
using mshtml; // <- don't forget to add the reference
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
browser.Navigate("http://google.com/");
browser.LoadCompleted += browser_LoadCompleted;
}
void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
HTMLDocument doc = (HTMLDocument)browser.Document;
string html = doc.documentElement.innerHTML.ToString();
// from here, you should be able to parse the HTML
// or sniff the HTMLDocument (using HTML Agility Pack for instance)
}
}
From this HTMLDocument, you have access to a lot of properties, including HTML elements, CSS styles and scripts. I invite you to put a break-point and check out what best fits your needs.
Nevertheless, since the page you want to load uses JavaScript to fill its content, the HTMLDocument will probably not be complete a the time the LoadCompleted is raise.
In that case, I suggest to use a timer to poll until the content is stable.
You could also use HTMLDocument to inject your own JavaScript code, and call C# methods througth WebBrowser.ObjectForScripting, but this is gonna be much more complicated and harder to maintain.
I have a webkit-sharp WebView which I am using to display HTMLvia the
LoadString method.
The webview is placed in a ScrolledWindow the ScrolledWindow is placed
in a Gtk Window.
I want to be able to tell the WebView to scroll to a specific part
of the HTML. Normally one would do this using an anchor.
I have defined an anchor and some JavaScript to jump to that anchor, I
call the JavaScript via the ExecuteScript method. This does nothing at
all.
I have also tried adding a button to the HTML that calls the
JavaScript. This also does nothing.
Is there something I can do to make this work, to make it so I can
scroll to a known location in the page?
Update: I can make this work by saving the HTML to a file and then loading from there using a URL which tells it to scroll. However I would like to avoid doing that because of the performance hit of writing the page to disk before displaying it.
How are you adding to ScrolledWindow? I am not GTK expert, but have discovered that it works if you call "add" method and does not work if you call add_with_viewport
Also you may not need to execute the javascript. It looks like you control the html page and can easily access the dom element and call click method on the element.Following sample python code looks for anchor with id "one" and scrolls to desired location without calling javascript.
from gi.repository import WebKit
from gi.repository import Gtk
from gi.repository import GLib, GObject
class WebkitApp:
def exit(self, arg, a1):
Gtk.main_quit()
def onLoad(self, view, frame):
doc = view .get_dom_document()
a = doc.get_element_by_id("one")
a.click()
def __init__(self):
win = Gtk.Window()
self.view = WebKit.WebView()
# Signal Connections
self.view.connect("onload-event", self.onLoad)
file = open("/tmp/epl-v10.html")
html = file.read()
self.view.load_string(html, "text/html", "UTF-8", "file:///tmp")
sw = Gtk.ScrolledWindow()
# sw.add_with_viewport(self.view)
sw.add(self.view);
win.add(sw)
win.maximize()
win.connect("delete-event", self.exit)
win.show_all()
app = WebkitApp()
Gtk.main()
The JavaScript to jump to that anchor, doesn't work while the page is still loading, By waiting for the page load to finish (there is an event for it) then running the script I made it work.
Given an element:
<a id='myElement'>I want to scroll to this line</a>
This can be scrolled to with this JavaScript:
document.getElementById('myElement').scrollIntoView();
Which can be executed in a WebView with the ExecuteScript method, e.g.:
_webView.ExecuteScript("document.getElementById('myElement').scrollIntoView();");
I have a situation where a rather clever website updates the latest information on the site via Shockwave Flash through a TCP connection. The data received is then updated onto the page via JavaScript so in order to get the latest data a browser is required. If attempts are made to hit the website with continual requests then a) you get banned and b) you're not actually getting the latest data, only the last updated base framework.
So I need to run a browser with scripts enabled.
My first question is, using the standard WPF WebBrowser in .NET I get the following warnings which I don't get in standard IE, Chrome or Firefox. What is causing this and how do I supress/allow it but still allowing scripts for the site to be run?
My second question relates to is there a better way do to this or are there any better alternatives to the WebBrowser control that will
Allow scripts to run
can access the DOM or html and scripts returned in at least text format
is compatible with WPF
can hide the browser as I don't actually want it displayed.
So far I've looked into WebKit.NET which doesn't seem to allow access to the DOM and didn't like WPF windows when I tested and also Awesomium but again didn't appear to allow direct access to the DOM without javascript.
Are there any other options (apart from hacking their scripts)?
Thank you
set WebBrowser.ScriptErrorsSuppressed = true;
Ultimately I ended up keeping the WPF control and used this code to inject a JavaScript script to disable JavaScript errors. The Microsoft HTML Object Library needs to be added.
private const string DisableScriptError = #"function noError() { return true;} window.onerror = noError;";
private void webBrowser1_Navigated(object sender, System.Windows.Navigation.NavigationEventArgs e)
{
InjectDisableScript();
}
private void InjectDisableScript()
{
HTMLDocumentClass doc = webBrowser1.Document as HTMLDocumentClass;
HTMLDocument doc2 = webBrowser1.Document as HTMLDocument;
IHTMLScriptElement scriptErrorSuppressed = (IHTMLScriptElement)doc2.createElement("SCRIPT");
scriptErrorSuppressed.type = "text/javascript";
scriptErrorSuppressed.text = DisableScriptError;
IHTMLElementCollection nodes = doc.getElementsByTagName("head");
foreach (IHTMLElement elem in nodes)
{
HTMLHeadElementClass head = (HTMLHeadElementClass)elem;
head.appendChild((IHTMLDOMNode)scriptErrorSuppressed);
}
}
WPF WebBrowser does not have this property as the WinForms control.
You'd be better using a WindowsFormsHost in your WPF application and use the WinForms WebBrowser (so that you can use SuppressScriptErrors.) Make sure you run in full trust.
I'm trying to inject some CSS that accompanies some other HTML into a C# managed WebBrowser control. I am trying to do this via the underlying MSHTML (DomDocument property) control, as this code is serving as a prototype of sorts for a full IE8 BHO.
The problem is, while I can inject HTML (via mydomdocument.body.insertAdjacentHTML) and Javascript (via mydomdocument.parentWindow.execScript), it is flat-out rejecting my CSS code.
If I compare the string containing the HTML I want to insert with the destination page source after injection, the MSHTML's source will literally contain everything except for the <style> element and its underlying source.
The CSS passes W3C validation for CSS 2.1. It doesn't do anything too tricky, with the exception that some background-image properties have the image directly embedded into the CSS (e.g. background-image: url("data:image/png;base64 ...), and commenting out those lines doesn't change the result.
More strangely (and I am not sure if this is relevant), was that I was having no problems with this last week. I came back to it this week and, after switching around some of the code that handles the to-be-injected HTML before actual injection, it no longer worked. Naturally I thought that one of my changes might somehow be the problem, but after commenting all that logic out and feeding it a straight string the HTML is still appearing unformatted.
At the moment I'm injecting into the <body> tag, though I've attempted to inject into <head> and that's met with similar results.
Thanks in advance for your help!
tom
Ended up solving this myself:
mshtml.HTMLDocument test = (mshtml.HTMLDocument)webBrowser1.Document.DomDocument;
//inject CSS
if (test.styleSheets.length < 31) { // createStyleSheet throws "Invalid Argument if >31 stylesheets on page
mshtml.IHTMLStyleSheet css = (mshtml.IHTMLStyleSheet)test.createStyleSheet("", 0);
css.cssText = myDataClass.returnInjectionCSS(); // String containing CSS to inject into the page
// CSS should now affect page
} else {
System.Console.WriteLine("Could not inject CSS due to styleSheets.length > 31");
return;
}
What I didn't realize is that createStyleSheet creates a pointer that is still 'live' in the document's DOM... therefore you don't need to append your created stylesheet back to its parent. I ended up figuring this out by studying dynamic CSS code for Javascript as the implementations are pretty much identical.