Get HTML source code from browser control embedded in C# - c#

I have a browser control embeded in a C# windows app. I want to grab the rendered HTML (which could have been modified by javascript) not the original one.
Any suggestions?

You can get the HTML, and indeed set it, with WebBrowser.DocumentText.
Sheng is correct, DocumentText returns the streamed document before scripts run. His code doesn't compile, but it's essentially correct. I found that you need:
mshtml.HTMLDocument doc = webBrowser1.Document.DomDocument as mshtml.HTMLDocument;
string html = doc.documentElement.outerHTML;

DocumentText internally use the document's IPersistStream interface which returns the original HTML. Use webBrowser1.Document.DocumentElement.OuterHTML instead.

Add a Navigated event to your WebBrowser. Only then will your document be filled.
private void webBrowser1_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
Console.WriteLine(webBrowser1.DocumentText);
}

Related

How do I return all visible text on a web page as a big, unparsed string?

I am looking for a simple script that would basically do the equivalent of a user pressing Ctrl+A (select all) on a web page and then copying the text to the clipboard so I can pull it into a string from there.
The reason I want to emulate a user selecting all and then copy and pasting is because some pages are generated with Javascript and do not have the visible text in the HTML.
In any case, I am just looking for the raw unparsed text. I do not care if the spacing/line breaks are messed up, etc. I just want a quick and dirty snapshot of all the selectable text on the page into a string.
I have tried doing below as an example:
private void button3_Click(object sender, EventArgs e)
{
HAP.HtmlWeb web = new HAP.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
Load(#"https://mywebsite");
string str = doc.DocumentNode.InnerText;
MessageBox.Show(str);
}
but if the page has javascript it does not return the text displayed by it.
Instead of
doc.DocumentNode.InnerText;
Use this
doc.DocumentNode.InnerHtml;
It will get you the whole HTML including JS and CSS. Hope it helps.
With jQuery: $(document).text() or $('body').text()

Display portion of a Website into a Web Browser

i want to navigate to a specific website, and i want then to be displayed in the web browser only a portion of the website, which starts with:
<div id="dex1" ...... </div>
I know i need to get the element by id, but firstly i tried writing this:
string data = webBorwser.Document.Body.OuterHtml;
So from data i need to grab that content "id" and display it and the rest to be deleted.
Any idea on this?
webBrowser1.DocumentCompleted += (sender, e) =>
{
webBrowser1.DocumentText = webBrowser1.Document.GetElementById("dex1").OuterHtml;
};
On second thoughts, don't do that, setting the DocumentText property causes the DocumentCompleted event to fire again. So maybe do:
webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
webBrowser1.DocumentCompleted -= webBrowser1_DocumentCompleted;
webBrowser1.DocumentText = webBrowser1.Document.GetElementById("dex1").OuterHtml;
}
Although in most real world cases I'd expect you'd get better results injecting some javascript to do the DOM manipulation, a la Andrei's answer.
Edit: to just replace everything inside the body tag which might if you're lucky maintain all the required styling and scripts if they're all in the head don't reference any discarded context, you may have some joy with:
webBrowser1.Document.Body.InnerHtml = webBrowser1.Document.GetElementById("dex1").OuterHtml;
So, as you probably need a lot of external resources like scripts and images. You can add some custom javascript to modify the DOM however you like after you have loaded the document from your website. From How to update DOM content inside WebBrowser Control in C#? it would look something like this:
HtmlElement headElement = webBrowser1.Document.GetElementsByTagName("head")[0];
HtmlElement scriptElement = webBrowser1.Document.CreateElement("script");
IHTMLScriptElement domScriptElement = (IHTMLScriptElement)scriptElement.DomElement;
domScriptElement.text = "function applyChanges(){ $('body >').hide(); $('#dex1').show().prependTo('body');}";
headElement.AppendChild(scriptElement);
// Call the nextline whenever you want to execute your code
webBrowser1.Document.InvokeScript("applyChanges");
This is also assuming that jquery is available so you can do simple DOM manipulation.
The javascript code is just hiding all children on the body and then prepending the '#dex' div to the body so that it's at the top and visible.

WebBrowser control - see files loaded when navigating to a website

I am trying to extract some information from a website. But when I navigate to it, it uses javascript to connect me to a server before dynamically loading a php-page. I can follow the sequence in Chrome with the developer tools. I figured it would be easiest to reproduce it in C# with the Webbrowser control and simply navigate to the website. Then the webbrowser control must contain all the javascript files, the text from the dynamically loaded php page and so on. But is this true and where in the control are they stored? I can't seem to find them.
Recreate the whole sequence diagram implemented in Chrome would be a lot of work. However, "extract some information from a website" is something that can be done quite easily.
Disclaimer: I assumed this question was for the WPF's WebBrower control (it would be almost the same for WinForms)
You can get the HTMLDocument once the page is loaded, using:
using mshtml; // <- don't forget to add the reference
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
browser.Navigate("http://google.com/");
browser.LoadCompleted += browser_LoadCompleted;
}
void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
HTMLDocument doc = (HTMLDocument)browser.Document;
string html = doc.documentElement.innerHTML.ToString();
// from here, you should be able to parse the HTML
// or sniff the HTMLDocument (using HTML Agility Pack for instance)
}
}
From this HTMLDocument, you have access to a lot of properties, including HTML elements, CSS styles and scripts. I invite you to put a break-point and check out what best fits your needs.
Nevertheless, since the page you want to load uses JavaScript to fill its content, the HTMLDocument will probably not be complete a the time the LoadCompleted is raise.
In that case, I suggest to use a timer to poll until the content is stable.
You could also use HTMLDocument to inject your own JavaScript code, and call C# methods througth WebBrowser.ObjectForScripting, but this is gonna be much more complicated and harder to maintain.

is there a straightforward way to retrieve text that is rendered by the browser but is not hard-coded in the actual html file?

I'm trying to retrieve data from a webpage but I cannot do it by making a web request and parsing the resulting html file because the actual text that I'm trying to retrieve is not in the html file! I imagine that this text is pulled using some script and for that reason it's not in the html file. For all I know I'm looking at the wrong data, but assuming that my theory is correct, is there a straightforward way to retrieve whatever text is displayed by the browser (Firefox or IE) rather than attempt to fetch the text from the html file?
Assuming you are referring to text that has been generated using Javascript in the browser.
You can use PhantomJS to achieve this: http://phantomjs.org/
It is essentially a headless browser that will process Javascript.
You may need to run this as ane xternal program but Im sure you can do that through C#
Your other option would be to open the web page in a WebBrowser object which should execute the scripts, and then you can get the HtmlDocument object and go from there.
Take a look at this example...
private void test()
{
WebBrowser wBrowser1 = new WebBrowser();
wBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wBrowser1_DocumentCompleted);
wBrowser1.Url = new Uri("Web Page URL");
}
void wBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument document = (sender as WebBrowser).Document;
// get elements and values accordingly.
}

Page is cut while printing with the Windows.Forms.WebBrowser control

I have trouble with the printing function of the WebBrowser control.
First i load my page and it is rendered correct.
Then i set the header/footer/margins and the right printer:
webbrowser printing;
How do I programatically change printer settings with the WebBrowser control?
That works so far. Then i use myBrowser.Print();
But my website does not print right. Only the upper left corner is printed, a few centimeters and then there are scrollbars.
I printed my website with the IE9 and everything was alright. Also tried different browsermodes and documentmodes. No problem.
And i thought the control and the IE are technically the same...
Are there any parameters i have forgotten?
The website i want to print is old and has no doctype. But since the control displays it correct, I expect it to print correct, too.
Edit:
Found out that it has to do with javascript on the website, which does not run for printing.
Is there a way to get the HTML from the manipulated DOM?
For the print function no external documents are processed. All documents for a website have to be merged into one printable site. To get the other documents you have to cast the myBrowser.Document.DomDocument to an IHTMLDocument2. From that IHTMLDocument2 you can extract the CSS or JS to put it into the html.
For Example:
void myBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
myBrowser.DocumentCompleted -= new WebBrowserDocumentCompletedEventHandler(myBrowser_DocumentCompleted);
myBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(myBrowser_DocumentPrintable);
String mySource = myBrowser.DocumentText;
// Get the CSS
IHTMLDocument2 doc = (myBrowser.Document.DomDocument) as IHTMLDocument2;
myCSS = doc.styleSheets.item(0).cssText;
mySource = mySource.Replace("<link rel=\"stylesheet\" type=\"text/css\" href=\"/css/style.css\">", "<style type=\"text/css\">"+myCSS+"</style>");
// Reload
myBrowser.DocumentText = mySource;
}
void myBrowser_DocumentPrintable(object sender, WebBrowserDocumentCompletedEventArgs e)
{
myBrowser.DocumentCompleted -= new WebBrowserDocumentCompletedEventHandler(myBrowser_DocumentPrintable);
myBrowser.Print();
}

Categories

Resources