Get HtmlDocument after javascript manipulations - c#

In C#, using the System.Windows.Forms.HtmlDocument class (or another class that allows DOM parsing), is it possible to wait until a webpage finishes its javascript manipulations of the HTML before retrieving that HTML? Certain sites add innerhtml to pages through javascript, but those changes do not show up when I parse the HtmlElements of the HtmlDocument.
One possibility would be to update the HtmlDocument of the page after a second. Does anybody know how to do this?

Someone revived this question by posting what I think is an incorrect answer. So, here are my thoughts to address it.
Non-deterministically, it's possible to get close to finding out if the page has finished its AJAX stuff. However, it completely depends on the logic of that particular page: some pages are perpetually dynamic.
To approach this, one can handle DocumentCompleted event first, then asynchronously poll the WebBrowser.IsBusy property and monitor the current HTML snapshot of the page for changes, like below.
The complete sample can be found here.
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}

In general aswer is "no" - unless script on the page notifies your code in some way you have to simply wait some time and grab HTML. Waiting a second after document ready notification likley will cover most sites (i.e. jQuery's $(code) cases).

You need to give the application a second to process the Java. Simply halting the current thread will delay the java processing as well so your doc will still come up outdated.
WebBrowserDocumentCompletedEventArgs cachedLoadArgs;
private void TimerDone(object sender, EventArgs e)
{
((Timer)sender).Stop();
respondToPageLoaded(cachedLoadArgs);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
cachedLoadArgs = e;
System.Windows.Forms.Timer timer = new Timer();
int interval = 1000;
timer.Interval = interval;
timer.Tick += new EventHandler(TimerDone);
timer.Start();
}

What about using 'WebBrowser.Navigated' event?

I made with WEbBrowser take a look at my class:
public class MYCLASSProduct: IProduct
{
public string Name { get; set; }
public double Price { get; set; }
public string Url { get; set; }
private WebBrowser _WebBrowser;
private AutoResetEvent _lock;
public void Load(string url)
{
_lock = new AutoResetEvent(false);
this.Url = url;
browserInitializeBecauseJavascriptLoadThePage();
}
private void browserInitializeBecauseJavascriptLoadThePage()
{
_WebBrowser = new WebBrowser();
_WebBrowser.DocumentCompleted += webBrowser_DocumentCompleted;
_WebBrowser.Dock = DockStyle.Fill;
_WebBrowser.Name = "webBrowser";
_WebBrowser.ScrollBarsEnabled = false;
_WebBrowser.TabIndex = 0;
_WebBrowser.Navigate(Url);
Form form = new Form();
form.Hide();
form.Controls.Add(_WebBrowser);
Application.Run(form);
_lock.WaitOne();
}
private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(_WebBrowser.Document.Body.OuterHtml);
this.Price = Convert.ToDouble(hDocument.DocumentNode.SelectNodes("//td[#class='ask']").FirstOrDefault().InnerText.Trim());
_WebBrowser.FindForm().Close();
_lock.Set();
}
if your trying to do this in a console application, you need to put this tag above your main, because Windows needs to communicate with COM Components:
[STAThread]
static void Main(string[] args)
I did not like this solution, But I think that is no one better!

Related

Get HTML source code from CefSharp web browser

I am using aCefSharp.Wpf.ChromiumWebBrowser (Version 47.0.3.0) to load a web page. Some point after the page has loaded I want to get the source code.
I have called:
wb.GetBrowser().MainFrame.GetSourceAsync()
however it does not appear to be returning all the source code (I believe this is because there are child frames).
If I call:
wb.GetBrowser().MainFrame.ViewSource()
I can see it lists all the source code (including the inner frames).
I would like to get the same result as ViewSource(). Could some one point me in the right direction please?
Update – Added Code example
Note: The address the web browser is pointing too will only work up to and including 10/03/2016. After that it may display different data which is not what I would be looking at.
In the frmSelection.xaml file
<cefSharp:ChromiumWebBrowser Name="wb" Grid.Column="1" Grid.Row="0" />
In the frmSelection.xaml.cs file
public partial class frmSelection : UserControl
{
private System.Windows.Threading.DispatcherTimer wbTimer = new System.Windows.Threading.DispatcherTimer();
public frmSelection()
{
InitializeComponent();
// This timer will start when a web page has been loaded.
// It will wait 4 seconds and then call wbTimer_Tick which
// will then see if data can be extracted from the web page.
wbTimer.Interval = new TimeSpan(0, 0, 4);
wbTimer.Tick += new EventHandler(wbTimer_Tick);
wb.Address = "http://www.racingpost.com/horses2/cards/card.sd?race_id=644222&r_date=2016-03-10#raceTabs=sc_";
wb.FrameLoadEnd += new EventHandler<CefSharp.FrameLoadEndEventArgs>(wb_FrameLoadEnd);
}
void wb_FrameLoadEnd(object sender, CefSharp.FrameLoadEndEventArgs e)
{
if (wbTimer.IsEnabled)
wbTimer.Stop();
wbTimer.Start();
}
void wbTimer_Tick(object sender, EventArgs e)
{
wbTimer.Stop();
string html = GetHTMLFromWebBrowser();
}
private string GetHTMLFromWebBrowser()
{
// call the ViewSource method which will open up notepad and display the html.
// this is just so I can compare it to the html returned in GetSourceAsync()
// This is displaying all the html code (including child frames)
wb.GetBrowser().MainFrame.ViewSource();
// Get the html source code from the main Frame.
// This is displaying only code in the main frame and not any child frames of it.
Task<String> taskHtml = wb.GetBrowser().MainFrame.GetSourceAsync();
string response = taskHtml.Result;
return response;
}
}
I don't think I quite get this DispatcherTimer solution. I would do it like this:
public frmSelection()
{
InitializeComponent();
wb.FrameLoadEnd += WebBrowserFrameLoadEnded;
wb.Address = "http://www.racingpost.com/horses2/cards/card.sd?race_id=644222&r_date=2016-03-10#raceTabs=sc_";
}
private void WebBrowserFrameLoadEnded(object sender, FrameLoadEndEventArgs e)
{
if (e.Frame.IsMain)
{
wb.ViewSource();
wb.GetSourceAsync().ContinueWith(taskHtml =>
{
var html = taskHtml.Result;
});
}
}
I did a diff on the output of ViewSource and the text in the html variable and they are the same, so I can't reproduce your problem here.
This said, I noticed that the main frame gets loaded pretty late, so you have to wait quite a while until the notepad pops up with the source.
I was having the same issue trying to get click on and item located in a frame and not on the main frame. Using the example in your answer, I wrote the following extension method:
public static IFrame GetFrame(this ChromiumWebBrowser browser, string FrameName)
{
IFrame frame = null;
var identifiers = browser.GetBrowser().GetFrameIdentifiers();
foreach (var i in identifiers)
{
frame = browser.GetBrowser().GetFrame(i);
if (frame.Name == FrameName)
return frame;
}
return null;
}
If you have a "using" on your form for the module that contains this method you can do something like:
var frame = browser.GetFrame("nameofframe");
if (frame != null)
{
string HTML = await frame.GetSourceAsync();
}
Of course you need to make sure the page load is complete before using this, but I plan to use it a lot. Hope it helps!
Jim

WebBrowser.Document is null on return from thread, not updating in new thread

public static User registerUser()
{
Uri test = new Uri("https://www.example.com/signup");
HtmlDocument testdoc = runBrowserThread(test);
string tosend = "test";
User user = new User();
user.apikey = tosend;
return user;
}
public static HtmlDocument runBrowserThread(Uri url)
{
HtmlDocument value = null;
var th = new Thread(() =>
{
var br = new WebBrowser();
br.DocumentCompleted += browser_DocumentCompleted;
br.Navigate(url);
value = br.Document;
Application.Run();
});
th.SetApartmentState(ApartmentState.STA);
th.Start();
th.Join(8000);
return value;
}
static void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var br = sender as WebBrowser;
if (br.Url == e.Url)
{
Console.WriteLine("Natigated to {0}", e.Url);
Console.WriteLine(br.Document.Body.InnerHtml);
System.Console.ReadLine();
Application.ExitThread(); // Stops the thread
}
}
I am trying to scan this page, and while it does get the HTML it does not pass it back in to the function call, but instead sends back null (I presume that is post processing).
How can I make it so that the new thread passes back its result?
There are several problems with your approach.
You're not waiting till the webpage is navigated, I mean till Navigated event. So document could be null till then.
You're quitting after 8 seconds, if page takes more than 8 seconds to load you won't get the document.
If document isn't properly loaded, you're leaving the thread alive.
I guess WebBrowser control will not work as expected unless you add it into a form and show it(it needs to be visible in screen).
Etc..
Don't mix up things. Your goal can't be to use WebBrowser. If you need to just download the string from webpage, use HttpClient.GetStringAsync.
Once you get the page as string format, If you want to manipulate the html, use HtmlAgilityPack.
Moved over to using WaitN instead of the default browser model. A bit buggy but now works like it should do.
using (var browser = new FireFox("https://www.example.com/signup"))
{
browser.GoTo("https://example.com/signup");
browser.WaitForComplete();
}

Winform WebBrowser Pass Cookie Then Process Links?

I asked this question a while ago but seems that there are no answers, so i tried to go with an alternative solution but i am stuck now, please see the following code:
WebBrowser objWebBrowser = new WebBrowser();
objWebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(objWebBrowser_DocumentCompleted);
objWebBrowser.Navigate("http://www.website.com/login.php?user=xxx&pass=xxx");
objWebBrowser.Navigate("http://www.website.com/page.php?link=url");
And here is the event code:
WebBrowser objWebBrowser = (WebBrowser)sender;
String data = new StreamReader(objWebBrowser.DocumentStream).ReadToEnd();
Since it's impossible for me to use the WebBrowser.Document.Cookies before a document is loaded, i have first to navigate the login page, that will store a cookie automatically, but after that i want to call the other navigate in order to get a result. Now using the above code it doesn't work cause it always takes the second one, and it won't work for me to put it in the event cause what i want is like this:
Navigate with the login page and store cookie for one time only.
Pass a different url each time i want to get some results.
Can anybody give a solution ?
Edit:
Maybe the sample of code i provided was misleading, what i want is:
foreach(url in urls)
{
Webborwser1.Navigate(url);
//Then wait for the above to complete and get the result from the event, then continue
}
I think you want to simulate a blocking call to Navigate if you are not authorized. There are probably many ways to accomplish this and other approaches to get what you want, but here's some code I wrote up quickly that might help you get started.
If you have any questions about what I'm trying to do here, let me know. I admit it feels like "a hack" which makes me think there's a smarter solution, but anyway....
bool authorized = false;
bool navigated;
WebBrowser objWebBrowser = new WebBrowser();
void GetResults(string url)
{
if(!authorized)
{
NavigateAndBlockWithSpinLock("http://www.website.com/login.php?user=xxx&pass=xxx");
authorized = true;
}
objWebBrowser.Navigate(url);
}
void NavigateAndBlockWithSpinLock(string url)
{
navigated = false;
objWebBrowser.DocumentCompleted += NavigateDone;
objWebBrowser.Navigate(url);
int count = 0;
while(!navigated && count++ < 10)
Thread.Sleep(1000);
objWebBrowser.DocumentCompleted -= NavigateDone;
if(!navigated)
throw new Exception("fail");
}
void NavigateDone(object sender, WebBrowserDocumentCompletedEventArgs e)
{
navigated = true;
}
void objWebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if(authorized)
{
WebBrowser objWebBrowser = (WebBrowser)sender;
String data = new StreamReader(objWebBrowser.DocumentStream).ReadToEnd();
}
}

How to make WebBrowser wait till it loads fully?

I have a C# form with a web browser control on it.
I am trying to visit different websites in a loop.
However, I can not control URL address to load into my form web browser element.
This is the function I am using for navigating through URL addresses:
public String WebNavigateBrowser(String urlString, WebBrowser wb)
{
string data = "";
wb.Navigate(urlString);
while (wb.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
data = wb.DocumentText;
return data;
}
How can I make my loop wait until it fully loads?
My loop is something like this:
foreach (string urlAddresses in urls)
{
WebNavigateBrowser(urlAddresses, webBrowser1);
// I need to add a code to make webbrowser in Form to wait till it loads
}
Add This to your code:
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
Fill in this function
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
//This line is so you only do the event once
if (e.Url != webBrowser1.Url)
return;
//do you actual code
}
After some time of anger of the crappy IE functionality I've came across making something which is the most accurate way to judge page loaded complete.
Never use the WebBrowserDocumentCompletedEventHandler event
use WebBrowserProgressChangedEventHandler with some modifections seen below.
//"ie" is our web browser object
ie.ProgressChanged += new WebBrowserProgressChangedEventHandler(_ie);
private void _ie(object sender, WebBrowserProgressChangedEventArgs e)
{
int max = (int)Math.Max(e.MaximumProgress, e.CurrentProgress);
int min = (int)Math.Min(e.MaximumProgress, e.CurrentProgress);
if (min.Equals(max))
{
//Run your code here when page is actually 100% complete
}
}
Simple genius method of going about this, I found this question googling "How to sleep web browser or put to pause"
According to MSDN (contains sample source) you can use the DocumentCompleted event for that. Additional very helpful information and source that shows how to differentiate between event invocations can be found here.
what you experiencend happened to me . readyStete.complete doesnt work in some cases. here i used bool in document_completed to check state
button1_click(){
//go site1
wb.Navigate("site1.com");
//wait for documentCompleted before continue to execute any further
waitWebBrowserToComplete(wb);
// set some values in html page
wb.Document.GetElementById("input1").SetAttribute("Value", "hello");
// then click submit. (submit does navigation)
wb.Document.GetElementById("formid").InvokeMember("submit");
// then wait for doc complete
waitWebBrowserToComplete(wb);
var processedHtml = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
var rawHtml = wb.DocumentText;
}
// helpers
//instead of checking readState . we get state from DocumentCompleted Event via bool value
bool webbrowserDocumentCompleted = false;
public static void waitWebBrowserToComplete(WebBrowser wb)
{
while (!webbrowserDocumentCompleted )
Application.DoEvents();
webbrowserDocumentCompleted = false;
}
form_load(){
wb.DocumentCompleted += (o, e) => {
webbrowserDocumentCompleted = true;
};
}

How can I display multiple images in a loop in a WP7 app?

In my (Silverlight) weather app I am downloading up to 6 seperate weather radar images (each one taken about 20 mins apart) from a web site and what I need to do is display each image for a second then at the end of the loop, pause 2 seconds then start the loop again. (This means the loop of images will play until the user clicks the back or home button which is what I want.)
So, I have a RadarImage class as follows, and each image is getting downloaded (via WebClient) and then loaded into a instance of RadarImage which is then added to a collection (ie: List<RadarImage>)...
//Following code is in my radar.xaml.cs to download the images....
int imagesToDownload = 6;
int imagesDownloaded = 0;
RadarImage rdr = new RadarImage(<image url>); //this happens in a loop of image URLs
rdr.FileCompleteEvent += ImageDownloadedEventHandler;
//This code in a class library.
public class RadarImage
{
public int ImageIndex;
public string ImageURL;
public DateTime ImageTime;
public Boolean Downloaded;
public BitmapImage Bitmap;
private WebClient client;
public delegate void FileCompleteHandler(object sender);
public event FileCompleteHandler FileCompleteEvent;
public RadarImage(int index, string imageURL)
{
this.ImageIndex = index;
this.ImageURL = imageURL;
//...other code here to load in datetime properties etc...
client = new WebClient();
client.OpenReadCompleted += new OpenReadCompletedEventHandler(wc_OpenReadCompleted);
client.OpenReadAsync(new Uri(this.ImageURL, UriKind.Absolute));
}
private void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
if (e.Error == null)
{
StreamResourceInfo sri = new StreamResourceInfo(e.Result as Stream, null);
this.Bitmap = new BitmapImage();
this.Bitmap.SetSource(sri.Stream);
this.Downloaded = true;
FileCompleteEvent(this); //Fire the event to let the app page know to add it to it's List<RadarImage> collection
}
}
}
As you can see, in the class above I have exposed an event handler to let my app page know when each image has downloaded. When they have all downloaded I then run the following code in my xaml page - but only the last image ever shows up and I can't work out why!
private void ImageDownloadedEventHandler(object sender)
{
imagesDownloaded++;
if (imagesDownloaded == imagesToDownload)
{
AllImagesDownloaded = true;
DisplayRadarImages();
}
}
private void DisplayRadarImages()
{
TimerSingleton.Timer.Stop();
foreach (RadarImage img in radarImages)
{
imgRadar.Source = img.Bitmap;
Thread.Sleep(1000);
}
TimerSingleton.Timer.Start(); //Tick poroperty is set to 2000 milliseconds
}
private void SingleTimer_Tick(object sender, EventArgs e)
{
DisplayRadarImages();
}
So you can see that I have a static instance of a timer class which is stopped (if running), then the loop should show each image for a second. When all 6 have been displayed then it pauses, the timer starts and after two seconds DisplayRadarImages() gets called again.
But as I said before, I can only ever get the last image to show for some reason and I can't seem to get this working properly.
I'm fairly new to WP7 development (though not to .Net) so just wondering how best to do this - I was thinking of trying this with a web browser control but surely there must be a more elegant way to loop through a bunch of images!
Sorry this is so long but any help or suggestions would be really appreciated.
Mike
You can use a background thread with either a Timer or Sleep to periodically update your image control.
Phạm Tiểu Giao - Threads in WP7
You'll need to dispatch updates to the UI with
Dispatcher.BeginInvoke( () => { /* your UI code */ } );
Why don't you add the last image twice to radarImages, set the Timer to 1000 and display just one image on each tick?

Categories

Resources