c# screen scraping project - webbrowser not changing url - c#

I'm working on a little automation project at the min and have hit a brick wall. Firstly i'd like to state the only reason i'm using webbrowser for this component of the project is the site being scraped has obfuscated code and requires a java enabled browser to display the code, i've got another app using webclient which works fine for other test sites but unfortunately can't be used on this target
My problem arises when trying to programatically configure the webbrowser control
First problem i've discovered is if i manually set the url in the controls properties it loads page 1 up and the scraper works for that page. However, I proceeded to clear the url in the properties and set it manually in the Form1_Load method but it returns about:blank as the url despite the fact i've verified the automated parameter being pulled in is fine and should be getting set without issue
Here's what i'm using:
Note:
collection refers to an XML serialized array of definitions
definition refers to the active definition for this target,the idea being to configure this for multiple targets
private void Form1_Load(object sender, EventArgs e)
{
PopulateScraperCollection();
webBrowser1.Url = new Uri(collection.ElementAt(b).AccessUrl);
NavigateToUrl(collection.ElementAt(b).AccessUrl);
}
public void PopulateScraperCollection()
{
string[] xmlFiles = Directory.GetFiles(#"E:\DealerConfigs\");
foreach (string xmlFile in xmlFiles)
{
collection.Add(ScraperDefinition.Deserialize(xmlFile));
}
}
public void NavigateToUrl(string url)
{
Console.WriteLine(collection.ElementAt(b).AccessUrl);
webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
webBrowser1.Navigate(webBrowser1.Url);
}
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = sender as WebBrowser;
Process(collection.ElementAt(b), 0);
b++;
}
Consequently this causes another issue in using DocumentCompleted to navigate to the paginated results. On the first page load i use a DocumentCompleted event to trigger the link extraction. When I attempt to set the url for the the next page,which is being picked out fine using xpath and again verified, using F10 to step over in debug indicates it hasnt been changed and the DocumentCompleted event isn't being triggered
My code to change the url etc. is:
string nextPageUrl = string.Format(definition.NextPageUrlFormat, WebUtility.HtmlDecode(relativeUrl));
webBrowser1.Url = new Uri(nextPageUrl);
webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
webBrowser1.Navigate(webBrowser1.Url);
Any help as always is greatly appreciated, this is proving to be a nightmare to automate, not only because WebBrowser is so much slower than WebClient, but its proving a pain to alter on the fly
Regards
Barry

You should never really set webBrowser1.Url, You should just be using the Navigate void, so
private void Form1_Load(object sender, EventArgs e)
{
PopulateScraperCollection();
NavigateToUrl(collection.ElementAt(b).AccessUrl);
}
My guess would be why it isnt navigating, is that the collection.ElementAt(b).AccessUrl is null or about:blank
Im not really sure how to answer your question, but the Navigate void should change it
NB: WebBrowser control is proper crap, you could try another WebBrowser control like Awesomium or GeckoFX

Related

C# - Wait for a WebBrowser to completely finish navigating and loading a website/webpage

I'm trying to do web automation by creating a Windows Form application in C# using WebBrowser. I currently have the code below that navigates to Youtube and inputs a string in Youtube's search bar.
website.Navigate("www.youtube.com");
website.Document.GetElementById("search").InnerText = "Cavaliers vs Boston highlights";
However, I get a NullReferenceException in the line
website.Document.GetElementById("search").InnerText = "Cavaliers vs Boston highlights";
I tried searching in different websites on how a WebBrowser is able to determine if it has completely finished loading the website you have specified in the Navigate method but so far I haven't found any.
What I have found online are methods that checks a WebBrowser's ready state but upon trying it, it doesn't even load the Form I created, yet still proceeds to the GetElementById method.
Hoping someone can help me with this, been trying to find a solution since morning.
Try to add an event listener to WebBrowser. The WebBrowser has a WebBrowser.DocumentCompleted event that occurs when the web page has been fully loaded.
Something like
public frmMain()
{
website.DocumentCompleted += website_DocumentCompleted;
}
public void website_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
website.Document.GetElementById("search").InnerText = "Cavaliers vs Boston highlights"
}
where frmMain is your form. It could be of course added in somewhere else too.

Use asynchrony with GeckoFX web browser

I have a problem, I am trying to work with a custom webbrowser for a specific website.
The problem I am having with GeckoFX is that some times I need to wait for DocumentCompleted to continue with execution of particular methods.
I don't want to put all my code into one large DocumentCompleted event, as that seems silly and wrong.
I got the code to work by using the Application.DoEvents() as follows, but I read that this is not a right way to go, and that webbrowser should be best run as async.
private void AddNewTab(string tmsAddress) //add a new browser to my form
{
TabPage tab = new TabPage();
browserTabControl.TabPages.Insert(browserTabControl.TabCount - 1, tab);
GeckoWebBrowser browser = new GeckoWebBrowser();
tab.Controls.Add(browser);
tmsBrowser.Dock = DockStyle.Fill;
tmsBrowser.Navigate(address);
tmsBrowser.DocumentCompleted += new EventHandler<Gecko.Events.GeckoDocumentCompletedEventArgs>(tmsBrowser_DocumentCompleted);
}
Navigation on the page is manual, i.e. users work with the page normally, but from time to time they can use a shortcut button to get somewhere.
private void someButton_MouseUp(object sender, MouseEventArgs e)
{
GeckoWebBrowser browser = getCurrentBrowser();
//get some data from the page here
OpenPageInContentFrame(address + parameter1 + parameter2); //I need to wait for this page to load and then do HighlightItemRow()
while (!eventHandled) //documentCompleted events sets the bool as 'true'
{
Application.DoEvents();
}
HighlightItemRow(browser, parameter1);
}
}
I wanted to go with ManualResetEvents instead of while() and Application.DoEvents(), but using manResEvent.WaitOne() causes the whole application to freeze, including the navigation, so the page never actually loads. I think this must be because it's all in a single thread, I don't know how to make it working - I never used anything async etc.

Stop Action on Navigating event of WebBrowser Control

I have created a little Win Form App in C# and added the WebBrowser component to it. What i am trying to achieve is a little app that can load a local html page from a file which has "custom" protocols in it and can of course also navigate to a web address.
For example i would have entry as follows in my webpage
'Close Company</TD></TR>' which would open a task in a program.
The way i tried to achieve this was via the Navigating event as shown below
private void webBrowser_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
if ((webBrowser.StatusText.Contains("Special")))
{
//For some reason the stop doesn't do much it still tries to proceed to special:123
//diplaying can not load page..
webBrowser.Stop();
//Launch program here.
MessageBox.Show("Special Command Found");
}
}
Problem is that it still navigates and says it can't find of course the page.
I swapped Stop with GoBack which for some reason has the same issue the first time i run it and when i then select backward in the browser it works from thereon.
I also tried navigated and use of GoBack, besides having a flashing in the app from going back the event does not fire again after the first time anymore.
Has anyone any ideas how to solve this or what i am doing wrong here ?
Instead of using WebBrowser.Stop();
just set e.cancel = true;

How do I make the Delete key work in the WebBrowser control

I have a .net Windows forms project that includes the System.Windows.forms.WebBrowser control to allow the user to do some editing of HTML content. When this control is in edit mode Elements such as div or span can be drag-and-drop edited, but selecting an element and typing Delete does nothing.
I have seen a few posts that talk about making this work in C++ but they are not very detailed. Example http://social.msdn.microsoft.com/Forums/en-US/ieextensiondevelopment/thread/1f485dc6-e8b2-4da7-983f-ca431f96021f/
This next post talks about using a function called TranslateAccelerator method to solve similar problems in MFC projects. http://vbyte.com/iReader/Reader.asp?ISBN=0735607818&URI=/HTML/chaab.htm
Does anyone have a suggestion on how to make the delete key work in C# or VB for a windows forms project?
Here is my code to create the WebBrowser content:
WebBrowser1.Navigate("about:blank") ' Initializes the control
Application.DoEvents
WebBrowser1.Document.OpenNew(False).Write("<html><body><span>Project Title</span><input type='text' value='' /></body></html>")
WebBrowser1.ActiveXInstance.Document.DesignMode = "On" ' Option Explicit must be set to off
WebBrowser1.Document.Body.SetAttribute("contenteditable", "true")
Thanks
Well the problem was that one of the control properties, "WebBrowserShortcutsEnabled" was set to false. Thanks everyone for your help, there is no way anyone could have guessed that so I get a big "DUH!". I did find a way to make this work in c# where the code would look like this:
public Form1() {
InitializeComponent();
webBrowser1.Navigate("about:blank"); // Initializes the webbrowser control
}
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
mshtml.IHTMLDocument2 doc = webBrowser1.Document.DomDocument as mshtml.IHTMLDocument2;
doc.designMode = "On";
webBrowser1.Document.OpenNew(false).Write(#"<html><body><span>Project Title</span><input type=""text"" value="""" /></body></html>");
}
...assuming that a reference had been added to MSHTML.
The documentCompleted event accomplishes the same thing as the Application.DoEvents in my first code exmaple, so that could go either way.
I just tried this method:
webBrowser1.Navigate(#"javascript:document.body.contentEditable='true'; document.designMode='on'; void 0");
Elements can be dragged and deleted, you can also edit text with a double click.

C# Webbrowser Control: Navigating to a List to URLs

I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML.
I used the following logic
foreach (string href in hrefs)
{
webBrowser1.Url = new Uri(href);
webBrowser1.Navigate(href);
}
I want to do some work in the "webBrowser1_DocumentCompleted" event once the page is loaded completely. But the "webBrowser1_DocumentCompleted" does not get the control as I am using the loop here. It only get the control when the last url in "hrefs" is navigated and the control exits the loop.
Whats the best way to handle such problem?
Store the list somewhere in your state, as well as the index of where you've got to. Then in the DocumentCompleted event, parse the HTML and then navigate to the next page.
(Personally I wouldn't use the WebBrowser control for web crawling... I know it means it'll handle the JavaScript for you, but it'll be a lot harder to parallelize nicely than using multiple WebRequest or WebClient objects.)
First of all, you are setting new url to same web browser control, even before it has loaded anything, this way you will simply see the last url on your browser. Definately browser will certainly take some time to load url, so I guess navigation is cancelled well in advance before Document_Completed can be fired.
There is only one way to do this simultaneously,
You have to use a tab control, and open a new tab item for every url and each tab item will have its own web browser control and you can set its url.
foreach(string href in hrefs){
TabItem item = new TabItem();
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += wb_DocumentCompleted;
wb.Url = href;
item.Child = web;
tabControl1.Items.Add(item);
}
private void wb_DocumentCompleted(object sender, EventArgs e){
/// do your stuff...
}
In order to improve above method, you should see how can you create multiple tab items in different UI threads, its pretty log topic to discuss here, but it is still possible.
Another method is to do use a queue...
private static Queue<string> queue = new ...
foreach(string href in hrefs){
queue.Enqueue(href);
}
private void webBrowser1_DocumentCompleted(object sender, EventArgs e){
if(queue.Count>0){
webBrowser1.Url = queue.Dequeue();
}
}

Categories

Resources