Web scraping - can't get a table present in the webpage - c#

I'm parsing the HTML of a webpage for getting some information. In my webpage, I have a <table> which I'm trying to access. But when I write the following code, 0 elements are returned:
WebBrowser csexBrowser = new WebBrowser();
HtmlElementCollection table2 = this.csexBrowser.Document.GetElementsByTagName("table");
Here, table2 has nothing. 0 elements.I'm using winforms.
EDIT: This is the link. If you search for a name, then it will show you some results in a table.

There is a verification step which precedes access to the link you provided. In the http://www.nsopw.gov/en-US/Search/Verification document, there are no tables.
Are you sure you pass the verification URL first?
[EDIT]
Please try this:
public Form1()
{
InitializeComponent();
WebBrowser csexBrowser = new WebBrowser();
//here we say what we want to do when the Navigated event occurs
csexBrowser.Navigated += csexBrowser_Navigated;
//this takes some time
csexBrowser.Navigate("http://www.nsopw.gov/en-US/Search");
}
void csexBrowser_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
//here the document is loaded and we will find the table
HtmlElementCollection table2 = ((WebBrowser)sender).Document.GetElementsByTagName("table");
}

If you insist on navigating with a browser then you must wait for the navigation to finish. Personally I hate this method plus the event fire most people go for can multi-trigger I have found.
Do this:
csexBrowser.Navigate(Url);
while (csexBrowser.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
Simply navigates to the given url and doesn't continue until the page has finished loading.
Done and done.

Related

How to have C# Webbrowser handle webpage login popup for webscraping

I'm trying to programmatically login to a site like espn.com. The way the site is setup is once I click on the Log In button located on the homepage, a Log In popup window is displayed in the middle of the screen with the background slightly tinted. My goal is to programmatically obtain that popup box, supply the username and password, and submit it -- hoping that a cookie is returned to me to use as authentication. However, because Javascript is used to display the form, I don't necessarily have easy access to the form's input tags via the main page's HTML.
I've tried researching various solutions such as HttpClient and HttpWebRequest, however it appears that a Webbrowser is best since the login form is displayed using Javascript. Since I don't necessarily have easy access to the form's input tags, a Webbrowser seems the best alternative to capturing the popup's input elements.
class ESPNLoginViewModel
{
private string Url;
private WebBrowser webBrowser1 = new WebBrowser();
private SHDocVw.WebBrowser_V1 Web_V1;
public ESPNLoginViewModel()
{
Initialize();
}
private void Initialize()
{
Url = "http://www.espn.com/";
Login();
}
private void Login()
{
webBrowser1.Navigate(Url);
webBrowser1.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(webpage_DocumentCompleted);
Web_V1 = (SHDocVw.WebBrowser_V1)this.webBrowser1.ActiveXInstance;
Web_V1.NewWindow += new SHDocVw.DWebBrowserEvents_NewWindowEventHandler(Web_V1_NewWindow);
}
//This never gets executed
private void Web_V1_NewWindow(string URL, int Flags, string TargetFrameName, ref object PostData, string Headers, ref bool Processed)
{
//I'll start determing how to code this once I'm able to get this invoked
}
private void webpage_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElement loginButton = webBrowser1.Document.GetElementsByTagName("button")[5];
loginButton.InvokeMember("click");
//I've also tried the below InvokeScript method to see if executing the javascript that
//is called when the Log In button is clicked, however Web_V1_NewWindow still wasn't called.
//webBrowser1.Document.InvokeScript("buildOverlay");
}
}
I'm expecting the Web_V1_NewWindow handler to be invoked when the InvokeMember("click") method is called. However, code execution only runs through the webpage_DocumentCompleted handler without any calls to Web_V1_NewWindow. It might be that I need to use a different method than InvokeMember("click") to invoke the Log In button's click event handler. Or I might need to try something completely different altogether. I'm not 100% sure the Web_V1.NewWindow is the correct approach for my needs, but I've seen NewWindow used often when dealing with popups so I figured I should give it a try.
Any help would be greatly appreciated as I've spent a significant amount of time on this.
I know it is the late answer. But it will help someone else.
You can extract the value from FRAME element by following
// Get frame using frame ID
HtmlWindow frameWindow = (from HtmlWindow win
in WbBrowser.Document.Window.Frames select win)
.Where(x => string.Compare(x.WindowFrameElement.Id, "frm1") == 0)
.FirstOrDefault();
// Get first frame textbox with ID
HtmlElement txtElement = (from HtmlElement element
in frameWindow.Document.GetElementsByTagName("input")
select element)
.Where(x => string.Compare(x.Id, "txt") == 0).FirstOrDefault();
// Check txtElement is nul or not
if(txtElement != null)
{
Label1.Text = txtElement.GetAttribute("value");
}
For more details check
this article

DotNetBrowser FinishLoadingFrameEvent multiple use

im trying to load multiple pages using DotNetBrowser , and i need to know each time when the new url is loaded,
myBro.FinishLoadingFrameEvent += delegate (object send, FinishLoadingEventArgs es)
{
if (es.IsMainFrame && es.ValidatedURL.Contains("login"))
{
DOMDocument document = myBro.GetDocument();
DOMElement user = document.GetElementById("LoginForm_login");
user.SetAttribute("value", "email");
DOMElement pass = document.GetElementById("LoginForm_password");
pass.SetAttribute("value", "pass");
DOMElement loginbtn = document.GetElementByTagName("button");
loginbtn.Click();
// can't add nothing more here //
};
but this code does inform me only if the first page is loaded
The FinishLoadingFrameEvent is fired for each frame loaded on the web page, even after the page is reloaded. You can use it multiple times to be notified when a browser has loaded the web page completely after the LoadURL method is called.
Here is a sample code based on the documentation article https://dotnetbrowser.support.teamdev.com/support/solutions/articles/9000110055-loading-url-synchronously :
ManualResetEvent waitEvent = new ManualResetEvent(false);
browser.FinishLoadingFrameEvent += delegate(object sender, FinishLoadingEventArgs e)
{
// Wait until main document of the web page is loaded completely.
if (e.IsMainFrame)
{
waitEvent.Set();
}
};
//Load URL
browser.LoadURL("http://www.google.com");
waitEvent.WaitOne();
//The page http://www.google.com is now loaded completely
//Then, reset the event and load the next URL
waitEvent.Reset();
browser.LoadURL("http://www.microsoft.com");
waitEvent.WaitOne();
//The page http://www.microsoft.com is now loaded completely

Winform WebBrowser Pass Cookie Then Process Links?

I asked this question a while ago but seems that there are no answers, so i tried to go with an alternative solution but i am stuck now, please see the following code:
WebBrowser objWebBrowser = new WebBrowser();
objWebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(objWebBrowser_DocumentCompleted);
objWebBrowser.Navigate("http://www.website.com/login.php?user=xxx&pass=xxx");
objWebBrowser.Navigate("http://www.website.com/page.php?link=url");
And here is the event code:
WebBrowser objWebBrowser = (WebBrowser)sender;
String data = new StreamReader(objWebBrowser.DocumentStream).ReadToEnd();
Since it's impossible for me to use the WebBrowser.Document.Cookies before a document is loaded, i have first to navigate the login page, that will store a cookie automatically, but after that i want to call the other navigate in order to get a result. Now using the above code it doesn't work cause it always takes the second one, and it won't work for me to put it in the event cause what i want is like this:
Navigate with the login page and store cookie for one time only.
Pass a different url each time i want to get some results.
Can anybody give a solution ?
Edit:
Maybe the sample of code i provided was misleading, what i want is:
foreach(url in urls)
{
Webborwser1.Navigate(url);
//Then wait for the above to complete and get the result from the event, then continue
}
I think you want to simulate a blocking call to Navigate if you are not authorized. There are probably many ways to accomplish this and other approaches to get what you want, but here's some code I wrote up quickly that might help you get started.
If you have any questions about what I'm trying to do here, let me know. I admit it feels like "a hack" which makes me think there's a smarter solution, but anyway....
bool authorized = false;
bool navigated;
WebBrowser objWebBrowser = new WebBrowser();
void GetResults(string url)
{
if(!authorized)
{
NavigateAndBlockWithSpinLock("http://www.website.com/login.php?user=xxx&pass=xxx");
authorized = true;
}
objWebBrowser.Navigate(url);
}
void NavigateAndBlockWithSpinLock(string url)
{
navigated = false;
objWebBrowser.DocumentCompleted += NavigateDone;
objWebBrowser.Navigate(url);
int count = 0;
while(!navigated && count++ < 10)
Thread.Sleep(1000);
objWebBrowser.DocumentCompleted -= NavigateDone;
if(!navigated)
throw new Exception("fail");
}
void NavigateDone(object sender, WebBrowserDocumentCompletedEventArgs e)
{
navigated = true;
}
void objWebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if(authorized)
{
WebBrowser objWebBrowser = (WebBrowser)sender;
String data = new StreamReader(objWebBrowser.DocumentStream).ReadToEnd();
}
}

How to make WebBrowser wait till it loads fully?

I have a C# form with a web browser control on it.
I am trying to visit different websites in a loop.
However, I can not control URL address to load into my form web browser element.
This is the function I am using for navigating through URL addresses:
public String WebNavigateBrowser(String urlString, WebBrowser wb)
{
string data = "";
wb.Navigate(urlString);
while (wb.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
data = wb.DocumentText;
return data;
}
How can I make my loop wait until it fully loads?
My loop is something like this:
foreach (string urlAddresses in urls)
{
WebNavigateBrowser(urlAddresses, webBrowser1);
// I need to add a code to make webbrowser in Form to wait till it loads
}
Add This to your code:
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
Fill in this function
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
//This line is so you only do the event once
if (e.Url != webBrowser1.Url)
return;
//do you actual code
}
After some time of anger of the crappy IE functionality I've came across making something which is the most accurate way to judge page loaded complete.
Never use the WebBrowserDocumentCompletedEventHandler event
use WebBrowserProgressChangedEventHandler with some modifections seen below.
//"ie" is our web browser object
ie.ProgressChanged += new WebBrowserProgressChangedEventHandler(_ie);
private void _ie(object sender, WebBrowserProgressChangedEventArgs e)
{
int max = (int)Math.Max(e.MaximumProgress, e.CurrentProgress);
int min = (int)Math.Min(e.MaximumProgress, e.CurrentProgress);
if (min.Equals(max))
{
//Run your code here when page is actually 100% complete
}
}
Simple genius method of going about this, I found this question googling "How to sleep web browser or put to pause"
According to MSDN (contains sample source) you can use the DocumentCompleted event for that. Additional very helpful information and source that shows how to differentiate between event invocations can be found here.
what you experiencend happened to me . readyStete.complete doesnt work in some cases. here i used bool in document_completed to check state
button1_click(){
//go site1
wb.Navigate("site1.com");
//wait for documentCompleted before continue to execute any further
waitWebBrowserToComplete(wb);
// set some values in html page
wb.Document.GetElementById("input1").SetAttribute("Value", "hello");
// then click submit. (submit does navigation)
wb.Document.GetElementById("formid").InvokeMember("submit");
// then wait for doc complete
waitWebBrowserToComplete(wb);
var processedHtml = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
var rawHtml = wb.DocumentText;
}
// helpers
//instead of checking readState . we get state from DocumentCompleted Event via bool value
bool webbrowserDocumentCompleted = false;
public static void waitWebBrowserToComplete(WebBrowser wb)
{
while (!webbrowserDocumentCompleted )
Application.DoEvents();
webbrowserDocumentCompleted = false;
}
form_load(){
wb.DocumentCompleted += (o, e) => {
webbrowserDocumentCompleted = true;
};
}

Is is possible to know for sure if a WebBrowser is navigating or not?

I'm trying to find a way for my program to know when a WebBrowser is navigating and when is not. This is because the program will interact with the loaded document via JavaScript that will be injected in the document. I don't have any other way to know when it starts navigating than handling the Navigating event since is not my program but the user who will navigate by interacting with the document. But then, when DocumentCompleted occurs doesn't necessarily mean that it have finished navigating. I've been googling a lot and found two pseudo-solutions:
Check for WebBrowser's ReadyState property in the DocumentCompleted event. The problem with this is that if not the document but a frame in the document loads, the ReadyState will be Completed even if the main document is not completed.
To prevent this, they advise to see if the Url parameter passed to DocumentCompleted matches the Url of the WebBrowser. This way I would know that DocumentCompleted is not being invoked by some other frame in the document.
The problem with 2 is that, as I said, the only way I have to know when a page is navigating is by handling the Navigating (or Navigated) event. So if, for instance, I'm in Google Maps and click Search, Navigating will be called, but just a frame is navigating; not the whole page (on the specific Google case, I could use the TargetFrameName property of WebBrowserNavigatingEventArgs to check if it's a frame the one that is navigating, but frames doesn't always have names). So after that, DocumentCompleted will be called, but not with the same Url as my WebBrowsers Url property because it was just a frame the one that navigated, so my program would thing that it's still navigating, forever.
Adding up calls to Navigating and subtracting calls to DocumentCompleted wont work either. They are not always the same. I haven't find a solution to this problem for months already; I've been using solutions 1 and 2 and hoping they will work for most cases. My plan was to use a timer in case some web page has errors or something but I don't think Google Maps has any errors. I could still use it but the only uglier solution would be to burn up my PC.
Edit: So far, this is the closest I've got to a solution:
partial class SafeWebBrowser
{
private class SafeNavigationManager : INotifyPropertyChanged
{
private SafeWebBrowser Parent;
private bool _IsSafeNavigating = false;
private int AccumulatedNavigations = 0;
private bool NavigatingCalled = false;
public event PropertyChangedEventHandler PropertyChanged;
public bool IsSafeNavigating
{
get { return _IsSafeNavigating; }
private set { SetIsSafeNavigating(value); }
}
public SafeNavigationManager(SafeWebBrowser parent)
{
Parent = parent;
}
private void SetIsSafeNavigating(bool value)
{
if (_IsSafeNavigating != value)
{
_IsSafeNavigating = value;
OnPropertyChanged(new PropertyChangedEventArgs("IsSafeNavigating"));
}
}
private void UpdateIsSafeNavigating()
{
IsSafeNavigating = (AccumulatedNavigations != 0) || (NavigatingCalled == true);
}
private bool IsMainFrameCompleted(WebBrowserDocumentCompletedEventArgs e)
{
return Parent.ReadyState == WebBrowserReadyState.Complete && e.Url == Parent.Url;
}
protected void OnPropertyChanged(PropertyChangedEventArgs e)
{
if (PropertyChanged != null) PropertyChanged(this, e);
}
public void OnNavigating(WebBrowserNavigatingEventArgs e)
{
if (!e.Cancel) NavigatingCalled = true;
UpdateIsSafeNavigating();
}
public void OnNavigated(WebBrowserNavigatedEventArgs e)
{
NavigatingCalled = false;
AccumulatedNavigations++;
UpdateIsSafeNavigating();
}
public void OnDocumentCompleted(WebBrowserDocumentCompletedEventArgs e)
{
NavigatingCalled = false;
AccumulatedNavigations--;
if (AccumulatedNavigations < 0) AccumulatedNavigations = 0;
if (IsMainFrameCompleted(e)) AccumulatedNavigations = 0;
UpdateIsSafeNavigating();
}
}
}
SafeWebBrowser inherits WebBrowser. The methods OnNavigating, OnNavigated and OnDocumentCompleted are called on the corresponding WebBrowser overridden methods. The property IsSafeNavigating is the one that would let me know if it's navigating or not.
Waiting till the document has loaded is a difficult problem, but you want to continually check for .ReadyState and .Busy (dont forget that). I will give you some general information you will need, then I will answer your specific question at the end.
BTW, NC = NavigateComplete and DC = DocumentComplete.
Also, if the page you are waiting for has frames, you need to get a ref to them and check their .busy and .readystate as well, and if the frames are nested, the nested frames .readystate and .busy as well, so you need to write a function that recursively retreives those references.
Now, regardless of how many frames it has, first fired NC event is always the top document, and last fired DC event is always that of the top (parent) document as well.
So you should check to see if its the first call and the pDisp Is WebBrowser1.object (literally thats what you type in your if statement) then you know its the NC for top level document, then you wait for this same object to appear in a DC event, so save the pDisp to a global variable, and wait until a DC is run and that DC's pDisp is equal to the Global pDisp you've saved during the first NC event (as in, the pDisp that you saved globally in the first NC event that fired). So once you know that pDisp was returned in a DC, you know the entire document is finished loading.
This will improve your currect method, however, to make it more fool proof, you need to do the frames checking as well, since even if you did all of the above, it's over 90% good but not 100% fool proof, need to do more for this.
In order to do successful NC/DC counting in a meaningful way (it is possible, trust me) you need to save the pDisp of each NC in an array or a collection, if and only if it doesn't already exist in that array/collection. The key to making this work is checking for the duplicate NC pDisp, and not adding it if it exists. Because what happens is, NC fires with a particular URL, then a server-side redirect or URL change occurs and when this happens, the NC is fired again, BUT it happens with the same pDisp object that was used for the old URL. So the same pDisp object is sent to the second NC event now occuring for the second time with a new URL but still all being done with the exact same pDisp object.
Now, because you have a count of all unique NC pDisp objects, you can (one by one) remove them as each DC event occurs, by doing the typical If pDisp Is pDispArray(i) Then (this is in VB) comparison wrapped in a For Loop, and for each one taken off, your array count will get closer to 0. This is the accurate way to do it, however, this alone is not enough, as another NC/DC pair can occur after your count reaches 0. Also, you got to remember to do the exact same For Loop pDisp checking in the NavigateError event as you do in the DC event, because when a navigation error occurs, a NavigateError event is fired INSTEAD of the DC event.
I know this was a lot to take, but it took me years of having to deal with this dreaded control to figure these things out, I've got other code & methods if you need, but some of the stuff I mentioned here in relation to WB Navigation being truely ready, haven't been published online before, so I really hope you find them useful and let me know how you go. Also, if you want/need clarification on some of this let me know, unfortunately, the above isn't everything if you want to be 100% sure that the webpage is done loading, cheers.
PS: Also, forgot to mention, relying on URL's to do any sort of counting is inaccurate and a very bad idea because several frames can have the same URL - as an example, the www.microsoft.com website does this, there are like 3 frames or so calling MS's main site that you see in the address bar. Don't use URL's for any counting method.
First I've converted the document to XML and then used my magic method:
nodeXML = HtmlToXml.ConvertToXmlDocument((IHTMLDocument2)htmlDoc.DomDocument);
if (ExitWait(false))
return false;
conversion:
public static XmlNode ConvertToXmlDocument(IHTMLDocument2 doc2)
{
XmlDocument xmlDoc = new XmlDocument();
IHTMLDOMNode htmlNodeHTML = null;
XmlNode xmlNodeHTML = null;
try
{
htmlNodeHTML = (IHTMLDOMNode)((IHTMLDocument3)doc2).documentElement;
xmlDoc.AppendChild(xmlDoc.CreateXmlDeclaration("1.0", ""/*((IHTMLDocument2)htmlDoc.DomDocument).charset*/, ""));
xmlNodeHTML = xmlDoc.CreateElement("html"); // create root node
xmlDoc.AppendChild(xmlNodeHTML);
CopyNodes(xmlDoc, xmlNodeHTML, htmlNodeHTML);
}
catch (Exception err)
{
Utils.WriteLog(err, "Html2Xml.ConvertToXmlDocument");
}
magic method:
private bool ExitWait(bool bDelay)
{
if (m_bStopped)
return true;
if (bDelay)
{
DateTime now = DateTime.Now;
DateTime later = DateTime.Now;
TimeSpan difT = (later - now);
while (difT.TotalMilliseconds < MainDef.IE_PARSER_DELAY)
{
Application.DoEvents();
System.Threading.Thread.Sleep(10);
later = DateTime.Now;
difT = later - now;
if (m_bStopped)
return true;
}
}
return m_bStopped;
}
where m_bStopped is false by default, IE_PARSER_DELAY is a timeout value.
I hope this helps.

Categories

Resources