I am trying to access this webpage http://www.pof.com with C# code.
I figured out that the Document element is stored in an iframe after I successfully logged in as a user and I am not familiar with how to access the document element.
All I want to do is to get the HTML format of that page which is loaded with an iframe and go to some of the links of that site.
Use following code:
document.getElementById('iframe1').contentWindow.document
or simply,
var elemVal;
if (iframeDocument) {
elemVal= iframeDocument.getElementById('#iframe1');
}
Related
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
I need to read this page in WCF service
http://bvmf.bmfbovespa.com.br/cias-listadas/empresas-listadas/ResumoEmpresaPrincipal.aspx?codigoCvm=9512&idioma=pt-br
But I want to read this node generate dynamic by server class="ficha responsive"
When I use a method like
HtmlDocument doc = web.Load("http://bvmf.bmfbovespa.com.br/cias-listadas/empresas-listadas/ResumoEmpresaPrincipal.aspx?codigoCvm=9512&idioma=pt-br")
I not get full page because page call dynamic this form
form name="aspnetForm"
method="post"
action="ResumoEmpresaPrincipal.aspx?codigoCvm=9512&idioma+=+pt+-+br&idioma=pt-br"
id="aspnetForm"
How I can get load FULL page or post data to this webform in C#?? or load a full HTML Content ?
ResumoEmpresaPrincipal.aspx?codigoCvm=9512
The solution to read a full page content are in this post
Scraping webpage generated by javascript with C#
How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono
I know that I can access all iframes using the following properties of webbrowser:
string html = webBrowser1.Document.Window.Frames[0].WindowFrameElement.InnerText;
But I'm struggling with cross-domain restriction..
My document url is like www.subdomain1.sport.com/...
And iframes url is like www.subdimain2.sport.com/...
How to access iframes content and put some text into input tag there?
I think you must refer the following URL to get the content of IFrame which exists on cross domain.
http://codecentrix.blogspot.com/2008/02/when-ihtmlwindow2document-throws.html
I'm having problems picking out data I need that's inside an iframe form. Is it even possible using HtmlAgilityPack? Here's a screenshot using Firebug so it's easier for you guys to see.
http://i.stack.imgur.com/ftt84.jpg
I need to parse out the post_form_id. I've tried
var value = doc.DocumentNode.SelectSingleNode("//input[#type='hidden' and #name='post_form_id']")
.Attributes["value"].Value;
but obviously won't work because it's inside the iframe form. Appreciate any help.
I would
Use the HTMLAglityPack to find the iframe location
Use the System.URI class find the absolute link of the iframe page
Open this iframe page
Use HTMLAglityPack again on the iframe page to find the required information