Accessing content inside #document of a website

Accessing content inside #document of a website - c#

I would like to access content of a web-page using C#. The content is inside an i-Frame of the Body of the website, underlying an #document object. I am using this to read the page:
WebClient wbClient = new WebClient();
wbClient.UseDefaultCredentials = true;
byte[] raw = wbClient.DownloadData(stWebPage);
stWebPageContent = System.Text.Encoding.UTF8.GetString(raw);
However, the relevant information inside the #document is ignored.
Can anybody explain what I have to do to access the needed info? It is nested under body/div/iframe/#document/html/body/div/..... Thanks!

Note: I am assuming stWebPage is pointing to a http url.
iFrame content will not be downloaded directly in this one call. You need to look for iFrame in stWebPageContent using Regex and pull the value in 'src' attribute, make another call to the src url for downloading content. More details can be found at this link.

Related

C# asp.net Using WebClient, is there a way to get a web page's rendered Html?

Is there a way to get the fully rendered html of a web page using WebClient instead of the page source? I'm trying to scrape some data from the page's html. My current code is like this:
WebClient client = new WebClient();
var result = client.DownloadString("https://somepageoutthere.com/");
//using CsQuery
CQ dom = result;
var someElementHtml = dom["body > main];

WebClient will only return the URL you requested. It will not run any javacript on the page (which runs on the client) so if javascript is changing the page DOM in any way, you will not get that through webclient.
You are better off using some other tools. Look for those that will render the HTML and javascript in the page.

I don't know what you mean by "fully rendered", but if you mean "with all data loaded by ajax calls", the answer is: no, you can't.
The data which is not present in the initial html page is loaded through javascript in the browser, and WebClient has no idea what javascript is, and cannot interpret it, only browsers do.
To get this kind of data, you need to identify these calls (if you don't know the url of the data webservice, you can use tools like Fiddler), simulate/replay them from your application, and then, if successful, get response data, and extract data from it (will be easy if data comes as json, and more tricky if it comes as html)

better use http://html-agility-pack.net
it has all the functionality to scrap web data and having good help on the site

How can I get the source of an iframe that only works on specified domains?

So I'm trying to read the source of an url, let's say domain.xyz. No problem, I can simply get it work using HttpWebRequest.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
My problem is that it will return the page source, but without the source of the iframe inside this page. I only get something like this:
<iframe src="http://anotherdomain.xyz/frame_that_only_works_on_domain_xyz"></iframe>
I figured out that I can easily get the src of the iframe with WebBrowser, or basic string functions (the results are the same), and create another HttpWebRequest using the address. The problem is that if I view the full page (where the frame was inserted) in a browser (Chrome), i get the expected results. But if I copy the src to another tab, the contents are not the same. It says that the content I want to view is blocked because it's only allowed through domain.xyz.
So my final question is:
How can I simulate the request through a specified domain, or get the full, rendered page source?

That's likely the referer property of the web request: typically a browser tells the web server where it found the link to the page it is requesting.
That means, when you create the web request for the iframe, you set the referer property of that request to the page containing the link.
If that doesn't work, cookies may be another option. I.e. you have to collect the cookies sent for the first request, and send them with the second request.

Unable to obtain html content from a given url using method downloadstring(), how to resolve this?

I am trying to read the html content from a given url , this is my simple c# code
WebClient client = new WebClient();
var downloadString = client.DownloadString("https://www.yahoo.com"); // suppose
but the problem is I cannot obtain the html content, after executing the code it tells in my IDE: "before you can move on please activate javascript" , but javascript is enabled in all my browers (firefox/explorer/chrome )
how to resolve this?

In your sample you misspelled the domain name. In other way it works just fine. http://screencast.com/t/3YibceMtSFu

Get response from php page

I need to call .php page from my aspx.cs page.but I don't want to load the page.I just want to call the page and the page will give me XLM response that I need to store in DB.I am trying this with the Ajax,but according to this link.We are not be able to call cross domain page from ajax.
In short I want to read the data from php page using asp.net code.
can anybody please help me out.
Update :: Is the P3P policy will usefull in case of cross domain page calling.

I got the solutions,thanks for your help.
create a new WebClient object
WebClient client = new WebClient();
string url = "http://testurl.com/test.php";
create a byte array for holding the returned data
byte[] html = client.DownloadData(url);
use the UTF8Encoding object to convert the byte
array into a string
UTF8Encoding utf = new UTF8Encoding();
get the converted string
string mystring = utf.GetString(html);

If I understand you correctly, your problem is that you want to do a cross-domain ajax call - which is not possible. The way to go around this is to make a call to your own backend which then fetches the data from the other site and sensd it back to the browser. Remember to do any needed safety check in the back end - depending on how much you trust the other domain of cource... (but even if you trust it 100%, it may be hacked or have some other problems that makes it return something else than what you tink it returns)

how do you pull in the html from a URL?

I have one site that is displaying html content that needs to be displayed on another site (first site is front end to content management system).
So site1 page = http://site1/pagecontent
and on site 2 (http://site2/pagecontent) I need to be able to display the contents shown on site 1 page.
I am using .Net C# (MVC), and do not want to just use an iframe.
Is there a way to suck in the html into a string variable?

Yes, you can use the WebClient class: http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx
WebClient wClient = new WebClient();
string s = wClient.DownloadString(site2);

Sure. See the System.Net.WebClient class, specificially, the DownloadString() method.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Accessing content inside #document of a website - c#

Related

C# asp.net Using WebClient, is there a way to get a web page's rendered Html?

How can I get the source of an iframe that only works on specified domains?

Unable to obtain html content from a given url using method downloadstring(), how to resolve this?

Get response from php page

how do you pull in the html from a URL?

Categories

Resources