I have one site that is displaying html content that needs to be displayed on another site (first site is front end to content management system).
So site1 page = http://site1/pagecontent
and on site 2 (http://site2/pagecontent) I need to be able to display the contents shown on site 1 page.
I am using .Net C# (MVC), and do not want to just use an iframe.
Is there a way to suck in the html into a string variable?
Yes, you can use the WebClient class: http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx
WebClient wClient = new WebClient();
string s = wClient.DownloadString(site2);
Sure. See the System.Net.WebClient class, specificially, the DownloadString() method.
Related
Is there a way to get the fully rendered html of a web page using WebClient instead of the page source? I'm trying to scrape some data from the page's html. My current code is like this:
WebClient client = new WebClient();
var result = client.DownloadString("https://somepageoutthere.com/");
//using CsQuery
CQ dom = result;
var someElementHtml = dom["body > main];
WebClient will only return the URL you requested. It will not run any javacript on the page (which runs on the client) so if javascript is changing the page DOM in any way, you will not get that through webclient.
You are better off using some other tools. Look for those that will render the HTML and javascript in the page.
I don't know what you mean by "fully rendered", but if you mean "with all data loaded by ajax calls", the answer is: no, you can't.
The data which is not present in the initial html page is loaded through javascript in the browser, and WebClient has no idea what javascript is, and cannot interpret it, only browsers do.
To get this kind of data, you need to identify these calls (if you don't know the url of the data webservice, you can use tools like Fiddler), simulate/replay them from your application, and then, if successful, get response data, and extract data from it (will be easy if data comes as json, and more tricky if it comes as html)
better use http://html-agility-pack.net
it has all the functionality to scrap web data and having good help on the site
I would like to access content of a web-page using C#. The content is inside an i-Frame of the Body of the website, underlying an #document object. I am using this to read the page:
WebClient wbClient = new WebClient();
wbClient.UseDefaultCredentials = true;
byte[] raw = wbClient.DownloadData(stWebPage);
stWebPageContent = System.Text.Encoding.UTF8.GetString(raw);
However, the relevant information inside the #document is ignored.
Can anybody explain what I have to do to access the needed info? It is nested under body/div/iframe/#document/html/body/div/..... Thanks!
Note: I am assuming stWebPage is pointing to a http url.
iFrame content will not be downloaded directly in this one call. You need to look for iFrame in stWebPageContent using Regex and pull the value in 'src' attribute, make another call to the src url for downloading content. More details can be found at this link.
Can i use like webclient or webrequest (I dont know) to do the following:
click a button, once its click it send a string lets say "Hello" to a website's textbox,
for example: button1 click, write "Hello" on textbox1 on website http://Test.com
There two general ways to deal with websites - either you are speaking straight HTTP in your C# program (this is WebRequest) or you use COM/Interop to control a browser.
If you are looking at HTML, then you need to use Interop to remote control a browser. Other alternatives to look at are Selenium.
WebClient is a class in System.Net namespace
you can download the content throught webclient by writing a method like
public static string downloadcontent(string urlofpage)
{
WebClient client = new WebClient();
string content = client.DownloadString(urlofpage);
return content;
}
this method return you a page who you want to download.
if you need something else then tell me by comment
I am using the code,
string loadFile = HttpContext.Current.Request.Url.AbsoluteUri;
// this.Response.ClearContent();
// this.Response.ClearHeaders();
this.Response.AppendHeader("content-disposition", "attachment; filename " + filename);
this.Response.ContentType ="application/html";
this.Response.WriteFile("C:\\Users\\Desktop\\Jobspoint Website\\jobpoint3.0\\print.aspx");
this.Response.Flush();
this.Response.Close();
this.Response.End();
to download an aspx page in asp.net C#.. But its only showing the html tags and static values... How can I save the entire page without html tags and with the values that retrieved from the database?
Thanks...
Leema
Use WebClient for this. It will download your file.
If I have understood correctly, one option would be to actually make a request to the web server using WebClient for example. And then write the response to that request to the Response.OutputStream. This means that the server will actually make a second request to it self and then send the response to the second request back to the client.
This way you will have the web server actually process the request and return the resulting HTML back to you rather than just the raw aspx page.
I'm downloading a web site using WebClient
public void download()
{
client = new WebClient();
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
client.Encoding = Encoding.UTF8;
client.DownloadStringAsync(new Uri(eUrl.Text));
}
void client_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
SaveFileDialog sd = new SaveFileDialog();
if (sd.ShowDialog() == DialogResult.OK)
{
StreamWriter writer = new StreamWriter(sd.FileName,false,Encoding.Unicode);
writer.Write(e.Result);
writer.Close();
}
}
This works fine. But I am unable to read content that is loaded using ajax. Like this:
<div class="center-box-body" id="boxnews" style="width:768px;height:1167px; ">
loading .... </div>
<script language="javascript">
ajax_function('boxnews',"ajax/category/personal_notes/",'');
</script>
This "ajax_function" downloads data from server on the client side.
How can I download the full web html data?
To do so, you would need to host a Javascript runtime inside of a full-blown web browser. Unfortunately, WebClient isn't capable of doing this.
Your only option would be automation of a WebBrowser control. You would need to send it to the URL, wait until both the main page and any AJAX content has been loaded (including triggering that load if user action is required to do so), then scrape the entire DOM.
If you are only scraping a particular site, you are probably better off just pulling the AJAX URL yourself (simulating all required parameters), rather than pulling the web page that calls for it.
I think you'd need to use a WebBrowser control to do this since you actually need the javascript on the page to run to complete the page load. Depending on your application this may or may not be possible for you -- note it's a Windows.Forms control.
When you visit a page in a browser, it
1.downloads a document from the
requested url
2.downloads anything referenced by an
img, link, script,etc tag (anything
that references an external file)
3.executes javascript where applicable.
The WebClient class only performs step 1. It encapsulates a single http request and response. It does not contain a script engine, and does not, as far as I know, find image tags, etc that reference other files and initiate further requests to obtain those files.
If you want to get a page once it's been modified by an AJAX call and handler, you'll need to use a class that has the full capabilities of a web browser, which pretty much means using a web browser that you can somehow automate server-side. The WebBrowser control does this, but it's for WinForms only, I think. I shudder to think of the security issues here, or the demand that would be placed on the server if multiple users are taking advantage of this facility simultaneously.
A better question to ask yourself is: why are you doing this? If the data you're really interested in is being obtained via AJAX (probably through a web service), why not skip the webClient step and just go straight to the source?