How to get HTML code from webpage? - c#

I'm trying to get HTML code from a specific webpage, but when I do it using
HttpWebRequest request;
HttpWebResponse response;
StreamReader streamReader;
request = (HttpWebRequest)WebRequest.Create(pageURL);
response = (HttpWebResponse)request.GetResponse();
streamReader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("windows-1251"));
htmlCode = streamReader.ReadToEnd();
streamReader.Close();
or using WebClient, I get redirected to a login page and I get its code.
Is there any other way to get HTML code?
I read some information here: How to get HTML from a current request, in a postback , but didn't understand what should I do, or how and where to specify URL.
P.S.:
I'm logged-in in a browser. Notepad++ perfectly gets what I need via "right click - view source code".
Thanks.

If you get redirected to a login page, then presumably you must be logged in before you can get the content.
So you need to make a request, with suitable credentials, to the login page. Get whatever tokens are sent (usually in the form of cookies) to maintain the login. Then request the page you want (sending the cookies with the request).
Alternatively (and this is the preferred approach), most major sites that expect automated systems to interact with them provide an API (often using OAuth for authentication). Consult their documentation to see how their API works.

If the page you want to get to is behind a login screen - you're going to need to do the login mechanism through code. And add an associated CookieCollection to hold the login cookie that the website will try to drop on your Request.
Alternatively, if you have a user who can help the program along, you could try listing the cookies for the site once they've logged in through their browser. Copy that cookie across and add it to the CookieCollection.
Cheers
Simon

If you want to scrap an html page that requires autentication, I suggest you to use Watin
to fill the proper fields and navigate to the pages you want to download.
Maybe iot seems a little overkilling at a first glance, but it will save a lot of troubles later.

Related

How to post to a sign-in form with Flurl

I want to represent a signin in a url(i.e. eclass.aueb.gr) to get the source code of the next page(the portfolio of the user).
What i have now, is the code from documentation...
var response = await "https://eclass.aueb.gr/index.php".PostUrlEncodedAsync(new
{
uname = "name",
pass = "pass"
});
Currently the response is the code of the url itself.
The page is most likely following the PRG pattern (as it should, per best practices), so the fact that the response you are getting is the original page just means you didn't code the redirect part. Do you need to? Not if all you need to do is log in. Are you successfully logged in? Hard to say for certain. response.StatusCode might be "Unauthorized" (401) if it failed (you could test it against invalid credentials), but since you're scraping a site designed for browsers and not automation (like an API), you might have to pick through that big string of HTML you got back and look for error messages. There again, see what happens in a browser when you try to log in with invalid credentials. And remember - Chrome DevTools are your friend, particularly the Network tab in this case.

WebRequest class to post data to login form

I want to use the WebRequest class to post data to a website. This works fine, however the website I'm posting to requires cookies/sessions (it's a login form). After logging in I need to retrieve some account information (this is information on a specific page).
How can I make sure the login information is being stored? In AutoIT I did this using a hidden webbrowser, however I want to use a console application for it.
My current code (to login) is too long to post here, so it can be found here.
Take a look at my aspx sessions scraper on bitbucket. It does exactly what you are asking for, including some aspx webforms specific extensions, like sending postbacks etc.
You need to store the cookie that you get after logging in and then send that cookie when you request pages containing personal information.
Here is an example of using cookies with WebRequest
It is possible that you can't connect because the session has ended so in this case you need to relogin.

Set Referer header in asp.net

This should be an easy question, but I've been unable to solve it. I'm trying to change the Referral header prior to redirecting the page of an HttpResponse object. I know this can be done in an HttpWebResponse, but can't get this to work for a standard Page.Response.
I'm trying to just set the referer header to look like it originated from a temp page on my site (this is for analytics tracking for an external system).
Is this possible to do??
I've tried to use the code below (as well as variations such as Response.AppendHeader and Response.AddHeader), however the Referer always shows as the page that the Request initiated from.
Response.Headers.Add("Referer", "http://test.local/fromA");
Response.Redirect(HttpContext.Current.Request.Url.AbsoluteUri);
If not via .net can this be accomplished via js?
Thanks!
Referer is controlled (and sent) by the client. You can't affect it server-side. There may be some JavaScript that you could emit that'd get the client to do it - but it's probably considered a security flaw, so I wouldn't count on it.
The referrer is set by the client, not the server. It is useful to include in a request and not a response as it points to the URL where the request came from.

Getting data from a webpage

I have an idea for an App that would really help me out in work but I'm not sure if it's possible.
I want to run a C# desktop application that will ask for a value. When a value is supplied, the application will open a browswer, go to a webpage and add the value into a form on an online website. The form is then submitted and a new page is loaded that contains a table of results. I then want to extract the table of results from the page source and write code to parse the result values.
It is not important that the user see's this happen in an actual browser. In other words if there's a way to do it by reading HTTP requests then thats great.
The biggest problem I have is getting the values into the form and then retrieving the page source after the form is submitted and the next page loads.
Any help really appreciated.
Thanks
Provided that you're only using this in a legal context:
Usually, web forms are sent via POST request to the web server, specifically some script that handles it. You can look at the HTML code for the form's page and find out the destination for the form (form's action).
You can then use a HttpWebRequest in C# to "pretend you are the form", sending a POST request with all the required parameters (adding them to the HTTP header).
As a result you will get the source code of the destination page as it would be sent to the browser. You can parse this.
This is definitely possible and you don't need to use an actual web browser for this. You can simply use a System.Net.WebClient to send your HTTP request and get an HTTP response.
I suggest to use wireshark (or you can use Firefox + Firebug) it allows you to see HTTP requests and responses. By looking at the HTTP traffic you can see exactly how you should pass your HTTP request and which parameters you should be setting.
You don't need to involve the browser with this. WebClient should do all that you require. You'll need to see what's actually being posted when you submit the form with the browser, and then you should be able to make a POST request using the WebClient and retrieve the resulting page as a string.
The docs for the WebClient constructor have a nice example.
See e.g. this question for some pointers on at least the data retrieval side. You're going to know a lot more about the http protocol before you're done with this...
Why would you do this through web pages if you don't even want the user to do anything?
Web pages are purely for interaction with users, if you simply want data transfer, use WCF.
#Brian using Wireshark will result in a very angry network manager, make sure you are actually allowed to use it.

Download a file over HTTPS C# - Cookie and Header Prob?

I am trying to download a file over HTTPS and I just keep running into a brick wall with correctly setting Cookies and Headers.
Does anyone have/know of any code that I can review for doing this correctly ? i.e. download a file over https and set cookies/headers ?
Thanks!
I did this the other day, in summary you need to create a HttpWebRequest and HttpWepResponse to submit/receive data. Since you need to maintain cookies across multiple requests, you need to create a cookie container to hold your cookies. You can set header properties on request/response if needed as well....
Basic Concept:
Using System.Net;
// Create Cookie Container (Place to store cookies during multiple requests)
CookieContainer cookies = new CookieContainer();
// Request Page
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("https://www.amazon.com");
req.CookieContainer = cookies;
// Response Output (Could be page, PDF, csv, etc...)
HttpWebResponse resp= (HttpWebResponse)req.GetResponse();
// Add Response Cookies to Cookie Container
// I only had to do this for the first "login" request
cookies.Add(resp.Cookies);
The key to figuring this out is capturing the traffic for real request. I did this using Fiddler and over the course of a few captures (almost 10), I figured out what I need to do to reproduce the login to a site where I needed to run some reports based on different selection critera (date range, parts, etc..) and download the results into CSV files. It's working perfect, but Fiddler was the key to figuring it out.
http://www.fiddler2.com/fiddler2/
Good Luck.
Zach
This fellow wrote an application to download files using HTTP:
http://www.codeproject.com/KB/IP/DownloadDemo.aspx
Not quite sure what you mean by setting cookies and headers. Is that required by the site you are downloading from? If it is, what cookies and headers need to be set?
I've had good luck with the WebClient class. It's a wrapper for HttpWebRequest that can save a few lines of code: http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

Categories

Resources