HTML Agility Pack Problems with W3C tools

HTML Agility Pack Problems with W3C tools - c#

I'm trying to access the HTML result of the w3C mobileOK Checker by passing a url such as
http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F
The URL works if you put it in a browser but I can't seem to be able to access it via the HTMLAgilityPack. The reason for this probably is that the URL needs to send a number of requests to it's server since it's an online testing, therefore it's not just a "static" URL. I have accessed other URLs without any problems. Below is my code:
HtmlAgilityPack.HtmlDocument webGet = new HtmlAgilityPack.HtmlDocument();
HtmlWeb hw = new HtmlWeb();
webGet = hw.Load("http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F");
HtmlNodeCollection nodes = webGet.DocumentNode.SelectNodes("//head");
if (nodes != null)
{
foreach(HtmlNode n in nodes)
{
string x = n.InnerHtml;
}
}
Edit: I tried to access it via Stream Reader and the website returns the following error: The remote server returned an error: (403) Forbidden.
I'm guessing that it's related.

I checked your example and was able to verify the described behaviour. It seems to me that w3.org checks if the request program is a browser or anything else.
I created an extended webClient class for another project on my own, and was able to access the given url with success.
Program.cs
WebClientExtended client = new WebClientExtended();
string exportPath = #"e:\temp"; // adapt to your own needs
string url = "http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F";
/// load html by using cusomt webClient class
/// but use HtmlAgilityPack for parsing, manipulation aso
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(System.Text.Encoding.UTF8.GetString(client.DownloadData(url)));
doc.Save(Path.Combine(exportPath, "check.html"));
WebClientExtended
public class WebClientExtended : WebClient
{
#region Felder
private CookieContainer container = new CookieContainer();
#endregion
#region Eigenschaften
public CookieContainer CookieContainer
{
get { return container; }
set { container = value; }
}
#endregion
#region Konstruktoren
public WebClientExtended()
{
this.container = new CookieContainer();
}
#endregion
#region Methoden
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
request.AllowAutoRedirect = false;
request.ServicePoint.Expect100Continue = false;
if (request != null)
{
request.CookieContainer = container;
}
((HttpWebRequest)r).Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
((HttpWebRequest)r).UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"; //IE
r.Headers.Set("Accept-Encoding", "gzip, deflate, sdch");
r.Headers.Set("Accept-Language", "de-AT,de;q=0.8,en;q=0.6,en-US;q=0.4,fr;q=0.2");
r.Headers.Add(System.Net.HttpRequestHeader.KeepAlive, "1");
((HttpWebRequest)r).AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
return r;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
if (!string.IsNullOrEmpty(response.Headers["Location"]))
{
request = GetWebRequest(new Uri(response.Headers["Location"]));
request.ContentLength = 0;
response = GetWebResponse(request);
}
return response;
}
#endregion
}
I think the crucial point is the addition/manipulation of userAgent, Accept-encoding, -language strings. The result of my code is the downloaded page check.html.

Related

How can I get html from page with cloudflare ddos portection?

I use htmlagility to get webpage data but I tried everything with page using www.cloudflare.com protection for ddos. The redirect page is not possible to handle in htmlagility because they don't redirect with meta nor js I guess, they check if you have already being checked with a cookie that I failed to simulate with c#. When I get the page, the html code is from the landing cloadflare page.

I also encountered this problem some time ago. The real solution would be solve the challenge the cloudflare websites gives you (you need to compute a correct answer using javascript, send it back, and then you receive a cookie / your token with which you can continue to view the website). So all you would get normally is a page like
In the end, I just called a python-script with a shell-execute. I used the modules provided within this github fork. This could serve as a starting point to implement the circumvention of the cloudflare anti-dDoS page in C# aswell.
FYI, the python script I wrote for my personal usage just wrote the cookie in a file. I read that later again using C# and store it in a CookieJar to continue browsing the page within C#.
#!/usr/bin/env python
import cfscrape
import sys
scraper = cfscrape.create_scraper() # returns a requests.Session object
fd = open("cookie.txt", "w")
c = cfscrape.get_cookie_string(sys.argv[1])
fd.write(str(c))
fd.close()
print(c)
EDIT: To repeat this, this has only LITTLE to do with cookies! Cloudflare forces you to solve a REAL challenge using javascript commands. It's not as easy as accepting a cookie and using it later on. Look at https://github.com/Anorov/cloudflare-scrape/blob/master/cfscrape/init.py and the ~40 lines of javascript emulation to solve the challenge.
Edit2: Instead of writing something to circumvent the protection, I've also seen people using a fully-fledged browser-object (this is not a headless browser) to go to the website and subscribe to certain events when the page is loaded. Use the WebBrowser class to create an infinetly small browser window and subscribe to the appropiate events.
Edit3:
Alright, I actually implemented the C# way to do this. This uses the JavaScript Engine Jint for .NET, available via https://www.nuget.org/packages/Jint
The cookie-handling code is ugly because sometimes the HttpResponse class won't pick up the cookies, although the header contains a Set-Cookie section.
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
using System.Web;
using System.Collections;
using System.Threading;
namespace Cloudflare_Evader
{
public class CloudflareEvader
{
/// <summary>
/// Tries to return a webclient with the neccessary cookies installed to do requests for a cloudflare protected website.
/// </summary>
/// <param name="url">The page which is behind cloudflare's anti-dDoS protection</param>
/// <returns>A WebClient object or null on failure</returns>
public static WebClient CreateBypassedWebClient(string url)
{
var JSEngine = new Jint.Engine(); //Use this JavaScript engine to compute the result.
//Download the original page
var uri = new Uri(url);
HttpWebRequest req =(HttpWebRequest) WebRequest.Create(url);
req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0";
//Try to make the usual request first. If this fails with a 503, the page is behind cloudflare.
try
{
var res = req.GetResponse();
string html = "";
using (var reader = new StreamReader(res.GetResponseStream()))
html = reader.ReadToEnd();
return new WebClient();
}
catch (WebException ex) //We usually get this because of a 503 service not available.
{
string html = "";
using (var reader = new StreamReader(ex.Response.GetResponseStream()))
html = reader.ReadToEnd();
//If we get on the landing page, Cloudflare gives us a User-ID token with the cookie. We need to save that and use it in the next request.
var cookie_container = new CookieContainer();
//using a custom function because ex.Response.Cookies returns an empty set ALTHOUGH cookies were sent back.
var initial_cookies = GetAllCookiesFromHeader(ex.Response.Headers["Set-Cookie"], uri.Host);
foreach (Cookie init_cookie in initial_cookies)
cookie_container.Add(init_cookie);
/* solve the actual challenge with a bunch of RegEx's. Copy-Pasted from the python scrapper version.*/
var challenge = Regex.Match(html, "name=\"jschl_vc\" value=\"(\\w+)\"").Groups[1].Value;
var challenge_pass = Regex.Match(html, "name=\"pass\" value=\"(.+?)\"").Groups[1].Value;
var builder = Regex.Match(html, #"setTimeout\(function\(\){\s+(var t,r,a,f.+?\r?\n[\s\S]+?a\.value =.+?)\r?\n").Groups[1].Value;
builder = Regex.Replace(builder, #"a\.value =(.+?) \+ .+?;", "$1");
builder = Regex.Replace(builder, #"\s{3,}[a-z](?: = |\.).+", "");
//Format the javascript..
builder = Regex.Replace(builder, #"[\n\\']", "");
//Execute it.
long solved = long.Parse(JSEngine.Execute(builder).GetCompletionValue().ToObject().ToString());
solved += uri.Host.Length; //add the length of the domain to it.
Console.WriteLine("***** SOLVED CHALLENGE ******: " + solved);
Thread.Sleep(3000); //This sleeping IS requiered or cloudflare will not give you the token!!
//Retreive the cookies. Prepare the URL for cookie exfiltration.
string cookie_url = string.Format("{0}://{1}/cdn-cgi/l/chk_jschl", uri.Scheme, uri.Host);
var uri_builder = new UriBuilder(cookie_url);
var query = HttpUtility.ParseQueryString(uri_builder.Query);
//Add our answers to the GET query
query["jschl_vc"] = challenge;
query["jschl_answer"] = solved.ToString();
query["pass"] = challenge_pass;
uri_builder.Query = query.ToString();
//Create the actual request to get the security clearance cookie
HttpWebRequest cookie_req = (HttpWebRequest) WebRequest.Create(uri_builder.Uri);
cookie_req.AllowAutoRedirect = false;
cookie_req.CookieContainer = cookie_container;
cookie_req.Referer = url;
cookie_req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0";
//We assume that this request goes through well, so no try-catch
var cookie_resp = (HttpWebResponse)cookie_req.GetResponse();
//The response *should* contain the security clearance cookie!
if (cookie_resp.Cookies.Count != 0) //first check if the HttpWebResponse has picked up the cookie.
foreach (Cookie cookie in cookie_resp.Cookies)
cookie_container.Add(cookie);
else //otherwise, use the custom function again
{
//the cookie we *hopefully* received here is the cloudflare security clearance token.
if (cookie_resp.Headers["Set-Cookie"] != null)
{
var cookies_parsed = GetAllCookiesFromHeader(cookie_resp.Headers["Set-Cookie"], uri.Host);
foreach (Cookie cookie in cookies_parsed)
cookie_container.Add(cookie);
}
else
{
//No security clearence? something went wrong.. return null.
//Console.WriteLine("MASSIVE ERROR: COULDN'T GET CLOUDFLARE CLEARANCE!");
return null;
}
}
//Create a custom webclient with the two cookies we already acquired.
WebClient modedWebClient = new WebClientEx(cookie_container);
modedWebClient.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0");
modedWebClient.Headers.Add("Referer", url);
return modedWebClient;
}
}
/* Credit goes to https://stackoverflow.com/questions/15103513/httpwebresponse-cookies-empty-despite-set-cookie-header-no-redirect
(user https://stackoverflow.com/users/541404/cameron-tinker) for these functions
*/
public static CookieCollection GetAllCookiesFromHeader(string strHeader, string strHost)
{
ArrayList al = new ArrayList();
CookieCollection cc = new CookieCollection();
if (strHeader != string.Empty)
{
al = ConvertCookieHeaderToArrayList(strHeader);
cc = ConvertCookieArraysToCookieCollection(al, strHost);
}
return cc;
}
private static ArrayList ConvertCookieHeaderToArrayList(string strCookHeader)
{
strCookHeader = strCookHeader.Replace("\r", "");
strCookHeader = strCookHeader.Replace("\n", "");
string[] strCookTemp = strCookHeader.Split(',');
ArrayList al = new ArrayList();
int i = 0;
int n = strCookTemp.Length;
while (i < n)
{
if (strCookTemp[i].IndexOf("expires=", StringComparison.OrdinalIgnoreCase) > 0)
{
al.Add(strCookTemp[i] + "," + strCookTemp[i + 1]);
i = i + 1;
}
else
al.Add(strCookTemp[i]);
i = i + 1;
}
return al;
}
private static CookieCollection ConvertCookieArraysToCookieCollection(ArrayList al, string strHost)
{
CookieCollection cc = new CookieCollection();
int alcount = al.Count;
string strEachCook;
string[] strEachCookParts;
for (int i = 0; i < alcount; i++)
{
strEachCook = al[i].ToString();
strEachCookParts = strEachCook.Split(';');
int intEachCookPartsCount = strEachCookParts.Length;
string strCNameAndCValue = string.Empty;
string strPNameAndPValue = string.Empty;
string strDNameAndDValue = string.Empty;
string[] NameValuePairTemp;
Cookie cookTemp = new Cookie();
for (int j = 0; j < intEachCookPartsCount; j++)
{
if (j == 0)
{
strCNameAndCValue = strEachCookParts[j];
if (strCNameAndCValue != string.Empty)
{
int firstEqual = strCNameAndCValue.IndexOf("=");
string firstName = strCNameAndCValue.Substring(0, firstEqual);
string allValue = strCNameAndCValue.Substring(firstEqual + 1, strCNameAndCValue.Length - (firstEqual + 1));
cookTemp.Name = firstName;
cookTemp.Value = allValue;
}
continue;
}
if (strEachCookParts[j].IndexOf("path", StringComparison.OrdinalIgnoreCase) >= 0)
{
strPNameAndPValue = strEachCookParts[j];
if (strPNameAndPValue != string.Empty)
{
NameValuePairTemp = strPNameAndPValue.Split('=');
if (NameValuePairTemp[1] != string.Empty)
cookTemp.Path = NameValuePairTemp[1];
else
cookTemp.Path = "/";
}
continue;
}
if (strEachCookParts[j].IndexOf("domain", StringComparison.OrdinalIgnoreCase) >= 0)
{
strPNameAndPValue = strEachCookParts[j];
if (strPNameAndPValue != string.Empty)
{
NameValuePairTemp = strPNameAndPValue.Split('=');
if (NameValuePairTemp[1] != string.Empty)
cookTemp.Domain = NameValuePairTemp[1];
else
cookTemp.Domain = strHost;
}
continue;
}
}
if (cookTemp.Path == string.Empty)
cookTemp.Path = "/";
if (cookTemp.Domain == string.Empty)
cookTemp.Domain = strHost;
cc.Add(cookTemp);
}
return cc;
}
}
/*Credit goes to https://stackoverflow.com/questions/1777221/using-cookiecontainer-with-webclient-class
(user https://stackoverflow.com/users/129124/pavel-savara) */
public class WebClientEx : WebClient
{
public WebClientEx(CookieContainer container)
{
this.container = container;
}
public CookieContainer CookieContainer
{
get { return container; }
set { container = value; }
}
private CookieContainer container = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
if (request != null)
{
request.CookieContainer = container;
}
return r;
}
protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
{
WebResponse response = base.GetWebResponse(request, result);
ReadCookies(response);
return response;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
ReadCookies(response);
return response;
}
private void ReadCookies(WebResponse r)
{
var response = r as HttpWebResponse;
if (response != null)
{
CookieCollection cookies = response.Cookies;
container.Add(cookies);
}
}
}
}
The function will return a webclient with the solved challenges and cookies inside. You can use it as follows:
static void Main(string[] args)
{
WebClient client = null;
while (client == null)
{
Console.WriteLine("Trying..");
client = CloudflareEvader.CreateBypassedWebClient("http://anilinkz.tv");
}
Console.WriteLine("Solved! We're clear to go");
Console.WriteLine(client.DownloadString("http://anilinkz.tv/anime-list"));
Console.ReadLine();
}

A "simple" working method to bypass Cloudflare if you don't use libraries (that sometimes does not work).
Open a "hidden" WebBrowser (size 1,1 or so).
Open the root of your target Cloudflare site.
Get the cookies from WebBrowser.
Use these cookies in WebClient.
Make sure the UserAgent for both WebBrowser and WebClient are identical. Cloudflare will give you a 503 if a mismatch there on the WebClient aftwerwards.
You will need to search here on stack on how to get cookies from WebBrowser and how to modify WebClient so you can set its cookiecontainer + modify the UserAgent on 1 or both so they are identical.
Since the cookies from Cloudflare seems to never expire, you can then serialize the cookies to somewhere temporary and load it each time you run your app, maybe a verification and refetch if failing.
Been doing this for a while and it works quite well. Could not get the C# libs to work for a specific Cloudflare site while they worked on others. No clue to why yet.
This also works behind the scenes on an IIS server, but you will have to set up "frowned upon" settings. That is, run the app pool as SYSTEM or ADMIN and set it to Classic mode.

Nowadays' answer should include Flaresolverr project.
It is meant to be deployed as a container using Docker, so you only have to pass it a port and it's running.
It doesn't impact your project as you don't import a library. It is currently supported. The only bad point I see, is that you need to install Docker to make it work.

Use WebClient to get html of the page,
I wrote following class which handles cookies too,
Just pass CookieContainer instance in constructor.
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Linq;
using System.Net;
using System.Text;
namespace NitinJS
{
public class SmsWebClient : WebClient
{
public SmsWebClient(CookieContainer container, Dictionary<string, string> Headers)
: this(container)
{
foreach (var keyVal in Headers)
{
this.Headers[keyVal.Key] = keyVal.Value;
}
}
public SmsWebClient(bool flgAddContentType = true)
: this(new CookieContainer(), flgAddContentType)
{
}
public SmsWebClient(CookieContainer container, bool flgAddContentType = true)
{
this.Encoding = Encoding.UTF8;
System.Net.ServicePointManager.Expect100Continue = false;
ServicePointManager.MaxServicePointIdleTime = 2000;
this.container = container;
if (flgAddContentType)
this.Headers["Content-Type"] = "application/json";//"application/x-www-form-urlencoded";
this.Headers["Accept"] = "application/json, text/javascript, */*; q=0.01";// "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
//this.Headers["Accept-Encoding"] = "gzip, deflate";
this.Headers["Accept-Language"] = "en-US,en;q=0.5";
this.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1; rv:23.0) Gecko/20100101 Firefox/23.0";
this.Headers["X-Requested-With"] = "XMLHttpRequest";
//this.Headers["Connection"] = "keep-alive";
}
private readonly CookieContainer container = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
if (request != null)
{
request.CookieContainer = container;
request.Timeout = 3600000; //20 * 60 * 1000
}
return r;
}
protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
{
WebResponse response = base.GetWebResponse(request, result);
ReadCookies(response);
return response;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
ReadCookies(response);
return response;
}
private void ReadCookies(WebResponse r)
{
var response = r as HttpWebResponse;
if (response != null)
{
CookieCollection cookies = response.Cookies;
container.Add(cookies);
}
}
}
}
USAGE:
CookieContainer cookies = new CookieContainer();
SmsWebClient client = new SmsWebClient(cookies);
string html = client.DownloadString("http://www.google.com");

Using HttpWebRequest to login to instagram

Hey guys so I'm trying to write a C# Application in which the user can login to their instagram account from a WPF. The problem I'm having is getting the authorization code. When I use this code I keep getting the login page URL, not the successful login page.
Help please!
Any feedback is appreciated! been stuck on this a while
private static AuthInfo GetInstagramAuth(string oAuthUri, string clientId, string redirectUri, InstagramConfig config,
string login, string password)
{
List<Auth.Scope> scopes = new List<Auth.Scope>();
scopes.Add(Auth.Scope.basic);
var link = InstaSharp.Auth.AuthLink(oAuthUri, clientId, redirectUri, scopes);
// Логинимся по указанному узлу
CookieAwareWebClient client = new CookieAwareWebClient();
// Зашли на страницу логина
var result = client.DownloadData(link);
var html = System.Text.Encoding.Default.GetString(result);
// Берем токен
string csr = "";
string pattern = #"csrfmiddlewaretoken""\svalue=""(.+)""";
var r = new System.Text.RegularExpressions.Regex(pattern);
var m = r.Match(html);
csr = m.Groups[1].Value;
// Логинимся
string loginLink = string.Format(
"https://instagram.com/accounts/login/?next=/oauth/authorize/%3Fclient_id%3D{0}%26redirect_uri%3Dhttp%3A//kakveselo.ru%26response_type%3Dcode%26scope%3Dbasic", clientId);
NameValueCollection parameters = new NameValueCollection();
parameters.Add("csrfmiddlewaretoken", csr);
parameters.Add("username", login);
parameters.Add("password", password);
// Нужно добавить секретные кукисы, полученные перед логином
// Нужны заголовки что ли
string agent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)";
client.Headers["Referer"] = loginLink;
client.Headers["Host"] = "instagram.com";
//client.Headers["Connection"] = "Keep-Alive";
client.Headers["Content-Type"] = "application/x-www-form-urlencoded";
//client.Headers["Content-Length"] = "88";
client.Headers["User-Agent"] = agent;
// client.Headers["Accept-Language"] = "ru-RU";
//client.Headers["Accept-Encoding"] = "gzip, deflate";
client.Headers["Accept"] = "text/html, application/xhtml+xml, */*";
client.Headers["Cache-Control"] = "no-cache";
// Запрос
var result2 = client.UploadValues(loginLink, "POST", parameters);
// Постим данные, Получаем code
// New link не на апи, а на instagram
string newPostLink = string.Format(
"https://instagram.com/oauth/authorize/?client_id={0}&redirect_uri=http://kakveselo.ru&response_type=code&scope=basic", clientId);
HttpWebRequest request =
(HttpWebRequest) WebRequest.Create(newPostLink);
request.AllowAutoRedirect = false;
request.CookieContainer = client.Cookies;
request.Referer = newPostLink;
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.UserAgent = agent;
string postData = String.Format("csrfmiddlewaretoken={0}&allow=Authorize", csr);
request.ContentLength = postData.Length;
ASCIIEncoding encoding = new ASCIIEncoding();
byte[] loginDataBytes = encoding.GetBytes(postData);
request.ContentLength = loginDataBytes.Length;
Stream stream = request.GetRequestStream();
stream.Write(loginDataBytes, 0, loginDataBytes.Length);
// send the request
var response = request.GetResponse();
string location = response.Headers["Location"];
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("--Responce from the webrequest--");
Console.ResetColor();
Console.WriteLine(((HttpWebResponse)response).ResponseUri+"\n\n");
// Теперь вытаскиваем код и получаем аутентификацию
pattern = #"kakveselo.ru\?code=(.+)";
r = new System.Text.RegularExpressions.Regex(pattern);
m = r.Match(location);
string code = m.Groups[1].Value;
// Наконец, получаем токен аутентификации
var auth = new InstaSharp.Auth(config); //.OAuth(InstaSharpConfig.config);
// now we have to call back to instagram and include the code they gave us
// along with our client secret
var oauthResponse = auth.RequestToken(code);
return oauthResponse;
}
}
I was using this website as an example and CookieAwareWebClient is just a WebClient that handles Cookies. I'll post it below:
using System;
/// <summary>
/// A Cookie-aware WebClient that will store authentication cookie information and persist it through subsequent requests.
/// </summary>
using System.Net;
public class CookieAwareWebClient : WebClient
{
//Properties to handle implementing a timeout
private int? _timeout = null;
public int? Timeout
{
get
{
return _timeout;
}
set
{
_timeout = value;
}
}
//A CookieContainer class to house the Cookie once it is contained within one of the Requests
public CookieContainer Cookies { get; private set; }
//Constructor
public CookieAwareWebClient()
{
Cookies = new CookieContainer();
}
//Method to handle setting the optional timeout (in milliseconds)
public void SetTimeout(int timeout)
{
_timeout = timeout;
}
//This handles using and storing the Cookie information as well as managing the Request timeout
protected override WebRequest GetWebRequest(Uri address)
{
//Handles the CookieContainer
var request = (HttpWebRequest)base.GetWebRequest(address);
request.CookieContainer = Cookies;
//Sets the Timeout if it exists
if (_timeout.HasValue)
{
request.Timeout = _timeout.Value;
}
return request;
}
}

Are you sure the login process on the website don't use javascript in some step(s)?
As far as i'm aware, if it's the case webrequests won't do the job.
All datas/actions that are javascript related will be non-existent through mere webrequests.
I noticed that for security reasons, Websites with personnal accounts tend to mix their login process with javascript now, to avoid bots requests.

Okay So I Figured out the issue. If you want to use webrequests and webresponses you need to make sure that the header information is correct. The issue with mine was I wasn't passing enough information from the browser. To see this information i used Tamper Data
It's an add on for Firefox and allows you took look at everything you are sending or receiving to/from the server.

How to pass cookies to HtmlAgilityPack or WebClient?

I use this code to login:
CookieCollection cookies = new CookieCollection();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("example.com");
request.CookieContainer = new CookieContainer();
request.CookieContainer.Add(cookies);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
cookies = response.Cookies;
string getUrl = "example.com";
string postData = String.Format("my parameters");
HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create(getUrl);
getRequest.CookieContainer = new CookieContainer();
getRequest.CookieContainer.Add(cookies);
getRequest.Method = WebRequestMethods.Http.Post;
getRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0";
getRequest.AllowWriteStreamBuffering = true;
getRequest.ProtocolVersion = HttpVersion.Version11;
getRequest.AllowAutoRedirect = true;
getRequest.ContentType = "application/x-www-form-urlencoded";
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
getRequest.ContentLength = byteArray.Length;
Stream newStream = getRequest.GetRequestStream();
newStream.Write(byteArray, 0, byteArray.Length);
newStream.Close();
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream(), Encoding.GetEncoding("windows-1251")))
{
doc.LoadHtml(sr.ReadToEnd());
webBrowser1.DocumentText = doc.DocumentNode.OuterHtml;
}
then I want to use HtmlWeb (HtmlAgilityPack) or Webclient to parse the HTML to HtmlDocument(HtmlAgilityPack).
My problem is that when I use:
WebClient wc = new WebClient();
webBrowser1.DocumentText = wc.DownloadString(site);
or
doc = web.Load(site);
webBrowser1.DocumentText = doc.DocumentNode.OuterHtml;
The login disappear so i think I must somehow pass the cookies.. Any suggestions?

Check HtmlAgilityPack.HtmlDocument Cookies
Here is an example of what you're looking for (syntax not 100% tested, I just modified some class I usually use):
public class MyWebClient
{
//The cookies will be here.
private CookieContainer _cookies = new CookieContainer();
//In case you need to clear the cookies
public void ClearCookies() {
_cookies = new CookieContainer();
}
public HtmlDocument GetPage(string url) {
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
//Set more parameters here...
//...
//This is the important part.
request.CookieContainer = _cookies;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
//When you get the response from the website, the cookies will be stored
//automatically in "_cookies".
using (var reader = new StreamReader(stream)) {
string html = reader.ReadToEnd();
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
}
}
Here is how you use it:
var client = new MyWebClient();
HtmlDocument doc = client.GetPage("http://somepage.com");
//This request will be sent with the cookies obtained from the page
doc = client.GetPage("http://somepage.com/another-page");
Note: If you also want to use POST method, just create a method similar to GetPage with the POST logic, refactor the class, etc.

There are some recommendations here: Using CookieContainer with WebClient class
However, it's probably just easier to keep using the HttpWebRequest and set the cookie in the CookieContainer:
HTTPWebRequest and CookieContainer
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx
The code looks something like this:
// Create a HttpWebRequest
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(getUrl);
// Create the cookie container and add a cookie
request.CookieContainer = new CookieContainer();
// Add all the cookies
foreach (Cookie cookie in response.Cookies)
{
request.CookieContainer.Add(cookie);
}
The second thing is that you don't need to download the site again, since you already have it from your web response and you're saving it here:
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream(), Encoding.GetEncoding("windows-1251")))
{
webBrowser1.DocumentText = doc.DocumentNode.OuterHtml;
}
You should be able to just take the HTML and parse it with the HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
And that should do it... :)

Try caching cookies from previous response locally and resend them each web request as follows:
private CookieCollection cookieCollection;
...
parserObject = new HtmlWeb
{
AutoDetectEncoding = true,
PreRequest = request =>
{
if (cookieCollection != null)
cookieCollection.Cast<Cookie>()
.ForEach(cookie => request.CookieContainer.Add(cookie));
return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};

C# WebClient login to accounts.google.com

I have very difficult time trying to authenticate to accounts.google.com using webclient
I'm using C# WebClient object to achieve following.
I'm submitting form fields to https://accounts.google.com/ServiceLoginAuth?service=oz
Here is POST Fields:
service=oz
dsh=-8355435623354577691
GALX=33xq1Ma_CKI
timeStmp=
secTok=
Email=test#test.xom
Passwd=password
signIn=Sign in
PersistentCookie=yes
rmShown=1
Now when login page loads before I submit data it has following headers:
Content-Type text/html; charset=UTF-8
Strict-Transport-Security max-age=2592000; includeSubDomains
Set-Cookie GAPS=1:QClFh_dKle5DhcdGwmU3m6FiPqPoqw:SqdLB2u4P2oGjt_x;Path=/;Expires=Sat, 21-Dec-2013 07:31:40 GMT;Secure;HttpOnly
Cache-Control no-cache, no-store
Pragma no-cache
Expires Mon, 01-Jan-1990 00:00:00 GMT
X-Frame-Options Deny
X-Auto-Login realm=com.google&args=service%3Doz%26continue%3Dhttps%253A%252F%252Faccounts.google.com%252FManageAccount
Content-Encoding gzip
Transfer-Encoding chunked
Date Thu, 22 Dec 2011 07:31:40 GMT
X-Content-Type-Options nosniff
X-XSS-Protection 1; mode=block
Server GSE
OK now how do I use WebClient Class to include those headers?
I have tried webClient_.Headers.Add(); but it has limited effect and always returns login page.
Below is a class that I use. Would appreciate any help.
Getting login page
public void LoginPageRequest(Account acc)
{
var rparams = new RequestParams();
rparams.URL = #"https://accounts.google.com/ServiceLoginAuth?service=oz";
rparams.RequestName = "LoginPage";
rparams.Account = acc;
webClient_.DownloadDataAsync(new Uri(rparams.URL), rparams);
}
void webClient__DownloadDataCompleted(object sender, DownloadDataCompletedEventArgs e)
{
RequestParams rparams = (RequestParams)e.UserState;
if (rparams.RequestName == "LoginPage")
{
ParseLoginRequest(e.Result, e.UserState);
}
}
Now getting form fields using HtmlAgilityPack and adding them into Parameters collection
public void ParseLoginRequest(byte[] data, object UserState)
{
RequestParams rparams = (RequestParams)UserState;
rparams.ClearParams();
ASCIIEncoding encoder = new ASCIIEncoding();
string html = encoder.GetString(data);
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode form = doc.GetElementbyId("gaia_loginform");
rparams.URL = form.GetAttributeValue("action", string.Empty);
rparams.RequestName = "LoginPost";
var inputs = form.Descendants("input");
foreach (var element in inputs)
{
string name = element.GetAttributeValue("name", "undefined");
string value = element.GetAttributeValue("value", "");
if (!name.Equals("undefined")) {
if (name.ToLower().Equals("email"))
{
value = rparams.Account.Email;
}
else if (name.ToLower().Equals("passwd"))
{
value = rparams.Account.Password;
}
rparams.AddParam(name,value);
Console.WriteLine(name + "-" + value);
}
}
webClient_.UploadValuesAsync(new Uri(rparams.URL),"POST", rparams.GetParams,rparams);
After I post the data I get login page rather than redirect or success message.
What am I doing wrong?

After some fiddling around, it looks like the WebClient class is not the best approach to this particular problem.
To achieve following goal I had to jump one level below to WebRequest.
When making WebRequest (HttpWebRequest) and using HttpWebResponse it is possible to set CookieContainer
webRequest_ = (HttpWebRequest)HttpWebRequest.Create(rparams.URL);
webRequest_.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
CookieContainer cookieJar = new CookieContainer();
webRequest_.CookieContainer = cookieJar;
string html = string.Empty;
try
{
using (WebResponse response = webRequest_.GetResponse())
{
using (var streamReader = new StreamReader(response.GetResponseStream()))
{
html = streamReader.ReadToEnd();
ParseLoginRequest(html, response,cookieJar);
}
}
}
catch (WebException e)
{
using (WebResponse response = e.Response)
{
HttpWebResponse httpResponse = (HttpWebResponse)response;
Console.WriteLine("Error code: {0}", httpResponse.StatusCode);
using (var streamReader = new StreamReader(response.GetResponseStream()))
Console.WriteLine(html = streamReader.ReadToEnd());
}
}
and then when making post use the same Cookie Container in following manner
webRequest_ = (HttpWebRequest)HttpWebRequest.Create(rparams.URL);
webRequest_.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
webRequest_.Method = "POST";
webRequest_.ContentType = "application/x-www-form-urlencoded";
webRequest_.CookieContainer = cookieJar;
var parameters = new StringBuilder();
foreach (var key in rparams.Params)
{
parameters.AppendFormat("{0}={1}&",HttpUtility.UrlEncode(key.ToString()),
HttpUtility.UrlEncode(rparams.Params[key.ToString()]));
}
parameters.Length -= 1;
using (var writer = new StreamWriter(webRequest_.GetRequestStream()))
{
writer.Write(parameters.ToString());
}
string html = string.Empty;
using (response = webRequest_.GetResponse())
{
using (var streamReader = new StreamReader(response.GetResponseStream()))
{
html = streamReader.ReadToEnd();
}
}
So this works, this code is not for production use and can be/should be optimized.
Treat it just as an example.

This is a quick example written in the answer pane and untested. You will probably need to parse some values out of an initial request for some form values to go in to formData. A lot of my code is based on this type of process unless we need to scrape facebook spokeo type sites in which case the ajax makes us use a different approach.
using System;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;
using System.Text;
namespace GMailTest
{
class Program
{
private static NameValueCollection formData = new NameValueCollection();
private static CookieAwareWebClient webClient = new CookieAwareWebClient();
static void Main(string[] args)
{
formData.Clear();
formData["service"] = "oz";
formData["dsh"] = "-8355435623354577691";
formData["GALX"] = "33xq1Ma_CKI";
formData["timeStmp"] = "";
formData["secTok"] = "";
formData["Email"] = "test#test.xom";
formData["Passwd"] = "password";
formData["signIn"] = "Sign in";
formData["PersistentCookie"] = "yes";
formData["rmShown"] = "1";
byte[] responseBytes = webClient.UploadValues("https://accounts.google.com/ServiceLoginAuth?service=oz", "POST", formData);
string responseHTML = Encoding.UTF8.GetString(responseBytes);
}
}
public class CookieAwareWebClient : WebClient
{
public CookieAwareWebClient() : this(new CookieContainer())
{ }
public CookieAwareWebClient(CookieContainer c)
{
this.CookieContainer = c;
this.Headers.Add("User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5");
}
public CookieContainer CookieContainer { get; set; }
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = this.CookieContainer;
}
return request;
}
}
}

how to login in https sites with the help of webrequest and response

how to login in https sites with the help of webrequst and webresponse in c# .
here is the code
public string postFormData(Uri formActionUrl, string postData)
{
gRequest = (HttpWebRequest)WebRequest.Create(formActionUrl);
gRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4";
gRequest.CookieContainer = new CookieContainer();
gRequest.Method = "POST";
gRequest.Accept = " text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, */*";
gRequest.KeepAlive = true;
gRequest.ContentType = #"text/html; charset=iso-8859-1";
#region CookieManagement
if (this.gCookies != null && this.gCookies.Count > 0)
{
gRequest.CookieContainer.Add(gCookies);
}
//logic to postdata to the form
string postdata = string.Format(postData);
byte[] postBuffer = System.Text.Encoding.GetEncoding(1252).GetBytes(postData);
gRequest.ContentLength = postBuffer.Length;
Stream postDataStream = gRequest.GetRequestStream();
postDataStream.Write(postBuffer, 0, postBuffer.Length);
postDataStream.Close();
//post data logic ends
//Get Response for this request url
gResponse = (HttpWebResponse)gRequest.GetResponse();
//check if the status code is http 200 or http ok
if (gResponse.StatusCode == HttpStatusCode.OK)
{
//get all the cookies from the current request and add them to the response object cookies
gResponse.Cookies = gRequest.CookieContainer.GetCookies(gRequest.RequestUri);
//check if response object has any cookies or not
if (gResponse.Cookies.Count > 0)
{
//check if this is the first request/response, if this is the response of first request gCookies
//will be null
if (this.gCookies == null)
{
gCookies = gResponse.Cookies;
}
else
{
foreach (Cookie oRespCookie in gResponse.Cookies)
{
bool bMatch = false;
foreach (Cookie oReqCookie in this.gCookies)
{
if (oReqCookie.Name == oRespCookie.Name)
{
oReqCookie.Value = oRespCookie.Name;
bMatch = true;
break; //
}
}
if (!bMatch)
this.gCookies.Add(oRespCookie);
}
}
}
#endregion
StreamReader reader = new StreamReader(gResponse.GetResponseStream());
string responseString = reader.ReadToEnd();
reader.Close();
//Console.Write("Response String:" + responseString);
return responseString;
}
else
{
return "Error in posting data";
}
}
// calling the above function
httphelper.postFormData(new Uri("https://login.yahoo.com/config/login?.done=http://answers.yahoo.com%2f&.src=knowsrch&.intl=us"), ".tries=1&.src=knowsrch&.md5=&.hash=&.js=&.last=&promo=&.intl=us&.bypass=&.partner=&.u=0b440p15q1nmb&.v=0&.challenge=Rt_fM1duQiNDnI5SrzAY_GETpNTL&.yplus=&.emailCode=&pkg=&stepid=&.ev=&hasMsgr=0&.chkP=Y&.done=http%3A%2F%2Fanswers.yahoo.com%2F&.pd=knowsrch_ver%3D0%26c%3D%26ivt%3D%26sg%3D&login=xyz&passwd=xyz&.save=Sign+In");

You need to see how authentication works for the site you are working with.
This may be through cookies, special headers, hidden field or something else.
Fire up a tool like Fiddler and see what the network traffic is like when logging in and how it is different from not being logged in
Recreate this logic with WebRequest and WebResponse.
See the answers to this SO question (HttpRequest: pass through AuthLogin).

What for? Watin is good for testing and such, and it's easy to do basic screen scraping with it. Why reinvent the wheel if you don't have to.

you can set the WebRequest.Credentials property. for an example and documentation see:
http://msdn.microsoft.com/en-us/library/system.net.networkcredential.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML Agility Pack Problems with W3C tools - c#

Related

How can I get html from page with cloudflare ddos portection?

Using HttpWebRequest to login to instagram

How to pass cookies to HtmlAgilityPack or WebClient?

C# WebClient login to accounts.google.com

how to login in https sites with the help of webrequest and response

Categories

Resources