Unable to fetch data using HttpWebRequest or HtmlAgilityPack

Unable to fetch data using HttpWebRequest or HtmlAgilityPack - c#

I am trying to make web scraper in C# for NSE. The code works with other sites but when ran on https://www.nseindia.com/ it gives error - An error occurred while sending the request. Unable to read data from the transport connection: Operation timed out.
I have tried with two different approaches Try1() & Try2().
Can anyone please tell what I am missing in my code?
class Program
{
public void Try1() {
HtmlWeb web = new HtmlWeb();
HttpStatusCode statusCode = HttpStatusCode.OK;
web.UserAgent = GetUserAgent();
web.PostResponse = (request, response) =>
{
if (response != null)
{
statusCode = response.StatusCode;
Console.WriteLine("Status Code: " + statusCode);
}
};
Task<HtmlDocument> task = web.LoadFromWebAsync(GetURL());
HtmlDocument document = task.Result;
}
public void Try2() {
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(GetURL());
request.UserAgent = GetUserAgent();
request.Accept= "*/*;";
using (var response = (HttpWebResponse)(request.GetResponse()))
{
HttpStatusCode code = response.StatusCode;
if (code == HttpStatusCode.OK)
{
using (StreamReader streamReader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(streamReader);
Console.WriteLine("Document Loaded.");
}
}
}
}
private string GetURL() {
// return "https://html-agility-pack.net/";
return "https://www.nseindia.com/";
}
private string GetUserAgent() {
return "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36";
}
}

Your are lack of headers towards Accept and others so it couldn't response back.
Besides that, I would recommend you using HttpClient instead of HttpWebRequest
public static async Task GetHtmlData(string url)
{
HttpClient httpClient = new HttpClient();
using (var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url)))
{
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml, charset=UTF-8, text/javascript, */*; q=0.01");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137");
request.Headers.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
request.Headers.TryAddWithoutValidation("X-Requested-With", "XMLHttpRequest");
using (var response = await httpClient.SendAsync(request).ConfigureAwait(false))
{
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false))
using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(decompressedStream))
{
var result = await streamReader.ReadToEndAsync().ConfigureAwait(false);
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(result);
Console.WriteLine(result);
Console.WriteLine("Document Loaded.");
}
}
}
Use it by
await GetHtmlData("https://www.nseindia.com/");

Related

Web Response 'an error occured while sending this request' while trying to login into Rockstar social club page

I want to login to Rockstar Social Club page https://pl.socialclub.rockstargames.com
I have this script
public static void Login()
{
string firstUrl = "https://pl.socialclub.rockstargames.com/profile/signin";
string formParams = string.Format("login-field={0}&password-field={1}", "mynickname", "mypassword");
string cookieHeader;
WebRequest req = WebRequest.Create(firstUrl);
req.ContentType = "application/x-www-form-urlencoded";
req.Method = "POST";
byte[] bytes = Encoding.ASCII.GetBytes(formParams);
req.ContentLength = bytes.Length;
using (Stream os = req.GetRequestStream())
{
os.Write(bytes, 0, bytes.Length);
}
WebResponse resp = req.GetResponse();
cookieHeader = resp.Headers["Set-cookie"];
string pageSource;
string getUrl = "https://pl.socialclub.rockstargames.com/games/gtav/pc/career/overview/gtaonline";
WebRequest getRequest = WebRequest.Create(getUrl);
getRequest.Headers.Add("Cookie", cookieHeader);
WebResponse getResponse = getRequest.GetResponse(); //Here returns me this error: System.Net.WebException: 'An error occurred while sending the request"
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
}
}
Error occures in WebResponse getResponse = getRequest.GetResponse();
System.Net.WebException: 'An error occurred while sending the request'
I don't know how to repair this, and succesfully login to this website.

I have accomplished what you are attempting to do, but on a different website.
Basically - a few years ago, I wanted to create a website that would track my Guild/Company details on Final Fantasy XIV.
They didn't have an API, so I made one.
In order to get the information I required, I needed to use a mix of HtmlAgilityPack along with the C# WebBrowser control.
In order to pass the verification token stage above, you need to run the page source in a Web Browser control.
This will allow dynamic fields and data to be generated.
You then need to take that data, and submit it with your post data.
This is to fool it into thinking the request is coming from the page.
Be warned, when doing your posts - you may need to allow for redirects and you may need to mirror the referrer and host fields to match the website you are emulating.
The specific process I followed was:
Navigate to login page in WebBrowser control
Get page source
Load into HtmlAgilityPack HtmlDocument class
Use XPath to scrape the login form.
Take _verification tokens, csrf tokens etc make note of them.
Post a web-request with the necessary data to the form target destination url.
Read the response
Be aware - sometimes the response will actually be html code that tells it to do a Javascript redirect - in my case with Final Fantasy XIV - it was loading up another form and performing an autopost on page load.
You will also want to use
LoggedInCookies = new CookieContainer();
In your first HttpWebRequest
followed by:
request.CookieContainer = LoggedInCookies;
for each subsequent request.
The cookie container will trap and persist the authentication related cookies, while the WebBrowser control and HtmlAgilityPack will allow you to scrape the fields from the web forms that you need to break through.
Adding some code from wayback when I solved this for Final Fantasy XIV's lodestone website.
This code is very old and may not work anymore, but the process it follows could be adapted for sites that do not use Javascript as part of the login process.
Pay attention to the areas where it allows the request to be redirected, this is because the Server endpoint you are calling may do Action redirects etc
If your request does not allow those redirects, then it will not be emulating the login process.
class LoggedInClient
{
public static CookieContainer LoginCookie(string user, string pass)
{
string sStored = "";
string url = "http://eu.finalfantasyxiv.com/lodestone/account/login/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
CookieContainer cookies = new CookieContainer();
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36";
request.CookieContainer = cookies;
HttpWebResponse response1 = (HttpWebResponse)request.GetResponse();
Console.WriteLine(cookies.Count.ToString());
string sPage = "";
using (var vPage = new StreamReader(response1.GetResponseStream()))
{
sPage = vPage.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sPage);
sStored = doc.DocumentNode.SelectSingleNode("//input[#type='hidden' and #name='_STORED_']").Attributes["value"].Value;
string param = "sqexid="+user+"8&password="+pass+"&_STORED_=" + sStored;
string postURL = doc.DocumentNode.SelectSingleNode("//form[#name='mainForm']").Attributes["action"].Value;
//Console.WriteLine(sStored);
postURL = "https://secure.square-enix.com/oauth/oa/" + postURL;
request.Method = "POST";
byte[] paramAsBytes = Encoding.Default.GetBytes(param);
request = (HttpWebRequest)WebRequest.Create(postURL);
request.ContentType = "application/x-www-form-urlencoded";
request.Method = "POST";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36";
request.CookieContainer = cookies;
request.AllowAutoRedirect = false;
try
{
using (Stream stream = request.GetRequestStream())
{
stream.Write(paramAsBytes, 0, paramAsBytes.Length);
}
}
catch (Exception ee)
{
Console.WriteLine(ee.ToString());
}
string sGETPage = "";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (var vPage = new StreamReader(response.GetResponseStream()))
{
sPage = vPage.ReadToEnd();
sGETPage = response.Headers["Location"];
}
}
// Console.WriteLine(sPage);
request = (HttpWebRequest)WebRequest.Create(sGETPage);
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36";
request.CookieContainer = cookies;
HttpWebResponse response2 = (HttpWebResponse)request.GetResponse();
Console.WriteLine(cookies.Count.ToString());
sPage = "";
using (var vPage = new StreamReader(response2.GetResponseStream()))
{
sPage = vPage.ReadToEnd();
}
// Console.WriteLine(sPage);
doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sPage);
string _c = doc.DocumentNode.SelectSingleNode("//input[#type='hidden' and #name='_c']").Attributes["value"].Value;
string cis_sessid = doc.DocumentNode.SelectSingleNode("//input[#type='hidden' and #name='cis_sessid']").Attributes["value"].Value;
string action = doc.DocumentNode.SelectSingleNode("//form[#name='mainForm']").Attributes["action"].Value;
string sParams = "_c=" + _c + "&cis_sessid=" + cis_sessid;
byte[] bData = Encoding.Default.GetBytes(sParams);
// Console.WriteLine(sStored);
request = (HttpWebRequest)WebRequest.Create(action);
request.ContentType = "application/x-www-form-urlencoded";
request.Method = "POST";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36";
request.CookieContainer = cookies;
request.AllowAutoRedirect = true;
try
{
using (Stream stream = request.GetRequestStream())
{
stream.Write(bData, 0, bData.Length);
}
}
catch (Exception ee)
{
Console.WriteLine(ee.ToString());
}
string nextPage = "";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (var vPage = new StreamReader(response.GetResponseStream()))
{
nextPage = vPage.ReadToEnd();
}
}
// Console.WriteLine(nextPage);
doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(nextPage);
string csrf_token = doc.DocumentNode.SelectSingleNode("//input[#type='hidden' and #name='csrf_token']").Attributes["value"].Value;
string cicuid = "51624738";
string timestamp = Convert.ToInt32(DateTime.UtcNow.Subtract(new DateTime(1970, 1, 1)).TotalSeconds).ToString() + "100";
action = "http://eu.finalfantasyxiv.com/lodestone/api/account/select_character/";
sParams = "csrf_token=" + csrf_token + "&cicuid=" + cicuid + "&timestamp=" + timestamp;
bData = Encoding.Default.GetBytes(sParams);
request = (HttpWebRequest)WebRequest.Create(action);
request.ContentType = "application/x-www-form-urlencoded";
request.Method = "POST";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36";
request.CookieContainer = cookies;
request.AllowAutoRedirect = true;
try
{
using (Stream stream = request.GetRequestStream())
{
stream.Write(bData, 0, bData.Length);
}
}
catch (Exception ee)
{
Console.WriteLine(ee.ToString());
}
nextPage = "";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (var vPage = new StreamReader(response.GetResponseStream()))
{
nextPage = vPage.ReadToEnd();
}
}
return cookies;
}
}

My webrequests keeps returning the wrong response why?

So I am trying to connect to http://cmyip.org because the content contains information about the proxy you are connecting with. Which is perfect because I am learning about making web-requests with proxies so it will show me if I actually connected with the proxy or not.
However instead of getting the content of the website I am making a request to I am getting the content of the proxy html
Before start thread
<html><head><title>Wowza Streaming Engine 4 Subscription Edition 4.7.2.01 build2
1094</title></head><body>Wowza Streaming Engine 4 Subscription Edition 4.7.2.01
build21094</body></html>
Working.
Why is it showing the contents of the proxy and not cmyip.org? I'm sure I am just too close to the code to see it but could really use some help.
If you try to connect to 192.99.46.182:1935 on your web browser you will see the same thing.
public static bool TestProxy(string proxyIP, int proxyPort)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.cmyip.org/");
WebProxy myproxy = new WebProxy(proxyIP, proxyPort);
myproxy.BypassProxyOnLocal = false;
request.Proxy = myproxy;
request.Method = "GET";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
request.Timeout = 2000;
try
{
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string Content = sr.ReadToEnd();
console.WriteLine(Content);
}
}
catch (Exception)
{
return false;
}
return true;
}
private static void SimpleTest()
{
var aBool = TestProxy("192.99.46.182", 1935);
if (aBool == true)
{
Console.WriteLine("Working.");
}
Console.ReadLine();
}

Some websites rejecting HttpClient requests even with headers set

I have written some code to check to see if all websites in my database are still hosted and online.
The problem is some of these sites seem to have bot protection and whenever I try to request then via HttpClient they raise an error instead of displaying the page.
I have seen other similar questions that suggest to add in browser headers so I have done this but this does not help. The same sites still reject the HttpClient connection but are perfectly fine when I view them in the browser.
Have I done something wrong with my code or do I need some additional steps?
Here is my code:
public static async Task CheckSite(string url, int id)
{
try
{
using(var db = new PlaceDBContext())
using (HttpClient client = new HttpClient(new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip
}))
using (HttpResponseMessage response = await client.GetAsync(url))
using (HttpContent content = response.Content)
{
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
var rd = db.RootDomains.Find(id);
string result = await content.ReadAsStringAsync();
if (result != null && result.Length >= 50)
{
Console.WriteLine("fine");
rd.LastCheckOnline = true;
}
else
{
Console.WriteLine("There was empty or short result");
rd.LastCheckOnline = false;
}
db.SaveChanges();
semaphore.Release();
}
}
catch(Exception ex)
{
Console.WriteLine(ex.Message);
using(var db = new PlaceDBContext())
{
var rd = db.RootDomains.Find(id);
rd.LastCheckOnline = false;
db.SaveChanges();
semaphore.Release();
}
}
}

Set the headers before sending the request. You are doing them after already getting a response
public static async Task CheckSite(string url, int id) {
try {
using (var db = new PlaceDBContext())
using (var client = new HttpClient(new HttpClientHandler() {
AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip
})) {
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
using (var response = await client.GetAsync(url))
using (var content = response.Content) {
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
var rd = db.RootDomains.Find(id);
string result = await content.ReadAsStringAsync();
if (result != null && result.Length >= 50) {
Console.WriteLine("fine");
rd.LastCheckOnline = true;
} else {
Console.WriteLine("There was empty or short result");
rd.LastCheckOnline = false;
}
db.SaveChanges();
semaphore.Release();
}
}
} catch (Exception ex) {
Console.WriteLine(ex.Message);
using (var db = new PlaceDBContext()) {
var rd = db.RootDomains.Find(id);
rd.LastCheckOnline = false;
db.SaveChanges();
semaphore.Release();
}
}
}

Get all images from website has block calls

I'm trying get all images from page:
public async Task<PhotoURL> GetImagePortal()
{
strLinkPage = "http://www.propertyguru.com.sg/listing/19077438";
var lstString = new List<string>();
int itotal = default(int);
HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
string strHtml = await client.GetStringAsync(strLinkPage);
doc.LoadHtml(strHtml);
var pageHtml = doc.DocumentNode;
if (pageHtml != null)
{
var projectRoot = pageHtml.SelectSingleNode("//*[contains(#class,'submain')]");
//var projectChild = projectRoot.SelectSingleNode("div/div[2]");
var imgRoot = projectRoot.SelectSingleNode("//*[contains(#class,'white-bg-padding')]");
var imgChilds = imgRoot.SelectNodes("div[1]/div[1]/ul[1]/li");
itotal = imgChilds.Count();
foreach (var imgItem in imgChilds)
{
string linkImage = imgItem.SelectSingleNode("img").Attributes["src"].Value;
lstString.Add(linkImage);
}
}
return await Task.Run(() => new PhotoURL { total = itotal, URL = lstString });
}
at line
string strHtml = await client.GetStringAsync(strLinkPage);
i have error 405 method not allowed.
I try using
WebClient, HTTPWebRequest.
help me, please!

The site required a user-agent and since you are just using a HttpClient without any options, the site does not think it is a correct request (It does not look like it's coming from a browser without the user agent).
Try this:
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");
Or if you prefer any other user agents strings.

Server returning 500 error when using WebRequest to get XML document

Here is my code to get the xml document from a url that is passed in.
var request = WebRequest.Create(url);
request.Method = "GET";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = 0;
var response = request.GetResponse(); // Error is thrown here
When I copy and paste the url into my browser it works just fine.
Here is the complete xml that is returned
<Model>
<Item>
<Id>7908</Id>
</Item>
</Model>
Is the xml in an incorrect format? I have tried changing the content type to be application/xml but I still get this error.
EDIT=======================================================
I am trying to use webclient using this code:
using (var wc = new System.Net.WebClient())
{
wc.Headers["Method"] = "GET";
wc.Headers["ContentType"] = "text/xml;charset=\"utf-8\"";
wc.Headers["Accept"] = "text/xml, */*";
wc.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-us";
wc.Headers["KeepAlive"] = "true";
wc.Headers["AutomaticDecompression"] = (DecompressionMethods.Deflate | DecompressionMethods.GZip).ToString();
var response = wc.DownloadString(url);
}
The response string is empty!!! Any ideas why this is not returning any result but pasting the url into the browser returns the xml?

I finally got it working. I had to use this code:
using (var wc = new System.Net.WebClient())
{
wc.Headers["Method"] = "GET";
wc.Headers["Accept"] = "application/xml";
var response = wc.DownloadString(url);
}
The key was using the accept header of "application/xml" otherwise the response would come back empty.

This should hopefully do the trick:
try
{
using(var response = (HttpWebResponse)request.GetResponse())
{
// Do things
}
}
catch(WebException e)
{
// Handled!...
}
Try what Joel Lee suggested if this fails.

Why not use a WebClient instead.
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
var request = base.GetWebRequest(address);
if (request.GetType() == typeof(HttpWebRequest)){
((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36";
}
return request;
}
}
using(var wc = new MyWebClient()){
var response = wc.DownloadString(url);
//do stuff with response
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unable to fetch data using HttpWebRequest or HtmlAgilityPack - c#

Related

Web Response 'an error occured while sending this request' while trying to login into Rockstar social club page

My webrequests keeps returning the wrong response why?

Some websites rejecting HttpClient requests even with headers set

Get all images from website has block calls

Server returning 500 error when using WebRequest to get XML document

Categories

Resources