Can't able to Download HTML of a Specific Website - c#

I am doing webparsing using C# Console Application.
My code is:
var req = WebRequest.Create("http://watch.squidtv.net/");
req.BeginGetResponse(r =>
{
var response = req.EndGetResponse(r);
var stream = response.GetResponseStream();
var reader = new StreamReader(stream, true);
var str = reader.ReadToEnd();
Console.WriteLine(str);
}, null);
This Code is runing fine with other URLs but when I changed URL to http://watch.squidtv.net/ then two problems occurred-
First one- It is not downloading its html.
Second one- Its generates a sound of CPU.
Then I changed the code and used webClient like this -
string htmlCode = "";
htmlCode = client.DownloadString("http://watch.squidtv.net");
Console.WriteLine(htmlCode);
But the problem is same :(
what can be the problem ???

I found the the Solution
the probelm was HTML header in HTML header there is gzip object Encoding the httpwebrequest is not accepting the gzip header which causing the problem when i used this code the problem solved
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://watch.squidtv.net/");
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string htmlCode;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
htmlCode = reader.ReadToEnd();
}

Possibly you'll have to specify more in your WebRequest so that the SquidTV server can know to send you back the HTML for one idea.
Consider that in a browser there are lots of headers that get sent to the server. If you want to take a look use Fiddler or WireShark to see all the extra data that gets sent.
Firewalls could be another issue as you are sending out a request that may not be allowed and thus nothing is coming back. This would be where I'd likely suggest intermediate tools like WireShark or Fiddler that may be useful in seeing if the request is getting out at least.

Related

How to login a website using httpwebrequest via my web app or generic handler and access the content?

Basically I am making a chat app for my university students only and for that I have to make sure they are genuine by checking there details on UMS(university management system) and get their basic detail so they chat genuinely. I am nearly done with my chat app only the login is left.
So I want to login to my UMS page via my website from a generic handler.
and then navigate to another page in it to access there basic info keeping the session alive.
I did research on httpwebrequest and failed to login with my credentials.
https://ums.lpu.in/lpuums
(made in asp.net)
I did tried codes from other posts for login.
I am novice to this part so bear with me.. any help will be appreciated.
Without the actual handshake with UMS via a defined API, you would end up scraping UMS html, which is bad for various reasons.
I would suggest you read up on Single Sign On (SSO).
A few articles on SSO and ASP.NET -
1. Codeproject
2. MSDN
3. asp.net forum
Edit 1
Although, I think this is a bad idea, since you say you are out of options, here is a link that shows how Html Agility Pack can help in scraping the web pages.
Beware of the drawbacks of screen scraping, changes from UMS will not be communicated to you, and you will see your application not working all of a sudden.
public string Scrap(string Username, string Password)
{
string Url1 = "https://www.example.com";//first url
string Url2 = "https://www.example.com/login.aspx";//secret url to post request to
//first request
CookieContainer jar = new CookieContainer();
HttpWebRequest request1 = (HttpWebRequest)WebRequest.Create(Url1);
request1.CookieContainer = jar;
//Get the response from the server and save the cookies from the first request..
HttpWebResponse response1 = (HttpWebResponse)request1.GetResponse();
//second request
string postData = "***viewstate here***";//VIEWSTATE
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(Url2);
request2.CookieContainer = jar;
request2.KeepAlive = true;
request2.Referer = Url2;
request2.Method = WebRequestMethods.Http.Post;
request2.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request2.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
request2.ContentType = "application/x-www-form-urlencoded";
request2.AllowWriteStreamBuffering = true;
request2.ProtocolVersion = HttpVersion.Version11;
request2.AllowAutoRedirect = true;
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
request2.ContentLength = byteArray.Length;
Stream newStream = request2.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
using (StreamReader sr = new StreamReader(response2.GetResponseStream()))
{
responseData = sr.ReadToEnd();
}
return responseData;
}
This is the code which works for me any one can add there links and viewstate for asp.net websites to scrap and you need to take care of cookie too.
and for other websites(non asp.net) they don't require viewstate.
Use fiddler to find things needed to add in header and viewstate or cookie.
Hope this helps if some one having the problem. :)

Login and retrieve data from embedded form page using C#

I'm a bit confused on how to go about this as I'm not really conversant with web stuff. I'm using a console application in C# to try and retrieve value from a page link inside a password protected homepage. I'm using the following details
Here's the code I'm trying:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("");
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705;)";
req.Method = "POST";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.Headers.Add("Accept-Language: en-us,en;q=0.5");
req.Headers.Add("Accept-Encoding: gzip,deflate");
req.Headers.Add("Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7");
req.KeepAlive = true;
req.Headers.Add("Keep-Alive: 300");
req.Referer = "copy from url";
req.ContentType = "application/x-www-form-urlencoded";
String Username = copy from url;
String PassWord = copy from url;
StreamWriter sw = new StreamWriter(req.GetRequestStream());
sw.Write(string.Format("&loginname={0}&password={1}&btnSubmit=Log In&institutioncode=H4V9KLUT45AV&version=2", Username, PassWord));
sw.Close();
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string tmp = reader.ReadToEnd();
However, when I inspect the data retrieved from the web page it shows something like this:
'...Your Session has timed out due to inactivity.Please logout and
relogin.return to login page>'
I'm guessing this is due to some VIEWSTATE stuff in ASP.NET
I'm also guessing I might have a problem with retrieving the data from the link I'll extract from the homepage, coz it seems the link simply loads data into a frame rather than reload the webpage.
Anyone please?
Your form data is incorrect. After removing the & at the beginning it worked for me:
sw.Write(string.Format("loginname={0}&password={1}&btnSubmit=Log In&institutioncode=H4V9KLUT45AV&version=2", Username, PassWord));
Additionally, as already mentioned in the other answer, you need to add the returned ASPSESSIONIDSSRRDRST cookie in further requests to the site.
Ok... the website is using Cookies, so, after you logged in you need to retrieve the cookies first, to make another WebRequest:
CookieCollection cookiesResponse = new CookieCollection();
if (response != null)
{
foreach (string cookie in response.Headers["Set-Cookie"].Split(';'))
{
string name = cookie.Split('=')[0];
string value = cookie.Substring(name.Length + 1);
cookiesResponse.Add(new Cookie(name.Trim(), value.Trim(), path, domain));
}
}
In you example the cookie contains: ASPSESSIONIDSSRRDRST=FEKODBMDBEIPCLLENCFLFBEA
You must use that CookieCollection for any request to the web, in your request you can set the cookies:
request.CookieContainer = cookiesResponse;
And finaly, you can parse the response. You can use an html tag parse, or parse the plain text.
I hope this is helpful.

matweb.com: How to get source of page?

I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?
It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here
It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.
You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.

C# Not getting proper response from HttpWebResponse. Encoding?

I'm trying to fetch some webpages using the code below:
public static string FetchPage(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; sv-SE; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.Headers.Add("Accept-Language", "sv-se,sv;q=0.8,en-us;q=0.5,en;q=0.3");
req.Headers.Add("Accept-Encoding", "gzip,deflate");
req.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
req.Headers.Add("Keep-Alive", "115");
req.Headers.Add("Cache-Control: max-age=0");
req.AllowAutoRedirect = true;
req.IfModifiedSince = DateTime.Now;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
using (Stream resStream = resp.GetResponseStream())
{
StreamReader reader = new StreamReader(resStream);
return reader.ReadToEnd();
}
}
}
Some pages work (W3C, example.com) while most others I've tried do not (BBC.co.uk, CNN.com, etc). Wireshark shows that I'm getting a proper reponse.
I've tried setting the encoding of the reader to the expected encoding of the response (CNN - utf8) as well as every possible combination but I have had no luck.
What am I missing out on here?
The first bytes of my response are always "1f ef bf bd" if you're able to tell something based on that.
I suspect the most likely explanation is that you are getting compressed data and not uncompressing it. Try using a stream filter to deflate/unzip it. See Rick Strahl's blog article for more info.
Loading http://bbc.co.uk worked for me when leaving out the "Accept-Encoding" header:
req.Headers.Add("Accept-Encoding", "gzip,deflate");

HttpWebRequest has empty response requesting a search from Bing

I have the following code that sends a HttpWebRequest to Bing. When I request the url below though it returns what appears to be an empty response when it should be returning a list of results.
var response = string.Empty;
var httpWebRequest = WebRequest.Create("http://www.bing.com/search?q=stackoverflow&count=100") as HttpWebRequest;
httpWebRequest.Method = WebRequestMethods.Http.Get;
httpWebRequest.Headers.Add("Accept-Language", "en-US");
httpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Win32)";
httpWebRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
using (var httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse)
{
Stream stream = null;
using (stream = httpWebResponse.GetResponseStream())
{
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
stream = new GZipStream(stream, CompressionMode.Decompress);
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
stream = new DeflateStream(stream, CompressionMode.Decompress);
var streamReader = new StreamReader(stream, Encoding.UTF8);
response = streamReader.ReadToEnd();
}
}
Its pretty standard code for requesting and receiving a web page. Any ideas why the response is empty? Thanks in advance.
EDIT I left off a query string parameter in the url. I also had &count=100 which I have now corrected. It seems to work for values of 50 and below but returns nothing when larger. This works ok when in the browser, but not for this web request.
It makes me think the issue is that the response is large and HttpWebResponse is not handling that for me the way I have it set up. Just a guess though.
This works just fine on my machine. Perhaps you are IP banned from Bing?
Your code works fine on my machine.
I suggest you get yourself a copy of Fiddler and examine the actual HTTP sesssion occuring. May be a proxy or firewall thing.

Categories

Resources