I'm doing a webscraping project in ASP.net for a website, as there is a need for Catpcha code, hence I need to get the Captcha code for users to key in before continue.
So far the project is working fine, but the only problem I found is that sometimes the captcha code response was not entirely captured hence converting the response stream to Image caused the following errors:
"Parameter is invalid."
I noticed that web browsers do not have this problem, and it always can show the captcha code nicely as long as the server is not down.
However, this doesn't make sense to HttpWebRequest, it is sometimes able to get it, and sometimes not, may I know is there a way to ensure that the Response Stream is complete?
My Code snippet is as follow:
public Image GetCaptchaCode()
{
Image returnVal = null;
Uri uri = new Uri(URL_CAPTCHA);
HttpWebRequest request = null;
HttpWebResponse response = null;
try
{
// Get Cookies
CookieCollection cookies = this.GetCookies();
foreach (Cookie cookie in cookies)
{
Console.WriteLine(cookie.Name + ": " + cookie.Value);
}
// Get Catpcha
request = (HttpWebRequest)HttpWebRequest.Create(uri);
request.ProtocolVersion = HttpVersion.Version11;
request.Method = WebRequestMethods.Http.Get; // use GET for loading Captcha
request.CookieContainer = this._cookies; // Store Cookies Info
System.Net.ServicePointManager.Expect100Continue = false;
// Add more cookies
if (cookies != null)
{
request.CookieContainer.Add(cookies);
}
// Handle Gzip Compression
request.Headers.Add(HttpRequestHeader.AcceptEncoding, HEADER_TYPE);
request.AutomaticDecompression = DecompressionMethods.GZip;
request.Referer = URL_REFERER;
request.UserAgent = USER_AGENT;
// Get Response
response = (HttpWebResponse)request.GetResponse();
returnVal = Image.FromStream(response.GetResponseStream());
}
catch (Exception ex)
{
string errMsg = ex.Message;
}
finally
{
if (uri != null) uri = null;
if (request != null) request = null;
if (response != null)
{
response.Close();
response = null;
}
}
return returnVal;
}
Related
I'm trying to check if .txt file is exists or not from web url. This is my code:
static public bool URLExists(string url)
{
bool result = false;
WebRequest webRequest = WebRequest.Create(url);
webRequest.Timeout = 1200; // miliseconds
webRequest.Method = "HEAD";
HttpWebResponse response = null;
try
{
response = (HttpWebResponse)webRequest.GetResponse();
result = true;
}
catch (WebException webException)
{
//(url + " doesn't exist: " + webException.Message);
}
finally
{
if (response != null)
{
response.Close();
}
}
return result;
}
If i enter "http://www.example.com/demo.txt" is not a valid file path and website showing 404 error page then this code return true. How to solve this problem. Thanks in advance.
Use the StatusCode property of the HttpWebResponse object.
response = (HttpWebResponse)webRequest.GetResponse();
if(response.StatusCode == HttpStatusCode.NotFound)
{
result = false;
}
else
{
result = true;
}
Look through the list of possible status codes to see which ones you want to interpret as the file not existing.
I am trying to consume WCF webserice which is siteminder protected. The issue is when I am trying to browse the webservice URL in browser it is working fine with the credential that I have supplied.
But when I am trying to do the same programmatically, it's throwing an error -
error #401 unauthorized.
for reference -
http://www.codeproject.com/Articles/80314/How-to-Connect-to-a-SiteMinder-Protected-Resource
CookieContainer cookies = null;
HttpWebRequest request = null;
HttpWebResponse response = null;
string responseString = null;
NameValueCollection tags = null;
string url = null;
url = PROTECTED_URL;
Debug.WriteLine("Step 1: Requesting page #" + url);
request = (HttpWebRequest)WebRequest.Create(url);
request.AllowAutoRedirect = false;
response = (HttpWebResponse)request.GetResponse();
ShowResponse(response);
// Step 2: Get the redirection location
// make sure we have a valid response
if (response.StatusCode != HttpStatusCode.Found)
{
throw new ApplicationException();
}
url = response.Headers["Location"];
// Step 3: Open a connection to the redirect and load the login form,
// from this screen we will capture the required form fields.
Debug.WriteLine("Step 3: Requesting page #" + url);
request = (HttpWebRequest)WebRequest.Create(url);
request.AllowAutoRedirect = false;
try
{
response = (HttpWebResponse)request.GetResponse();
}
catch (Exception ex)
{
string str = ex.Message.ToString();
}
It's my HtTpClient to call WCF.
public async Task<string> webClient(string method, string uri)
{
try
{
var client = new HttpClient();
client.Timeout =new TimeSpan(0,0,0,10);
client.BaseAddress = new Uri(uri);
client.DefaultRequestHeaders.Accept.Add(
new MediaTypeWithQualityHeaderValue("application/json"));
var response = client.GetAsync(method).Result;
string content = await response.Content.ReadAsStringAsync();
return content;
}
catch (Exception ex)
{
return "Error";
}
}
Uri is base adress, method is your method name.
string response = webClient(uri + "/GetSomething/", uri).Result;
I am having an issue with my code as I have to verify the status of some websites in order to notify if they are working or not. I can successfully check this for a number of websites, however for a particular one it is always returning 404 Not Found, even though the site is up and I can view it if I try to open the page on the browser, this is my code:
public HttpStatusCode GetHeaders(string url, bool proxyNeeded)
{
HttpStatusCode result = default(HttpStatusCode);
var request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "HEAD";
request.Credentials = new NetworkCredential(username, password, domain);
if (proxyNeeded)
{
IWebProxy proxy = new WebProxy("127.0.0.1", port_number);
proxy.Credentials = new NetworkCredential(username, password, domain);
request.Proxy = proxy;
}
try
{
var response = (HttpWebResponse) request.GetResponse();
result = response.StatusCode;
response.Close();
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
WebResponse resp = e.Response;
using (StreamReader sr = new StreamReader(resp.GetResponseStream()))
{
//TODO: capture the exception!!
}
}
}
return result;
}
The exception does not tell me anything different than "Not Found" which really confuses me as I don't know where else to look.The URL failing is: http://cbprod-app/InterAction/home
If I request the page without setting the proxy, comes back with an "401 Not authorized" as it requires the proxy authentication to be displayed, I am running out of ideas, any suggestion about what am I doing wrong?
Thank you very much in advance.
I am attempting to load a page I've received from an RSS feed and I receive the following WebException:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
with an inner exception:
Invalid URI: The hostname could not be parsed.
I wrote a code that would attempt loading the url via an HttpWebRequest. Due to some suggestions I received, when the HttpWebRequest fails I then set the AllowAutoRedirect to false and basically manually loop through the iterations of redirect until I find out what ultimately fails. Here's the code I'm using, please forgive the gratuitous Console.Write/Writeline calls:
Uri url = new Uri(val);
bool result = true;
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(url);
string source = String.Empty;
Uri responseURI;
try
{
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
responseURI = httpWebResponse.ResponseUri;
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
reader.Close();
}
}
req.Abort();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
result = true;
}
catch (ArgumentException ae)
{
Console.WriteLine(url + "\n--\n" + ae.Message);
result = false;
}
catch (WebException we)
{
Console.WriteLine(url + "\n--\n" + we.Message);
result = false;
string urlValue = url.ToString();
try
{
bool cont = true;
int count = 0;
do
{
req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(urlValue);
req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
req.AllowAutoRedirect = false;
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
responseURI = httpWebResponse.ResponseUri;
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
if (string.IsNullOrEmpty(source))
{
urlValue = httpWebResponse.Headers["Location"].ToString();
count++;
reader.Close();
}
else
{
cont = false;
}
}
}
} while (cont);
}
catch (UriFormatException uriEx)
{
Console.WriteLine(urlValue + "\n--\n" + uriEx.Message + "\r\n");
result = false;
}
catch (WebException innerWE)
{
Console.WriteLine(urlValue + "\n--\n" + innerWE.Message+"\r\n");
result = false;
}
}
if (result)
Console.WriteLine("testing successful");
else
Console.WriteLine("testing unsuccessful");
Since this is currently just test code I hardcode val as http://rss.nytimes.com/c/34625/f/642557/s/3d072012/sc/38/l/0Lartsbeat0Bblogs0Bnytimes0N0C20A140C0A70C30A0Csarah0Ekane0Eplay0Eamong0Eofferings0Eat0Est0Eanns0Ewarehouse0C0Dpartner0Frss0Gemc0Frss/story01.htm
the ending url that gives the UriFormatException is: http:////www-nc.nytimes.com/2014/07/30/sarah-kane-play-among-offerings-at-st-anns-warehouse/?=_php=true&_type=blogs&_php=true&_type=blogs&_php=true&_type=blogs&_php=true&_type=blogs&_php=true&_type=blogs&_php=true&_type=blogs&_php=true&_type=blogs&partner=rss&emc=rss&_r=6&
Now I'm sure if I'm missing something or if I'm doing the looping wrong, but if I take val and just put that into a browser the page loads fine, and if I take the url that causes the exception and put it in a browser I get taken to an account login for nytimes.
I have a number of these rss feed urls that are resulting in this problem. I also have a large number of these rss feed urls that have no problem loading at all. Let me know if there is any more information needed to help resolve this. Any help with this would be greatly appreciated.
Could it be that I need to have some sort of cookie capability enabled?
You need to keep track of the cookies while doing all your requests. You can use an instance of the CookieContainer class to achieve that.
At the top of your method I made the following changes:
Uri url = new Uri(val);
bool result = true;
// keep all our cookies for the duration of our calls
var cookies = new CookieContainer();
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(url);
// assign our CookieContainer to the new request
req.CookieContainer = cookies;
string source = String.Empty;
Uri responseURI;
try
{
And in the exception handler where you create a new HttpWebRequest, you do the assignment from our CookieContainer again:
do
{
req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(urlValue);
// reuse our cookies!
req.CookieContainer = cookies;
req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
req.AllowAutoRedirect = false;
using (System.Net.WebResponse webResponse = req.GetResponse())
{
This makes sure that on each successive call the already present cookies are resend again in the next request. If you leave this out, no cookies are sent and therefore the site you try to visit assumes you are a fresh/new/unseen user and gives you a kind of authentication path.
If you want to store/keep cookies beyond this method you could move the cookie instance variable to a static public property so you can use all those cookies program-wide like so:
public static class Cookies
{
static readonly CookieContainer _cookies = new CookieContainer();
public static CookieContainer All
{
get
{
return _cookies;
}
}
}
And to use it in a WebRequest:
var req = (System.Net.HttpWebRequest) WebRequest.Create(url);
req.CookieContainer = Cookies.All;
how to login in https sites with the help of webrequst and webresponse in c# .
here is the code
public string postFormData(Uri formActionUrl, string postData)
{
gRequest = (HttpWebRequest)WebRequest.Create(formActionUrl);
gRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4";
gRequest.CookieContainer = new CookieContainer();
gRequest.Method = "POST";
gRequest.Accept = " text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, */*";
gRequest.KeepAlive = true;
gRequest.ContentType = #"text/html; charset=iso-8859-1";
#region CookieManagement
if (this.gCookies != null && this.gCookies.Count > 0)
{
gRequest.CookieContainer.Add(gCookies);
}
//logic to postdata to the form
string postdata = string.Format(postData);
byte[] postBuffer = System.Text.Encoding.GetEncoding(1252).GetBytes(postData);
gRequest.ContentLength = postBuffer.Length;
Stream postDataStream = gRequest.GetRequestStream();
postDataStream.Write(postBuffer, 0, postBuffer.Length);
postDataStream.Close();
//post data logic ends
//Get Response for this request url
gResponse = (HttpWebResponse)gRequest.GetResponse();
//check if the status code is http 200 or http ok
if (gResponse.StatusCode == HttpStatusCode.OK)
{
//get all the cookies from the current request and add them to the response object cookies
gResponse.Cookies = gRequest.CookieContainer.GetCookies(gRequest.RequestUri);
//check if response object has any cookies or not
if (gResponse.Cookies.Count > 0)
{
//check if this is the first request/response, if this is the response of first request gCookies
//will be null
if (this.gCookies == null)
{
gCookies = gResponse.Cookies;
}
else
{
foreach (Cookie oRespCookie in gResponse.Cookies)
{
bool bMatch = false;
foreach (Cookie oReqCookie in this.gCookies)
{
if (oReqCookie.Name == oRespCookie.Name)
{
oReqCookie.Value = oRespCookie.Name;
bMatch = true;
break; //
}
}
if (!bMatch)
this.gCookies.Add(oRespCookie);
}
}
}
#endregion
StreamReader reader = new StreamReader(gResponse.GetResponseStream());
string responseString = reader.ReadToEnd();
reader.Close();
//Console.Write("Response String:" + responseString);
return responseString;
}
else
{
return "Error in posting data";
}
}
// calling the above function
httphelper.postFormData(new Uri("https://login.yahoo.com/config/login?.done=http://answers.yahoo.com%2f&.src=knowsrch&.intl=us"), ".tries=1&.src=knowsrch&.md5=&.hash=&.js=&.last=&promo=&.intl=us&.bypass=&.partner=&.u=0b440p15q1nmb&.v=0&.challenge=Rt_fM1duQiNDnI5SrzAY_GETpNTL&.yplus=&.emailCode=&pkg=&stepid=&.ev=&hasMsgr=0&.chkP=Y&.done=http%3A%2F%2Fanswers.yahoo.com%2F&.pd=knowsrch_ver%3D0%26c%3D%26ivt%3D%26sg%3D&login=xyz&passwd=xyz&.save=Sign+In");
You need to see how authentication works for the site you are working with.
This may be through cookies, special headers, hidden field or something else.
Fire up a tool like Fiddler and see what the network traffic is like when logging in and how it is different from not being logged in
Recreate this logic with WebRequest and WebResponse.
See the answers to this SO question (HttpRequest: pass through AuthLogin).
What for? Watin is good for testing and such, and it's easy to do basic screen scraping with it. Why reinvent the wheel if you don't have to.
you can set the WebRequest.Credentials property. for an example and documentation see:
http://msdn.microsoft.com/en-us/library/system.net.networkcredential.aspx