I am trying to grab the page code from the below page. It gives me a 405 error. If I try to get the page code from the home page it works fine but from this specific page i get Method not allowed, thoughts?
WebRequest request = WebRequest.Create("https://www.realtor.com/realestateandhomes-search/California/counties");
request.UseDefaultCredentials = true;
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
Console.WriteLine(responseFromServer);
The site thinks you are a bot.
Details:
I tried it with HttpClient (recommended: doesn't throw an exception upon receiving a non-200 response code), and inspected the response HTML. Here is the important snipit:
<p>
As you were browsing, something about your browser made us think you might be a bot. There are a few reasons this might happen, including:
</p>
<ul>
<li>You're a power user moving through this website with super-human speed</li>
<li>You've disabled JavaScript and/or cookies in your web browser</li>
<li>A third-party browser plugin is preventing JavaScript from running. Additional information is available in this
<a title='Third party browser plugins that block javascript' href='http://ds.tl/help-third-party-plugins' target='_blank'>
support article
</a>.
</li>
</ul>
If you want the full response, try running this:
async void LogResponse()
{
using System.Net.Http.HttpClient client = new System.Net.Http.HttpClient();
var response = await client.GetAsync("https://www.realtor.com/realestateandhomes-search/California/counties");
Console.WriteLine(await response.Content.ReadAsStringAsync());
}
Side complaint against realtor.com, 405 (The method specified in the Request-Line is not allowed) is a rather poor response code for this; a 403 (The server understood the request, but is refusing to fulfill it.) seems better suited.
Related
My application is sending some data to some government's service.
The workflow is to first authenticate on their REST(JSON) service to get an authentication token, and then send the actual data+token to their SOAP service.
The problem is that if I call the authentication service in quick succession after the last soap request, their REST serice will return "404 – Not Found" HTML instead of JSON response.
This is the code for sending authentication requests:
RestClient client = new RestClient(ret.Url);
AuthRequestToken requestToken = new AuthRequestToken();
requestToken.userLoginDetails.organisationCode = _organizationCode;
requestToken.userLoginDetails.userId = _username;
requestToken.userLoginDetails.password = _password;
ret.RequestJson = requestToken.ToString();
var request = new RestRequest(Method.POST);
request.AddHeader("Content-Type", "application/json");
request.AddHeader("cache-control", "no-cache");
request.AddParameter("application/json", ret.RequestJson, ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
This is the code for sending SOAP requests:
HttpWebRequest webRequest = CreateWebRequest(envelope);
using (WebResponse webResponse = webRequest.GetResponse())
{
using (Stream responseStream = webResponse.GetResponseStream())
{
using (StreamReader rd = new StreamReader(responseStream))
{
ret.ResponseXML = rd.ReadToEnd();
}
responseStream.Close();
}
}
This is the CreateWebRequest() method
private HttpWebRequest CreateWebRequest(XElement content)
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(_url);
//webRequest.Headers.Add("SOAPAction", action);
webRequest.ContentType = "text/xml;charset=\"utf-8\"";
webRequest.Accept = "text/xml";
webRequest.Method = "POST";
using (Stream stream = webRequest.GetRequestStream())
{
content.Save(stream);
}
return webRequest;
}
RestClient is a class in the RestSharp library downloaded from https://restsharp.dev/
Using TcpView or netstat -abn I can see that after any request (either RestClient or HttpWebRequest), the connection stays in ESTABLISHED state for up to 5-30 seconds.
Everything works fine 99% of the time, except in a specific scenario when I make a RestClient request within 5-30 seconds after the last HttpWebRequest, before the connection switches from ESTABLISHED to CLOSE_WAIT.
I should mention that this code was working perfectly up to a couple of days ago. Before then, their authentication service was on a different IP address form their SOAP service. Now they are on the same IPAddress, and probably even on the same physical server.
Before they switched the servers I used to call authentication request before each and every SOAP request, and it worked, but since this error started happening, I modified my code to authenticate only occasionally and use the same token for a bunch of SOAP requests. This considerably reduced the chance for this error, but I still ocassinaly get it when traffic is high.
It seems to me that RestClient and HttpWebRequest are using the same socket under the hood and one of them is not cleaning up properly. It seems that RestClient inherits some junk from the HttpWebRequest because the "404 - Not Found" returned by the service looks the same as when I deliberately navigate to the wrong URL of the authentication service.
It is also possible that I'm not disposing or closing something properly, but I tried closing every stream, client or connection I could find, and injected 'using' everywhere, but nothing seems to help.
I tried contacting the government's tech suport, but judging by my prior experience, it will take weeks before they even bother to connect me to someone who can understand the problem.
This is the 404 HTML I get:
<!doctype html>
<html lang="en">
<head>
<title>HTTP Status 404 – Not Found</title>
<style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style>
</head>
<body>
<h1>HTTP Status 404 – Not Found</h1>
<hr class="line" />
<p>
<b>Type</b> Status Report</p>
<p>
<b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p>
<hr class="line" />
<h3>Apache Tomcat/9.0.35</h3>
</body>
</html>
Do you have any suggestion on what I could try to prevent this from happening?
As I said, I currently have some workaround which tris to refresh the token when it gets the chance, and even delay regular requests if necessary, but Id like to not use workarounds if possible, especially since I don't know what the socket timeout is. It is 5 sec on most computers, but on some wireless networks it stays ESTABLISHED for almost a minute.
If it matters, both services are on HTTPS.
Thank you!
I solved it by making a small console application which receives credentials through command line parameters, connects to the rest service and returns a token in the standard output.
Parent application periodically calls this exe in the background, and reads a new token from the standard output.
I have an external URL, like http://a.com/?id=5 (not in my project)
and I want my website to show this URL's contents,
ex.
My website(http://MyWebsite.com/?id=123) shows 3rd party's url (http://a.com/?id=5) contents
but I don't want the client side to get a real URL(http://a.com/?id=5), I'll check the AUTH first and then shows the page.
I assume that you do not have control over the server of "http://a.com/?id=5". I think there's no way to completely hide the external link to users. They can always look at the HTML source code and http requests & trace back the original location.
One possible solution to partially hide that external site is using curl equivalent of MVC, on your controller: after auth-ed, you request the website from "http://a.com/?id=5" and then return that to your user:
ASP.NET MVC - Using cURL or similar to perform requests in application:
I assume the request to "http://a.com/?id=5" is in GET method:
public string GetResponseText(string userAgent) {
string url = "http://a.com/?id=5";
string responseText = String.Empty;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
request.UserAgent = userAgent;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader sr = new StreamReader(response.GetResponseStream())) {
responseText = sr.ReadToEnd();
}
return responseText;
}
then, you just need to call this in your controller. Pass the same userAgent from client so that they can view the website exactly like they open it with their web browsers:
return GetResponseText( request.UserAgent);
//request is the request passed to the controller for http://MyWebsite.com/?id=123
PS: I may not using the correct MVC API, but the idea is there. Just need to look up MVC document on HttpWebRequest to make it work correctly.
I would like to grab some content from a website that is made with Drupal.
The challenge here is that i need to login on this site before i can access the page i want to scrape. Is there a way to automate this login process in my C# code, so i can grab the secure content?
To access the secured content, you'll need to store and send cookies with every request to your server, starting with the request that sends your log in info and then saving the session cookie that the server gives you (which is your proof that you are who you say you are).
You can use the System.Windows.Forms.WebBrowser for a less control but out-of-the-box solution that will handle cookies.
My preferred method is to use System.Net.HttpWebRequest to send and receive all web data and then use the HtmlAgilityPack to parse the returned data into a Document Object Model (DOM) which can be easily read from.
The trick to getting System.Net.HttpWebRequest to work is that you must create a long-lived System.Net.CookieContainer that will keep track of your log in info (and other things the server expects you to keep track of). The good news is that the HttpWebRequest will take care of all of this for you if you provide the container.
You need a new HttpWebRequest for each call you make, so you must sets their .CookieContainer to the same object every time. Here is an example:
UNTESTED
using System.Net;
public void TestConnect()
{
CookieContainer cookieJar = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/login.htm");
request.CookieContainer = cookieJar;
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
// do page parsing and request setting here
request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/submit_login.htm");
// add specific page parameters here
request.CookeContainer = cookieJar;
response = (HttpWebResponse) request.GetResponse();
request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/secured_page.htm");
request.CookeContainer = cookieJar;
// this will now work since you have saved your authentication cookies in 'cookieJar'
response = (HttpWebResponse) request.GetResponse();
}
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx
HttpWebRequest Class
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx
You'll have to use the Services module to do that. Also check out this link for a bit of explanation.
I'm trying to get a stream from a url:http://actueel.nl.pwc.com/site/syndicate.jsp but i get the 403 error. It doest requier login. I used fiddler to check why IE can open it while my code doesn't. What i got was that there were 2 connections done when opening the link in IE. 1 succeeded while the other got a 403. The 403 was a sublink to a giff image. Seems like the xml is a public file, but the image it contains is located in a inaccesible folder.
I need to know how to ignore the image so i can still get the rest of stream. this is my code to test it(by the way..i tryed with WeClient too and headers) :
try
{
WebRequest request = WebRequest.Create("http://actueel.nl.pwc.com/site/syndicate.jsp");
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
MessageBox.Show(reader.ReadToEnd());
}
catch(Exception ex){
MessageBox.Show(ex.Message);
}
Thanks for your reactions ;)
I agree with Dmytro. The WebRequest is NOT attempting to download the gif image referenced in the jsp file, only the contents of the jsp itself is being downloaded. Try looking carefully (in Fiddler) at the IE request compared to yours - only the url but also all the request/response headers - and see if anything else is missing, such as cookies or ACCEPT headers.
Using Wireshark and wget, the differences were in the headers only.
The remote server requires User Agent and an Accept headers.
eg:
WebRequest request = WebRequest.Create("http://actueel.nl.pwc.com/site/syndicate.jsp");
((HttpWebRequest)request).UserAgent = "stackoverflow.com/q/4233673/111013";
((HttpWebRequest) request).Accept = "*/*";
Basically, I'm trying to grab an EXE from CNet's Download.com
So i created web parser and so far all is going well.
Here is a sample link pulled directly from their site:
http://dw.com.com/redir?edId=3&siteId=4&oId=3001-20_4-10308491&ontId=20_4&spi=e6323e8d83a8b4374d43d519f1bd6757&lop=txt&tag=idl2&pid=10566981&mfgId=6250549&merId=6250549&pguid=PlvcGQoPjAEAAH5rQL0AAABv&destUrl=ftp%3A%2F%2F202.190.201.108%2Fpub%2Fryl2%2Fclient%2Finstaller-ryl2_v1673.exe
Here is the problem: When you attempt to download, it begins with HTTP, then redirects to an FTP site. I have tried .NET's WebClient and HttpWebRequest Objects, and it looks like Neither can support Redirects.
This Code Fails at GetResponse();
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://dw.com.com/redir");
WebResponse response = req.GetResponse();
Now, I also tried this:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://dw.com.com/redir");
req.AllowAutoRedirect = false;
WebResponse response = req.GetResponse();
string s = new StreamReader(response.GetResponseStream()).ReadToEnd();
And it does not throw the error anymore, however variable s turns out to be an empty string.
I'm at a loss! Can anyone help out?
You can get the value of the "Location" header from the response.headers, and then create a new FtpWebRequest to download that resource.
in your first code snippet you will be redirected to a link using a different protocol (i.e it's no longer Http as in HttpWebRequest) so it fails du to a malformed http response.
In the second part you're no longer redirected and hence you don't receive a FTP response (which is not malform when interpreted as HTTP response).
You need to acquire FTP link,as ferozo wrote you can do this by getting the value of the header "location", and use a FtpWebRequest to access the file