I am creating an application which will check for broken links in content.
All working apart from you tube links where I get a mixed response, broken links (or codes I have just made up) sometime come up with 200 ok and sometimes they come up as broken.
Is there a different way of checking broken links in youtube?
Im using standard .net/c# code
try
{
HttpWebRequest request = WebRequest.Create(match.Groups[1].ToString()) as HttpWebRequest;
//Setting the Request method HEAD, you can also use GET too.
request.Method = "HEAD";
//Getting the Web Response.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
//Returns TRUE if the Status code == 200
// result = "true";
result = response.StatusDescription;
response.Close();
// return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
//Any exception will returns false.
result = "false";
}
if(match.Groups[1].ToString().Contains(#"\n"))
{
//....
}
sometime come up with 200 ok and sometimes they come up as broken.
What you are doing is similar to web crawler, youtube will definitely have an anti-crawler mechanism, so if you have been accessing the link uninterrupted, your access may be blocked. In order to reduce the probability of this situation, you can reduce the frequency of visits and simulating the request of real users to visit the site as much as possible
Related
I want to check if an given URL a is link to http://youtube.com.
I know there are lots of various shortened version's of the links (e.g. http://youtu.be), so what I am after is a way to resolve the URL and see if it ends up as http://youtube.com.
A couple of example inputs are:
http://www.youtube.com/v/[videoid]
http://www.youtu.be/watch?v=[videoid]
Does anyone know of a way to do this?
You could perform a HEAD request:
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://www.youtu.be/Ddn4MGaS3N4");
request.Method = "HEAD";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) {
Console.WriteLine("Does this resolve to youtube?: {0}", response.ResponseUri.ToString().Contains("youtube.com") ? "Yes" : "No");
}
Appears to work fine. Unsure of edge cases but seems to do the job.
(Note: No error checking here such as 404 errors, etc).
bool isYoutube = false;
string host = new Uri(url).Host;
if (host == "youtube.com" || host == "youtu.be")
{
isYoutube = true;
}
First you may have to check what the hostname is for youtube (I'm just assuming it is http://youtube.com) but after you have that the following code will do what you want;
using System.Net;
IPHostEntry host = Dns.Resolve(theInputHostName);
if (host.HostName == "http://youtube.com")
// it resolves to youtube, do something.
If you want to know whether a given URL redirects (using status codes 301/302) to an YouTube URL, you may either use WebClient/HttWebRequest/whatever directly and check the response, or disable HttpWebRequest.AllowAutoRedirect and traverse all redirects manually (checking the status code and then the Location HTTP header).
I'm sending an HTTPWebRequest to a 3rd party with the code below. The response takes between 2 and 22 seconds to come back. The 3rd party claims that once they receive it, they are sending back a response immediately, and that none of their other partners are reporting any delays (but I'm not sure I believe them -- they've lied before).
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://www.example.com");
request.Timeout = 38000;
request.Method = "POST";
request.ContentType = "text/xml";
StreamWriter streamOut = new StreamWriter(request.GetRequestStream(), System.Text.Encoding.ASCII);
streamOut.Write(XMLToSend); // XMLToSend is just a string that is maybe 1kb in size
streamOut.Close();
HttpWebResponse resp = null;
resp = (HttpWebResponse)request.GetResponse(); // This line takes between 2 and 22 seconds to return.
StreamReader responseReader = new StreamReader(resp.GetResponseStream(), Encoding.UTF8);
Response = responseReader.ReadToEnd(); // Response is merely a string to hold the response.
Is there any reason that the code above would just...pause? The code is running in a very solid hosting provider (Rackspace Intensive Segment), and the machine it is on isn't being used for anything else. I'm merely testing some code that we are about to put into production. So, it's not that the machine is taxed, and given that it is Rackspace and we are paying a boatload, I doubt it is their network either.
I'm just trying to make sure that my code is as fast as possible, and that I'm not doing anything stupid, because in a few weeks, this code will be ramped up to run 20,000 requests to this 3rd part every hour.
Try doing a flush before you close.
streamOut.Flush();
streamOut.Close();
Also download microsoft network monitor to see for certain if the hold up is you or them, you can download it here...
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=983b941d-06cb-4658-b7f6-3088333d062f&displaylang=en
There is a few things that I would do:
I would profile the code above and get some definitive timings.
Implement the using statements in order to dispose of resources correctly.
Write the code in an async style there's going to be an awful lot of IO wait once its ramped.
Can you hit the URL in a regular ole browser? How fast is that?
Can you hit other URL's (not your partner's) in this code? How fast is that?
It is entirely possible you're getting bitten by the 'latency bug' where even an instant response from your partner results in unpredictable delays from your perspective.
Another thought: I noticed the https in your URL. Is it any faster with http?
My app currently uses OAuth to communicate with the Twitter API. Back in December, Twitter upped the rate limit for OAuth to 350 requests per hour. However, I am not seeing this. I am still getting 150 from the account/rate_limit_status method.
I was told that I needed to use the X-RateLimit-Limit HTTP header to get the new rate limit. However, in my code, I do not see that header.
Here is my code...
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(newURL);
request.Method = "GET";
request.ServicePoint.Expect100Continue = false;
request.ContentType = "application/x-www-form-urlencoded";
using (WebResponse response = request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
responseString = reader.ReadToEnd();
}
}
If I inspect the response, I can see that it has a property for Headers, and that there are 16 headers. However, I do not have X-RateLimit-Limit in the list.
(source: yfrog.com)
Any idea what I am doing wrong?
You should simple be able to use:
using (WebResponse response = request.GetResponse())
{
string limit = response.Headers["X-RateLimit-Limit"];
...
}
If that doesn't work as expected, you can do a watch on response.Headers and see what's in there.
Look at the raw response text (e.g., with Fiddler). If the header isn't there, no amount of C# code is going to make it appear. :) From what you've shown, it seems the header isn't in the response.
Update: When I go to: http://twitter.com/account/rate_limit_status.xml there is no X-RateLimit-Limit header. But when I go to http://twitter.com/statuses/public_timeline.xml, it's there. So I think you just need to use a different method.
It still says 150, though!
How can I check a File exits in a web location in ASP.Net(in a different web application, but same web server), currently I doing like this. Is there any better way of doing this?
using (WebClient client = new WebClient())
{
try
{
Stream stream = client.OpenRead("http://localhost/images/myimage.jpg");
if (stream != null)
{
//exists
}
}
catch
{
//Not exists
}
}
Remember that you are never going to get a 100% definitive response on the existence of a file, but the way I do it would be pretty similar to yours...
bool remoteFileExists(string addressOfFile)
{
try
{
HttpWebRequest request = WebRequest.Create(addressOfFile) as HttpWebRequest;
request.Method = "HEAD";
request.CachePolicy = new RequestCachePolicy(RequestCacheLevel.NoCacheNoStore);
var response = request.GetResponse() as HttpWebResponse;
return (response.StatusCode == HttpStatusCode.OK);
}
catch(WebException wex)
{
return false;
}
}
EDIT :: looking at the edit by Anton Gogolev above (How can one check to see if a remote file exists using C#) I should have cast the response to a HttpWebResponse object and checked the status code. Edited the code to reflect that
If a file is accessible via HTTP, you can issue a HTTP HEAD requrest for that particular URL using HttpWebRequest. If HttpWebResponse.StatusCode will be 200, than file is there.
EDIT: See this on why GetResponse throws stupid exceptions when it actually should not do that.
You can use Server.MapPath to get the directory and then check if file exist using IO standard methods like File.Exists
The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server but the server could not find what was requested. A 404 error indicates that the requested resource may be available in the future.
You can use a HEAD request (HttpWebRequest.Method = "HEAD")
in our application we have some kind of online help. It works really simple: If the user clicks on the help button a URL is build depending on the current language and help context (e.g. "http://example.com/help/" + [LANG_ID] + "[HELP_CONTEXT]) and called within the browser.
So my question is: How can i check if a file exists on the web server without loading the complete file content?
Thanks for your Help!
Update: Thanks for your help. My question has been answered.
Now we have proxy authentication problems an cannot send the HTTP request ;)
You can use .NET to do a HEAD request and then look at the status of the response.
Your code would look something like this (adapted from The Lowly HTTP HEAD Request):
// create the request
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
// instruct the server to return headers only
request.Method = "HEAD";
// make the connection
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
// get the status code
HttpStatusCode status = response.StatusCode;
Here's a list detailing the status codes that can be returned by the StatusCode enumerator.
Can we assume that you are running your web application on the same web server as you are retrieving your help pages from? If yes, then you can use the Server.MapPath method to find a path to the file on the server combined with the File.Exists method from the System.IO namespace to confirm that the file exists.
Had the same problem myself and found this question and the answers here really useful.
But the answers here use the old WebRequest-class which is a bit outdated, it has no async support for starters. So I wanted to use the more modern way of doing it with HttpClient. Here is an example with a little helper class to check if the file exist:
using System.Net.Http;
using System.Threading.Tasks;
class HttpClientHelper
{
private static HttpClient _httpClient;
public static async Task<bool> DoesFileExist(string url)
{
if (_httpClient == null)
{
_httpClient = new HttpClient();
}
using (HttpRequestMessage request = new HttpRequestMessage(HttpMethod.Head, url))
{
using (HttpResponseMessage response = await _httpClient.SendAsync(request))
{
return response.StatusCode == System.Net.HttpStatusCode.OK;
}
}
}
}
Usage:
if (await HttpClientHelper.DoesFileExist("https://www.google.com/favicon.ico"))
{
// Yes it does!
}
else
{
// No it doesn't!
}
Send a HEAD request for the URL (instead of a GET). The server will return a 404 if it doesn't exist.
Take a look at the HttpWebResponse class. You could do something like this:
string url = "http://example.com/help/" + LANG_ID + HELP_CONTEXT;
WebRequest request=WebRequest.Create(URL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusDescription=="OK")
{
// worked
}
If you want to check the status of a document on the server:
function fetchStatus(address) {
var client = new XMLHttpRequest();
client.onreadystatechange = function() {
// in case of network errors this might not give reliable results
if(this.readyState == 4)
returnStatus(this.status);
}
client.open("HEAD", address);
client.send();
}
Thank you.
EDIT: Apparently a good method to do this would be a HEAD request.
You could also create a server-side application that stores the name of every available web page on the server. Your client application could then query this application and respond a little bit quicker than a full page request, and without throwing a 404 error every time the file doesn't exist.