How to find the content of a URL in c#? - c#

I have a URL. Now I want to find out the content of the URL. By content of the URL I mean whether the URL contains a html page, video or an image/photo. How can I do this in asp.net with c#.

The easiest way would be to do a HEAD request with HttpWebRequest:
var req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "HEAD";
using (var response = (HttpWebResponse)req.GetResponse())
{
// Here, examine the response headers.
// In particular response.ContentType
}
In some cases, HEAD might give you a 405 error, meaning that the server doesn't support HEAD.
In that case, just do a GET request (change req.Method = "GET"). That will start to download the page, but you can still view the content type header.

Probably start off using a WebClient and visit/download the page. Then use an HTML parser, and whatever method you deem best, to determine what kind of content is on the page.

Except from following the link, fetching the result and figuring out from the file content what file it is (which is rather tricky), there is no fool proof way.
You can try to determine from the file extension or the returned content-type header (you can issue a HEAD request) what the type should be. This will tell you what the server claims the file type to be.

For easier testing, this is a console application, but it should work with ASP.NET all the same:
namespace ConsoleApplication1
{
using System;
using System.Net;
class Program
{
static void Main()
{
//var request = WebRequest.Create("https://www.google.com"); // page will result in html/text
var request = WebRequest.Create(#"https://www.google.de/logos/2013/douglas_adams_61st_birthday-1062005.2-res.png");
request.Method = "HEAD"; // only request header information, don't download the whole file
var response = request.GetResponse();
Console.WriteLine(response.ContentType);
Console.WriteLine("Done.");
Console.ReadLine();
}
}
}

Related

.net MVC:How to hide the true URL?

I have an external URL, like http://a.com/?id=5 (not in my project)
and I want my website to show this URL's contents,
ex.
My website(http://MyWebsite.com/?id=123) shows 3rd party's url (http://a.com/?id=5) contents
but I don't want the client side to get a real URL(http://a.com/?id=5), I'll check the AUTH first and then shows the page.
I assume that you do not have control over the server of "http://a.com/?id=5". I think there's no way to completely hide the external link to users. They can always look at the HTML source code and http requests & trace back the original location.
One possible solution to partially hide that external site is using curl equivalent of MVC, on your controller: after auth-ed, you request the website from "http://a.com/?id=5" and then return that to your user:
ASP.NET MVC - Using cURL or similar to perform requests in application:
I assume the request to "http://a.com/?id=5" is in GET method:
public string GetResponseText(string userAgent) {
string url = "http://a.com/?id=5";
string responseText = String.Empty;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
request.UserAgent = userAgent;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader sr = new StreamReader(response.GetResponseStream())) {
responseText = sr.ReadToEnd();
}
return responseText;
}
then, you just need to call this in your controller. Pass the same userAgent from client so that they can view the website exactly like they open it with their web browsers:
return GetResponseText( request.UserAgent);
//request is the request passed to the controller for http://MyWebsite.com/?id=123
PS: I may not using the correct MVC API, but the idea is there. Just need to look up MVC document on HttpWebRequest to make it work correctly.

WebRequest response content

I'm trying to find out response content of the given url using HttpWebRequest
var targetUri = new Uri("http://www.foo.com/Message/CheckMsg?msg=test");
var webRequest = (HttpWebRequest)WebRequest.Create(targetUri);
var webRequestResponse = webRequest.GetResponse();
The above code always returns the home page (http://www.foo.com) content. I was expecting http://www.foo.com/Message page content. something wrong or am I missing something?
Is the CheckMsg is an html or php file? When I'm accessing websites using webrequest I always have to use the extension. Otherwise the website will think it's a folder. I would recommend trying to add that.
var targetUri = new Uri("http://www.foo.com/Message/CheckMsg.html?msg=test");

in C#, how can I get the HTML content of a website before displaying it?

I have a web browser project in C#, I am thinking such system; when user writes the url then clicks "go" button, my browser get content of written web site ( it shouldn't visit that page, I mean it shouldn't display anything), then I want look for a specific "keyword" for ex; "violence", if there exists, I can navigate that browser to a local page that has a warning. Shortly, in C#, How can I get content of a web site before visiting?...
Sorry for my english,
Thanks in advance!
System.Net.WebClient:
string url = "http://www.google.com";
System.Net.WebClient wc = new System.Net.WebClient();
string html = wc.DownloadString(url);
You have to use WebRequest and WebResponse to load a site:
example:
string GetPageSource (string url)
{
HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
webrequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webrequest.GetResponse();
string responseHtml;
using (StreamReader responseStream = new StreamReader(webResponse.GetResponseStream()))
{
responseHtml = responseStream.ReadToEnd().Trim();
}
return responseHtml;
}
After that you can check the responseHtml for some Keywords... for example with RegEx.
You can make an HTTP request (via HttpClient to the site) and parse the results looking for the various keywords. Then you can make the decision whether or not to visibly 'navigate' the user there.
There's an HTTP client sample on Dev Center that may help.

How to scrape data

I am trying scrape data from this url: http://icecat.biz/en/p/Coby/DP102/desc.htm
I want to scrape that specs table from that url.
But I checked source code of url that spec table is not displaying because i think that table is loading using Ajax.
How can I get that table.Whats needs to be done?
I used the following code:
string Strproducturl = "http://icecat.biz/en/p/Coby/DP102/desc.htm";
System.Net.ServicePointManager.Expect100Continue = false;
HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(Strproducturl);
httpWebRequest.KeepAlive = true;
ASCIIEncoding encoding = new ASCIIEncoding();
HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream);
string response = streamReader.ReadToEnd();
As IanNorton mentioned, you'll need to make your request to the URL that Icecat use to load the specs using AJAX. For the example link you provided, the specs details URL you'll need to request will be:
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
You can then work your way through the HTML response to get the spec details you require.
You mentioned in your comment that the scraping process is automated. The specs URL is in a basic format, you just need the product ID. However, if you don't have the IDs, just a series of URLs like the example on in your original question, you'll need to get the product ID from the URL you have.
For example, the URL example you gave redirects to a different URL:
http://icecat.biz/p/coby/dp102/digital-photo-frames-0716829961025-dp-102-digital-photo-frame-1091664.html
This URL contains the product ID, right at the end.
You could do a HttpWebRequest to your original URL, stop before it does the redirect and catch the redirecting URL:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://icecat.biz/en/p/Coby/DP102/desc.htm");
request.AllowAutoRedirect = false;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if(response.StatusCode == HttpStatusCode.Redirect){
string redirectUrl = response.GetResponseHeader("Location");
}
Once you've got the redirectUrl variable, you can use Regex to get the ID then do another HttpWebRequest to the specs detail URL.
I would suggest that you use a library like HtmlAgilityPack to select various elements from the html document.
I took a quick look at the link and noticed that the data is actually loaded using an addtional ajax request. You can use the following url to get the ajax data
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
The use HtmlAgilityPack to parse that data.
I know this is very old but you could more easily just retrieve the XML from
https://openIcecat-xml:freeaccess#data.icecat.biz/export/freexml.int/EN/1091664.xml
You will also get all images and descriptions as well :-)

How to check if a file exists on an webserver by its URL?

in our application we have some kind of online help. It works really simple: If the user clicks on the help button a URL is build depending on the current language and help context (e.g. "http://example.com/help/" + [LANG_ID] + "[HELP_CONTEXT]) and called within the browser.
So my question is: How can i check if a file exists on the web server without loading the complete file content?
Thanks for your Help!
Update: Thanks for your help. My question has been answered.
Now we have proxy authentication problems an cannot send the HTTP request ;)
You can use .NET to do a HEAD request and then look at the status of the response.
Your code would look something like this (adapted from The Lowly HTTP HEAD Request):
// create the request
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
// instruct the server to return headers only
request.Method = "HEAD";
// make the connection
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
// get the status code
HttpStatusCode status = response.StatusCode;
Here's a list detailing the status codes that can be returned by the StatusCode enumerator.
Can we assume that you are running your web application on the same web server as you are retrieving your help pages from? If yes, then you can use the Server.MapPath method to find a path to the file on the server combined with the File.Exists method from the System.IO namespace to confirm that the file exists.
Had the same problem myself and found this question and the answers here really useful.
But the answers here use the old WebRequest-class which is a bit outdated, it has no async support for starters. So I wanted to use the more modern way of doing it with HttpClient. Here is an example with a little helper class to check if the file exist:
using System.Net.Http;
using System.Threading.Tasks;
class HttpClientHelper
{
private static HttpClient _httpClient;
public static async Task<bool> DoesFileExist(string url)
{
if (_httpClient == null)
{
_httpClient = new HttpClient();
}
using (HttpRequestMessage request = new HttpRequestMessage(HttpMethod.Head, url))
{
using (HttpResponseMessage response = await _httpClient.SendAsync(request))
{
return response.StatusCode == System.Net.HttpStatusCode.OK;
}
}
}
}
Usage:
if (await HttpClientHelper.DoesFileExist("https://www.google.com/favicon.ico"))
{
// Yes it does!
}
else
{
// No it doesn't!
}
Send a HEAD request for the URL (instead of a GET). The server will return a 404 if it doesn't exist.
Take a look at the HttpWebResponse class. You could do something like this:
string url = "http://example.com/help/" + LANG_ID + HELP_CONTEXT;
WebRequest request=WebRequest.Create(URL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusDescription=="OK")
{
// worked
}
If you want to check the status of a document on the server:
function fetchStatus(address) {
var client = new XMLHttpRequest();
client.onreadystatechange = function() {
// in case of network errors this might not give reliable results
if(this.readyState == 4)
returnStatus(this.status);
}
client.open("HEAD", address);
client.send();
}
Thank you.
EDIT: Apparently a good method to do this would be a HEAD request.
You could also create a server-side application that stores the name of every available web page on the server. Your client application could then query this application and respond a little bit quicker than a full page request, and without throwing a 404 error every time the file doesn't exist.

Categories

Resources