All,
In HTML, it is my understanding that a url that starts // (e.g. //www.google.com) refers to a protocol-less url that should be requested in the same scheme as that in which the page was served.
However, the following c# code fails
var uri = new Uri("//www.google.com", UriKind.RelativeOrAbsolute);
Assert.IsTrue(uri.IsAbsoluteUri);
Am I missing something here? At the moment I am rolling my own regex to find out if a URI is absolute:
return Regex.IsMatch(url, #"^(https?:)?//")
It's not absolute. It's relative to whether the URL is accessed from a source that is served over HTTP, HTTPS, or something else.
Related
I did an application which parses an html document and then obtains some urls, the problem is the urls only can be downloaded directly from the navigator.
In VB.NET or C#, how I could redirect this url to obtain a direct link for later paste the link to download it in a Download Manager?
dim url as string = "http://m.mrtzcmp3.net/get.php?singer=Madonna&song=Like%20A%20Virgin%20&size=5242104&ids=687474703a2h2h63733434303876342g766s2g6f652h75323237363831362h617564696h732h3132323564303466333839622g6f7033"
I need to say that I'm not much experimented with http things, maybe I'm wrong and the url has anything to redirect or something similar fault, please just say me how can I redirect that kind of urls or If I'm wrong.
UPDATE:
Tried this, but I get the same url without any changes:
Dim url As String = _
"http://m.mrtzcmp3.net/get.php?singer=Madonna&song=Like%20A%20Virgin%20&size=5242104&ids=687474703a2h2h63733434303876342g766s2g6f652h75323237363831362h617564696h732h3132323564303466333839622g6f7033"
Dim request As HttpWebRequest = DirectCast(HttpWebRequest.Create(url), HttpWebRequest)
request.AllowAutoRedirect = True
Dim response As HttpWebResponse
Dim resUri As String
response = request.GetResponse
resUri = response.ResponseUri.AbsoluteUri
MsgBox(resUri)
UPDATE 2:
In the answer from here HttpWebRequest Login data Then Redirect
He says
If the redirect is handled transparently, the _response.ResponseURI
will contain the address it redirected to. If not, you have to read
the redirect header and decide yourself whether or not to request the
new page.
so...if I need to do thatm, how I can do that?
UPDATE 3:
DownloadThemAll plugin for Firefox can obtain the direct urls... as you can see all the urls finishes with an .mp3 file extension, that's what I need
To my knowledge, the url
http://m.mrtzcmp3.net/get.php?singer=Madonna&song=Like%20A%20Virgin%20&size=5242104&ids=687474703a2h2h63733434303876342g766s2g6f652h75323237363831362h617564696h732h3132323564303466333839622g6f7033
IS the direct url, a direct file url does not need to have the filetype in it.
you can download the file using
string url = "http://m.mrtzcmp3.net/get.php?singer=Madonna&song=Like%20A%20Virgin%20&size=5242104&ids=687474703a2h2h63733434303876342g766s2g6f652h75323237363831362h617564696h732h3132323564303466333839622g6f7033"
WebClient wc = new WebClient();
wc.DownloadFile(url, fileName);
you can get the fileName (Madonna-Like A Virgin -www.mrtzcmp3.net.mp3) by using
HttpWebRequest myHttpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
string header = myHttpWebResponse.Headers.ToString();
fileName = header.Remove(0, header.IndexOf("filename=")+10);
fileName = fileName.Remove(fileName.IndexOf('"'));
that is untested, but it should work.
edit: I think this does what you want, but I may have misunderstood your question
you can perform a web request using web client to get the content (url) from that url, then you just need to perform the redirect.
Use an HttpWebRequest and use the AllowAutoRedirect=true to get the direct link and download the file.
Can you try to paste the URL to an URl shortener like tinyUrl or BitLy? Maybe there is a shortener Service that provides an API?
The file then will be downloaded at: http://tinyurl.com/phzhxsr
You will never get a direct URL from the site owner because the URL is dynamicaly parsed and the file is send with the retrun datastream, not by downloading a specific URL.
This is me publicly documenting my mistake so that if I or anyone does it again, they don't have to spend 3 hours tearing their hair out trying to fix such a simple thing.
Context
I was sending an HttpRequest from one C# MVC ASP.NET application to another.
The applications require an HTTPS connection, and we are using URLRewrite to redirect an HTTP request to an HTTPS url.
One application was sending a POST request with some JSON data in the body, pretty standard stuff. The other application was set up to receive this data with an MVC controller class (CollectionAction and Insert methods for GET and POST respectively).
Symptoms of the problem
The receiving application was running the GET method (CollectionAction) instead of the POST action (ItemAction). The reason for this was that the request coming in to the application was in fact a GET request, and to top it off the JSON data was missing too.
I sent the header "x-http-method" to override the request method from GET to POST (I was already setting the request httpmethod to POST but this was being ignored). This worked but still I had no data being sent.
So now I am stuck pulling my hair out, because I can see a POST request with content-length and data being sent out and I have a GET request with no data or content-length coming in (but the headers were preserved)
Turns out I was using UriBuilder to take a base URL and apply a resource path to it. For example I would have "google.com" in my web.config and then the UriBuilder would take a resource like Pages and construct the url "google.com/Pages". Unfortunately, I was not initializing the UriBuilder with the base URL, and instead was using a second UriBuilder to extract the host and add that to the path like so:
public Uri GetResourceUri(string resourceName)
{
var domain = new UriBuilder(GetBaseUrl());
var uribuilder = new UriBuilder()
{
Path = domain.Path.TrimEnd('/') + "/" + resourceName.TrimStart('/'),
Host = domain.Host
};
var resourceUri = uribuilder.Uri;
return resourceUri;
}
The problem with this code is that the scheme is ignored (HTTP:// vs HTTPS://) and it defaults to HTTP. So my client was sending out the request to an HTTP url instead of the required HTTPS url. This is the interesting part, URLRewrite was kicking in and saying that we needed to go to an HTTPS url instead so it redirected us there. But in doing so, it ignored the Http-Method and the POST data, which just got set to defaults GET and null. This is what the 2nd application could see at the receiving end.
So the function had to be rewritten to this which fixed the problem:
public Uri GetResourceUri(string resourceName)
{
var baseUrl = GetBaseUrl();
var domain = new UriBuilder(baseUrl);
var uribuilder = new UriBuilder(baseUrl)
{
Path = domain.Path.TrimEnd('/') + "/" + resourceName.TrimStart('/'),
};
var resourceUri = uribuilder.Uri;
return resourceUri;
}
The goal of my program is to grab a webpage and then generate a list of Absolute links with the pages it links to.
The problem I am having is when a page redirects to another page without the program knowing, it makes all the relative links wrong.
For example:
I give my program this link: moodle.pgmb.si/moodle/course/view.php?id=1
On this page, if it finds the link href="signup.php" meaning signup.php in the current directory, it errors because there is no directory above the root.
However this error is invalid because the page's real location is:
moodle.pgmb.si/moodle/login/index.php
Meaning that "signup.php" is linking to moodle.pgmb.si/signup.php which is a valid page, not moodle.pgmb.si/moodle/course/signup.php like my program thinks.
So my question is how is my program supposed to know that the page it received is at another location?
I am doing this in C Sharp using the follownig code to get the HTML
WebRequest wrq = WebRequest.Create(address);
WebResponse wrs = wrq.GetResponse();
StreamReader strdr = new StreamReader(wrs.GetResponseStream());
string html = strdr.ReadToEnd();
strdr.Close();
wrs.Close();
You should be able to use ResponseUri method of WebResponse class. This will contain the URI of the internet resource that actually provided the response data, as opposed to the resource that was requested. You can then use this URI to build correct links.
http://msdn.microsoft.com/en-us/library/system.net.webresponse.responseuri.aspx
What I would do is first check if each link is absolute or relative by searching for an "http://" within it. If it's absolute, you're done. If it's relative, then you need to append the path to the page you're scanning in front of it.
There are a number of ways you could get the current path: you could Split() it on the slashes ("/"), then recombine all but the last one. Or you could search for the last occurrence of a slash and then take a substring of up to and including that position.
Edit: Re-reading the question, I'm not sure I am understanding. href="signup.php" is a relative link, which should go to the /signup.php. So the current behavior you mentioned is correct "moodle.pgmb.si/moodle/course/signup.php."
The problem is that, if the URL isn't a relative or absolute URL, then you have no way of knowing where it goes unless you request it. Even then, it might not actually be being served from where you think it is located. This is because it might actually be implemented as a HTTP Redirect or similar server side.
So if you want to be exhaustive, what you can do is:
Use your current technique to grab a list of all links on the page.
Attempt to request each of those pages. Then if you:
Get a 200 responce code then all is good - it's there.
Get a 404 response code you know the page does not exist
Get a 3XX response code then you know where the web server
expects that content to actually orginate form.
Your (Http)WebResponse object should have a ResponseCode property. Note that you should also handle any possible WebException errors - these too will have a WebResponse with a ResponseCode in (usually 5xx).
You can also look at the HttpWebResponse Headers property - the Location header.
I have a complete URL like: A: http://www.domain.com/aaa/bbb/ccc/ddd/eee.ext.
I have a relative URL like: B: ../../fff.ext
I’m looking for the easiest way in .NET C# to combine these two URLs and get:
C: http://www.domain.com/aaa/bbb/fff.ext
This is like what browsers does: you’re browsing URL A, then, page’s HTML have an hyperlink as B, the resulting URL is C.
You'd probably have better luck looking up "PathCanonicalize".
Also, with my findings, one of the overloaded Uri constructors can handle this:
Uri combined = new Uri(
new Uri("http://www.domain.com/aaa/bbb/ccc/ddd/eee.ext", UriKind.Absolute),
"../../fff.ext"
);
Proof is in the pudding
I want to check if a URI returns a valid result.
Example:
String path = String.Format("{0}/agreements/{1}.gif", PicRoot, languageTwoLetterCode);
WebRequest request = WebRequest.Create(new Uri(path, UriKind.Relative));
...
This throws a notsuportedexception. So I figure I should be providing the absolute URI.
All examples I can find use a hardcoded root (like www.example.com). This is off course unacceptable because it is uncertain what the actual root of the website will be.
How can I either check the result from a relative URI or find the current root?
Are there better ways to check if say "/content/pics/agreements/en.gif" returns a gif or a 404?
You can get the root of the web site from the server object.
You could use Server.MapPath (see also) and then check if the physical path is on the server (use File.Exists)