How can I combine two urls the same way a browser does? - c#

I'm writing some kind of a page scraper, and one of the things I'm looking to do is combine the current url with an url fragment extracted from the current page.
Like this:
if (WebPath.IsAbsolute(urlFragment))
links.Add(new Uri(urlFragment));
else
links.Add(new Uri(currentUrl, urlFragment));
Easy peasy - this approach works most of the time, for both relative and absolute Uris.
However, some pages look like http://example.com/couple/of/folders/, with the url fragment couple/of/otherfolders/. And every single browser out there interprets that as http://example.com/couple/of/otherfolders.
Of course, my code yields http://example.com/couple/of/folders/couple/of/otherfolders. Which totally looks correct from the Uri's point of view - but I don't get how a browser can interpret this otherwise.
Now, I've searched for a solution to this problem, but I only found people who didn't know how to combine two urls, so that didn't get me very far. Closest thing I found was this question: How do you combine URL fragments in Java the same way browsers do? , but the answer doesn't tackle my particular problem.
Does anybody know what I'm missing?
Edit - this is the IsAbsolute method (I know I should replace it with new Uri(link).IsAbsoluteUri):
public static bool IsAbsolute(string path)
{
var uppercasePath = path.ToUpper();
return uppercasePath.StartsWith("HTTP://") || uppercasePath.StartsWith("HTTPS://");
}

Normally, browsers wouldn’t do that. But when there’s a <base> element, its href replaces the current page’s URL for the page’s URL-resolving purposes.
Check for a <base> and use it in place of currentUrl if it exists.
Also, thanks for reminding me to fix all my scrapers!

Related

URL ending with '/Whatever'

I hope, you can help me with this one.
Is it possible to have an URL like this : http://example.com/xxxyyy
When users access the above link, I'd like to extract the xxxyyy part of the URL for further use.
I'd like to do this WITHOUT subdomains, as I don't know how many different 'xxxyyy's I'll have to accept. (eg http://example.com/europe, http://example.com/spam and so on)
Regards,
Morten
It depends on exactly what you're trying to do, but you should find what you need here: Making Sense of ASP.NET Paths
For example:
string path = Request.ApplicationPath;
Check the documentation here.

Regular expression to get the Name from URL link

I have a Hyperlink field (aka column) in SharePoint 2010.
Say it's called SalesReportUrl. The url looks like:
http://portal.cab.com/SalessiteCollection/October2012Library/Forms/customview.aspx
The hyperlink field stores values in two fields (the link and description).
What would be the RegEx if I want to get the October2012Library out of the Url?
I tried this but it's definitely not working:
#"<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>";
I also tried:
^(.*?/)?Forms/$
but no luck.
I think sharepoint stores hyperlink like this:
http://portal.cab.com/SalessiteCollection/October2012Library/Forms/customview.aspx, some description
Looks like this has a solution. but what's the syntax substring get the list or library name ?https://sharepoint.stackexchange.com/questions/40712/get-list-title-in-sharepoint-designer-workflow
How about this (as Daniel suggested) :
string url = #"http://portal.cab.com/SalessiteCollection/October2012Library/Forms/customview.aspx";
Uri uri = new Uri(url);
if(uri.Segments.Length > 2))
Console.WriteLine(uri.Segments[2]); // will output "October2012Library/"
you can add .Replace("/", string.Empty) if you want to get rid of the "/"
Console.WriteLine(uri.Segments[2].Replace("/", string.Empty));
http://[^/]+/[^/]+/([^/]+)/
match's group[1] is the value you need. it gets the 3rd part (divided by /) in the url. if you need make sure it is followed by other parts, i.e. forms, add it at the end.
try using this new RegEx("SalessiteCollection/(.+?)/Forms").match(<urlString>).groups[1].value
Though it is a rough answer, you might have to make few corrections but I hope you understand what I am trying to explain.
maybe this?
http:\/\/([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}\/[a-zA-Z]*\/([a-zA-Z0-9]*)\/
http://rubular.com/r/LuuuORPRXt

get only website name

I have, for example, the following URL:
http://www.beta.microsoft.com/path/page.htm
and I need to retrieve the name from it, which in this case is:
microsoft
I need to get the name of the website - without the sub-domain, www, .com extension and other stuff - only the name.
How do I get it in the fastest and most convenient way?
Din.
It sounds like you mean the domain name:
new Uri(string).Host
You could make a Array with all the domain extensions, replace that with String.Empty to remove it and then pick the last item from Split('.'). This will give you what you want most of the times. Otherwise it is not possible to know which part is the right one.
UPDATE:
This code does what wanted, but i'm guessing there is a better way for this, maybe regex or something in that direction.
http://pastebin.com/SVkiJ1Vq

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.
Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.
A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.
It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.
Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.
There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

MonoRail redirect to # anchor

I'm using Castle Monorail with jQuery tabbed navigation.
When handling a controller action, I would like to redirect to a view, and control which tab is visible. Therefore, I'd like to have my controller redirecting to a specific anchor in a view, something along the lines of:
RedirectToAction("Edit", "id=1", "#roles"));
Resulting in the url:
http://localhost/MyApp/User/edit.rails?id=1#roles
However, the actual result encodes the # sign to %23
http://localhost/MyApp/User/edit.rails?id=1&%23roles=&
I'm surely missing a basic concept here. What do I need to do to solve this?
It does not only encode the '#' sign, it simply refer to it as another query string parameter (adds '&' and '=')
I'd advise you to post this question to the users group of Castle Project, and even better - open issue on Castle's issue tracker.
Not the best solution, but I used RedirectToUrl() and used a static url.
Another solution would be to use the Routing-engine and create the url yourself, and then add the actual hash.
check
RoutingModuleEx.Engine.CreateUrl()
Or something like that.

Categories

Resources