How to match URL in c#?

How to match URL in c#? - c#

I have found many examples of how to match particular types of URL-s in PHP and other languages. I need to match any URL from my C# application. How to do this? When I talk about URL I talk about links to any sites or to files on sites and subdirectiories and so on.
I have a text like this: "Go to my awsome website http:\www.google.pl\something\blah\?lang=5" or else and I need to get this link from this message. Links can start only with www. too.

If you need to test your regex to find URLs you can try this resource
http://gskinner.com/RegExr/
It will test your regex while you're writing it.
In C# you can use regex for example as below:
Regex r = new Regex(#"(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*");
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
while (m.Success)
{
//do things with your matching text
m = m.NextMatch();
}

Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too)
^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions

I am not sure exactly what you are asking, but a good start would be the Uri class, which will parse the url for you.

Here's one defined for URL's.
^http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

This will return a match collection of all matches found within "yourStringThatHasUrlsInIt":
var pattern = #"((ht|f)tp(s?)\:\/\/|~/|/)?([w]{2}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?";
var regex = new Regex(pattern);
var matches = regex.Matches(yourStringThatHasUrlsInIt);
The return will be a "MatchCollection" which you can read more about here:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchcollection.aspx

//This code return (protocol://)host:port from URL
//Commented URL's with different protocols. Just uncomment to test.
//string url = "http://www.contoso.com:8080/letters/readme.html";
//string url = "ftp://www.contoso.com:8080/letters/readme.html";
//string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string host = "";//empty string with host from url
//protocol, (ip/domain), port
host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${proto}://${host}${port}");
//(ip/domain):port without protocol. If HTTPS board loading images from HTTP host.
//host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${host}${port}");
Console.WriteLine("url: "+url+"\nhost: "+host); //display host
see https://rextester.com/PVSO54371

u can also use https://github.com/d-kistanov-parc/DotNetUrlPatternMatching
The library allows you to match a URL to a pattern.
How it works:
an url pattern is split into parts
each non-empty part is matched with a similar one from the URL.
You can specify a Wildcard * or ~
Where * is any character set within the group (scheme, host, port, path, parameter, fragment)
Where ~ any character set within a group segment (host, path)
Only supply parts of the URL you care about. Parts which are left out will match anything. E.g. if you don’t care about the host, then leave it out.

Related

Characters after "#" is not recognized by system.net.webclient.DownloadFile method [duplicate]

How do you properly encode a path that includes a hash (#) in it? Note the hash is not the fragment (bookmark?) indicator but part of the path name.
For example, if there is a path like this:
http://www.contoso.com/code/c#/somecode.cs
It causes problems when you for example try do this:
Uri myUri = new Uri("http://www.contoso.com/code/c#/somecode.cs");
It would seem that it interprets the hash as the fragment indicator.
It feels wrong to manually replace # with %23. Are there other characters that should be replaced?
There are some escaping methods in Uri and HttpUtility but none seem to do the trick.

There are a few characters you are not supposed to use. You can try to work your way through this very dry documentation, or refer to this handy URL summary on Stack Overflow.
If you check out this very website, you'll see that their C# questions are encoded %23.
Stack Overflow C# Questions
You can do this using either (for ASP.NET):
string.Format("http://www.contoso.com/code/{0}/somecode.cs",
Server.UrlEncode("c#")
);
Or for class libraries / desktop:
string.Format("http://www.contoso.com/code/{0}/somecode.cs",
HttpUtility.UrlEncode("c#")
);

Did some more digging friends and found a duplicate question for Java:
HTTP URL Address Encoding in Java
However, the .Net Uri class does not offer the constructor we need, but the UriBuilder does.
So, in order to construct a proper URI where the path contains illegal characters, do this:
// Build Uri by explicitly specifying the constituent parts. This way, the hash is not confused with fragment identifier
UriBuilder uriBuilder = new UriBuilder("http", "www.contoso.com", 80, "/code/c#/somecode.cs");
Debug.WriteLine(uriBuilder.Uri);
// This outputs: http://www.contoso.com/code/c%23/somecode.cs
Notice how it does not unnecessarily escape parts of the URI that does not need escaping (like the :// part) which is the case with HttpUtility.UrlEncode. It would seem that the purpose of this class is actually to encode the querystring/fragment part of the URL - not the scheme or hostname.

Use UrlEncode: System.Web.HttpUtility.UrlEncode(string)
class Program
{
static void Main(string[] args)
{
string url = "http://www.contoso.com/code/c#/somecode.cs";
string enc = HttpUtility.UrlEncode(url);
Console.WriteLine("Original: {0} ... Encoded {1}", url, enc);
Console.ReadLine();
}
}

Unable to encode Url properly using HttpUtility.UrlEncode() method

I have created an application in which I need to encode/decode special characters from the url which is entered by user.
For example : if user enters http://en.wikipedia.org/wiki/Å then it's respective Url should be http://en.wikipedia.org/wiki/%C3%85.
I made console application with following code.
string value = "http://en.wikipedia.org/wiki/Å";
Console.WriteLine(System.Web.HttpUtility.UrlEncode(value));
It decodes the character Å successfully and also encodes :// characters. After running the code I am getting output like : http%3a%2f%2fen.wikipedia.org%2fwiki%2f%c3%85 but I want http://en.wikipedia.org/wiki/%C3%85
What should I do?

Uri.EscapeUriString(value) returns the value that you expect. But it might have other problems.
There are a few URL encoding functions in the .NET Framework which all behave differently and are useful in different situations:
Uri.EscapeUriString
Uri.EscapeDataString
WebUtility.UrlEncode (only in .NET 4.5)
HttpUtility.UrlEncode (in System.Web.dll, so intended for web applications, not desktop)

You could use regular expressions to select hostname and then urlencode only other part of string:
var inputString = "http://en.wikipedia.org/wiki/Å";
var encodedString;
var regex = new Regex("^(?<host>https?://.+?/)(?<path>.*)$");
var match = regex.Match(inputString);
if (match.Success)
encodedString = match.Groups["host"] + System.Web.HttpUtility.UrlEncode(match.Groups["path"].ToString());
Console.WriteLine(encodedString);

detecting emails inside an http header

i am building up a proxy in csharp and one of my tasks is to find emails inside
an http header, problem is that inside the data that i get i receive %40 instead of
#, could anyone please tell me how can i detect emails when the # inside the mail address is being replaced with %40?
here is my code for getting email addresses inside a given string (with # and not %40 instead)
Code:
string regexPattern = #"[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}";
Regex regex = new Regex(regexPattern);
MatchCollection matches = regex.Matches(this._context.Request.Headers[i]);
foreach (Match match in regex.Matches(this._context.Request.Headers[i]))
{// any email address should be printed
Console.WriteLine(match.Value);
}

Not sure if I understood your question. Why don't you just replace # with %40 in regex you provided?
So:
string regexPattern = #"[A-Za-z0-9._%-]+%40[A-Za-z0-9.-]+\.[A-Za-z]{2,4}";

URL decode your data before running it through the regex:
regex.Matches(this._context.Server.UrlDecode(this._context.Request.Headers[i]))
Or:
regex.Matches(HttpUtility.UrlDecode(this._context.Request.Headers[i]))

Use OR (|) to allow # or "%40"
[A-Za-z0-9._%-]+(#|%40)[A-Za-z0-9.-]+.[A-Za-z]{2,4}

How to replace %20 with - in Url Rewriting

I am using UrlRewritingNet.UrlRewriter.dll for url rewriting purpose and frankly, I am new to this stuff. My problem is that , i want to replace %20 in my url with -.

HttpUtility.UrlDecode() does what you need.

If you need a custom replace apart that HttpUtility gives you (in this case it will convert it to space!) then just use string replacement.
Uri myuri = new Uri(myolduri.ToString().Replace("%20","-"));

or you can put the url in a string then use
string urla = "your url";
string urlb = url.Replace("%20", "-");

Remove URl from a string in c#

I have a string I have parsed from a RSS feed
thumbnail url='http://photos3.media.pix.ie/11/C5/11C5B77C92204ADBBD0CF5FDF4BA351B-0000314357- 0002211156-00240L-00000000000000000000000000000000.jpg' height='240' width='226'
I need to remove just the URL detail from the string to form the basis of a image on a Windows Phone 7 application.
What would you suggest as the best way of doing this
The code from the phone is here
FeedItems.ItemsSource = from imageFeed in xmlImageEntries.Descendants("item")
select new PixIEPanoramaTest.Data.FeedItem
{
ThumbSource = imageFeed.Element("thumbnail").Value,
};
Feeditems is just a bound list box. The thumbsource variable just needs the URL from the string.
Any thoughts greatly appreciated ,
Rob

You can just access the attribute value for the url attribute:
ThumbSource = imageFeed.Element("thumbnail").Attribute("url").Value,
It would probably worth using some kind of extensions method to return string.Empty if the attribute is missing, though.

There are libraries available to help parse RSS feeds - e.g.
working with rss + c#
on codeplex http://rssr7.codeplex.com/releases/view/43832
If you want to do a quick "hacky" parse of the string yourself, then you could use RegularExpressions - e.g. a Regex something like thumbnail url='(?<imageurl>http.*?)' height='240' width='226' would pull out your url as the imageurl group
Match match = Regex.Match(input, #"thumbnail url='(?<imageurl>http.*?)' height='240' width='226'");
if (match.Success)
{
string url = match.Groups[1].Value;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to match URL in c#? - c#

Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too) ^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w][0-9a-zA-Z])(:(0-9))(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions

I am not sure exactly what you are asking, but a good start would be the Uri class, which will parse the url for you.

Here's one defined for URL's. ^http(s?)\:\/\/[0-9a-zA-Z]([-.\w][0-9a-zA-Z])(:(0-9))(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,])?", RegexOptions.IgnoreCase);

Related

Characters after "#" is not recognized by system.net.webclient.DownloadFile method [duplicate]

Unable to encode Url properly using HttpUtility.UrlEncode() method

detecting emails inside an http header

How to replace %20 with - in Url Rewriting

Remove URl from a string in c#

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to match URL in c#? - c#

Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too) ^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions

I am not sure exactly what you are asking, but a good start would be the Uri class, which will parse the url for you.

Here's one defined for URL's. ^http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

Related

Characters after "#" is not recognized by system.net.webclient.DownloadFile method [duplicate]

Unable to encode Url properly using HttpUtility.UrlEncode() method

detecting emails inside an http header

How to replace %20 with - in Url Rewriting

Remove URl from a string in c#

Categories

Resources

Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too) ^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w][0-9a-zA-Z])(:(0-9))(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions

Here's one defined for URL's. ^http(s?)\:\/\/[0-9a-zA-Z]([-.\w][0-9a-zA-Z])(:(0-9))(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,])?", RegexOptions.IgnoreCase);