Regular expression to match Url's, except a certain domain - c#

I have the following regular expression that matches Url's. What I want to do is make it not match when a URL belongs to a certain domain, let's say google.com.
How can I do that? I've been reading other question and regular expression references and so far I could achieve it. My Regular expression:
^(https?:\/\/)?([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})([\/\w \.-]*)*\/?$
I use this to filter messages in a chat, I'm using C# to do so. Here's a tool in case you want to dig further: http://regexr.com/3faji
C# extension method:
static class String
{
public static string ClearUrl(string text)
{
Regex regx = new Regex(#"^(https?:\/\/)?([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})([\/\w \.-]*)*\/?$",
RegexOptions.IgnoreCase);
string output = regx.Replace(text, "*");
return output;
}
}
Thanks for any help

You can use negative lookahead in your regex to avoid matching certain domains:
^(https?:\/\/)?(?!(?:www\.)?google\.com)([\da-zA-Z.-]+)\.([a‌​-zA-Z\.]{2,6})([\/\w .-]*)*\/?$
Or else:
^(https?:\/\/)?(?!.*google\.com)([\da-zA-Z.-]+)\.([a‌​-zA-Z\.]{2,6})([\/\w .-]*)*\/?$
(?!(?:www\.)?google\.com) is negative lookahead that will assert failure when we have www.google.com or google.com ahead.
RegEx Demo

This should work using negative lookahead, and also includes URLs that start with www instead of the protocol, and also that are not the first character of a line:
((http|ftp|https):\/\/|www.)(?!google|www.google)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?

Related

What will be regular expression of domain URL of "www.google.com"?

What will be the regex for RegularExpressionValidator in asp.net for domain name like "www.google.com"?
Valid Cases:
www.google.com
www.youwebsite.com
Invalid Cases:
http://www.google.com
https://www.google.com
google.com
www.google
Currently I used (?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9] but it fails for invalid case number 3 and 4.
The pattern that you tried fails for the third and fourth of the invalid cases because in general you are matching a-z0-9 and then repeat 1+ times . followed by a-z0-9 which does not take a www into account.
If you want to keep your pattern, you should make sure that it starts with www.
^www\.(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]$
Regex demo
You might shorten your pattern and make the match a bit broader:
^www\.[a-z0-9-]+(?:\.[a-z0-9-]+)*\.com$
Regex demo
You can always extend the character class if you want to allow matching more characters.
Assuming that we would have valid ULRs as listed, we can start with a simple expression such as:
^www\..+\.com
Demo 1
Then, we can add additional boundaries, if desired. For instance, we could add char class and end anchor, such as:
^www\..+\.com$
^www\.[A-Za-z_]+\.com$
Demo 2
If necessary, we would continue adding more constraints and test:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"^www\.[A-Za-z_]+\.com";
string input = #"www.google.com
www.youwebsite.com
http://www.google.com
https://www.google.com
google.com
www.google";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
RegEx Circuit
jex.im visualizes regular expressions:
This matches the part you want.
\bwww\.[a-zA-Z0-9]{2,256}\.com\b
But easier way to go with such a simple pattern is to use StartsWith, EndsWith, and then check what is in between.

Find exact url match

I want to find exact url mach in url list using with Regular Expression .
string url = #"http://web/P02/Draw/V/Service.svc";
string myword = #"http://web/P02/Draw/V/Service.svc http://web/P02/Draw/V/Service.svc?wsdl";
string pattern = #"(^|\s)" + url + #"(\s|$)";
Match match = Regex.Match(pattern, myword);
if (match.Success)
{
myword = Regex.Replace(myword, pattern, "pattern");
}
But the pattern returns no result.
What do you think is the problem ?
Strange formatting aside, here is a pattern to match each individual URL in your list.
Pattern = "http://([a-zA-Z]|/|[0-9])*\.svc";
Frankly, I don't think you're having issues with syntax or implementation. If you want to tweak the expression I wrote above, this is the place to do it: Online RegEx Tool
You're passing wrong arguments to Regex.Match method. You need to swap arguments like this>
Match match = Regex.Match(myword,pattern);
Why not use Linq on the string collection (when splitted by a space)
myword.Split(' ').Where(x => x.Equals(url)).Single().Replace(url, "pattern");
You've got your arguments the wrong way around, as has been pointed out
. in a regular expression pattern is a special character, so you need to escape url when you use it to build pattern - you can use Regex.Escape(url)
You don't need to check the match is a success before performing the replacement, unless you have other logic that depends on whether the match was a success.

Regular Expression Not working in .net

I'm using the following expression.
\W[A-C]{3}
The objective is to match 3 characters of anything between A and C that don't have any characters before them. So with input "ABC" it matches but "DABC" does not.
When i try this expression using various online regex tools (eg. http://gskinner.com/RegExr/), it works perfectly. When i try to use it in an asp.net RegularExpressionValidator or with the RegEx class, it never matches anything.
I've tried various different methods of not allowing a character before the match. eg.
[^\w] and [^a-zA-Z0-9]
all work in the online tools, but not in .net.
This test fails, but i'm not sure why?
[Test]
public void RegExWorks()
{
var regex = new Regex("\\W[A-C]{3}");
Match match = regex.Match("ABC");
Assert.IsTrue(match.Success);
}
How about something like this:
^[A-C]{3}
It is simple, but seems to fit what you are asking, and I tested it in rubular.com and .NET
Problem is that you require there to be a \W character. Use alteration to fix that, or a lookbehind to make sure there are no invalid characters.
Alteration:
(?:\W|^)[A-C]{3}
But I'd prefer a negative lookbehind:
(?<!\w)[A-C]{3}
\b (as in gymbralls answer) is short for (?<!\w)(?=\w)|(?<=\w)(?!\w), which in this case would just mean (?<!\w), thus being equivalent.
Also, in C# you can use the # quoting so you don't have to double escape things, eg:
var regex = new Regex(#"(?<!\w)[A-C]{3}");
You should consider trying:
[Test]
public void RegExWorks()
{
var regex = new Regex("\\b[A-C]{3}");
Match match = regex.Match("ABC");
Assert.IsTrue(match.Success);
}
The \\b matches a word boundary, which means it will match "ABC" as well as " ABC" and "$ABC". Using \\W requires there to be a non-word character, which doesn't sound like it is what you want.
Let me know if I'm missing something.
It is simple like "[A-C]{3}" this
OK so you can try following Expression
"[A-C][A-C]{2}"

Regex for a specific url pattern

In C#, how would I capture the integer value in the URL like:
/someBlah/a/3434/b/232/999.aspx
I need to get the 999 value from the above url.
The url HAS to have the /someBlah/ in it.
All other values like a/3434/b/232/ can be any character/number.
Do I have escape for the '/' ?
Try the following
var match = Regex.Match(url,"^http://.*someblah.*\/(\w+).aspx$");
if ( match.Success ) {
string name = match.Groups[1].Value;
}
You didn't specify what names could appear in front of the ASPX file. I took the simple approach of using the \w regex character which matches letters and digits. You can modify it as necessary to include other items.
You are effectively getting the file name without an extension.
Although you specifically asked for a regular expression, unless you are in a scenario where you really need to use one, I'd recommend that you use System.IO.Path.GetFileNameWithoutExtension:
Path.GetFileNameWithoutExtension(Context.Request.FilePath)
^(?:.+/)*(?:.+)?/someBlah/(?:.+/)*(.+)\.aspx$
This is a bit exhaustive, but it can handle scenarios where /someBlah/ does not have to be at the beginning of the string.
The ?: operator implies a non-capturing group, which may or may not be supported by your RegEx flavor.
Regex regex = new Regex("^http://.*someBlah.*/(\\d+).aspx$");
Match match = regex.Match(url);
int result;
if (match.Success)
{
int.TryParse(match.Groups[1].Value, out result);
}
Using \d rather than \w ensures that you only match digits, and unless the ignore case flag is set the capitalisation of someBlah must be correct.

How can I group multiple e-mail addresses and user names using a regular expression

I have the following text that I am trying to parse:
"user1#emailaddy1.com" <user1#emailaddy1.com>, "Jane Doe" <jane.doe# addyB.org>,
"joe#company.net" <joe#company.net>
I am using the following code to try and split up the string:
Dim groups As GroupCollection
Dim matches As MatchCollection
Dim regexp1 As New Regex("""(.*)"" <(.*)>")
matches = regexp1 .Matches(toNode.InnerText)
For Each match As Match In matches
groups = match.Groups
message.CompanyName = groups(1).Value
message.CompanyEmail = groups(2).Value
Next
But this regular expression is greedy and is grabbing the entire string up to the last quote after "joe#company.net". I'm having a hard time putting together an expression that will group this string into the two groups I'm looking for: Name (in the quotes) and E-Mail (in the angle brackets). Does anybody have any advice or suggestions for altering the regexp to get what I need?
Rather than rolling your own regular expression, I would do this:
string[] addresses = toNode.InnerText.Split(",");
foreach(string textAddress in addresses)
{
textAddress = address.Trim();
MailAddress address = new MailAddress(textAddress);
message.CompanyName = address.DisplayName;
message.CompanyEmail = address.Address;
}
While your regular expression may work for the few test cases that you have shown. Using the MailAddress class will probably be much more reliable in the long run.
How about """([^""]*)"" <([^>]*)>" for the regex? I.e. make explicit that the matched part won't include a quote/closing paren. You may also want to use a more restrictive character-range instead.
Not sure what regexp engine ASP.net is running but try the non-greedy variant by adding a ? in the regex.
Example regex
""(.*?)"" <(.*?)>
You need to specify that you want the minimal matched expression.
You can also replace (.*) pattern by more precise ones:
For example you could exclude the comma and the space...
Usually it's better to avoid using .* in a regular expression, because it reduces performance !
For example for the email, you can use a pattern like [\w-]+#([\w-]+.)+[\w-]+ or a more complex one.
You can find some good patterns on : http://regexlib.com/

Categories

Resources