Regex for google referrer validation - c#

I am totally new to Regex and have been trying to do this with little success.
Basically what I want to do is to create a regex that matches any google domain such as Google.com, Google.co.uk, etc.
So far I have ^http://www.google\.com/.*$, but this only matches Google.com. How can I modify it to allow any extension besides com?
Thanks!

You could use alternation, but then you would have to supply all TLDs you want to allow:
^http://www\.google\.(?:com|co\.uk|de|es)/.*$
Add more options separated by pipes. Alternatively, you could allow any TLD (whether valid or not) with this:
^http://www\.google\.[a-z.]+/.*$
However this would also match something like http://www.google.myowndomain.com/. I don't think there would be any approach that allows only valid domains without listing them all.
By the way, if you want to make that slash and the path/query at the end optional, change that to one of the following:
^http://www\.google\.(?:com|co\.uk|de|es)(?:/.*)?$
^http://www\.google\.[a-z.]+(?:/.*)?$
And then you could go another step further and make the www. optional:
^http://(?:www\.)?google\.(?:com|co\.uk|de|es)(?:/.*)?$
^http://(?:www\.)?google\.[a-z.]+(?:/.*)?$
You see, matching all possible but valid URLs for a given problem is not an easy task, but one that needs careful consideration ;).
Depending on the language you are using there might be better options with built-in URL-parsing functions. In PHP for instance, this would be a much easier approach:
$domain = parse_url($urlStr, PHP_URL_HOST);
$isGoogle = preg_match('/^(?:www\.)?google\.[a-z.]+/', $domain);
Or (since this is not perfect anyway, as outlined above) you could abandon regex altogether and do the check like this:
$isGoogle = strpos($domain, 'google.') !== false;

Related

Regex expression to validate a user input

I am building a system where the user builds a query by selecting his operands from a combobox(names of operands are then put between $ sign).
eg. $TotalPresent$+56
eg. $Total$*100
eg 100*($TotalRegistered$-$NumberPresent$)
Things like that,
However since the user is allowed to enter brackets and the +,-,* and /.
Thus he can also make mistakes like
eg. $Total$+1a
eg. 78iu+$NumberPresent$
ETC...
I need a way to validate the query built by the user.
How can I do that ?
A regex will never be able to properly validate a query like that. Either your validation would be incomplete, or you would reject valid input.
As you're building a query, you must already have a way parse and execute it. Why not use your parsing code to validate the user input? If you want to have client-side validation you could use an ajax call to the server.
I need a way to validate the query built by the user.
Personally, I don't think it is a good idea to use regex here. It can be possible with help of some extensions (see here, for example), but original Kleene expressions aren't fit for checking whether unlimited number of parentheses is balanced. Even worse, too difficult expression may result in significant time and memory spent, opening doors to denial-of-service attacks (if your service is public).
You can make use of a weak expression, though: one which is easy to write and match with and forbids most obvious mistakes. Some inputs will still be illegal, but you will discover that on parsing, as Menno van den Heuvel offered. Something like this should do:
^(?:[-]?\(*)?(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+)(?:\)*[+/*-]\(*(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+))*(?:\)*)$
Hey guys I managed to get to my ends(Thanks to Anirudh(Validating a String using regex))
I am posting my answer as it may help further visitors.
string UserFedData = ttextBox1.Text.Trim().ToString();
//this is a regex to detect conflicting user built queries.
var troublePattern = new Regex(#"^(\(?\d+\)?|\(?[$][^$]+[$]\)?)([+*/-](\(?\d+\)?|\(?[$][^$]+[$]\)?))*$");
//var troublePattern = new Regex(#"var troublePattern = new Regex(#"^(?:[-]?\(*)?(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+)(?:\)*[+/*-]\(*(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+))*(?:\)*)$");
string TroublePattern = troublePattern.ToString();
//readyToGo is the boolean that indicates if further processing of data is safe or not
bool readyToGo = Regex.IsMatch(UserFedData, TroublePattern, RegexOptions.None);

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.
What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");
I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

Maybe I need a Regex?

I am making a simple console application for a home project. Basically, it monitors a folder for any files being added.
FileSystemWatcher fsw = new FileSystemWatcher(#"c:\temp");
fsw.Created += new FileSystemEventHandler(fsw_Created);
bool monitor = true;
while (monitor)
{
fsw.WaitForChanged(WatcherChangeTypes.Created, 1000);
if(Console.KeyAvailable)
{
monitor = false;
}
}
Show("User has quit the process...", ConsoleColor.Yellow);
When a new files arrives, 'WaitForChanges' gets called, and I can then start the work.
What I need to do is check the filename for patterns. In real life, I am putting video files into this folder. Based on the filename, I will have rules, which move the files into specific directories. So for now, I'll have a list of KeyValue pairs... holding a RegEx (I think?), and a folder. So, if the filename matches a regex, it moves it into the related folder.
An example of a filename is:
CSI- NY.S07E01.The 34th Floor.avi
So, my Regex needs to look at it, and see if the words CSI "AND" (NY "OR" NewYork "OR" New York) exist. If they do, I will then move them to a \Series\CSI\NY\ folder.
I need the AND, because another file example for a different series is:
CSI- Crime Scene Investigation.S11E16.Turn On, Tune In, Drop Dead
So, for this one, I would need to have some NOTs. So, I need to check if the filename has CSI, but NOT ("New York" or "NY" or "NewYork")
Could someone assist me with these RegExs? Or maybe, there's a better method?
You can try to store conditions in Func<string,bool>
Dictionary<Func<string,bool>,string> dic = new Dictionary<Func<string, bool>, string>();
Func<string, bool> f1 = x => x.Contains("CSI") && ((x.Contains("NY") || x.Contains("New York"));
dic.Add(f1,"C://CSI/");
foreach (var pair in dic)
{
if(pair.Key.Invoke("CSI- NY.S07E01.The 34th Floor.avi"))
{
// copy
return;
}
}
I think you have the right idea. The nice thing about this approach is that you can add/remove/edit regular expressions to a config file or some other approach which means you don't have to recompile the project every time you want to keep track of a new show.
A regular expression for CSI AND NY would look something like this.
First if you want to check if CSI exists in the filename the regex is simply "CSI". Keep in mind it's case sensitive by default.
If you want to check if NY, New York or NewYork exist in the file name the regex is "((NY)|(New York)|(NewYork))" The bars indicate OR and the parenthesis are used to designate groups.
In order to combine the two you could run both regexes and in some cases (where perhaps order is unimportant) this might be easier. However if you always expect the show type to come after the syntax would be "(CSI).*((NY)|(New York)|(NewYork))" The period means "any character" and the asterisk means zero or more.
This does not look as one regex, even if you succeed with tossing the whole thing into one. Regexes which match "anything without a given word" are a pain. I'd better stick with two regexes for each rule: one which should match, and the other which should NOT match for this rule to be triggered. If you need your "CSI" and "NY" but don't like fixing any particular order within the filename, you as well may switch from a pair of regexes to a pair of lists of regexes. In general it's better to put this logic into code and configuration and keep regexes as simple as possible. And yes, you're quite likely to get away with simple substring search, no explicit need for regexes as long as you keep your code smart enough.
Well, people already gave you some advises about doing this using:
Regular expressions
Func and storing exactly the C# code that will be executed against the file
so I'm just give you a different one.
I disagree with using Regular Expressions for this purpose. I agree with #Anton S. Kraievoy: I don't like regexes to match anything without a given word. It is easier to check: !text.Contains(word)
The second option looks perfect if you are looking for a fast solution, but...
If that is a more complex application, and you want to design it correctly, I think you should:
Define how you will store those patterns (in a class with members, or in a string, etc). An string example could be:
"CSI" & ("NY" || "Las Vegas")
Then write a module that will match a filename with that pattern.
You're creating kind of a DSL.
Why is it better than just paste directly the C# code?
Well, because:
You can easily change the semantic of your patterns
You can generate the validation code in any language you want, because you're storing patterns in a generic way.
The thing is how to match a pattern against a filename.
You have some options:
Write the grammar of your pattern and write a parser by yourself
Generate (I'm not 100% sure if it is possible, that depends on the grammar) the write a regex that will convert your grammar into C# code.
Like: "A" & "B" to string.Contains("A") && string.Contains("B") or something like that.
Use a tool to do that, like ANTLR.

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.
Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.
A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.
It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.
Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.
There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Regular expression to extract domain name from any domain

I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:
yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com
Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.
If no scheme present (no colon in string), prepend "http://" to make it a valid URL.
Pass string to Uri constructor.
Access the Uri's Host property.
Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.
It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.
A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.
Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.
Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!
I would recommend trying this yourself. Using regulator and a regex cheat sheet.
http://sourceforge.net/projects/regulator/
http://regexlib.com/CheatSheet.aspx
Also find some good info on Regular Expressions at coding horror.
Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).
A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:
(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))
You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Categories

Resources