EmailAddressAttribute in .NET 4.6.1 allow dot on the end.
That means that following email: someone#google.com. is valid.
For Microsoft this email is valid.
But, for example, for PayPal, email is not valid.
So does anybody know, is dot on the end of email valid or not?
There is a lot of contending information about whether this is legal, or valid. Those are two different views, and I'm going to try and explain a bit why.
Email addresses are described in part by RFC 5322 - Internet Message Format which explains email formats in excruciating detail.
In section 3.4.1 - Addr-spec, the email address format is explained. I'm paraphrasing for brevity but the general format is:
local-part#domain
With local-name described as one of the following dot-atom / quoted-string / obs-local-part and domain described as dot-atom / domain-literal / obs-domain.
So it's a domain name, which is described in RFC 1034 - Domain Names - Concepts And Facilities.
A domain name can be ambiguous or unambiguous, which is defined by the absence or presence of the trailing dot. Ambiguous domain names are not guaranteed to resolve to a location, but most (if not all at this point) DNS search lists append a period behind the scenes if one is not present, but this is a Quality-of-Life improvement. Unambiguous domain names must contain a trailing period, it's basically a terminating character in DNS.
Thomas Flinkow already mentioned what the source looks like, I just wanted to give some context as to why - historically - the regex might be the way it is. A trailing period is legal, but validity is defined by the mail providers.
Well, since I did not find any documentation on that, I checked the source of the EmailAddressAttribute to see if any comments explained whether or not
someone#google.com.
is considered valid, but I did not find comments regarding that.
What I did find is this regular expression, which is used to determine whether or not an address is invalid:
^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*
(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|
[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09
\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*
(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$
which is obviously quite long. The interesting part however is this little part at the very end:
\.?
which means match between 0 and 1 "." characters.
Therefore I feel like it is intentionally built in that email addresses ending with a period are considered valid, although I have not found any external resources on whether email addresses ending with a period are actually allowed by any email providers.
For the validating, I would advise to not rely only on the EmailAddressAttribute, but to make your own validator (since EmailAddressAttribute is sealed and you can not derive your own attribute), which could look somewhat like this:
public bool IsValidEmailAddress(string email)
{
var emailValidator = new EmailAddressAttribute();
return emailValidator.IsValid(email) && !String.EndsWith(".");
}
In the code above, the attribute is used to provide the basic checking implementation, and !String.EndsWith(".") takes care of email addresses falsely determined as valid that have a trailing period.
TL;DR: The definite answer seems to be what Yannick Meeus has written:
A trailing period is legal, but validity is defined by the mail
providers.
and therefore Microsoft seems to have conformed to the rules, even though in practice only few (none?) mail providers allow a trailing period. So you have to decide whether or not you also confirm to the formal rules and allow the trailing "." or if you want to exclude it (as demonstrated in the sample code above).
Related
I have a user input to supply website address, obviously most users have no idea what is well formatted url so I look for a website address Regex that will follow this rules:
1) www.someaddress.com - True
2) someaddress.com - True
3) http://someaddress.com - True
4) https://someaddress.com - True
5) https://www.someaddress.co.il - True
6) http://www.someaddress.com - True
I use this Regex:
[RegularExpression(#"^((http|ftp|https|www)://)?([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?$", ErrorMessage = "Not a valid website address")]
public string SiteUrl { get; set; }
But it's useless because it allows almost every string to pass.
Please supply a data annotation answer and not answers such as:
Uri.IsWellFormedUriString
Because .net doesn't support client side validation for custom attributes.
There is a UrlAttribute to validate URLs, but it does enforce the protocol being there, which it appears you don't want.
However, the source code is available and it does use a regular expression that you can steal and modify. Modifying just the protocol portion to be optional the way you want, you get this:
^((http|ftp|https)://)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$
(Side note: I noticed that your regex allowed www://, which is suspicious. I took it out in this, but if you truly do need that, then you can add it.)
These are values I tested with:
www.someaddress.com Yes
someaddress.com Yes
http://someaddress.com Yes
https://someaddress.com Yes
https://www.someaddress.co.il Yes
cow No
hi hello.com No
this/that.com No
In the comments of the source code it does say:
This attribute provides server-side url validation equivalent to jquery validate, and therefore shares the same regular expression. See unit tests for examples.
I'm writing a very basic web server that has to support an extremely limited special server side scripting language. Basically all I need to support is "echo", addition/subtraction/multiplication (no division) with only 2 operands, a simple "date()" function that outputs the date and the use of the "&" operator to concatenate strings.
An example could be:
echo "Here is the date: " & date();
echo "9 x 15 = : & 9*15;
I've gone through and created the code necessary to generate tokens, but I'm not sure I'm using the right tokens.
I created tokens for the following:
ECHO - The echo command
WHITESPACE - Any whitespace
STRING - A string inside quotations
DATE - The date() function
CONCAT - the & operator for concatenation
MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc)
TERM - The terminal character (;)
The MATH one I am particularly unsure about. Typically I see people create a token specifically for integers and then for each operator as well, but since I ONLY want to allow binary operations, I thought it made sense to group it into one token. If I were to do everything separately, I would have to do some extra work to ensure that I never accepted "5+4+1".
So question 1 is am I on the right track with which tokens to use?
My next question is what to I do with these tokens next to ensure correct syntax? The approach that I had thought of was to basically say, "Okay I know I have this token, here is a list of tokens that are allowed to come next based on the current token. Is the next token in the list?"
Based on that, I made a list of all of my tokens as well as what tokens are valid to appear directly after them (didn't include whitespace for simplicity).
ECHO -> STRING|MATH|DATE
STRING -> TERM|CONCAT
MATH -> TERM|CONCAT
DATE -> TERM|CONCAT
CONCAT -> STRING|MATH|DATE
The problem is I'm not sure at all how to best implement this. Really I need to keep track of whitespace as well to make sure there are spaces between the tokens. But that means I have to look ahead two tokens at a time which is getting even more intimidating. I also am not sure how to manage the "valid next tokens" stuff without just some disgusting section of if blocks. Should I be checking for valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach an unexpected token? In this simple example, everything will always work just fine parsing left to right, there's no real precedence rules (except the MATH thing, but that's part of why I combined it into one token even though it feels wrong.) Even so, I wouldn't mind designing a more scalable and elegant solution.
In my research about writing parsers, I see a lot of references to creating "accept()" and "expect()" functions but I can't find any clear description of what they are supposed to do or how they are supposed to work.
I guess I'm just not sure how to implement this, and then how to actually come up with a resulting string at the end of the day.
Am I heading in the right direction and does anybody know of a resource that might help me understand how to best implement something simple like this? I am required to do it by hand and cannot use a tool like ANTLR.
Thanks in advance for any help.
The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:
echo "Here is the date: " & date();
I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().
My token list now contains the following tokens:
echo
"Here is the date: "
&
date
()
Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.
For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.
expect is what your parser does to get the next token, and fails if the token isn't a proper following token. To begin with, your parser expects ECHO or WHITESPACE. Those are the only valid starting terms. Having seen "ECHO", your parser expects one of WHITESPACE|STRING|MATH|DATE; anything else is an error. And so on.
accept is when your parser has seen a complete "statement" - ECHO, followed by a valid sequence of tokens, followed by TERM. Your parser now has enough information to process your ECHO command.
Oh, and hand-written parsers (especially simple ones) are very often disgusting collections of if blocks (or moral equivalents like switch statements) :) Further up the line of elegant-ness would be some kind of state machine, and further up from that is a grammar generator like yacc or GOLD Parser Generator (which in turn churn out ugly if, switch, and state machines for you).
EDIT to provide more details.
To help sort out responsibilities, create a "lexer" whose job is to read the input and produce tokens. This involves deciding what tokens look like. An easy token is the word "echo". A less easy token is a math operation; the token would consist of one or more digits, an operator, and one or more digits, with no whitespace between. The lexer would take care of skipping whitespace, as well as understanding a quoted string and the characters that form the date() function. The lexer would return two things - the type of token read and the value of the token (e.g., "MATH" and "9*15").
With a lexer in hand to read your input, the parser consumes the tokens and ensures they're in a proper order. First you have to see the ECHO token. If not, fail with an error message. After that, you have to see STRING, DATE, or MATH. If not, fail with an error message. After that, you loop, watching for either TERM, or else CONCAT followed by another STRING, DATE, or MATH. If you see TERM, break the loop. If you see neither TERM nor CONCAT, fail with an error message.
You can process the ECHO command as you're parsing, since it's a simple grammar. Each time you find a STRING, DATE or MATH, evaluate it and concatenate it to what you already have. When you find TERM, exit the function and return the built-up string.
Questions? Comments? Omelets? :)
How can I validate the following incorrect email addresses with regex in c#?
eg:
test#test.com.net.com
or
test#test.net.net.org
These are being validated as correct email addresses. Any thoughts? Thanks
While both test#test.com.net.com and test#test.net.net.org are valid, from a syntactic point of view, their domain parts do not point to existing domains.
For this kind of test, you may want to extract the domain part you are interest in and query the DNS (see RFC 2821 and RFC 2822) to see if it exists.
Since you are using .NET, by the way, I would suggest you to take a look at our EmailVerify.NET, a leading email validation library which can validate the syntax (according to the latest IETF standards), the domain parts and the presence of a mailbox for your email addresses.
You may want to just use something like:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
For a list, please see this page.
Please consider this regex:
([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
It will match all the email-address, if it did not match the email address, the email-ad is not in correct format, Hope it helps :)
I am writing a code generator in which the variable names are given by the user.
Previous answers have suggested using Regex or CodeDomProvider, the former will tell you if the identifier is valid, but doesn't check keywords, the latter checks keywords, but doesn't appear to check all Types known to the code.
How to determine if a string is a valid variable name?
For instance, a user could name a variable List, or Type, but that is not desirable. How would I prevent this?
The easiest way is to add a list of C# keywords to your application. MSDN has a complete list here.
If you really want to get fancy, you could dynamically compile your generated code and check for the specific errors that you're concerned about. In this case, you're specifically looking for error CS1041:
error CS1041: Identifier expected; '**' is a keyword
You'll probably want to ignore any errors regarding unresolved references, undeclared identifiers, etc.
As others have suggested, you could just prepend your identifiers with #, which is fine if you don't want the user to examine the generated code. If it's something they're going to have to maintain, however, I'd avoid that as (in my opinion) it makes the code noisy, just like $ all over the place in PHP or guys that insist on putting this. in front of every freaking field reference.
I'm not sure there is a full API available which will give you what you're looking for. However the end result you seem to be looking for is the generation of code which will not cause conflicts with reserved C# keywords or existing types. If that is the case one approach you can take is to escape all identifiers given by the user with the # symbol. This allows even reserved keywords in C# to be treated as identifiers.
For example the following is completely valid C# program
class Program
{
static void Main(string[] args)
{
int #byte = 42;
int #string = #byte;
int #Program = 0;
}
}
One option here would be to have your code generator prefix the user-specified name with #. As described in 2.4.2, the # sign (verbatim identifier):
prefix "#" enables the use of keywords as identifiers, which is useful when interfacing with other programming languages. The character # is not actually part of the identifier, so the identifier might be seen in other languages as a normal identifier, without the prefix. An identifier with an # prefix is called a verbatim identifier. Use of the # prefix for identifiers that are not keywords is permitted, but strongly discouraged as a matter of style.
This would allow you to check for the main keywords, and deny them as needed, but not worry about all of the conflicting type information, etc.
You could just prepend a # character to the variable - for instance, #private is a valid variable name.
I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:
yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com
Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.
If no scheme present (no colon in string), prepend "http://" to make it a valid URL.
Pass string to Uri constructor.
Access the Uri's Host property.
Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.
It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.
A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.
Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.
Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!
I would recommend trying this yourself. Using regulator and a regex cheat sheet.
http://sourceforge.net/projects/regulator/
http://regexlib.com/CheatSheet.aspx
Also find some good info on Regular Expressions at coding horror.
Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).
A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:
(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))
You would catch .anything and .co.anything, which I imagine covers most realistic cases...