Just I want to clarify one thing. Per client request we have to create a regular expression in such a way that it should allow apostrophe in email address.
My Question according to RFC standard will an email address contain aportrophe? If so how to recreate regular expression to allow apostrophe?
The regular expression below implements the official RFC 2822 standard for email addresses. Using this regular expression in actual applications is NOT recommended. It is shown to illustrate that with regular expressions there's always a trade-off between what's exact and what's practical.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
You could use the simplified one:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
And yes, apostrophe is allowed in the email, as long as it is not in domain name.
Here's the validation attribute I wrote. It validates pretty much every "raw" email address, that is those of the form local-part#*domain*. It doesn't support any of the other, more...creative constructs that the RFCs allow (this list is not comprehensive by any means):
comments (e.g., jsmith#whizbang.com (work))
quoted strings (escaped text, to allow characters not allowed in an atom)
domain literals (e.g. foo#[123.45.67.012])
bang-paths (aka source routing)
angle addresses (e.g. John Smith <jsmith#whizbang.com>)
folding whitespace
double-byte characters in either local-part or domain (7-bit ASCII only).
etc.
It should accept almost any email address that can be expressed thusly
foo.bar#bazbat.com
without requiring the use of quotes ("), angle brackets ('<>') or square brackets ([]).
No attempt is made to validate that the rightmost dns label in the domain is a valid TLD (top-level domain). That is because the list of TLDs is far larger now than the "big 6" (.com, .edu, .gov, .mil, .net, .org) plus 2-letter ISO country codes. ICANN actually updates the TLD list daily, though I suspect that the list doesn't actually change daily. Further, ICANN just approved a big expansion of the generic TLD namespace). And some email addresses don't have what you're recognize as a TLD (did you know that postmaster#. is theoretically valid and mailable? Mail to that address should get delivered to the postmaster of the DNS root zone.)
Extending the regular expression to support domain literals, it shouldn't be too difficult.
Here you go. Use it in good health:
using System;
using System.ComponentModel.DataAnnotations;
using System.Text.RegularExpressions;
namespace ValidationHelpers
{
[AttributeUsage( AttributeTargets.Property | AttributeTargets.Field , AllowMultiple = false )]
sealed public class EmailAddressValidationAttribute : ValidationAttribute
{
static EmailAddressValidationAttribute()
{
RxEmailAddress = CreateEmailAddressRegex();
return;
}
private static Regex CreateEmailAddressRegex()
{
// references: RFC 5321, RFC 5322, RFC 1035, plus errata.
string atom = #"([A-Z0-9!#$%&'*+\-/=?^_`{|}~]+)" ;
string dot = #"(\.)" ;
string dotAtom = "(" + atom + "(" + dot + atom + ")*" + ")" ;
string dnsLabel = "([A-Z]([A-Z0-9-]{0,61}[A-Z0-9])?)" ;
string fqdn = "(" + dnsLabel + "(" + dot + dnsLabel + ")*" + ")" ;
string localPart = "(?<localpart>" + dotAtom + ")" ;
string domain = "(?<domain>" + fqdn + ")" ;
string emailAddrPattern = "^" + localPart + "#" + domain + "$" ;
Regex instance = new Regex( emailAddrPattern , RegexOptions.Singleline | RegexOptions.IgnoreCase );
return instance;
}
private static Regex RxEmailAddress;
public override bool IsValid( object value )
{
string s = Convert.ToString( value ) ;
bool fValid = string.IsNullOrEmpty( s ) ;
// we'll take an empty field as valid and leave it to the [Required] attribute to enforce that it's been supplied.
if ( !fValid )
{
Match m = RxEmailAddress.Match( s ) ;
if ( m.Success )
{
string emailAddr = m.Value ;
string localPart = m.Groups[ "localpart" ].Value ;
string domain = m.Groups[ "domain" ].Value ;
bool fLocalPartLengthValid = localPart.Length >= 1 && localPart.Length <= 64 ;
bool fDomainLengthValid = domain.Length >= 1 && domain.Length <= 255 ;
bool fEmailAddrLengthValid = emailAddr.Length >= 1 && emailAddr.Length <= 256 ; // might be 254 in practice -- the RFCs are a little fuzzy here.
fValid = fLocalPartLengthValid && fDomainLengthValid && fEmailAddrLengthValid ;
}
}
return fValid ;
}
}
}
Cheers!
Related
I have to write a function that looks up for a string and check if is followed/preceded by a blank space, and if not add it here is my try :
public string AddSpaceIfNeeded(string originalValue, string targetValue)
{
if (originalValue.Contains(targetValue))
{
if (!originalValue.StartsWith(targetValue))
{
int targetValueIndex = originalValue.IndexOf(targetValue);
if (!char.IsWhiteSpace(originalValue[targetValueIndex - 1]))
originalValue.Insert(targetValueIndex - 1, " ");
}
if (!originalValue.EndsWith(targetValue))
{
int targetValueIndex = originalValue.IndexOf(targetValue);
if (!char.IsWhiteSpace(originalValue[targetValueIndex + targetValue.Length + 1]) && !originalValue[targetValueIndex + targetValue.Length + 1].Equals("(s)"))
originalValue.Insert(targetValueIndex + targetValue.Length + 1, " ");
}
}
return originalValue;
}
I want to try with Regex :
I tried like this for adding spaces after the targetValue :
Regex spaceRegex = new Regex("(" + targetValue + ")(?!,)(?!!)(?!(s))(?= )");
originalValue = spaceRegex.Replace(originalValue, (Match m) => m.ToString() + " ");
But not working, and I don't really know for adding space before the word.
Example adding space after:
AddSpaceIfNeeded(Hello my nameis ElBarto, name)
=> Output Hello my name is ElBarto
Example adding space before:
AddSpaceIfNeeded(Hello myname is ElBarto, name)
=> Output Hello my name is ElBarto
You may match your word in all three context while capturing them in separate groups and test for a match later in the match evaluator:
public static string AddSpaceIfNeeded(string originalValue, string targetValue)
{
return Regex.Replace(originalValue,
$#"(?<=\S)({targetValue})(?=\S)|(?<=\S)({targetValue})(?!\S)|(?<!\S){targetValue}(?=\S)", m =>
m.Groups[1].Success ? $" {targetValue} " :
m.Groups[2].Success ? $" {targetValue}" :
$"{targetValue} ");
}
See the C# demo
Note you may need to use Regex.Escape(targetValue) to escape any sepcial chars in the string used as a dynamic pattern.
Pattern details
(?<=\S)({targetValue})(?=\S) - a targetValue that is preceded with a non-whitespace ((?<=\S)) and followed with a non-whitespace ((?=\S))
| - or
(?<=\S)({targetValue})(?!\S) - a targetValue that is preceded with a non-whitespace ((?<=\S)) and not followed with a non-whitespace ((?!\S))
| - or
(?<!\S){targetValue}(?=\S) - a targetValue that is not preceded with a non-whitespace ((?<!\S)) and followed with a non-whitespace ((?!\S))
When m.Groups[1].Success is true, the whole value should be enclosed with spaces. When m.Groups[2].Success is true, we need to add a space before the value. Else, we add a space after the value.
I don't know Regex,
But I need to have regex expression for evaluation of ClassName.PropertyName?
Need to validate some values from appSettings for being compliant with ClassName.PropertyName convention
"ClassName.PropertyName" - this is the only format that is valid, the rest below is invalid:
"Personnel.FirstName1" <- the only string that should match
"2Personnel.FirstName1"
"Personnel.33FirstName"
"Personnel..FirstName"
"Personnel.;FirstName"
"Personnel.FirstName."
"Personnel.FirstName "
" Personnel.FirstName"
" Personnel. FirstName"
" 23Personnel.3FirstName"
I have tried this (from the link posted as duplicate):
^\w+(.\w+)*$
but it doesn't work: I have false positives, e.g. 2Personnel.FirstName1 as well as Personnel.33FirstName passes the check when both should have been rejected.
Can someone help me with that?
Let's start from single identifier:
Its first character must be letter or underscope
It can contain letters, underscopes and digits
So the regular expression for an identifier is
[A-Za-z_][A-Za-z0-9_]*
Next, we should chain identifier with . (do not forget to escape .) an indentifier followed by zero or more . + identifier:
^[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*$
In case it must be exactly two identifiers (and not, say abc.def.hi - three ones)
^[A-Za-z_][A-Za-z0-9_]*\.[A-Za-z_][A-Za-z0-9_]*$
Tests:
string[] tests = new string[] {
"Personnel.FirstName1", // the only string that should be matched
"2Personnel.FirstName1",
"Personnel.33FirstName",
"Personnel..FirstName",
"Personnel.;FirstName",
"Personnel.FirstName.",
"Personnel.FirstName ",
" Personnel.FirstName",
" Personnel. FirstName",
" 23Personnel.3FirstName",
} ;
string pattern = #"^[A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*$";
var results = tests
.Select(test =>
$"{"\"" + test + "\"",-25} : {(Regex.IsMatch(test, pattern) ? "matched" : "failed")}"");
Console.WriteLine(String.Join(Environment.NewLine, results));
Outcome:
"Personnel.FirstName1" : matched
"2Personnel.FirstName1" : failed
"Personnel.33FirstName" : failed
"Personnel..FirstName" : failed
"Personnel.;FirstName" : failed
"Personnel.FirstName." : failed
"Personnel.FirstName " : failed
" Personnel.FirstName" : failed
" Personnel. FirstName" : failed
" 23Personnel.3FirstName" : failed
Edit: In case culture specific names (like äöü.FirstName) should be accepted (see Rand Random's comments) then [A-Za-z] range should be changed into \p{L} - any letter. Exotic possibility - culture specific digits (e.g. Persian ones - ۰۱۲۳۴۵۶۷۸۹) can be solved by changing 0-9 into \d
// culture specific letters, but not digits
string pattern = #"^[\p{L}_][\p{L}0-9_]*(?:\.[\p{L}_][\p{L}0-9_]*)*$";
If each identifier should not exceed sertain length (say, 16) we should redesign initial identifier pattern: mandatory letter or underscope followed by [0..16-1] == {0,15} letters, digits or underscopes
[A-Za-z_][A-Za-z0-9_]{0,15}
And we have
string pattern = #"^[A-Za-z_][A-Za-z0-9_]{0,15}(?:\.[A-Za-z_][A-Za-z0-9_]{0,15})*$";
^[A-Za-z]*\.[A-Za-z]*[0-9]$
or
^[A-Za-z]*\.[A-Za-z]*[0-9]+$
if you need more than one numerical character in the number suffix
I am using the following regex to tokenize:
reg = new Regex("([ \\t{}%$^&*():;_–`,\\-\\d!\"?\n])");
The regex is supposed to filter out everything later, however the input string format that i am having problem with is in the following form:
; "string1"; "string2"; "string...n";
the result of the string: ; "social life"; "city life"; "real life" as I know should be like the following:
; White " social White life " ; White " city White life " ; White " real White life "
However there is a problem such that, I get the output in the following form
; empty White empty " social White life " empty ; empty White empty " city White life " empty ; empty White empty " real White life " empty
White: means White-Space,
empty: means empty entry in the split array.
My code for split is as following:
string[] ret = reg.Split(input);
for (int i = 0; i < ret.Length; i++)
{
if (ret[i] == "")
Response.Write("empty<br>");
else
if (ret[i] == " ")
Response.Write("White<br>");
else
Response.Write(ret[i] + "<br>");
}
Why I get these empty entries ? and especially when there is ; followed by space followed by " then the result looks like the following:
; empty White empty "
can I get explanation of why the command adds empty entries ? and how to remove them without any additional O(n) complexity or using another data structure as ret
In my experience, splitting at regex matches is almost always not the best idea. You'll get much better results through plain matching.
And regexes are very well suited for tokenization purposes, as they let you implement a state machine really easily, just take a look at that:
\G(?:
(?<string> "(?>[^"\\]+|\\.)*" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
Demo - use this with RegexOptions.IgnorePatternWhitespace of course.
Here, each match will have the following properties:
It will start at the end of the previous match, so there will be no unmatched text
It will contain exactly one matching group
The name of the group tells you the token type
You can ignore the whitespace group, and you should raise an error if you ever encounter a matching invalid group.
The string group will match an entire quoted string, it can handle escapes such as \" inside the string.
The invalid group should always be last in the pattern. You may add rules for other other types.
Some example code:
var regex = new Regex(#"
\G(?:
(?<string> ""(?>[^""\\]+|\\.)*"" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
", RegexOptions.IgnorePatternWhitespace);
var input = "; \"social life\"; \"city life\"; \"real life\"";
var groupNames = regex.GetGroupNames().Skip(1).ToList();
foreach (Match match in regex.Matches(input))
{
var groupName = groupNames.Single(name => match.Groups[name].Success);
var group = match.Groups[groupName];
Console.WriteLine("{0}: {1}", groupName, group.Value);
}
This produces the following:
separator: ;
whitespace:
string: "social life"
separator: ;
whitespace:
string: "city life"
separator: ;
whitespace:
string: "real life"
See how much easier it is to deal with these results rather than using split?
Why does EscapeDataString behave differently between .NET 4 and 4.5? The outputs are
Uri.EscapeDataString("-_.!~*'()") => "-_.!~*'()"
Uri.EscapeDataString("-_.!~*'()") => "-_.%21~%2A%27%28%29"
The documentation
By default, the EscapeDataString method converts all characters except
for RFC 2396 unreserved characters to their hexadecimal
representation. If International Resource Identifiers (IRIs) or
Internationalized Domain Name (IDN) parsing is enabled, the
EscapeDataString method converts all characters, except for RFC 3986
unreserved characters, to their hexadecimal representation. All
Unicode characters are converted to UTF-8 format before being escaped.
For reference, unreserved characters are defined as follows in RFC 2396:
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" |
(" | ")"
And in RFC 3986:
ALPHA / DIGIT / "-" / "." / "_" / "~"
The source code
It looks like whether each character of EscapeDataString is escaped is determined roughly like this
is unicode above \x7F
? PERCENT ENCODE
: is a percent symbol
? is an escape char
? LEAVE ALONE
: PERCENT ENCODE
: is a forced character
? PERCENT ENCODE
: is an unreserved character
? PERCENT ENCODE
It's at that final check "is an unreserved character" where the choice between RFC2396 and RFC3986 is made. The source code of the method verbatim is
internal static unsafe bool IsUnreserved(char c)
{
if (Uri.IsAsciiLetterOrDigit(c))
{
return true;
}
if (UriParser.ShouldUseLegacyV2Quirks)
{
return (RFC2396UnreservedMarks.IndexOf(c) >= 0);
}
return (RFC3986UnreservedMarks.IndexOf(c) >= 0);
}
And that code refers to
private static readonly UriQuirksVersion s_QuirksVersion =
(BinaryCompatibility.TargetsAtLeast_Desktop_V4_5
// || BinaryCompatibility.TargetsAtLeast_Silverlight_V6
// || BinaryCompatibility.TargetsAtLeast_Phone_V8_0
) ? UriQuirksVersion.V3 : UriQuirksVersion.V2;
internal static bool ShouldUseLegacyV2Quirks {
get {
return s_QuirksVersion <= UriQuirksVersion.V2;
}
}
Confusion
It seems contradictory that the documentation says the output of EscapeDataString depends on whether IRI/IDN parsing is enabled, whereas the source code says the output is determined by the value of TargetsAtLeast_Desktop_V4_5. Could someone clear this up?
A lot of changes has been done in 4.5 comparing to 4.0 in terms of system functions and how it behaves.
U can have a look at this thread
Why does Uri.EscapeDataString return a different result on my CI server compared to my development machine?
or
U can directly go to the following link
http://msdn.microsoft.com/en-us/library/hh367887(v=vs.110).aspx
All this has been with the input from the users around the world.
Given the following scenario, I am wondering if a better solution could be written with Regular Expressions for which I am not very familiar with yet. I am seeing holes in my basic c# string manipulation even though it somewhat works. Your thoughts and ideas are most appreciated.
Thanks much,
Craig
Given the string "story" below, write a script to do the following:
Variable text is enclosed by { }.
If the variable text is blank, remove any other text enclosed in [ ].
Text to be removed can be nested deep with [ ].
Format:
XYZ Company [- Phone: [({404}) ]{321-4321} [Ext: {6789}]]
Examples:
All variable text filled in.
XYZ Company - Phone: (404) 321-4321 Ext: 6789
No Extension entered, remove "Ext:".
XYZ Company - Phone: (404) 321-4321
No Extension and no area code entered, remove "Ext:" and "( ) ".
XYZ Company - Phone: 321-4321
No extension, no phone number, and no area code, remove "Ext:" and "( ) " and "- Phone: ".
XYZ Company
Here is my solution with plain string manipulation.
private string StoryManipulation(string theStory)
{
// Loop through story while there are still curly brackets
while (theStory.IndexOf("{") > 0)
{
// Extract the first curly text area
string lcCurlyText = StringUtils.ExtractString(theStory, "{", "}");
// Look for surrounding brackets and blank all text between
if (String.IsNullOrWhiteSpace(lcCurlyText))
{
for (int lnCounter = theStory.IndexOf("{"); lnCounter >= 0; lnCounter--)
{
if (theStory.Substring(lnCounter - 1, 1) == "[")
{
string lcSquareText = StringUtils.ExtractString(theStory.Substring(lnCounter - 1), "[", "]");
theStory = StringUtils.ReplaceString(theStory, ("[" + lcSquareText + "]"), "", false);
break;
}
}
}
else
{
// Replace current curly brackets surrounding the text
theStory = StringUtils.ReplaceString(theStory, ("{" + lcCurlyText + "}"), lcCurlyText, false);
}
}
// Replace all brackets with blank (-1 all instances)
theStory = StringUtils.ReplaceStringInstance(theStory, "[", "", -1, false);
theStory = StringUtils.ReplaceStringInstance(theStory, "]", "", -1, false);
return theStory.Trim();
}
Dealing with nested structures is generally beyond the scope of regular expressions. But I think there is a solution, if you run the regex replacement in a loop, starting from the inside out. You will need a callback-function though (a MatchEvaluator):
string ReplaceCallback(Match match)
{
if(String.IsNullOrWhiteSpace(match.Groups[2])
return "";
else
return match.Groups[1]+match.Groups[2]+match.Groups[3];
}
Then you can create the evaluator:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
And then you can call this in a loop until the replacement does not change anything any more:
newString = Regex.Replace(
oldString,
#"
\[ # a literal [
( # start a capturing group. this is what we access with "match.Groups[1]"
[^{}[\]]
# a negated character class, that matches anything except {, }, [ and ]
* # arbitrarily many of those
) # end of the capturing group
\{ # a literal {
([^{}[\]]*)
# the same thing as before, we will access this with "match.Groups[2]"
} # a literal }
([^{}[\]]*)
# "match.Groups[3]"
] # a literal ]
",
evaluator,
RegexOptions.IgnorePatternWhitespace
);
Here is the whitespace-free version of the regex:
\[([^{}[\]]*)\{([^{}[\]]*)}([^{}[\]]*)]