I have an application where I need to parse a string to find all the e-mail addresses in a that string. I am not a regular espression guru by any means and not sure what the differnce is between some expressions. I have found 2 expressions that, apprently, will match all of the e-mail addresses in a string. I cannot get either to work in my C# application. Here are the expressions:
/\b([A-Z0-9._%-]+)#([A-Z0-9.-]+\.[A-Z]{2,4})\b/i
^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$
Here is an example string:
Adam
<mailto:aedwards#domain.com?subject=Your%20prospect%20Search%20-%20ID:
%2011111> Edwards - Prospect ID: 11111, Ph: (555) 555-5555
Al
<mailto:Al#anotherdomain.com?subject=Your%20prospect%20Search%20-%20
ID:%20222222> Grayson - Prospect ID: 222222, Ph:
Angie
Here is the code in c#:
var mailReg = new Regex(EmailMatch, RegexOptions.IgnoreCase | RegexOptions.Multiline);
var matches = mailReg.Matches(theString);
The first regex is a Perl object (delimited by slashes). Drop the slashes and the mode modifier (i), and it should work:
EmailMatch = #"\b([A-Z0-9._%-]+)#([A-Z0-9.-]+\.[A-Z]{2,6})\b"
Also, .museum is a valid domain, so {2,6} is a bit better.
The second regex only matches entire strings that consist of nothing but an email address.
I would leave the \b intact.
The first of your two examples should work if you remove the \b from both ends. The \b means that it expects a word boundary (a space, end of line, &c.) before and after the email address and this is not present in your case.
(Please do not use your new found powers for evil.)
This expression worked: ([a-zA-Z0-9_-.]+)#([a-zA-Z0-9_-.]+).([a-zA-Z]{2,5})
Thanks for looking!
Related
this is my pattern:
-{8}\s+((?:[0-9]{2}\/[0-9]{2}\/[0-9]{2})\s+(?:[0-9]{2}:[0-9]{2}:[0-9]{2}))\s+(?:LINE)\s+=\s+([0-9]{0,9})\s+(?:STN)\s+=\s+([0-9]{0,9})[ ]*(?:\n[ \S]*)*INCOMING CALL(?:[\S\s][^-])*
and this is my string:
-------- 02/16/18 13:50:39 LINE = 0248 STN = 629
CALLING NUMBER 252
NAME Mar Ant
UNKNOWN
DNIS NUMBER 255
BC = SPEECH
VOIP CALL
00:00:00 INCOMING CALL RINGING 0:09
LINE = 0004
00:00:25 CALL RELEASED
it does match with several online regex testers but not with C# testers like http://regexstorm.net/tester, since
.NET does not use Posix syntax
i notices that if i remove the last part of the pattern
INCOMING CALL(?:[\S\s][^-])*
it does match but still incorrect or at least not what i expect from it.
what should i change to make this pattern match the string ?
The problem is not related to the pattern type, the pattern is not actually POSIX compliant as regex escapes cannot be used inside bracket expressions (see [\S\s] in your pattern, which can be written as . together with RegexOptions.Singleline option, or as (?s:.)).
All you need to do here is to replace \n with \r?\n or (?:\r\n?|\n) to match Windows line break style.
Use
-{8}\s+((?:[0-9]{2}/[0-9]{2}/[0-9]{2})\s+(?:[0-9]{2}:[0-9]{2}:[0-9]{2}))\s+(?:LINE)\s+=\s+([0-9]{0,9})\s+(?:STN)\s+=\s+([0-9]{0,9})[ ]*(?:\r?\n[ \S]*)*INCOMING CALL(?:[\S\s][^-])*
See this regex demo.
I am extracting all numbers used in an xml file. The numbers are written in following two patterns
<Environment Id="11" StringId="8407" DescriptionId="5014" RemoteControlAppStringId="8119; 8118" EnvironmentType="BlueToothBridge" AlternateId="1" XML_NAME_ID="BTBSpeechPlusM" FactoryGainType="LIN18">
<Offsets />
</Environment>
I am using regex: "\"\d*;\"" and "\"\d*\"" to extract all numbers.
from the above when i ran Regex "\"\d*\"" using
Regex.Match(myString, "\"\\d*\"")
the above line returns 8407, 11,5014 but it is not returning 8119 and 8118
Your regex will fail to match 8119; 8118 because your pattern is finding quoted numbers.
try with
\b\d+\b
\b specify that \d+ will match only in word boundary. So LIN18 will not match.
Depening on whether you can assume that the provided input is valid XML, you could use the following regular expression:1
Regex.match(myString, "(?<=\")\\d+(?=\")|(?<=\")\\d+(?=; ?\\d+\")|(?<=\"\\d+; ?)\\d+(?=\")" )
The main idea behind this is that it takes the three possible situations into account:
"[number]"
"[number]; [other_number]" (With or without a space before [other_number])
"[other_number]; [number]" (With or without a space before [number])
There are two new concepts I included in the regular expression:2
Positive lookahead: (?=[regex])
Positive lookbehind: (?<=[regex])
These concepts allow the regular expression to check if something specific is before or after it, without putting it in the match.
This regular expression could easily be optimised, but this is meant as an example of a basic approach.
One good tip for developing a regular expression like this is to use a tool (online or offline) to test your regular expression. The tool I used was .NET Regex Tester.
As #poke stated in the comment, it's because your regex doesn't match the string. Change your regex to capture specific matches and account for the possibility of the ';'.
Something like below should probably do the trick.
EDIT: (\b\d+\b)|(\b\d+[;*]\d+\b)
can someone write a regex in C# for me to verify the following emails ?
aa#bb.com;cc#dfs.com;asdf#fasdf.com;sdfsdf#fsaf.com;
every email addresses are seperated with ";", and I have wrote the following regex:
^(([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9});)*$
when use this regex to match a string, it might cause a dead loop, why ?
Thanks in advance!
I think you should split the email addresses and match each one against a regular expression for matching email.
Split the email addresses using ','
Match each email address against a validation expression.
Your regular expression suffers from catastrophic backtracking. I added atomic groups to your regular expression to create this:
^(([0-9a-zA-Z](?>(?>[-\.\w]*[0-9a-zA-Z])*#)(?>(?>[0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9}));)*$
Your specific example works:
string s = "aa#bb.com;cc#dfs.com;asdf#fasdf.com;sdfsdf#fsaf.com;";
Regex re = new Regex(#"^(([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9});)*$");
Console.WriteLine(re.IsMatch(s));
prints "True"
I could not reproduce the infinite loop on my computer with your example (I am using .NET 3.5). Here the code I used:
Regex rex = new Regex(#"^(([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*"+
#"#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9});)*$");
String emails = "aa#bb.com;cc#dfs.com;asdf#fasdf.com;sdfsdf#fsaf.com;";
Boolean ismatch = rex.IsMatch(emails);
Match match = rex.Match(emails);
ismatch result is true and match contains data.
This other question : How to avoid infinite loops in the .NET RegEx class? might be of interest to you.
You can try using this
\s*[;,]\s*(?!(?<=(?:^|[;,])\s*"(?:[^"]|""|\\")*[;,]\s*)(?:[^"]|""|\\")*"\s*(?:[;,]|$))
This regex splits comma or semicolon separated lists of optionally quoted strings. It also handles quoted delimiters and escaped quotes. Whitespace inside quotes is preserved, outside the quotes is removed.
Hope this will meet your requirement
Thanks
~ Aamod
I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/
I have the following text that I am trying to parse:
"user1#emailaddy1.com" <user1#emailaddy1.com>, "Jane Doe" <jane.doe# addyB.org>,
"joe#company.net" <joe#company.net>
I am using the following code to try and split up the string:
Dim groups As GroupCollection
Dim matches As MatchCollection
Dim regexp1 As New Regex("""(.*)"" <(.*)>")
matches = regexp1 .Matches(toNode.InnerText)
For Each match As Match In matches
groups = match.Groups
message.CompanyName = groups(1).Value
message.CompanyEmail = groups(2).Value
Next
But this regular expression is greedy and is grabbing the entire string up to the last quote after "joe#company.net". I'm having a hard time putting together an expression that will group this string into the two groups I'm looking for: Name (in the quotes) and E-Mail (in the angle brackets). Does anybody have any advice or suggestions for altering the regexp to get what I need?
Rather than rolling your own regular expression, I would do this:
string[] addresses = toNode.InnerText.Split(",");
foreach(string textAddress in addresses)
{
textAddress = address.Trim();
MailAddress address = new MailAddress(textAddress);
message.CompanyName = address.DisplayName;
message.CompanyEmail = address.Address;
}
While your regular expression may work for the few test cases that you have shown. Using the MailAddress class will probably be much more reliable in the long run.
How about """([^""]*)"" <([^>]*)>" for the regex? I.e. make explicit that the matched part won't include a quote/closing paren. You may also want to use a more restrictive character-range instead.
Not sure what regexp engine ASP.net is running but try the non-greedy variant by adding a ? in the regex.
Example regex
""(.*?)"" <(.*?)>
You need to specify that you want the minimal matched expression.
You can also replace (.*) pattern by more precise ones:
For example you could exclude the comma and the space...
Usually it's better to avoid using .* in a regular expression, because it reduces performance !
For example for the email, you can use a pattern like [\w-]+#([\w-]+.)+[\w-]+ or a more complex one.
You can find some good patterns on : http://regexlib.com/