How do you get an email address within a string - c#

I am pulling many emails from an Exchange 2003 server and from those emails, trying to determine which are bounce-backs (invalid) so I can remove them from our contacts.
What would be the most efficient method of searching the email bodies to find email addresses on the bounce backs?

You might want to look at this page, which has several variants of regexes for matching email addresses and explains the trade-offs for selecting each. You should definitely read it before picking one here.

Just use a regex.
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b

This is the regex that we use in a lot of our applications for email validation;
public static bool CheckEmail(string email)
{
//validate Email
Regex regex = new Regex(#"^([a-zA-Z0-9_\-\.\']+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})$", RegexOptions.IgnoreCase);
Match match = regex.Match(email);
return match.Success;
}
The actual process for correctly identifying a bounced email, rather than an auto-reply or genuine message is a little more complicated, but this will at least give you the email address.

I pulled a few of the answers here into something like this. It actually returns each email address from the string (sometimes there are multiples from the mail host and target address). I can then match each of the email addresses up against the outbound addresses we sent, to verify. I used the article from #plinth to get a better understanding of the regular expression and modified the code from #Chris Bint
However, I'm still wondering if this is the fastest way to monitor 10,000+ emails? Are there any more efficient methods (while still using c#)? The live code won't recreate the Regex object every time within the loop.
public static MatchCollection CheckEmail(string email)
{
Regex regex = new Regex(#"\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b", RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(email);
return matches;
}

Related

Why does EmailAddressAttribute.IsValid and MailAddress think that emails which contain "ª" are valid? [duplicate]

This question already has answers here:
Can an email address contain international (non-english) characters?
(7 answers)
Closed 1 year ago.
I have this C# code:
void Main()
{
// method 1 - using MailAddress
var email = "fooªbar#cander.com";
Console.WriteLine(IsValidEmail(email));
// method 2 - using EmailAddressAttribute
var validator = new System.ComponentModel.DataAnnotations.EmailAddressAttribute();
Console.WriteLine(validator.IsValid(email));
}
bool IsValidEmail(string email)
{
try
{
var addr = new System.Net.Mail.MailAddress(email);
return addr.Address == email;
}
catch
{
return false;
}
}
That validates the fooªbar#cander.com email address. And... It validates it althougt it has the "ª" symbol. Why? According to: What characters are allowed in an email address? it shoudn't be valid
It validates it althougt it has the "ª" symbol. Why?
Because your Regex allows "one or more \word characters" before the #, and ª is a word character:
RegexStorm uses the .net engine: you can see that the \w pattern (a single word character) has successfully matched an ª (one match)
According to: What characters are allowed in an email address? it shoudn't be valid
Alas, the regular expression you have used does not accurately implement the specification given in the linked question
When it comes to validating email addresses, genuinely I don't think you should try and control it to a very fine degree - it's a headache to form and maintain a complex Regex that considers every variation and it doesn't really bring much benefit, it just generates a pain point for users whose valid emails don't validate because of a bug in your Regex.
When we test for email validity, we basically only check that it contains an #.. what's the worst that can happen if a user types it in wrong?
(apologies if that picture appears huge; it looks reasonable on a cellphone but I recall that iPhone screenshots sometimes end up looking a bit oversized on web)

Regular expression that matches all valid format IPv6 addresses

At first glance, I concede that this question looks like a duplicate of this question and any other related to it:
Regular expression that matches valid IPv6 addresses
That question in fact has an answer that nearly answers my question, but not fully.
The code from that question which I have issues with, yet had the most success with, is as shown below:
private string RemoveIPv6(string sInput)
{
string pattern = #"(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))";
//That is one looooong regex! From: https://stackoverflow.com/a/17871737/3472690
//if (IsCompressedIPv6(sInput))
// sInput = UncompressIPv6(sInput);
string output = Regex.Replace(sInput, pattern, "");
if (output.Contains("Addresses"))
output = output.Substring(0, "Addresses: ".Length);
return output;
}
The issues I had with the regex pattern as provided in this answer, David M. Syzdek's Answer, is that it doesn't match and remove the full form of the IPv6 addresses I'm throwing at it.
I'm using the regex pattern to mainly replace IPv6 addresses in strings with blanks or null value.
For instance,
Addresses: 2404:6800:4003:c02::8a
As well as...
Addresses: 2404:6800:4003:804::200e
And finally...
Addresses: 2001:4998:c:a06::2:4008
All either don't get fully matched by the regex, or failed to be completely matched.
The regex will return me the remaining parts of the string as shown below:
Addresses: 8a
Addresses: 200e
Addresses: 2:4008
As can be seen, it has left remnants of the IPv6 addresses, which is hard to detect and remove, due to the varying formats that the remnants take on. Below is the regex pattern by itself for better analysis:
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
Therefore, my question is, how can this regex pattern be corrected so it can match, and therefore allow the complete removal of any IPv6 addresses, from a string that doesn't solely contain the IPv6 address(es) itself?
Alternatively, how can the code snippet I provided above be corrected to provide the required outcome?
For those who may be wondering, I am getting the string from the StandardOutput of nslookup commands, and the IPv6 addresses will always differ. For the examples above, I got those IPv6 addresses from "google.com" and "yahoo.com".
I am not using the built-in function to resolve DNS entries for a good reason, which I don't think will matter for the moment, therefore I am using nslookup.
As for the code that is calling that function, if required, is as below: (It itself is also another function/method, or rather part of one)
string output = "";
string garbagecan = "";
string tempRead = "";
string lastRead = "";
using (StreamReader reader = nslookup.StandardOutput)
{
while (reader.Peek() != -1)
{
if (LinesRead > 3)
{
tempRead = reader.ReadLine();
tempRead = RemoveIPv6(tempRead);
if (tempRead.Contains("Addresses"))
output += tempRead;
else if (lastRead.Contains("Addresses"))
output += tempRead.Trim() + Environment.NewLine;
else
output += tempRead + Environment.NewLine;
lastRead = tempRead;
}
else
garbagecan = reader.ReadLine();
LinesRead++;
}
}
return output;
The corrected regex should only allow the removal of IPv6 addresses, and leave IPv4 addresses untouched. The string that will be passed to the regex will not contain the IPv6 address(es) alone, and will almost always contain other details, and as such, it is unpredictable at which index will the addresses appear. The regex is also skipping all other IPv6 addresses after the first occuring IPv6 addresses as well for some reason, it should be noted.
Apologies if there are any missing details, I will try my best to include them in when alerted. I would also prefer working code samples, if possible, as I have almost zero knowledge regarding regex.
(?:^|(?<=\s))(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))(?=\s|$)
Using lookarounds you can enforce a complete match rather than a partial match.See demo.
https://regex101.com/r/cT0hV4/5
(?i)(?<ipv6>(?:[\da-f]{0,4}:){1,7}(?:(?<ipv4>(?:(?:25[0-5]|2[0-4]\d|1?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|1?\d\d?))|[\da-f]{0,4}))
Demo: Regex101
Github Repository

Regex to replace email addresses

I wish to replace email addresses in a string to something else. It does not work for me.
string body = "this is a test abc#emailadx.com";
string pattern = #"\b[!#$%&'*+./0-9=?_`a-z{|}~^-]+#[.0-9a-z-]+\.[a-z]{2,6}\b";
Regex.Replace(body, pattern, "Hidden Email Address");
return body;
Any hints would be helpful please.
You want to do this:
return Regex.Replace(body, pattern, "Hidden Email Address");
If you look at the documentation for Regex.Replace, you'll see that it returns the newly replaced string. It does not affect the string that was passed in.
NOTE: this is assuming you're using C#. But I'm guessing you are, from the syntax.
FURTHERMORE: If your regex still isn't working well, try this one from the Regular Expressions Cookbook (by Goyvaerts & Levithan):
#"^[\w!#$%&'*+/=?`{|}~^.-]+#[A-Z0-9.-]+$"

How can I use RegEx to make sure a valid email is written in my TextBox>?

I'm a complete newbie to RegEx and I'm sure it'll be brilliant to use once I know how to use it. :P
I have a couple of textBoxes and I was wondering if anyone could me acomplish what I need.
In the EMail textbox, I'd like to make sure the user writes in a valid email. xxx#yyy.zzz
Is there a way for RegEx to help me out?
I'd also really like a way to format the name the user writes down. So if a user writes in "SerGIo TAPIA gutTIerrez I want to format that string (behind the scenes before saving it) to "Sergio Tapia Gutierrez" Can RegEx do this?
Thanks so much SO.
(inb4 Rex :P )
A complete and accurate regex for email validation is surprisingly difficult, I trust you can use google to find some examples.
The general rule for email validation is to actually try to send an email.
Well, this is an easy one! :)
no, there exists no regex that can validate* e-mail addresses;
no, regex cannot transform "SerGIo TAPIA gutTIerrez" into "Sergio Tapia Gutierrez". Sure, some language like Perl (and other perhaps) can mix-in some fancy stuff inside regex-es to do this, but it is not regex that actually performs the transformation. Regex only matches text, plain and simple.
* by 'valid' I mean see if the address actually exists.
This is one way, but there are many others.
public static bool isEmail(string emailAddress)
{
if(string.IsNullOrEmpty(emailAddress))
return false;
Regex EmailAddress = new Regex(#"^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$");
return EmailAddress.IsMatch(emailAddress);
}
http://www.cambiaresearch.com/c4/bf974b23-484b-41c3-b331-0bd8121d5177/Parsing-Email-Addresses-with-Regular-Expressions.aspx
public bool TestEmailRegex(string emailAddress)
{
// string patternLenient = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
// Regex reLenient = new Regex(patternLenient);
string patternStrict = #"^(([^<>()[\]\\.,;:\s#""]+"
+ #"(\.[^<>()[\]\\.,;:\s#""]+)*)|("".+""))#"
+ #"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
+ #"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
+ #"[a-zA-Z]{2,}))$";
Regex reStrict = new Regex(patternStrict);
// bool isLenientMatch = reLenient.IsMatch(emailAddress);
// return isLenientMatch;
bool isStrictMatch = reStrict.IsMatch(emailAddress);
return isStrictMatch;
}

Parse email content with Regular Expressions

Everyday I receive thousands of emails and I want to parse the content/body of these emails to load them into a database.
My problem is that nowadays I am parsing the email body manually and I would like to change the logic to a Regular Expression in C#.
Here is the body of the emails:
Gentilissima Agenzia Nexity Residenziale
il nostro utente:
Sig./Sig.ra :Pablo Azorin
Email: pabloazorin#gmail.com
Tel.: 02322-498900
sta cercando un immobile con le seguenti caratteristiche:
Categoria: Residenziale
Tipologia: Villa
Tipo di contratto: Vendita
Comune: Assago Prov. Milano
Zona: non specificata
Fascia di prezzo: non specificata
I need to extract the text in bold and I thought a RegEx is what I need for this...
Looking forward to get your suggestion about how to make it works.
Thanks!
--Pablo
Assuming that the parts in your email that are not bold always occur like that in all your emails, you can easily grab all the parts from your email with the regex:
Sig\./Sig\.ra :(.*)
Email: (.*)
Tel\.: (.*)
sta cercando un immobile con le seguenti caratteristiche:
Categoria: (.*)
Tipologia: (.*)
Tipo di contratto: (.*)
Comune: (.*)
Zona: (.*)
Fascia di prezzo: (.*)
In C#
Regex regexObj = new Regex(#"Sig\./Sig\.ra :(.*)
Email: (.*)
Tel\.: (.*)
sta cercando un immobile con le seguenti caratteristiche:
Categoria: (.*)
Tipologia: (.*)
Tipo di contratto: (.*)
Comune: (.*)
Zona: (.*)
Fascia di prezzo: (.*)");
Match matchObj = regexObj.Match(subjectString);
string Sig = matchObj.Groups[1].Value;
string Email = matchObj.Groups[2].Value;
// and so on to get all the other parts
Read Mastering Regular Expressions. It will teach you everything you need to know to complete this and other similar regex problems, and will give you enough understanding and insight to get you started writing much more complicated regular expressions.
For email downloading I used Mailbee .Net objects. This library is quite easy to use and is well documented. But if you want to avoid programming you can also use an email parser like EmailParser2Database.
If the emails are in the same format always, you can do this a number of different ways. A simple way of doing it would be to split on the newline and take a substring on each line, starting after the label.
With regexes, you'd probably create a regex that creates a number of named captures. You can then index into the Groups property of the match on the name of each named group in order to get the value out of it. This is a little more complex, of course.
i think it will be much better to split this string into an array of lines
you can initialize a dictionary with all the titles as keys
and you will search each line for the Title from the dictionary ("Email:" for example) and then u put the the result back into the into a dictionary as value
at the end you will have a dictionary with all the titles and values.
i think you dont need a regex for that.
actually that way the order of the titles wont matter.
We found that for spam filtering and other high-volume applications, regular expressions are a bit slow for parsing MIME headers, which is what you want to do. The code is somewhat specialized, but I wrote a C state machine for doing the parsing which is as fast as you'll get without going to something like re2c. The code is not for the faint of heart, but it is blindingly fast.
For emails I think you'll find an explicit state machine is easier to work with than regular expressions. It's also the last refuge of the goto statement!
You really don't want to do this manually, or with regular expressions. There are many different ways to encode data in an email, and many emails that don't strictly conform to the spec that can still be parsed. I have had success with AnPOP in a .NET environment.

Categories

Resources