parse email with regex in c#

parse email with regex in c# - c#

I need to parse email files with regex in c#, that is parse the email file that contains several emails and parse it into its constituents e.g from, to, bcc etc.
the regex am using for email is
"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*"
the problem am having is the To, Cc and Bcc sometimes contains more than one email, and occurs in more than one line
To: Me meagain <me#me.com>,
Me1 meagain <me1#me.com>,Me3 meagain <me1#me.com>
Also, which regex will match the message?

Parsing an email message with regular expressions is a terrible idea. You might be able to parse the constituent parts with regular expressions, but finding the constituent parts with regular expressions is going to give you fits.
The normal case, of course, is pretty easy. But then you run across something like a message that has an embedded message within it. That is, the content includes a full email message with From:, To:, Bcc:, etc. And your naive regex parser thinks, "Oh, boy! I found a new message!"
You're better off reading and understanding the Internet Message Format and writing a real parser, or using something already written like OpenPop.NET.
Also, check out the suggestions in Reading Email using Pop3 in C# and https://stackoverflow.com/questions/26606/free-pop3-net-library, among others.
A good example of the difficulty you'll face is that your regular expression for matching email addresses is inadequate. According to section 3.2.4 of RFC2822 (linked above), the following characters are allowed in the "local-part" of the email address:
atext = ALPHA / DIGIT / ; Any character except controls,
"!" / "#" / ; SP, and specials.
"$" / "%" / ; Used for atoms
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
The domain name can contain any ASCII except whitespace and the "\" character, and has to meet some format requirements. Then there's the "obsolete" stuff that, although deprecated, is still in use. And that's just in parsing email addresses. If you look at the stuff that can be included in the other fields, I think you'll agree that trying to parse it with regular expressions is going to be frustrating at best.

http://www.codeproject.com/KB/office/reading_an_outlook_msg.aspx
The above tutorial will give you a decent idea of how to read *.msg files from the file system. If you consider using the System.Net.Mail.MailMessage object you can get all info such as:
senders,
recepients,
attachements,
html email template,
text email template,
etc...
Thanks,

I created an API called SigParser which does this for you. It breaks reply chain emails into their parts and handles these sorts of problems where lines are splitting. You get a nice array of the email response bodies with who each section of the email was to if that data was in the reply chain header.

Related

Separate email content from signature

Is there a way to separate email content (body text) from an added signature using IMap packages?
IEnumerable MailList = Client.Search(SearchCondition.Unseen());
var email = Client.GetMessage(MailList[0]);
string body = email.Body;
Thanks

This is a rather difficult problem.
For text/plain, you can look for the line "-- " (three characters, including the trailing space). For text/html, you can look for the CSS classes gmail_signature and moz-signature. For all mail, you can look for trailing text that matches the trailing text of the previous message from the same address.
However, none of this is foolproof. Lots of HTML sigs don't use those CSS rules (Outlook, for example, uses no relevant CSS), lots of plaintext sigs don't use --, and lots of middlecrapware inserts text after the signature so the "trailing text" may not be the at the very end.

Whitespace terminators & MakePlusRule in Irony

I'm trying to create a fairly simple parser using Irony, but am coming to the conclusion that Irony may not be suitable in this particular case.
These is an example of what I'm trying to parse:
server_name example.com *.example.com www.example.*;
server_name www.example.com ~^www\d+\.example\.com$;
server_name ~^(?<subdomain>.+?)\.(?<domain>.+)$;
I'm using FreeTextLiterals with either a space or semi-colon as a terminator
var serverNamevalue = new FreeTextLiteral("serverNameValue", FreeTextOptions.None, " ", ";");
I'm then using the MakePlusRule to pick up one or more server_name values:
httpCoreServerName.Rule = "server_name" + httpCoreServerNameItems + semicolon;
httpCoreServerNameItems.Rule = MakePlusRule(httpCoreServerNameItems, serverNamevalue);
However - I think there's a problem with having whitespace as a terminator for the FreeTextLiteral in this case. When I run this, I get a parser error. If I substitute the whitespace for another specific character to act as terminator (and also add this a delimiter in the call to MakePlusRule) - it works fine.
Does anyone have any ideas as to how I could deal with this in Irony?

I posted this question over at the Irony project on Codeplex where Roman Ivantsov - the developer of Irony - confirmed there was an issue with the parser when using semi-colons with FreeTextLiterals.
Roman has helpfully fixed / patched this issue. I've dowloaded the latest source and can confirm it's fixed the issue.

Semicolon in url as a separator for query strings

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?

As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.

In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

How can I deal with ampersands in a mail client's mailto links?

I have an ASP.NET/C# application, part of which converts WWW links to mailto links in an HTML email.
For example, if I have a link such as:
www.site.com
It gets rewritten as:
mailto:my#address.com?Subject=www.site.com
This works extremely well, until I run into URLs with ampersands, which then causes the subject to be truncated.
For example the link:
www.site.com?val1=a&val2=b
Shows up as:
mailto:my#address.com?Subject=www.site.com?val1=a&val2=b
Which is exactly what I want, but then when clicked, it creates a message with:
subject=www.site.com?val1=a
Which has dropped the &val2, which makes sense as & is the delimiter in a mailto command.
So, I have tried various other was to work around this with no success.
I have tried implicitly quoting the subject='' part and that did nothing.
I (in C#) replace '&' with & which Live Mail and Thunderbird just turn back into:
www.site.com?val1=a&val2=b
I replaced '&' with '%26' which resulted in:
mailto:my#address.com?Subject=www.site.com?val1=a%26amp;val2=b
In the mail with the subject:
www.site.com?val1=a&val2=b
EDIT:
In response to how URL is being built, this is much trimmed down but is the gist of it. In place of the att.Value.Replace I have tried System.Web.HtmlUtility.URLEncode calls which also results in a failure
HtmlAgilityPack.HtmlNodeCollection nodes =doc.DocumentNode.SelectNodes("//a[#href]");
foreach (HtmlAgilityPack.HtmlNode link in nodes)
{
HtmlAgilityPack.HtmlAttribute att = link.Attributes["href"];
att.Value = att.Value.Replace("&", "%26");
}

Try mailto:my#address.com?Subject=www.site.com?val1=a%26val2=b
& is an HTML escape code, whereas %26 is a URL escape code. Since it's a URL, that's all you need.
EDIT: I figured that's how you were building your URL. Don't build URLs that way! You need to get the %26 in there before you let anything else parse or escape it. If you really must do it this way (which you really should try to avoid), then you should search for "&" instead of just "&" because the string has already been HTML escaped at this point.
So, ideally, you build your URL properly before it's HTML escaped. If you can't do it properly, at least search for the right string instead of the wrong one. "&" is the wrong one.

You cant put any character as subject. You could try using System.Web.HttpUtility.URLEncode function on the subject´s value...

Using the URL escape code %26 is the right way.
Sadly this is still not working on the Android OS because of bug 8023

What I ended up doing for my case was eliminating the &.
www.site.com/mytest.php?val1=a=b=c. Where the 2nd and 3rd = would be equivalent to www.site.com?val1=a&val2=b&val3=c
In mytest.php I explode on ? and then explode again on =.
A total hack I know but it does work for me.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash

If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>

IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.