detecting emails inside an http header

detecting emails inside an http header - c#

i am building up a proxy in csharp and one of my tasks is to find emails inside
an http header, problem is that inside the data that i get i receive %40 instead of
#, could anyone please tell me how can i detect emails when the # inside the mail address is being replaced with %40?
here is my code for getting email addresses inside a given string (with # and not %40 instead)
Code:
string regexPattern = #"[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}";
Regex regex = new Regex(regexPattern);
MatchCollection matches = regex.Matches(this._context.Request.Headers[i]);
foreach (Match match in regex.Matches(this._context.Request.Headers[i]))
{// any email address should be printed
Console.WriteLine(match.Value);
}

Not sure if I understood your question. Why don't you just replace # with %40 in regex you provided?
So:
string regexPattern = #"[A-Za-z0-9._%-]+%40[A-Za-z0-9.-]+\.[A-Za-z]{2,4}";

URL decode your data before running it through the regex:
regex.Matches(this._context.Server.UrlDecode(this._context.Request.Headers[i]))
Or:
regex.Matches(HttpUtility.UrlDecode(this._context.Request.Headers[i]))

Use OR (|) to allow # or "%40"
[A-Za-z0-9._%-]+(#|%40)[A-Za-z0-9.-]+.[A-Za-z]{2,4}

Related

Unable to encode Url properly using HttpUtility.UrlEncode() method

I have created an application in which I need to encode/decode special characters from the url which is entered by user.
For example : if user enters http://en.wikipedia.org/wiki/Å then it's respective Url should be http://en.wikipedia.org/wiki/%C3%85.
I made console application with following code.
string value = "http://en.wikipedia.org/wiki/Å";
Console.WriteLine(System.Web.HttpUtility.UrlEncode(value));
It decodes the character Å successfully and also encodes :// characters. After running the code I am getting output like : http%3a%2f%2fen.wikipedia.org%2fwiki%2f%c3%85 but I want http://en.wikipedia.org/wiki/%C3%85
What should I do?

Uri.EscapeUriString(value) returns the value that you expect. But it might have other problems.
There are a few URL encoding functions in the .NET Framework which all behave differently and are useful in different situations:
Uri.EscapeUriString
Uri.EscapeDataString
WebUtility.UrlEncode (only in .NET 4.5)
HttpUtility.UrlEncode (in System.Web.dll, so intended for web applications, not desktop)

You could use regular expressions to select hostname and then urlencode only other part of string:
var inputString = "http://en.wikipedia.org/wiki/Å";
var encodedString;
var regex = new Regex("^(?<host>https?://.+?/)(?<path>.*)$");
var match = regex.Match(inputString);
if (match.Success)
encodedString = match.Groups["host"] + System.Web.HttpUtility.UrlEncode(match.Groups["path"].ToString());
Console.WriteLine(encodedString);

Split string with response from gmail

After I retrieve messages from mail box I want to separate message body from subject, date and other information. But I can't find wright algorithm. Here is my code:
// create an instance of TcpClient
TcpClient tcpclient = new TcpClient();
// HOST NAME POP SERVER and gmail uses port number 995 for POP
tcpclient.Connect("pop.gmail.com", 995);
// This is Secure Stream // opened the connection between client and POP Server
System.Net.Security.SslStream sslstream = new SslStream(tcpclient.GetStream());
// authenticate as client
sslstream.AuthenticateAsClient("pop.gmail.com");
//bool flag = sslstream.IsAuthenticated; // check flag
// Asssigned the writer to stream
System.IO.StreamWriter sw = new StreamWriter(sslstream);
// Assigned reader to stream
System.IO.StreamReader reader = new StreamReader(sslstream);
// refer POP rfc command, there very few around 6-9 command
sw.WriteLine("USER my_login");
// sent to server
sw.Flush();
sw.WriteLine("PASS my_pass");
sw.Flush();
// this will retrive your first email
sw.WriteLine("RETR 1");
sw.Flush();
string str = string.Empty;
string strTemp = string.Empty;
while ((strTemp = reader.ReadLine()) != null)
{
// find the . character in line
if (strTemp == ".")
{
break;
}
if (strTemp.IndexOf("-ERR") != -1)
{
break;
}
str += strTemp;
}
// close the connection
sw.WriteLine("Quit ");
sw.Flush();
richTextBox2.Text = str;
I have to extract:
The subject of message
The author
The date
The message body
Can anyone tell me how to do this?
String which I receive (str) contains the subject Test message and the body This is the text of test message. It looks like:
+OK Gpop ready for requests from 46.55.3.85 s42mb37199022eev+OK send PASS+OK Welcome.+OK message followsReturn-Path:
Received: from TMD-I31S3H51L29
(host-static-46-55-3-85.moldtelecom.md. [46.55.3.85]) by
mx.google.com with ESMTPSA id o5sm61119999eeg.8.2014.04.16.13.48.20
for (version=TLSv1
cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 16 Apr 2014
13:48:21 -0700 (PDT)Message-ID:
<534eec95.856b0e0a.55e1.6612#mx.google.com>MIME-Version: 1.0From:
mail_address#gmail.comTo: mail_address#gmail.comDate: Wed, 16 Apr 2014
13:48:21 -0700 (PDT)Subject: Test messageContent-Type: text/plain;
charset=us-asciiContent-Transfer-Encoding: quoted-printableThis is the
text of test message
Thank you very much!

What you first need to do is read rfc1939 to get an idea of the POP3 protocol. But immediately after reading that, you'll need to read the following list of RFCs... actually, screw it, I'm not going to paste the long list of them here, I'll just link you to the website of my MimeKit library which already has a fairly comprehensible list of them.
As your original code correctly did, it needs to keep reading from the socket until the termination sequence (".\r\n") is encountered, thus terminating the message stream.
The way you are doing it is really inefficient, but whatever, it'll (mostly) work except for the fact that you need to undo any/all byte-stuffing that is done by the POP3 server to munge lines beginning with a period ('.'). For more details, read the POP3 specification I linked above.
To parse the headers, you'll need to read rfc822. Suffice it to say, Olivier's approach will fall flat on its face, most likely the second it tries to 'split' any real-world messages... unless it gets extremely lucky.
As a hint, the message body is separated from the headers by a blank line.
Here's a few other problems you are likely to eventually run into:
Header values are supposed to be encoded if they contain non-ASCII text (see rfc2047 and rfc2231 for details).
Some header values in the wild are not properly encoded, and sometimes, even though they are not supposed to, include undeclared 8-bit text. Dealing with this is non-trivial. This also means that you cannot really use a StreamReader to read lines as you'll lose the original byte sequences.
If you actually want to do anything with the body of the message, you'll have to write a MIME parser.
I'd highly recommend using MimeKit and my other library, MailKit, for POP3 support.
Trust me, you are in for a world of pain trying to do this the way you are trying to do it.

String.Split is not powerful enough for this task. You wiil have to use Regex. The pattern that I suggest is:
^(?<name>\w+): (?<value>.*?)$
The meaning is:
^ Beginning of line (if you use the multiline option).
(?<name>pattern) Capturing group where the group name is "name".
\w+ A word.
.*? Any sequence of characters (for the value)
$ End of line
This code ...
MatchCollection matches =
Regex.Matches(text, #"^(?<name>\w+): (?<value>.*?)$", RegexOptions.Multiline);
foreach (Match match in matches) {
Console.WriteLine("{0} = {1}",
match.Groups["name"].Value,
match.Groups["value"].Value
);
}
... produces this output:
Received = from TMD-I31S3H51L29 (host-static-46-55-3-85.m ...
From = mail_address#gmail.com
To = mail_address#gmail.com
Date = Wed, 16 Apr 2014 13:48:21 -0700 (PDT)
Subject = Test message
The body seems to be start after the "Content-Transfer-Encoding:" line and goes to the end of the string. You can find the body like this:
Match body =
Regex.Match(text, #"^Content-Transfer-Encoding: .*?$", RegexOptions.Multiline);
if (body.Success) {
Console.WriteLine(text.Substring(body.Index + body.Length + 1));
}
In case the lines are separated by LineFeeds only the RegexOptions.Multiline might not works. Then you would have to replace the beginning and end of line symbols (^ and $) by \n in the regex expressions.

Accented characters displayed as hex values in mail source file

I have to convert the content of a mail message to XML format but I am facing some encoding problems. Indeed, all my accented characters and some others are displayed in the message file with their hex value.
Ex :
é is displayed =E9,
ô is displayed =F4,
= is displayed =3D...
The mail is configured to be sent with iso-8859-1 coding and I can see these parameters in the file :
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Notepad++ detects the file as "ANSI as UTF-8".
I need to convert it in C# (I am in a script task in an SSIS project) to be readable and I can not manage to do that.
I tried encoding it in UTF-8 in my StreamReader but it does nothing. Despite my readings on the topic, I still do not really understand the steps that lead to my problem and the means to solve it.
I point out that Outlook decodes the message well and the accented characters are displayed correctly.
Thanks in advance.

Ok I was looking on the wrong direction. The keyword here is "Quoted-Printable". This is where my issue comes from and this is what I really have to decode.
In order to do that, I followed the example posted by Martin Murphy in this thread :
C#: Class for decoding Quoted-Printable encoding?
The method described is :
public static string DecodeQuotedPrintables(string input)
{
var occurences = new Regex(#"=[0-9A-F]{2}", RegexOptions.Multiline);
var matches = occurences.Matches(input);
foreach (Match match in matches)
{
char hexChar= (char) Convert.ToInt32(match.Groups[0].Value.Substring(1), 16);
input =input.Replace(match.Groups[0].Value, hexChar.ToString());
}
return input.Replace("=\r\n", "");
}
To summarize, I open a StreamReader in UTF8 and place each read line in a string like that :
myString += line + "\r\n";
I open then my StreamWriter in UTF8 too and write the myString variable decoded in it :
myStreamWriter.WriteLine(DecodeQuotedPrintables(myString));

C# Imap search command with special characters like á,é

I'm working on an imap client search function.
I use this command: UID SEARCH FROM PÉTER
When I run this command i get the following error:
Error in IMAP command UID SEARCH: 8bit data in atom
I get this error when my pattern string(for example PÉTER) contains accentuated character.
What is the solution? What sholud I do?
Edit:
I try with UTF-8 encoded string (UID SEARCH FROM PÉTER), it runs without error, but it doesn't give back any result.
I check the test email account, and there are many mails with this sender.

In IMAP you need to send 8-bit data as string literals.
Literal syntax:
{byte_count} CRLF number-of-bytes
Example search:
cmdTag SEARCH charset UTF-8 subject {4} CRLF test CRLF

How to match URL in c#?

I have found many examples of how to match particular types of URL-s in PHP and other languages. I need to match any URL from my C# application. How to do this? When I talk about URL I talk about links to any sites or to files on sites and subdirectiories and so on.
I have a text like this: "Go to my awsome website http:\www.google.pl\something\blah\?lang=5" or else and I need to get this link from this message. Links can start only with www. too.

If you need to test your regex to find URLs you can try this resource
http://gskinner.com/RegExr/
It will test your regex while you're writing it.
In C# you can use regex for example as below:
Regex r = new Regex(#"(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*");
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
while (m.Success)
{
//do things with your matching text
m = m.NextMatch();
}

Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too)
^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions

I am not sure exactly what you are asking, but a good start would be the Uri class, which will parse the url for you.

Here's one defined for URL's.
^http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

This will return a match collection of all matches found within "yourStringThatHasUrlsInIt":
var pattern = #"((ht|f)tp(s?)\:\/\/|~/|/)?([w]{2}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?";
var regex = new Regex(pattern);
var matches = regex.Matches(yourStringThatHasUrlsInIt);
The return will be a "MatchCollection" which you can read more about here:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchcollection.aspx

//This code return (protocol://)host:port from URL
//Commented URL's with different protocols. Just uncomment to test.
//string url = "http://www.contoso.com:8080/letters/readme.html";
//string url = "ftp://www.contoso.com:8080/letters/readme.html";
//string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string host = "";//empty string with host from url
//protocol, (ip/domain), port
host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${proto}://${host}${port}");
//(ip/domain):port without protocol. If HTTPS board loading images from HTTP host.
//host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${host}${port}");
Console.WriteLine("url: "+url+"\nhost: "+host); //display host
see https://rextester.com/PVSO54371

u can also use https://github.com/d-kistanov-parc/DotNetUrlPatternMatching
The library allows you to match a URL to a pattern.
How it works:
an url pattern is split into parts
each non-empty part is matched with a similar one from the URL.
You can specify a Wildcard * or ~
Where * is any character set within the group (scheme, host, port, path, parameter, fragment)
Where ~ any character set within a group segment (host, path)
Only supply parts of the URL you care about. Parts which are left out will match anything. E.g. if you don’t care about the host, then leave it out.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.