Accented characters displayed as hex values in mail source file - c#

I have to convert the content of a mail message to XML format but I am facing some encoding problems. Indeed, all my accented characters and some others are displayed in the message file with their hex value.
Ex :
é is displayed =E9,
ô is displayed =F4,
= is displayed =3D...
The mail is configured to be sent with iso-8859-1 coding and I can see these parameters in the file :
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Notepad++ detects the file as "ANSI as UTF-8".
I need to convert it in C# (I am in a script task in an SSIS project) to be readable and I can not manage to do that.
I tried encoding it in UTF-8 in my StreamReader but it does nothing. Despite my readings on the topic, I still do not really understand the steps that lead to my problem and the means to solve it.
I point out that Outlook decodes the message well and the accented characters are displayed correctly.
Thanks in advance.

Ok I was looking on the wrong direction. The keyword here is "Quoted-Printable". This is where my issue comes from and this is what I really have to decode.
In order to do that, I followed the example posted by Martin Murphy in this thread :
C#: Class for decoding Quoted-Printable encoding?
The method described is :
public static string DecodeQuotedPrintables(string input)
{
var occurences = new Regex(#"=[0-9A-F]{2}", RegexOptions.Multiline);
var matches = occurences.Matches(input);
foreach (Match match in matches)
{
char hexChar= (char) Convert.ToInt32(match.Groups[0].Value.Substring(1), 16);
input =input.Replace(match.Groups[0].Value, hexChar.ToString());
}
return input.Replace("=\r\n", "");
}
To summarize, I open a StreamReader in UTF8 and place each read line in a string like that :
myString += line + "\r\n";
I open then my StreamWriter in UTF8 too and write the myString variable decoded in it :
myStreamWriter.WriteLine(DecodeQuotedPrintables(myString));

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

MimeKit Character Encoding/Decoding Issue

While using MimeKit to convert .eml files to .msg files, I'm running into an issue that appears to be related to encoding.
With an EML file containing the following, for instance:
--__NEXTPART_20160610_5EF5CF91_471687D
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
添付ファイル名テスト
The result is garbage in the body content:
・Y・t・t・#・C・・・シ・e・X・g
Additionally, base-64 encoded ü characters are showing up as ?? when the EML file is read. I've downloaded the latest release of MimeKit, but it doesn't seem to make a difference.
The .eml files open properly with Outlook 2016, but using MimeKit does not appear to be able to read and decode the files properly.
There are a few problems with your above MIME snippet :(
Content-Transfer-Encoding: 7bit is obviously not true, altho that's not likely to be the problem (MimeKit ignores values of 7bit and 8bit for this very reason).
Most importantly, however, is the fact that the charset parameter is iso-2022-jp but the content itself is very clearly not iso-2022-jp (it looks like utf-8).
When you get the TextPart.Text value, MimeKit gets that string by converting the raw stream content using the charset specified in the Content-Type header. If that is wrong, then the Text property will also have the wrong value.
The good news is that TextPart has GetText methods that allow you to specify a charset override.
I would recommend trying:
var text = part.GetText (Encoding.UTF8);
See if that works.
FWIW, iso-2022-jp is an encoding that forces Japanese characters into a 7bit ascii form that looks like complete jibberish. This is what your Japanese text would look like if it was actually in iso-2022-jp:
BE:IU%U%!%$%kL>%F%9%H
That's how I know it's not iso-2022-jp :)
Update:
Ultimately, the solution will probably be something like this:
var encodings = new List<Encoding> ();
string text = null;
try {
var encoding = Encoding.GetEncoding (part.ContentType.Charset,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ());
encodings.Add (encoding);
} catch (ArgumentException) {
} catch (NotSupportedException) {
}
// add utf-8 as our first fallback
encodings.Add (Encoding.GetEncoding (65001,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ()));
// add iso-8859-1 as our final fallback
encodings.Add (Encoding.GetEncoding (28591,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ()));
for (int i = 0; i < encodings.Count; i++) {
try {
text = part.GetText (encodings[i]);
break;
} catch (DecoderFallbackException) {
// this means that the content did not convert cleanly
}
}

Split string with response from gmail

After I retrieve messages from mail box I want to separate message body from subject, date and other information. But I can't find wright algorithm. Here is my code:
// create an instance of TcpClient
TcpClient tcpclient = new TcpClient();
// HOST NAME POP SERVER and gmail uses port number 995 for POP
tcpclient.Connect("pop.gmail.com", 995);
// This is Secure Stream // opened the connection between client and POP Server
System.Net.Security.SslStream sslstream = new SslStream(tcpclient.GetStream());
// authenticate as client
sslstream.AuthenticateAsClient("pop.gmail.com");
//bool flag = sslstream.IsAuthenticated; // check flag
// Asssigned the writer to stream
System.IO.StreamWriter sw = new StreamWriter(sslstream);
// Assigned reader to stream
System.IO.StreamReader reader = new StreamReader(sslstream);
// refer POP rfc command, there very few around 6-9 command
sw.WriteLine("USER my_login");
// sent to server
sw.Flush();
sw.WriteLine("PASS my_pass");
sw.Flush();
// this will retrive your first email
sw.WriteLine("RETR 1");
sw.Flush();
string str = string.Empty;
string strTemp = string.Empty;
while ((strTemp = reader.ReadLine()) != null)
{
// find the . character in line
if (strTemp == ".")
{
break;
}
if (strTemp.IndexOf("-ERR") != -1)
{
break;
}
str += strTemp;
}
// close the connection
sw.WriteLine("Quit ");
sw.Flush();
richTextBox2.Text = str;
I have to extract:
The subject of message
The author
The date
The message body
Can anyone tell me how to do this?
String which I receive (str) contains the subject Test message and the body This is the text of test message. It looks like:
+OK Gpop ready for requests from 46.55.3.85 s42mb37199022eev+OK send PASS+OK Welcome.+OK message followsReturn-Path:
Received: from TMD-I31S3H51L29
(host-static-46-55-3-85.moldtelecom.md. [46.55.3.85]) by
mx.google.com with ESMTPSA id o5sm61119999eeg.8.2014.04.16.13.48.20
for (version=TLSv1
cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 16 Apr 2014
13:48:21 -0700 (PDT)Message-ID:
<534eec95.856b0e0a.55e1.6612#mx.google.com>MIME-Version: 1.0From:
mail_address#gmail.comTo: mail_address#gmail.comDate: Wed, 16 Apr 2014
13:48:21 -0700 (PDT)Subject: Test messageContent-Type: text/plain;
charset=us-asciiContent-Transfer-Encoding: quoted-printableThis is the
text of test message
Thank you very much!
What you first need to do is read rfc1939 to get an idea of the POP3 protocol. But immediately after reading that, you'll need to read the following list of RFCs... actually, screw it, I'm not going to paste the long list of them here, I'll just link you to the website of my MimeKit library which already has a fairly comprehensible list of them.
As your original code correctly did, it needs to keep reading from the socket until the termination sequence (".\r\n") is encountered, thus terminating the message stream.
The way you are doing it is really inefficient, but whatever, it'll (mostly) work except for the fact that you need to undo any/all byte-stuffing that is done by the POP3 server to munge lines beginning with a period ('.'). For more details, read the POP3 specification I linked above.
To parse the headers, you'll need to read rfc822. Suffice it to say, Olivier's approach will fall flat on its face, most likely the second it tries to 'split' any real-world messages... unless it gets extremely lucky.
As a hint, the message body is separated from the headers by a blank line.
Here's a few other problems you are likely to eventually run into:
Header values are supposed to be encoded if they contain non-ASCII text (see rfc2047 and rfc2231 for details).
Some header values in the wild are not properly encoded, and sometimes, even though they are not supposed to, include undeclared 8-bit text. Dealing with this is non-trivial. This also means that you cannot really use a StreamReader to read lines as you'll lose the original byte sequences.
If you actually want to do anything with the body of the message, you'll have to write a MIME parser.
I'd highly recommend using MimeKit and my other library, MailKit, for POP3 support.
Trust me, you are in for a world of pain trying to do this the way you are trying to do it.
String.Split is not powerful enough for this task. You wiil have to use Regex. The pattern that I suggest is:
^(?<name>\w+): (?<value>.*?)$
The meaning is:
^ Beginning of line (if you use the multiline option).
(?<name>pattern) Capturing group where the group name is "name".
\w+ A word.
.*? Any sequence of characters (for the value)
$ End of line
This code ...
MatchCollection matches =
Regex.Matches(text, #"^(?<name>\w+): (?<value>.*?)$", RegexOptions.Multiline);
foreach (Match match in matches) {
Console.WriteLine("{0} = {1}",
match.Groups["name"].Value,
match.Groups["value"].Value
);
}
... produces this output:
Received = from TMD-I31S3H51L29 (host-static-46-55-3-85.m ...
From = mail_address#gmail.com
To = mail_address#gmail.com
Date = Wed, 16 Apr 2014 13:48:21 -0700 (PDT)
Subject = Test message
The body seems to be start after the "Content-Transfer-Encoding:" line and goes to the end of the string. You can find the body like this:
Match body =
Regex.Match(text, #"^Content-Transfer-Encoding: .*?$", RegexOptions.Multiline);
if (body.Success) {
Console.WriteLine(text.Substring(body.Index + body.Length + 1));
}
In case the lines are separated by LineFeeds only the RegexOptions.Multiline might not works. Then you would have to replace the beginning and end of line symbols (^ and $) by \n in the regex expressions.

A socket message from Python to C# comes through garbled

I'm trying to set up a very basic ZeroMQ-based socket link between Python server and C# client using simplejson and Json.NET.
I try to send a dict from Python and read it into an object in C#. Python code:
message = {'MessageType':"None", 'ContentType':"None", 'Content':"OK"}
message_blob = simplejson.dumps(message).encode(encoding = "UTF-8")
alive_socket.send(message_blob)
The message is sent as normal UTF-8 string or, if I use UTF-16, as "'\xff\xfe{\x00"\x00..." etc.
Code in C# is where my problem is:
string reply = client.Receive(Encoding.UTF8);
The UTF-8 message is received as "≻潃瑮湥≴›..." etc.
I tried to use UTF-16 and the message comes through OK, but the first symbols are still the little-endian \xFF \xFE BOM so when I try to feed it to the deserializer,
PythonMessage replyMessage = JsonConvert.DeserializeObject<PythonMessage>(reply);
//PythonMessage is just a very simple class with properties,
//not relevant to the problem
I get an error (obviously occurring at the first symbol, \xFF):
Unexpected character encountered while parsing value: .
Something is obviously wrong in the way I'm using encoding. Can you please show me the right way to do this?
The byte-order-mark is obligatory in UTF-16. You can use UTF-16LE or UTF-16BE to assume a particular byte order and the BOM will not be generated. That is, use:
message_blob = simplejson.dumps(message).encode(encoding = "UTF-16le")

Parsing emails for TIFF attachments in C#

I built an email parser that extracts TIFF attachments from emails sent by 2 different fax providers, RingCentral and eFax.
The application uses Pop3 to retrieve the email as a text stream and then parse the text to identify the section that represents the Tiff image.
By converting that section of text to a byte array and using a BinaryWriter, I'm able to create the TIFF file on my local hard drive.
public void SaveToFile(string filepath)
{
BinaryWriter bw = new BinaryWriter(new FileStream(filepath, FileMode.Create));
bw.Write(this.Data);
bw.Flush();
bw.Close();
}
The issue is that the eFax email attachments cause runtime errors when converting the text to a byte array.
//_data is a byte array
//RawData is a string
_data = Convert.FromBase64String(RawData); //fails on this line
I get the following error:
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or a non-white space character among the padding characters.
I assume it has something to do with the encoding/decoding of the string, but I've tried various encoding types and still get the error.
Some additional information:
Programming Language: C#
Email Host: GMail
If I manually forward the email back to myself, the parser works, but will not work against the original.
I even tried auto-forwarding in GMail but this did not work.
I'm responding here to the first comment below and thanks for your response.
The TIFF file is created by taking the section of text from the email that is associated to the TIFF file attachment, converting it to a byte array, and saving the file with a .tiff file extension. This works fine for all RingCentral emails. For example, the RingCentral email section header looks like this:
------=_NextPart_3327195283162919167883
Content-Type: image/tiff; name="18307730038-0803-141603-326.tif"
Content-Transfer-Encoding: base64
Content-Description: 18307730038-0803-141603-326.tif
Content-Disposition: attachment; filename="18307730038-0803-141603-326.tif"
Please note the Content-Transfer-Encoding value of base64. This explains why I use the following C# conversion code:
_data = Convert.FromBase64String(tiffEmailString);
_data is the private variable and is used as the return value in the SaveToFile method above (i.e. _data is returned when this.Data property value was used).
Now for the eFax (the email the fails) section header:
Content-Type: image/tiff; name=FAX_20130802_1375447833_61.tif
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="FAX_20130802_1375447833_61.tif"
Content-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==
It too shows base64. So shouldn't Convert.FromBase64String() method call work?
I'm also going to check whether my parser is grabbing additional text. But if I'm missing something, please point it out. Thanks.
Latest update:
As it turns out the issue was not the encoding but my parser! I was inadvertently including an additional header value in the attachment text. It's working now. Thanks.

Categories

Resources