MimeKit Character Encoding/Decoding Issue - c#

While using MimeKit to convert .eml files to .msg files, I'm running into an issue that appears to be related to encoding.
With an EML file containing the following, for instance:
--__NEXTPART_20160610_5EF5CF91_471687D
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit
添付ファイル名テスト
The result is garbage in the body content:
・Y・t・t・#・C・・・シ・e・X・g
Additionally, base-64 encoded ü characters are showing up as ?? when the EML file is read. I've downloaded the latest release of MimeKit, but it doesn't seem to make a difference.
The .eml files open properly with Outlook 2016, but using MimeKit does not appear to be able to read and decode the files properly.

There are a few problems with your above MIME snippet :(
Content-Transfer-Encoding: 7bit is obviously not true, altho that's not likely to be the problem (MimeKit ignores values of 7bit and 8bit for this very reason).
Most importantly, however, is the fact that the charset parameter is iso-2022-jp but the content itself is very clearly not iso-2022-jp (it looks like utf-8).
When you get the TextPart.Text value, MimeKit gets that string by converting the raw stream content using the charset specified in the Content-Type header. If that is wrong, then the Text property will also have the wrong value.
The good news is that TextPart has GetText methods that allow you to specify a charset override.
I would recommend trying:
var text = part.GetText (Encoding.UTF8);
See if that works.
FWIW, iso-2022-jp is an encoding that forces Japanese characters into a 7bit ascii form that looks like complete jibberish. This is what your Japanese text would look like if it was actually in iso-2022-jp:
BE:IU%U%!%$%kL>%F%9%H
That's how I know it's not iso-2022-jp :)
Update:
Ultimately, the solution will probably be something like this:
var encodings = new List<Encoding> ();
string text = null;
try {
var encoding = Encoding.GetEncoding (part.ContentType.Charset,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ());
encodings.Add (encoding);
} catch (ArgumentException) {
} catch (NotSupportedException) {
}
// add utf-8 as our first fallback
encodings.Add (Encoding.GetEncoding (65001,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ()));
// add iso-8859-1 as our final fallback
encodings.Add (Encoding.GetEncoding (28591,
new EncoderExceptionFallback (),
new DecoderExceptionFallback ()));
for (int i = 0; i < encodings.Count; i++) {
try {
text = part.GetText (encodings[i]);
break;
} catch (DecoderFallbackException) {
// this means that the content did not convert cleanly
}
}

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Read txt files (in unicode and utf8) by means of C#

I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:
...File.ReadAllText(utf8FileFullName, Encoding.UTF8);
and
...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);
But in both cases I got this "thank you - ???????". What's wrong?
Upd:
code for utf8
static void Main(string[] args)
{
var encoding = Encoding.UTF8;
var file = new FileInfo(#"D:\encodes\enc.txt");
Console.OutputEncoding = encoding;
var content = File.ReadAllText(file.FullName, encoding);
Console.WriteLine("encoding: " + encoding);
Console.WriteLine("content: " + content);
Console.ReadLine();
}
Result:
thanks ÑпаÑибо
Edited as UTF8 should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this
string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);
Use the Encoding type Default
File.ReadAllText(unicodeFileFullName, Encoding.Default);
It will fix the ???? Chracters.
When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.
Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.
"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."
As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.

Parsing emails for TIFF attachments in C#

I built an email parser that extracts TIFF attachments from emails sent by 2 different fax providers, RingCentral and eFax.
The application uses Pop3 to retrieve the email as a text stream and then parse the text to identify the section that represents the Tiff image.
By converting that section of text to a byte array and using a BinaryWriter, I'm able to create the TIFF file on my local hard drive.
public void SaveToFile(string filepath)
{
BinaryWriter bw = new BinaryWriter(new FileStream(filepath, FileMode.Create));
bw.Write(this.Data);
bw.Flush();
bw.Close();
}
The issue is that the eFax email attachments cause runtime errors when converting the text to a byte array.
//_data is a byte array
//RawData is a string
_data = Convert.FromBase64String(RawData); //fails on this line
I get the following error:
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or a non-white space character among the padding characters.
I assume it has something to do with the encoding/decoding of the string, but I've tried various encoding types and still get the error.
Some additional information:
Programming Language: C#
Email Host: GMail
If I manually forward the email back to myself, the parser works, but will not work against the original.
I even tried auto-forwarding in GMail but this did not work.
I'm responding here to the first comment below and thanks for your response.
The TIFF file is created by taking the section of text from the email that is associated to the TIFF file attachment, converting it to a byte array, and saving the file with a .tiff file extension. This works fine for all RingCentral emails. For example, the RingCentral email section header looks like this:
------=_NextPart_3327195283162919167883
Content-Type: image/tiff; name="18307730038-0803-141603-326.tif"
Content-Transfer-Encoding: base64
Content-Description: 18307730038-0803-141603-326.tif
Content-Disposition: attachment; filename="18307730038-0803-141603-326.tif"
Please note the Content-Transfer-Encoding value of base64. This explains why I use the following C# conversion code:
_data = Convert.FromBase64String(tiffEmailString);
_data is the private variable and is used as the return value in the SaveToFile method above (i.e. _data is returned when this.Data property value was used).
Now for the eFax (the email the fails) section header:
Content-Type: image/tiff; name=FAX_20130802_1375447833_61.tif
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="FAX_20130802_1375447833_61.tif"
Content-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==
It too shows base64. So shouldn't Convert.FromBase64String() method call work?
I'm also going to check whether my parser is grabbing additional text. But if I'm missing something, please point it out. Thanks.
Latest update:
As it turns out the issue was not the encoding but my parser! I was inadvertently including an additional header value in the attachment text. It's working now. Thanks.

Accented characters displayed as hex values in mail source file

I have to convert the content of a mail message to XML format but I am facing some encoding problems. Indeed, all my accented characters and some others are displayed in the message file with their hex value.
Ex :
é is displayed =E9,
ô is displayed =F4,
= is displayed =3D...
The mail is configured to be sent with iso-8859-1 coding and I can see these parameters in the file :
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Notepad++ detects the file as "ANSI as UTF-8".
I need to convert it in C# (I am in a script task in an SSIS project) to be readable and I can not manage to do that.
I tried encoding it in UTF-8 in my StreamReader but it does nothing. Despite my readings on the topic, I still do not really understand the steps that lead to my problem and the means to solve it.
I point out that Outlook decodes the message well and the accented characters are displayed correctly.
Thanks in advance.
Ok I was looking on the wrong direction. The keyword here is "Quoted-Printable". This is where my issue comes from and this is what I really have to decode.
In order to do that, I followed the example posted by Martin Murphy in this thread :
C#: Class for decoding Quoted-Printable encoding?
The method described is :
public static string DecodeQuotedPrintables(string input)
{
var occurences = new Regex(#"=[0-9A-F]{2}", RegexOptions.Multiline);
var matches = occurences.Matches(input);
foreach (Match match in matches)
{
char hexChar= (char) Convert.ToInt32(match.Groups[0].Value.Substring(1), 16);
input =input.Replace(match.Groups[0].Value, hexChar.ToString());
}
return input.Replace("=\r\n", "");
}
To summarize, I open a StreamReader in UTF8 and place each read line in a string like that :
myString += line + "\r\n";
I open then my StreamWriter in UTF8 too and write the myString variable decoded in it :
myStreamWriter.WriteLine(DecodeQuotedPrintables(myString));

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Categories

Resources