Hebrew in Html email body is not readable - c#

I am trying to parse an Gmail's email. I am using Imap methods and so far so good.
My problem is with html emails. I searched everywhere for converting html body to plain text but nothing works for me so I am trying to do it myself. I am taking the html, clearing the all the attributes and now I have an encoding issue.
Some of my emails are in Hebrew and the Hebrew in the html looks like this :
=F0=E0 =F6=F8=E5 =E0=E9=FA=E9 =F7=F9=F8 =E1=E1=F7=F9=E4 =E1=E8=EC=F4=
=E5=EF
I tried converting it from hex to string but the result wasn't perfect. some words were missing.
How can I convert is to Hebrew chars?
Thanks a lot,
Elad

It seems you have some encoding issues with the HTML you receive.
You're going to need to convert it to the correct encoding.
This works:
Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");
string msys = "=F0=E0 =F6=F8=E5 =E0=E9=FA=E9 =F7=F9=F8 =E1=E1=F7=F9=E4 =E1=E8=EC=F4=E5=EF";
msys = System.Web.HttpUtility.UrlDecode(msys.Replace('=', '%').Replace(" ", "%20"), latinEncoding);
byte[] latinBytes = latinEncoding.GetBytes(msys);
string hebrewString = hebrewEncoding.GetString(latinBytes);
First part of your problem is that the =F0=E0.. are actually URLEncoded with a = instead of a % at the begining. So we replace the problematic characters and UrlDecode it.
Afterwards, we convert it from the Windows-1252 encoding to the Windows-1255 encoding.
As a side note: there is a problem in the example string you gave: =F4= =E5=EF should actually be =F4 =E5=EF (the = character is always before, not after the decoded part)
I tested it and it works fine on your string... בהצלחה

Related

Why is UTF-7 interpreting umlauts correct and UTF-8 not?

I have a Base64 string which I want to convert and decode to UTF-8 like this:
byte[] encodedDataAsBytes = System.Convert.FromBase64String(vcard);
return Encoding.UTF8.GetString(encodedDataAsBytes);
This because Umlauts in the string need to be displayed correctly. The problem I face is that when I use UTF-8 as encoding the umlauts are NOT handled correctly. But when I use UTF-7
return Encoding.UTF7.GetString(encodedDataAsBytes);
everything works fine.
Why's that? Should'nt UTF-8 be able to handle umlauts??
Your vcard is UTF-7 encoded.
This is why Encoding.UTF7.GetString(encodedDataAsBytes); gives you the right result.
After it is encoded, you can't decide on another encoding.
To use UTF-8 encoding you would need access to the string before variable vcard got its value.
I had a similar problem. In my case, I used javaScript btoa() to encode a filename to Base64 within the Web UI, and send it over to the server. On the server side .net core, I used the code below to decode it back to a string filename.
// Note: encodedFilename is the result of btoa() from the client web UI.
var raw = Convert.FromBase64String(encodedFilename);
var filename = Encoding.UTF8.GetString(raw);
It failed to decode ä. However it worked when I used Encoding.UTF7(), but I think it is not the right solution. I believe that this due to the different encode/decode type. btoa() is binary to ASCII. What I really need is b64EncodeUnicode().
function b64EncodeUnicode(str) {
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
return String.fromCharCode('0x' + p1);
}));
}
Code Reference: https://developer.mozilla.org/en-US/docs/Glossary/Base64

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

I am getting ÐиÑилл ÐаÑанник from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.
String text = Encoding.UTF8
.GetString(Encoding.GetEncoding("iso-8859-1")
.GetBytes("ÐиÑилл ÐаÑанник"));
But isn't this hardcoding "iso-8859-1", as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.
Thanks in advance.
When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.
string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!
Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!
Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:
string typedByUser = "こんにちは、世界!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界!
The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.
However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.
By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.
A source of characters should only be transfered in one encoding, that means it's either iso-8859-1 or something else, but not both at the same time
(that means you might be wrong about the reverse engineered cyrillic in the first place)
Could you post the expected UTF-8 output of your input?

using the correct encoding for sending special characters to people using outlook

I was wondering what encoding format i should be using to send special characters in emails to people using outlook. I have done my own research and came accross ways to do things but none worked for me. I checked outlook and it seems that by default, it is using Western European (Windows) format and is therefore using Windows-1252 Encoding (if what i searched and understood is correct). However, when i tried to convert from unicode in C# to the Windows-1252 encoding, my outlook is still not recognising the special characters to be legitimate. E.g below some random person's name:
expected result: Mr Moné Rêve
actual result (wrong): Mr Moné Rêve
Can anyone help me on what approach i should take to make the above correct.
My code:
string Fullname = "Mr Moné Rêve";
Encoding unicode = new UnicodeEncoding();
Encoding win1252 = Encoding.GetEncoding(1252);
byte[] input = unicode.GetBytes(Fullname);
byte[] output = Encoding.Convert(unicode, win1252, input);
string result = win1252.GetString(output);
There is no "Correct" encoding. You should specify the charset in the HTML.
This is taken from an email that I have received (you can get the source from an email using for instance outlook):
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
When you set an encoding on the document you have to specify what encoding you use, otherwise the receiver won't know which one you used. The declaration of the encoding can be read with whatever encoding you wish to use, thus the encoding can be read without knowing the encoding.
Read this about encodings and how they work: http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
In the end, i went for checking for special characters in my string and changing special characters into their code equivalent e.g é becomes é
The best way is to convert them to their HTML entities. here is a tool called HTML
Special Character Converter, it will help you convert special characters to their HTML entities width just one click.

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? marks.
I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content doesn't contain the string you gave it:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question).
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:
content = content.Replace("\u0092", "'");
My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");
// This should replace smart single quotes with a straight single quote
Regex.Replace(content, #"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));
If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

UTF-8 to Unicode using C#

Help me please. I have problem with encoding response string after GET request:
var m_refWebClient = new WebClient();
var m_refStream = m_refWebClient.OpenRead(this.m_refUri);
var m_refStreamReader = new StreamReader(this.m_refStream, Encoding.UTF8);
var m_refResponse = m_refStreamReader.ReadToEnd();
After calling this code my string m_refResponse is json source with substrings like \u041c\u043e\u0439. What is it? How to encode it for Cyrillic? I am very tired after a lot of attempts.
corrected
Am I missing something here?
What is it?
"\u041c\u043e\u0439" is the String literal representation of Мой. You don't have to do anything more, Strings are Unicode, you've got your Cyrillic already.
(Unless you mean you literally have the sequence \u041c\u043e\u0439, ie. the value "\\u041c\\u043e\\u0439". That wouldn't be the result of an encoding error, that would be something happening at the server, for example it returning a JSON string, since JSON and C# use the same \u escapes. If that's what's happening use a JSON parser.)
I'm not 100% on this, but I would assume you'd have to pass Encoding.Unicode to StreamReader.

Categories

Resources