UTF-8 to Unicode using C#

UTF-8 to Unicode using C# - c#

Help me please. I have problem with encoding response string after GET request:
var m_refWebClient = new WebClient();
var m_refStream = m_refWebClient.OpenRead(this.m_refUri);
var m_refStreamReader = new StreamReader(this.m_refStream, Encoding.UTF8);
var m_refResponse = m_refStreamReader.ReadToEnd();
After calling this code my string m_refResponse is json source with substrings like \u041c\u043e\u0439. What is it? How to encode it for Cyrillic? I am very tired after a lot of attempts.
corrected

Am I missing something here?
What is it?
"\u041c\u043e\u0439" is the String literal representation of Мой. You don't have to do anything more, Strings are Unicode, you've got your Cyrillic already.
(Unless you mean you literally have the sequence \u041c\u043e\u0439, ie. the value "\\u041c\\u043e\\u0439". That wouldn't be the result of an encoding error, that would be something happening at the server, for example it returning a JSON string, since JSON and C# use the same \u escapes. If that's what's happening use a JSON parser.)

I'm not 100% on this, but I would assume you'd have to pass Encoding.Unicode to StreamReader.

Related

Why is UTF-7 interpreting umlauts correct and UTF-8 not?

I have a Base64 string which I want to convert and decode to UTF-8 like this:
byte[] encodedDataAsBytes = System.Convert.FromBase64String(vcard);
return Encoding.UTF8.GetString(encodedDataAsBytes);
This because Umlauts in the string need to be displayed correctly. The problem I face is that when I use UTF-8 as encoding the umlauts are NOT handled correctly. But when I use UTF-7
return Encoding.UTF7.GetString(encodedDataAsBytes);
everything works fine.
Why's that? Should'nt UTF-8 be able to handle umlauts??

Your vcard is UTF-7 encoded.
This is why Encoding.UTF7.GetString(encodedDataAsBytes); gives you the right result.
After it is encoded, you can't decide on another encoding.
To use UTF-8 encoding you would need access to the string before variable vcard got its value.

I had a similar problem. In my case, I used javaScript btoa() to encode a filename to Base64 within the Web UI, and send it over to the server. On the server side .net core, I used the code below to decode it back to a string filename.
// Note: encodedFilename is the result of btoa() from the client web UI.
var raw = Convert.FromBase64String(encodedFilename);
var filename = Encoding.UTF8.GetString(raw);
It failed to decode ä. However it worked when I used Encoding.UTF7(), but I think it is not the right solution. I believe that this due to the different encode/decode type. btoa() is binary to ASCII. What I really need is b64EncodeUnicode().
function b64EncodeUnicode(str) {
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
return String.fromCharCode('0x' + p1);
}));
}
Code Reference: https://developer.mozilla.org/en-US/docs/Glossary/Base64

Proper way to handle the ampersand character in JSON string send to REST web service

OK,
I am using the System.Runtime.Serialization and the DataContractJsonSerialization.
The problem is that in the request I send a value of a property with the & character. Say, AT&T, and I get a response with error: Invalid JSON Data.
I thought that the escaping would be done inside the library but now I see that the serialization is left untouched the ampersand & character.
Yes, for a JSON format this is valid.
But it will be a problem to my POST request since I need to send this to a server that if contains an ampersand will response with error, hence here I am.
HttpUtility.HtmlEncode is in the System.Web library and so the way to go is using Uri.EscapeUriString. I did this to try, but anyway, and without it all requests are working fine, except an ampersand is in a value.
EDIT: HttpUtility class is ported to the Windows Phone SDK but the prefer way to encode a string should be still Uri.EscapeUriString.
First thought was to get hands dirty and start replacing the special character which would cause a problem in the server, but, I wonder, is there another solution I should do, that it would be efficient and 'proper'?
I should tell that I use
// Convert the string into a byte array.
byte[] postBytes = Encoding.UTF8.GetBytes(data);
To convert the JSON to a byte[] and write to the Stream.
And,
request.ContentType = "application/x-www-form-urlencoded";
As the WebRequest.ContentType.
So, am I messed up for a reason or something I miss?
Thank you.

The problem was that I was encoding the whole request string including the key.
I had a request data={JSON} and I was formatting it, but the {JSON} part should only be encoded.
string requestData = "data=" + Uri.EncodeDataString(json) // worked perfect!
Stupid hole to step into.

Have you tried replacing the ampersand with & for the POST?

Hebrew in Html email body is not readable

I am trying to parse an Gmail's email. I am using Imap methods and so far so good.
My problem is with html emails. I searched everywhere for converting html body to plain text but nothing works for me so I am trying to do it myself. I am taking the html, clearing the all the attributes and now I have an encoding issue.
Some of my emails are in Hebrew and the Hebrew in the html looks like this :
=F0=E0 =F6=F8=E5 =E0=E9=FA=E9 =F7=F9=F8 =E1=E1=F7=F9=E4 =E1=E8=EC=F4=
=E5=EF
I tried converting it from hex to string but the result wasn't perfect. some words were missing.
How can I convert is to Hebrew chars?
Thanks a lot,
Elad

It seems you have some encoding issues with the HTML you receive.
You're going to need to convert it to the correct encoding.
This works:
Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");
string msys = "=F0=E0 =F6=F8=E5 =E0=E9=FA=E9 =F7=F9=F8 =E1=E1=F7=F9=E4 =E1=E8=EC=F4=E5=EF";
msys = System.Web.HttpUtility.UrlDecode(msys.Replace('=', '%').Replace(" ", "%20"), latinEncoding);
byte[] latinBytes = latinEncoding.GetBytes(msys);
string hebrewString = hebrewEncoding.GetString(latinBytes);
First part of your problem is that the =F0=E0.. are actually URLEncoded with a = instead of a % at the begining. So we replace the problematic characters and UrlDecode it.
Afterwards, we convert it from the Windows-1252 encoding to the Windows-1255 encoding.
As a side note: there is a problem in the example string you gave: =F4= =E5=EF should actually be =F4 =E5=EF (the = character is always before, not after the decoded part)
I tested it and it works fine on your string... בהצלחה

Loading XML to an XDocument with a URL containing an ampersand

XDocument xd = XDocument.Load("http://www.google.com/ig/api?weather=vilnius&hl=lt");
The ampersand & isn't a supported character in a string containing a URL when calling the Load() method. This error occurs:
XmlException was unhandled: Invalid character in the given encoding
How can you load XML from a URL into an XDocument where the URL has an ampersand in the querystring?

You need to URL-encode it as &:
XDocument xd = XDocument.Load(
"http://www.google.com/ig/api?weather=vilnius&hl=lt");
You might be able to get away with using WebUtility.HtmlEncode to perform this conversion automatically; however, be careful that this is not the intended use of that method.
Edit: The real issue here has nothing to do with the ampersand, but with the way Google is encoding the XML document using a custom encoding and failing to declare it. (Ampersands only need to be encoded when they occur within special contexts, such as the <a href="…" /> element of (X)HTML. Read Ampersands (&'s) in URLs for a quick explanation.)
Since the XML declaration does not specify the encoding, XDocument.Load is internally falling back to default UTF-8 encoding as required by XML specification, which is incompatible with the actual data.
To circumvent this issue, you can fetch the raw data and decode it manually using the sample below. I don’t know whether the encoding really is Windows-1252, so you might need to experiment a bit with other encodings.
string url = "http://www.google.com/ig/api?weather=vilnius&hl=lt";
byte[] data;
using (WebClient webClient = new WebClient())
data = webClient.DownloadData(url);
string str = Encoding.GetEncoding("Windows-1252").GetString(data);
XDocument xd = XDocument.Parse(str);

There is nothing wrong with your code - it is perfectly OK to have & in the query string, and it is how separate parameters are defined.
When you look at the error you'll see that it fails to load XML, not to query it from the Url:
XmlException: Invalid character in the given encoding. Line 1, position 473
which clearly points outside of your query string.
The problem could be "Apsiniaukę" (notice last character) in the XML response...

instead of "&" use "&" or "&" . and it will work fine .

.NET 3.5 C# StreamReader Reading ISO-8859-1 Characters Incorrectly

In summary I retrieve a HTTP Web Response containing JSON formatted data with unicode characters such as "\u00c3\u00b1" which should translate to "ñ". Instead these characters are converted to "Ã±" by the JSON parser I am using. The behavior I'm looking for is for those characters to be translated to "ñ".
Taking the following code and executing it...
string nWithAccent = "\u00c3\u00b1";
Encoding iso = Encoding.GetEncoding("iso8859-1");
byte[] isoBytes = iso.GetBytes(nWithAccent);
nWithAccent = Encoding.UTF8.GetString(isoBytes);
nWithAccent outputs "ñ". This is the result I am looking for. I took the above code and used it on the "response_body" variable below which contained the same characters as above (from what I could see using the Visual Studio 2008 Text Analyzer) and did not get the same result... it does nothing with the characters "\u00c3\u00b1".
In my application I execute the following code against an external system retrieving data in JSON format. Upon examining the "response_body" variable using the text analyzer in Visual Studio I see "\u00c3\u00b1" instead of ñ. E.g. the word "niño" would be seen in the Text Analyzer as "ni\u00c3\u00b1o".
using (HttpWResponse = (HttpWebResponse)this.HttpWRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
{
// token will expire 60 min from now.
this.TimeTillTokenExpiration = DateTime.Now.AddMinutes(60);
// read response data
response_body = reader.ReadToEnd();
}
}
I then use an open source JSON parser which replaces "\u00c3" with "Ã" and "\u00b1" with "±" with an end result of "Ã±" instead of "ñ". Is something wrong with the JSON parser or am I applying the wrong encoding to the response stream? The headers in the response indicate the charset as being UTF-8. Thanks for any replies!

The JSON response you are receiving is invalid. "\u00c3\u00b1" isn't the correct encoding for ñ.
Instead it's a sort of double encoding. It has first been encoded as an UTF-8 byte sequence and then the bytes above 128 have been escaped with the \u sequence.
Since a JSON response is usally UTF-8 anyway, there's no need to escape the two byte sequence for ñ. If escaping is used, it must not be applied to the two byte sequence but rather to the single Unicode character itself. It would then result in "\u00f1".
You can test it with an online JSON validator (such as JSONLint or JSON Format) by pasting the following JSON data:
{
"unescaped": "ñ",
"escaped": "\u00f1",
"wrong": "\u00c3\u00b1"
}

Replace
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
with
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.GetEncoding("iso8859-1")))

What happens if you pass this string to the JSON parser?
string s = "\\u00c3\\u00b1";
I suspect you'll get "Ã±".
Is there a way you can tell your JSON parser to interpret characters in the string as though they're UTF-8 bytes?
You're probably better off reading raw bytes from the response stream and passing that to the JSON parser.
I think the problem is that you're converting the raw bytes to a string, which contains the encoded characters. The JSON parser doesn't know if you want that "\u00c3\u00b1" converted to a single UTF-8 character, or two characters.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

UTF-8 to Unicode using C# - c#

I'm not 100% on this, but I would assume you'd have to pass Encoding.Unicode to StreamReader.

Related

Why is UTF-7 interpreting umlauts correct and UTF-8 not?

Proper way to handle the ampersand character in JSON string send to REST web service

Hebrew in Html email body is not readable

Loading XML to an XDocument with a URL containing an ampersand

.NET 3.5 C# StreamReader Reading ISO-8859-1 Characters Incorrectly

Categories

Resources