c# - How to convert a converted UTF8 string to UTF16? - c#

I'm trying to convert a converted UTF-8 string to UTF-16, because I'm going to read a file and it comes like the var strUTF8 below.
For example, the entry would be the string "Não é possível equipar" and the return I needed is "Não é possível equipar".
static void Main(string[] args)
{
test3();
Console.ReadKey();
}
static void test3()
{
string str = "Não é possível equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
return Encoding.Unicode.GetString(unicodeBytes);
}
I really don't know how to solve this. Any tips?

If you want to read a file then you should read a file. When you read the file, specify the encoding of that file. If I'm not mistaken UTF8 is the default, so reading files encoded with UTF8 doesn't require the encoding to be specified. If you want to save that text to a file with a specific encoding, specify that encoding when saving the file.
var text = File.ReadAllText(filePath, Encoding.UTF8);
File.WriteAllText(filePath, text, Encoding.Unicode);
That will effectively convert a file from UTF8 encoding to UTF16. A more verbose version would be:
var data = File.ReadAllBytes(filePath);
var text = Encoding.UTF8.GetString(data);
data = Encoding.Unicode.GetBytes(text);
File.WriteAllBytes(filePath, data);

Your Utf8ToUtf16() function is effectively a no-op. You are taking an arbitrary UTF-16 string as input, encoding it into UTF-8 bytes, then decoding those bytes as UTF-8 back into UTF-16. So, you effectively end up with the same string value you started with. You may as well have just written the following, the result would be the same:
static string Utf8ToUtf16(string utf8String)
{
return utf8String;
}
That being said, Não é possível equipar is what you get when the UTF-8 encoded form of Não é possível equipar is mis-interpreted as Latin (probably ISO-8859-1) or Windows-125x etc, instead of being properly interpreted as UTF-8 to begin with.
If you have a C# string that contains such UTF-8 bytes which were up-scaled as-is to UTF-16 (why???), then you need to down-scale those characters as-is back into 8-bit bytes, and then you can decode those bytes as UTF-8, eg:
static void test3()
{
string str = "Não é possível equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(utf8String); // or: GetEncoding(28591)
return Encoding.UTF8.GetString(utf8Bytes);
}

Related

Writing to file a UNICODE character from HEX input

I have a HEX input (eg. 394A) and I need to encode it as UNICODE, then save the resulting character(s) to a file. How do I go about that?
I've tried this, but it doesn't seem to work.
fsDest.Write(StrToUni(uni.ToString()), 0, 2);
private static byte[] StrToUni(string str)
{
Encoding unicode = Encoding.Unicode;
byte[] unicodeBytes = unicode.GetBytes(str);
return unicodeBytes;
}
I should see this in my file: 9J

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)
I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.
In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

Convert UTF-8 to Chinese Simplified (GB2312)

Is there a way to convert UTF-8 string to Chinese Simplified (GB2312) in C#. Any help is greatly appreciated.
Regards
Jyothish George
The first thing to be aware of is that there's no such thing as a "UTF-8 string" in .NET. All strings in .NET are effectively UTF-16. However, .NET provides the Encoding class to allow you to decode binary data into strings, and re-encode it later.
Encoding.Convert can convert a byte array representing text encoded with one encoding into a byte array with the same text encoded with a different encoding. Is that what you want?
Alternatively, if you already have a string, you can use:
byte[] bytes = Encoding.GetEncoding("gb2312").GetBytes(text);
If you can provide more information, that would be helpful.
Try this;
public string GB2312ToUtf8(string gb2312String)
{
Encoding fromEncoding = Encoding.GetEncoding("gb2312");
Encoding toEncoding = Encoding.UTF8;
return EncodingConvert(gb2312String, fromEncoding, toEncoding);
}
public string Utf8ToGB2312(string utf8String)
{
Encoding fromEncoding = Encoding.UTF8;
Encoding toEncoding = Encoding.GetEncoding("gb2312");
return EncodingConvert(utf8String, fromEncoding, toEncoding);
}
public string EncodingConvert(string fromString, Encoding fromEncoding, Encoding toEncoding)
{
byte[] fromBytes = fromEncoding.GetBytes(fromString);
byte[] toBytes = Encoding.Convert(fromEncoding, toEncoding, fromBytes);
string toString = toEncoding.GetString(toBytes);
return toString;
}
source here

Conversion of text to unicode strings

I have to process JSON files that looks like this:
\u0432\u043b\u0430\u0434\u043e\u043c <b>\u043f\u0443\u0442\u0438\u043c<\/b> \u043d\u0430\u0447
Unfortunately, I'm not sure how this encoding is called.
I would like to convert it to .NET Unicode strings. What's the easies way to do it?
This is Unicode characters for Russian alphabet.
try simply put this line in VisualStudio and it will parse it.
string unicodeString = "\u0432\u043b\u0430\u0434\u043e\u043c";
Or if you want to convert this string to another encoding, for example utf8, try this code:
static void Main()
{
string unicodeString = "\u0432\u043b\u0430\u0434\u043e\u043c <b>\u043f\u0443\u0442\u0438\u043c<\b> \u043d\u0430\u0447";
// Create two different encodings.
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] utf8Bytes = Encoding.Convert(unicode, utf8, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
Console.ReadKey();
}
code taken from Convert

C# and utf8_decode

Is there a C# utf8_decode equivalent?
Use the Encoding class.
For example:
byte[] bytes = something;
string str = Encoding.UTF8.GetString(bytes);
Yes. You can use the System.Text.Encoding class to convert the encoding.
string source = "Déjà vu";
Encoding unicode = Encoding.Unicode;
// iso-8859-1 <- codepage 28591
Encoding latin1 = Encoding.GetEncoding(28591);
Byte[] result = Encoding.Convert(unicode, latin1, unicode.GetBytes(s));
// result contains the byte sequence for the latin1 encoded string
edit: or simply
string source = "Déjà vu";
Byte[] latin1 = Encoding.GetEncoding(28591).GetBytes(source);
string (System.String) is always unicode encoded, i.e. if you convert the byte sequence back to string (Encoding.GetString()) your data will again be stored as utf-16 codepoints again.
If your input is a string here is a method that would probably work (assuming your from wester europe :)
public string Utf8Decode(string inputDate)
{
return Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(inputDate));
}
Of course, if the current encoding of the inputData is not latin1, change the "iso-8859-1" to the correct encoding.
I tried to make this implementation on Xamarin C#.
The code below worked for me:
public static string Utf8Encode(string inputDate)
{
byte[] bytes = Encoding.UTF8.GetBytes(inputDate);
return Encoding.GetEncoding("iso-8859-1").GetString(bytes,0, bytes.Length);
}
public static string Utf8Decode(string inputDate)
{
byte[] bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(inputDate);
return Encoding.UTF8.GetString(bytes, 0, bytes.Length);
}

Categories

Resources