Few days ago I've asked a question about german special characters.
I can encode and decode characters like ö, ä or ü now. But.. some characters left and I need to encode/decode them too.
For example, characters that fails: ² ³ € µ Ü Ö Ä ~ ´ §
Here is code:
private static byte[] MyGetBytesArray(string data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetBytes(data);
}
private static string MyGetString(byte[] data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetString(data);
}
I'm looking for a solution to encode/decode all characters. I'm writing an encrypt/decrypt algorythm, and I don't know what user will paste into program. I need to give back exactly the same.
Thanks for help, again..
EDIT:
Ok, UnicodeEncoding works (I think). It is my encrypt/decrypt algoryth now:/ I'm still not sure what is going on (I thnik it is sth with zeros. During encoding by Unicode zero is after every character), but encoding special characters wokrs. At least that test was successfull:
string text = File.ReadAllText(opd.FileName, Encoding.Default);
byte[] byt = getBytesArray(text);
string text2 = getString(byt);
if (text2 == text)
{
MessageBox.Show("OK");
}
else
{
MessageBox.Show("FAIL");
}
BTW. Encoding.Default is correct right ?
Try UnicodeEncoding instead.
var encoding = new UnicodeEncoding();
return Write(encoding.GetBytes(s));
Unfortunately those characters are Unicode so you won't be able to use the UTF8Encoding class.
Try using the UnicodeEncoding class instead.
Related
I need a equivalent C# code for Base64.getDecoder().decode Java code.
I have tried something like the following in C#
byte[] decodedBytes = Convert.FromBase64String(embedCode);
string decodedText = Encoding.UTF8.GetString(decodedBytes);
byte[] bytes = Encoding.ASCII.GetBytes(decodedText);
But the string has some special characters like 0��\u0002B\0�*-\u0017���c\u001e�aֺ]���qr����`. How can I achieve this in C#
'\u0002' is the 'start of text' character for unicode encoding.
So use
byte[] decodedBytes = Convert.FromBase64String(embedCode);
string decodedText = Encoding.Unicode.GetString(decodedBytes);
And please don't try to encode Unicode text as ASCII, Unicode has a wider character range, which ASCII will not be able to recognize. So use unicode encoding again to write to bytes.
I use the below code to convert from unicode to utf-8:
Encoding unicode = Encoding.Unicode;
Encoding UTF8 = Encoding.UTF8;
byte[] unicodeBytes = unicode.GetBytes(stringResp);
byte[] UTF8Bytes = Encoding.Convert(unicode, UTF8, unicodeBytes);
string stringResp = UTF8.GetString(UTF8Bytes, 0, UTF8Bytes.Length);
But the special characters doesn't show, only their unicode code (\u00c1 for Á for example). If I look manually the next string ("\\u00c1") and replace it for Á it works. Does someone knows why and how I can make an automatic conversion?
my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)
I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.
In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}
I have a unicode string, let's say "U+660E", and I want to display the corresponding character, which in this case is 明. See this page (ctrl-F to find 明).
My code so far:
string unicodeString = reader.GetString(0);
unicodeString.Trim();
Encoding codepage = Encoding.GetEncoding(950);
Encoding unicode = Encoding.Unicode;
byte[] encodedBytes = codepage.GetBytes(unicodeString);
//unicodeString = Encoding.Convert(codepage, unicode, encodedBytes).ToString();
unicodeString = unicode.GetString(encodedBytes);
richTextBox1.Text = unicodeString;
My output is "⭕㘶䔰�".
Any idea where I went wrong?
.net deals directly with unicode. You do not have to play the encoding games. Just tell the reader if the input is UTF-8 or UTF-16 and then deal with it as a normal string.
richTextBox1.Text = reader.GetString(0)
There's no need to convert to CP-950; C# is Unicode through-and-through. Just input and print as Unicode unless you're outputting to a file that you know has to be CP-950.
I am retrieving ASCII strings encoded with code page 437 from another system which I need to transform to Unicode so they can be mixed with other Unicode strings.
This is what I am working with:
var asciiString = "\u0094"; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeEncoding = Encoding.Unicode;
// This is what I attempted to do but it seems not to be able to support the eight bit. Characters using the eight bit are replaced with '?' (0x3F)
var asciiBytes = asciiEncoding.GetBytes(asciiString);
// This work-around does the job, but there must be built in functionality to do this?
//var asciiBytes = asciiString.Select(c => (byte)c).ToArray();
// This piece of code happliy converts the character correctly to unicode { 0x94 } => { 0xF6, 0x0 } .
var unicodeBytes = Encoding.Convert(asciiEncoding, unicodeEncoding, asciiBytes);
var unicodeString = unicodeEncoding.GetString(unicodeBytes); // I want this to be 'ö'.
What I am struggling with is that I cannot find a suitable method in the .NET framework to transform a string with character codes above 127 to a byte array. This seems strange since there are support there to transform a byte array with characters above 127 to Unicode strings.
So my question is, is there any built in method to do this conversion properly or is my work-around the proper way to do it?
var asciiString = "\u0094";
Whatever you name it, this will always be a Unicode string. .NET only has Unicode strings.
I am retrieving ASCII strings encoded with code page 437 from another system
Treat the incoming data as byte[], not as string.
var asciiBytes = new byte[] { 0x94 }; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeString = asciiEncoding.GetString(asciiBytes);
\u0094 is Unicode code-point 0094, which is a control character; it is not ö. If you wanted ö, the correct string is
string s = "ö";
which is LATIN SMALL LETTER O WITH DIAERESIS, aka code-point 00F6.
So:
var s = "\u00F6"; // Identical to "ö"
Now we get our encoding:
var enc = Encoding.GetEncoding(437);
var bytes = enc.GetBytes(s);
And we find that it is a single-byte decimal 148, which is hex 94 - i.e. what you were after.
The significance here is that in C# when you use the "\uXXXX" syntax, the XXXX is always referring to Unicode code-points, not the encoded value in some particular encoding.
You have to look earlier in the code. Once you have the data as a string, it has already been decoded. Any characters lost in that decoding is impossible to get back.
You need the input as bytes, so that you can use your encoding object for code page 437 to decode it into a string.
byte[] asciiData = new byte[] { 0x94 }; // character ö in codepage 437
Encoding asciiEncoding = Encoding.GetEncoding(437);
string unicodeString = asciiEncoding.GetString(asciiData);
Console.WriteLine(unicodeString);
Output:
ö