Compare Windows-1252 string to UTF-8 string

Compare Windows-1252 string to UTF-8 string - c#

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

Related

Encoding UTF-16 to UTF-8 C#

Hello everyone i have some problem with Encoding..
i want convert utf-16 to utf-8 i founded many code but didn't work..
I hope help me.. Thanks
This text =>
'\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003'
#I tryed this
string v = Regex.Unescape(text);
get result like
♦→♦0♦#♦B♦0 *3301: 01.11.2022 14:10, ♦?♦>♦?♦>♦;♦=♦5♦=♦8♦5 33.33 TJS. ♦¶♦>♦A♦B♦C♦?♦=♦> 3223
and continue
public static string Utf16ToUtf8(string utf16String)
{
// Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
// Return UTF8 bytes as ANSI string
return Encoding.Default.GetString(utf8Bytes);
}
don't worked
I need result like this
Карта *4411: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223

The code below decodes the text to what you want, but it would be much better to avoid getting into this situation in the first place. If the data is fundamentally text, store it as text in your log files without the extra "convert to UTF-16 then encode that binary data" aspect - that's just causing problems.
The code below "decodes" the text log data into a byte array by treating each \x escape sequence as a single byte (assuming \\ is used to encode backslashes) and treating any other character as a single byte - effectively ISO-8859-1.
It then converts the byte array to a string using big-endian UTF-16. The output is as desired:
Карта *3301: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223
The code is really inefficient - it's effectively a proof of concept to validate the text format you've got. Don't use it as-is; instead, use this as a starting point for improving your storage representation.
using System.Text;
class Program
{
static void Main()
{
string logText = #"\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003";
byte[] utf16 = DecodeLogText(logText);
string text = Encoding.BigEndianUnicode.GetString(utf16);
Console.WriteLine(text);
}
static byte[] DecodeLogText(string logText)
{
List<byte> bytes = new List<byte>();
for (int i = 0; i < logText.Length; i++)
{
if (logText[i] == '\\')
{
if (i == logText.Length - 1)
{
throw new Exception("Trailing backslash");
}
switch (logText[i + 1])
{
case 'x':
if (i >= logText.Length - 3)
{
throw new Exception("Not enough data for \\x escape sequence");
}
// This is horribly inefficient, but never mind.
bytes.Add(Convert.ToByte(logText.Substring(i + 2, 2), 16));
// Consume the x and hex
i += 3;
break;
case '\\':
bytes.Add((byte) '\\');
// Consume the extra backslash
i++;
break;
// TODO: Any other escape sequences?
default:
throw new Exception("Unknown escape sequence");
}
}
else
{
bytes.Add((byte) logText[i]);
}
}
return bytes.ToArray();
}
}

This also helped me:
string reg = Regex.Unescape(text2);
byte[] ascii = Encoding.BigEndianUnicode.GetBytes(reg);
byte[] utf8 = Encoding.Convert(Encoding.BigEndianUnicode, Encoding.UTF8, ascii);
Console.WriteLine(Encoding.BigEndianUnicode.GetString(utf8));

Is byte[] to UTF8 to JSON string a safe encoding for binary data?

I'm looking to log the body of HTML POST requests to my web API in a text file. I want it so that "human-readable" characters will be easily viewable, so I don't want to encode it to base64, but I also need all the non-"human-readable" bytes to be safely encoded so that they can always be restored to a byte[] which is exactly the same as the input array.
Will the following code always safely store all non-readable bytes as JSON escape sequences, while storing readable characters as themselves?
byte[] bodyBytes = GetBodyBytes(ctx);
var bodyString = System.Text.Encoding.UTF8.GetString(bodyBytes);
var safeString = Newtonsoft.Json.JsonConvert.SerializeObject(new { bodyString });
One problem I can see immediately is that this will store newline sequences simply as \n, losing the data about whether they used the Windows- or Unix-style newline bytes. How can I avoid losing any such data while keeping things human readable where possible?

In the end, I rolled my own encoding, as my requirements are probably too specific for an existing encoding scheme to implement completely. Here's the method I created:
private string UltraSafeEncode(byte[] bytes) {
if (bytes == null) {
throw new ArgumentNullException(nameof(bytes));
}
var sb = new System.Text.StringBuilder();
foreach (var thisByte in bytes) {
if (thisByte < 32 || thisByte == 92 || thisByte > 126) {
sb.Append('\\').AppendFormat("{0:x2}", thisByte);
continue;
}
sb.Append((char)thisByte);
}
return sb.ToString();
}

Check if string consists only of valid ISO 8859-1 characters

How to check if a string consists only of chars, which can be successfully encoded in ISO 8859-1? Or in other words - how to find "illegal"/"not ISO 8859-1 compatible" chars in a string?

Try this:
private static bool IsValidISO(string input)
{
byte[] bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(input);
String result = Encoding.GetEncoding("ISO-8859-1").GetString(bytes);
return String.Equals(input, result);
}
This answer is based on an answer of this Java question (my code is the C# equivalent):
http://www.velocityreviews.com/forums/t137810-checking-whether-a-string-contains-only-iso-8859-1-chars.html

You could setup an array or list of valid characters and then iterate through your string to check if each of them exists in your list of valid characters. The list can be created by adding all valid latin-1 characters to it.

I came up with this idea. Might this be possible?
private static bool IsValidISO(string input)
{
foreach (char c in input)
{
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] isoBytes = iso.GetBytes(c.ToString());
byte[] utfBytes = Encoding.Convert(iso, utf8, isoBytes);
string convertedC = utf8.GetString(utfBytes);
if (c != '?' && convertedC == "?")
return false;
}
return true;
}

Encoding special characters into Byte array

Few days ago I've asked a question about german special characters.
I can encode and decode characters like ö, ä or ü now. But.. some characters left and I need to encode/decode them too.
For example, characters that fails: ² ³ € µ Ü Ö Ä ~ ´ §
Here is code:
private static byte[] MyGetBytesArray(string data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetBytes(data);
}
private static string MyGetString(byte[] data)
{
Encoding enc = new UTF8Encoding(true, true);
return enc.GetString(data);
}
I'm looking for a solution to encode/decode all characters. I'm writing an encrypt/decrypt algorythm, and I don't know what user will paste into program. I need to give back exactly the same.
Thanks for help, again..
EDIT:
Ok, UnicodeEncoding works (I think). It is my encrypt/decrypt algoryth now:/ I'm still not sure what is going on (I thnik it is sth with zeros. During encoding by Unicode zero is after every character), but encoding special characters wokrs. At least that test was successfull:
string text = File.ReadAllText(opd.FileName, Encoding.Default);
byte[] byt = getBytesArray(text);
string text2 = getString(byt);
if (text2 == text)
{
MessageBox.Show("OK");
}
else
{
MessageBox.Show("FAIL");
}
BTW. Encoding.Default is correct right ?

Try UnicodeEncoding instead.
var encoding = new UnicodeEncoding();
return Write(encoding.GetBytes(s));

Unfortunately those characters are Unicode so you won't be able to use the UTF8Encoding class.
Try using the UnicodeEncoding class instead.

How can i convert a string of characters into binary string and back again?

I need to convert a string into it's binary equivilent and keep it in a string. Then return it back into it's ASCII equivalent.

You can encode a string into a byte-wise representation by using an Encoding, e.g. UTF-8:
var str = "Out of cheese error";
var bytes = Encoding.UTF8.GetBytes(str);
To get back a .NET string object:
var strAgain = Encoding.UTF8.GetString(bytes);
// str == strAgain
You seem to want the representation as a series of '1' and '0' characters; I'm not sure why you do, but that's possible too:
var binStr = string.Join("", bytes.Select(b => Convert.ToString(b, 2)));
Encodings take an abstract string (in the sense that they're an opaque representation of a series of Unicode code points), and map them into a concrete series of bytes. The bytes are meaningless (again, because they're opaque) without the encoding. But, with the encoding, they can be turned back into a string.
You seem to be mixing up "ASCII" with strings; ASCII is simply an encoding that deals only with code-points up to 128. If you have a string containing an 'é', for example, it has no ASCII representation, and so most definitely cannot be represented using a series of ASCII bytes, even though it can exist peacefully in a .NET string object.
See this article by Joel Spolsky for further reading.

You can use these functions for converting to binary and restore it back :
public static string BinaryToString(string data)
{
List<Byte> byteList = new List<Byte>();
for (int i = 0; i < data.Length; i += 8)
{
byteList.Add(Convert.ToByte(data.Substring(i, 8), 2));
}
return Encoding.ASCII.GetString(byteList.ToArray());
}
and for converting string to binary :
public static string StringToBinary(string data)
{
StringBuilder sb = new StringBuilder();
foreach (char c in data.ToCharArray())
{
sb.Append(Convert.ToString(c, 2).PadLeft(8, '0'));
}
return sb.ToString();
}
Hope Helps You.

First convert the string into bytes, as described in my comment and in Cameron's answer; then iterate, convert each byte into an 8-digit binary number (possibly with Convert.ToString, padding appropriately), then concatenate. For the reverse direction, split by 8 characters, run through Convert.ToInt16, build up a byte array, then convert back to a string with GetString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Compare Windows-1252 string to UTF-8 string - c#

Related

Encoding UTF-16 to UTF-8 C#

Is byte[] to UTF8 to JSON string a safe encoding for binary data?

Check if string consists only of valid ISO 8859-1 characters

Encoding special characters into Byte array

How can i convert a string of characters into binary string and back again?

Categories

Resources