Fixing a mis-encoded string after the fact

Fixing a mis-encoded string after the fact - c#

Main problem and question:
Given a garbled string for which the actual text is known, is it possible to consistently repair the garbled string?
According to Nyerguds comment on this answer:
If the string is an incorrect decoding done with a simply 8-bit
Encoding and you have the Encoding used to decode it, you can
usually get the bytes back without any corruption, though.
(emphases mine)
Which suggests that there are cases when it is not possible to derive the original bytes back. This leads me to the following question: are there cases when (mis)encoding an array of bytes is a lossy and irreversible operation?
Background:
I am calling an external C++ library that calls a web API somewhere. Sometimes this library gives me slightly garbled text. In my C# project, I am trying to find a way to consistently reverse the miscoding, but I only seem to be able to do so part of the time.
What I've tried:
It seems clear that the C++ library is wrongly encoding the original bytes, which it later passes to me as a string. My approach has been to guess at the encoding that the C++ library used to interpret the original source bytes. Then, I iterate through all possible encodings, reinterpreting the hopefully "original" bytes with another encoding.
class TestCase
{
public string Original { get; set; }
public string Actual { get; set; }
public List<string> Matches { get;} = new List<string>();
}
void Main()
{
var testCases = new List<TestCase>()
{
new TestCase {Original = "窶弑-shaped", Actual = "“U-shaped"},
new TestCase {Original = "窶廡窶・Type", Actual = "“F” Type"},
new TestCase {Original = "Ko窶冩lau", Actual = "Ko’olau"},
new TestCase {Original = "窶彗s is", Actual = "“as is"},
new TestCase {Original = "窶從ew", Actual = "“new"},
new TestCase {Original = "faﾃｧade", Actual = "façade"}
};
var encodings = Encoding.GetEncodings().Select(x => x.GetEncoding()).ToList();
foreach (var testCase in testCases)
{
foreach (var from in encodings)
{
foreach (var to in encodings)
{
// Guess the original bytes of the string
var guessedSourceBytes = from.GetBytes(testCase.Original);
// Guess what the bytes should have been interpreted as
var guessedActualString = to.GetString(guessedSourceBytes);
if (guessedActualString == testCase.Actual)
{
testCase.Matches.Add($"Reversed using \"{from.CodePage} {from.EncodingName}\", reinterpreted as: \"{to.CodePage} {to.EncodingName}\"");
}
}
}
}
}
As we can see above, out of the six test cases, all but one (窶廡窶・) was successful. In the successful cases, Shift-JIS (codepage 932) seemed to result in the correct "original" byte sequence for UTF8.
Getting the Shift-JIS bytes for 窶廡窶・ yields: E2 80 9C 46 E2 80 81 45.
E2 80 9C coincides with the UTF8 bytes for left double quotation mark, which is correct. However, E2 80 81 is em quad in UTF8, not the right double quotation mark I am expecting. Reinterpreting the whole byte sequence in UTF8 results in “F EType
No matter which encoding I use to derive the "original" bytes, and no matter what encoding I use to reinterpret said bytes, no combination seems to be able to successfully convert 窶廡窶・ to “F”.
Interestingly if I derive the UTF8 bytes for “F” Type, and purposely misinterpret those bytes as Shift-JIS, I get back 窶廡窶・Type
Encoding.GetEncoding(932).GetString(Encoding.UTF8.GetBytes("“F” Type"))
This leads me to believe that encoding can actually lead to data loss. I'm not well versed on encoding though, so could someone confirm whether my conclusion is correct, and if so, why this data loss occurs?

Yes, there are encodings that don't support all characters. One most common example is ASCIIEncoding that replaces all characters outside of standard ASCII range with ?.
...Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. … characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Related

Chars above 127 won't convert properly to Byte

I noticed a strange behavior, chars above 127 will not convert properly into Byte. It's a well known problem but I can't understand why it happens. I found out about it when I was working on a client-server app. The thing is chars are unsigned and so are Bytes so where is the loss of data
public class Constants
{
public constant char TOP3_REQUEST_CODE = (char)148;
}
public void printTopThree()
{
string request = Constants.TOP3_REQUEST_CODE.ToString();
string response = SendAndRecive (Constants.PORT, Constants.IP, request, Globals.SOCKET);
//The rest isn't relevant.
public string SendAndRecive(string port, string ip, string request, Socket socket)
{
Byte[] bytesSend = Encoding.ASCII.GetBytes(request);
Console.WriteLine(request [0]);
Console.WriteLine(bytesSend[0]);
//The program continues but its not relevant.
}
The code afterward doesn't change the byte array or string so it can't affect the results.
The output is:
148
63
The first char in the request is the code of the message (happens to be 148) but after converting, the first byte is 63.
My questions are:
1. How I can fix this? Is there some kind of another encoding that may solve my problem?
2. Why does this thing happen anyways?
EDIT: The request looks like this (in general):
1st byte: (char) Code (20,100,148 etc...)
2nd - 4th bytes: (int) Length (the length of the JSON object, The length can be 1, 2, or 3 bytes long)
5th - X bytes: (char) JsonObject (It's converted to char[])
Thanks for your time and attention
- Anthon

Strings and characters in C# don't use single-byte encodings. You should use the unicode encoding the CLR uses under the hood:
Byte[] bytesSend = Encoding.Unicode.GetBytes(request);

Thanks to all of you,
#Lasse Vågsæther Karlsen and #Marc Gravell♦ where right,
Use UTF-8.

Unable to convert string to byte if high-order bit is set

Apologies for the abortive first try, particularly to Olivier. Trying again.
Situation is we have a string coming in from a mainframe to a C# app. We understand it needs to be converted to a byte array. However, this data is a mixture of ASCII characters and true binary UINT16 and UINT32 fields, which are not always in the same spot in the data. Later on we will deserialize the data and will know the structure's data alignments, but not at this juncture.
Logic flow briefly is to send a structure with binary embedded, receive a reply with binary embedded, convert string reply to bytes (this is where we have issues), deserialize the bytes based on an embedded structure name, then process the structure. Until we reach deserialize, we don't know where the UINTs are. Bits are bits at this point.
When we have a reply byte which is ultimately part of a UINT16, and that byte has the high-order bit set (making it "extended ascii" or "negative", however you want to say it), that byte is converted to nulls. So any value >= 128 in that byte is lost.
Our code to convert looks like this:
public async Task<byte[]> SendMessage(byte[] sendBytes)
{
byte[] recvbytes = null;
var url = new Uri("http://<snipped>");
WebRequest webRequest = WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/octet-stream";
webRequest.Timeout = 10000;
using (Stream postStream = await webRequest.GetRequestStreamAsync().ConfigureAwait(false))
{
await postStream.WriteAsync(sendBytes, 0, sendBytes.Length);
await postStream.FlushAsync();
}
try
{
string Response;
int Res_lenght;
using (var response = (HttpWebResponse)await webRequest.GetResponseAsync())
using (Stream streamResponse = response.GetResponseStream())
using (StreamReader streamReader = new StreamReader(streamResponse))
{
Response = await streamReader.ReadToEndAsync();
Res_lenght = Response.Length;
}
if (string.IsNullOrEmpty(Response))
{
recvbytes = null;
}
else
{
recvbytes = ConvertToBytes(Response);
var table = (Encoding.Default.GetString(
recvbytes,
0,
recvbytes.Length - 1)).Split(new string[] { "\r\n", "\r", "\n" },
StringSplitOptions.None);
}
}
catch (WebException e)
{
//error
}
return recvbytes;
}
static byte[] ConvertToBytes(string inputString)
{
byte[] outputBytes = new byte[inputString.Length * sizeof(byte)];
String strLocalDate = DateTime.Now.ToString("hh.mm.ss.ffffff");
String fileName = "c:\\deleteMe\\Test" + strLocalDate;
fileName = fileName + ".txt";
StreamWriter writer = new StreamWriter(fileName, true);
for (int i=0;i<inputString.Length;i++) {
try
{
outputBytes[i] = Convert.ToByte(inputString[i]);
writer.Write("String in: {0} \t Byte out: {1} \t Index: {2} \n", inputString.Substring(i, 2), outputBytes[i], i);
}
catch (Exception ex)
{
//error
}
}
writer.Flush();
return outputBytes;
}
ConvertToBytes has a line in the FOR loop to display the values in and out, plus the index value. Here is one of several spots where we see the conversion error - note indexes 698 and 699 represent a UINT16:
String in: sp Byte out: 32 Index: 696 << sp = space
String in: sp Byte out: 32 Index: 697
String in: \0 Byte out: 0 Index: 698
String in: 2 Byte out: 50 Index: 700 << where is 699?
String in: 0 Byte out: 48 Index: 701
String in: 1 Byte out: 49 Index: 702
String in: 6 Byte out: 54 Index: 703
The expected value for index 699 is decimal 156, which is binary 10011100. The high order bit is on. So the conversion for #698 is correct, and for #700, which is an ascii 2 is correct, but not for #699. Given the UINT16 (0/156) is a component of the key to subsequent records, seeing 0/0 for the values is a show-stopper. We don't have a displacement error for 699, we see nulls in the deserialize. No idea why the .Write didn't report it.
Another example, such as 2/210 (decimal 722 when seen as a full UINT16) come out as 2/0 (decimal 512).
Please understand this code as shown above works for everything except the 8-bit reply string fields which have the high-order bit set.
Any suggestions how to convert a string element to byte regardless of the content of the string element would be appreciated. Thanks!

Without a good Minimal, Complete, and Verifiable example that reliably reproduces the problem, it's impossible to state specifically what is wrong. But given what you've posted, some useful observations can be made:
First and foremost, as far as "where is 699?" goes, it's obvious that an exception is being thrown. That's how the WriteLine() call would be skipped and result in no output for that index. You have a couple of opportunities in the code you posted for that to happen: the call to Convert.ToByte(), or the following statement (particularly the call to inputString.Substring()).
Unfortunately, without a good MCVE it's hard to understand why you are printing a two-character substring from the input string, or why you say the characters "sp" become the character value 0x20 (i.e. a space character). The output you describe in the question doesn't appear to be self-consistent. But, let's move on…
Assuming for the moment that at least in the specific case you're looking at, there are enough characters in inputString at that point for the call to Substring() to succeed, we're left with the conclusion that the call to Convert.ToByte() is failing.
Given what you wrote, it seems that the main issue here is a misunderstanding on your part about how text is encoded and manipulated in a C# program. In particular, a C# character is in some sense an abstraction and doesn't have an encoding at all. To the extent that you force the encoding to be revealed, i.e. by casting or otherwise converting the raw character value directly, that value is always encoded as UTF16.
Put another way: you are dealing with a C# string object, made of C# char values. I.e. by the time you get this text into your program and call the ConvertToBytes() method, it's already been converted to UFT16, regardless of the encoding used by the sender.
In UTF16, character values that would be greater than 127 (0x7f) in an "extended ASCII" encoding (e.g. any of the various ANSI/OEM/ISO single-byte encodings) are not encoded as their original value. Instead, they will have a 16-bit value greater than 255.
When you ask Convert.ToByte() to convert such a value to a byte, it will throw an exception, because the value is larger than the largest value that can fit in a byte.
It is fairly clear why the code you posted is producing the results you describe (at least, to some extent). But it is not clear at all what you are actually hoping to accomplish here. I can say that attempting to convert char values to/from byte values by straight casting is simply not going to work. The char type isn't a byte, it's two bytes and any non-ASCII characters will use larger values than can fit in a byte. You should be using one of the several .NET classes that actually will do text encoding, such as the Encoding.GetBytes() method.
Of course, to do that you'll have to make sure you first understand precisely why you are trying to convert to bytes and what encoding you want to use. The code you posted seems to be trying to interpret your encoded bytes as the current Encoding.Default encoding, so you should use that encoding to encode the text. But there's not really any value in encoding to that encoding only to decode back to a C# string value. Assuming you've done it correctly, all that will happen is you'll get exactly the same string you started with.
In other words, while I can explain the behavior you're seeing to the extent that you've described it here, that's unlikely to address whatever broader problem you are actually trying to solve. If the above does not get you back on track, please post a new question in which you've included a good MCVE and a clear explanation of what that broader problem you're trying to solve actually us.

Decoding a special character in C#

I am wondering how I could decode the special character â€¢ to HTML?
I have tried using System.Web.HttpUtility.HtmlDecode but not luck yet.

The issue here is not HTML decoding, but rather that the text was encoded in one character set (e.g., windows-1252) and then encoded again as a second (UTF-8).
In UTF-8, • is decoded as E2 80 A2. When this byte sequence is read using windows-1252 encoding, E2 80 A2 encodes as â€¢. (Saved again as UTF-8 â€¢ becomes C3 A2 E2 82 AC C2 A2 20 54 65 73 74.)
If the file is a windows-1252-encoded file, the file can simply be read with the correct encoding (e.g., as an argument to a StreamReader constructor.):
new StreamReader(..., Encoding.GetEncoding("windows-1252"));
If the file was saved with an incorrect encoding, the encoding can be reversed in some cases. For instance, for the string sequence in your question, you can write:
string s = "â€¢"; // the string sequence that is not properly encoded
var b = Encoding.GetEncoding("windows-1252").GetBytes(s); // b = `E2 80 A2`
string c = Encoding.UTF8.GetString(b); // c = `•`
Note that many common nonprinting characters are in the range U+2000 to U+2044 (Reference), such as "smart quotes", bullets, and dashes. Thus, the sequence â€?, where ? is any character, will typically signify this type of encoding error. This allows this type of error to be corrected more broadly:
static string CorrectText(string input)
{
var winencoding = Encoding.GetEncoding("windows-1252");
return Regex.Replace(input, "â€.",
m => Encoding.UTF8.GetString(winencoding.GetBytes(m.Value)));
}
Calling this function with text malformed in this way will correct some (but not all) errors. For instance CorrectText("â€¢Testâ€“orâ€œ") will return the intended •Test–or“.

HtmlDecode is for converting Html-encoded strings into a readable string format. Perhaps HtmlEncode might be what you're actually looking for.

How to prevent conversion of Windows-1252 argument into a Unicode string?

I've written my first COM classes. My unit tests work fine, but my first use of the COM objects has hit a snag.
The COM classes provide methods which accept a string, manipulate it and return a string. The consumer of the COM objects is a dBASE PLUS program.
When the input string contains common keyboard characters (ASCII 127 or lower), the COM methods work fine. However, if the string contains characters beyond the ASCII range, some of them get remapped from Windows-1252 to C#'s Unicode. This table shows the mapping that takes place: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
For example, if the dBASE program calls the COM object with:
oMyComObject.MyMethod("It will cost€123") where the € is hex 80,
the C# method receives it as Unicode:
public string MyMethod(string source)
{
// source is Unicode and now the Euro symbol is hex 20AC
...
}
I would like to avoid this remapping because I want the original hex content of the string.
I've tried adding the following to MyMethod to convert the string back to Windows-1252, but the Euro symbol gets lost because it becomes a question mark:
byte[] UnicodeBytes = Encoding.Unicode.GetBytes(source.ToString());
byte[] Win1252Bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), UnicodeBytes);
string Win1252 = Encoding.GetEncoding(1252).GetString(Win1252Bytes);
Is there a way to prevent this conversion of the "source" parameter to Unicode? Or, is there a way to convert it 100% from Unicode back to Windows-1252?

Yes, I'm answering my own question. The answer by "Jigsore" put me on the right track, but I want to explain more clearly in case someone else makes the same mistake I made.
I eventually figured out that I had misdiagnosed the problem. dBASE was passing the string fine and C# was receiving it fine. It was how I checked the contents of the string that was in error.
This turnkey builds on Jigsore's answer:
void Main()
{
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (int i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]); // output: 0x80 0x8A 0x8C 0x9A
// win1252String represents the string passed from dBASE to C#
string win1252String = Encoding.GetEncoding(1252).GetString(win1252bytes);
Console.WriteLine("\r\nWin1252 string is " + win1252String); // output: Win1252 string is €ŠŒš
Console.WriteLine("looking at the code of the first character the wrong way: " + (int)win1252String[0]);
// output: looking at the code of the first character the wrong way: 8364
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(win1252String[0].ToString());
Console.WriteLine("looking at the code of the first character the right way: " + bytes[0]);
// output: looking at the code of the first character the right way: 128
// Warning: If your input contains character codes which are large in value than what a byte
// can hold (ex: multi-byte Chinese characters), then you will need to look at more than just bytes[0].
}
The reason the first method was wrong is that casting (int)win1252String[0] (or the converse of casting an integer j to a character with (char)j) involves an implicit conversion with the Unicode character set C# uses.
I consider this resolved and would like to thank each person who took the time to comment or answer for their time and trouble. It is appreciated!

Actually you're doing the Unicode to Win-1252 conversion correctly, but you're performing an extra step. The original Win1252 codes are in the Win1252Bytes array!
Check the following code:
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]);
The output shows the Win-1252 codes for the unicodeText string, you can check this by looking at the CP1252.TXT table.

How to convert a string with character codes above 127 to a byte array properly?

I am retrieving ASCII strings encoded with code page 437 from another system which I need to transform to Unicode so they can be mixed with other Unicode strings.
This is what I am working with:
var asciiString = "\u0094"; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeEncoding = Encoding.Unicode;
// This is what I attempted to do but it seems not to be able to support the eight bit. Characters using the eight bit are replaced with '?' (0x3F)
var asciiBytes = asciiEncoding.GetBytes(asciiString);
// This work-around does the job, but there must be built in functionality to do this?
//var asciiBytes = asciiString.Select(c => (byte)c).ToArray();
// This piece of code happliy converts the character correctly to unicode { 0x94 } => { 0xF6, 0x0 } .
var unicodeBytes = Encoding.Convert(asciiEncoding, unicodeEncoding, asciiBytes);
var unicodeString = unicodeEncoding.GetString(unicodeBytes); // I want this to be 'ö'.
What I am struggling with is that I cannot find a suitable method in the .NET framework to transform a string with character codes above 127 to a byte array. This seems strange since there are support there to transform a byte array with characters above 127 to Unicode strings.
So my question is, is there any built in method to do this conversion properly or is my work-around the proper way to do it?

var asciiString = "\u0094";
Whatever you name it, this will always be a Unicode string. .NET only has Unicode strings.
I am retrieving ASCII strings encoded with code page 437 from another system
Treat the incoming data as byte[], not as string.
var asciiBytes = new byte[] { 0x94 }; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeString = asciiEncoding.GetString(asciiBytes);

\u0094 is Unicode code-point 0094, which is a control character; it is not ö. If you wanted ö, the correct string is
string s = "ö";
which is LATIN SMALL LETTER O WITH DIAERESIS, aka code-point 00F6.
So:
var s = "\u00F6"; // Identical to "ö"
Now we get our encoding:
var enc = Encoding.GetEncoding(437);
var bytes = enc.GetBytes(s);
And we find that it is a single-byte decimal 148, which is hex 94 - i.e. what you were after.
The significance here is that in C# when you use the "\uXXXX" syntax, the XXXX is always referring to Unicode code-points, not the encoded value in some particular encoding.

You have to look earlier in the code. Once you have the data as a string, it has already been decoded. Any characters lost in that decoding is impossible to get back.
You need the input as bytes, so that you can use your encoding object for code page 437 to decode it into a string.
byte[] asciiData = new byte[] { 0x94 }; // character ö in codepage 437
Encoding asciiEncoding = Encoding.GetEncoding(437);
string unicodeString = asciiEncoding.GetString(asciiData);
Console.WriteLine(unicodeString);
Output:
ö

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.