Unexpected conversion of Unicode overline (U+203E) to Shift-JIS

Unexpected conversion of Unicode overline (U+203E) to Shift-JIS - c#

For a customer project, a query is made against a DB and the results are written to a file. The file is required to be in Shift JIS as it is later used as input for another legacy system. The Wikipedia article indicates that:
The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively.
During some testing, I have verified that while the yen sign (U+00A5) properly becomes 0x5C, the overline (U+203E) becomes 0x3F (question mark) rather than the expected 0x7E.
While I am doing normal output with a StreamWriter to a file, below is minimal code to reproduce:
static void Test()
{
// Get Shift-JIS encoder.
var encoding = Encoding.GetEncoding("shift_jis");
// Declare overline (U+203E).
char c = (char) 0x203E;
// Get bytes when encoded as Shift-JIS.
var bytes = encoding.GetBytes(c.ToString());
// Expected 0x7E, but the value returned is 0x3F.
}
Is this behavior correct?
I suppose I could subclass EncoderFallback, but this seems like far more work for something that I would have expected to work from the start.

Upon further investigation, I must conclude that Shift JIS is a misnomer. Rather, this is codepage 932. Unicode and Microsoft provide a mapping table between this and Unicode. This is apparently what is being used to map the characters. Notice that it does not contain a mapping between (0x5C, U+00A5) and (0x7E, U+203E).
Note though that I wrote in the original question that "I have verified that while the yen sign (U+00A5) properly becomes 0x5C". Apparently, the Encoding.GetEncoding(String) method returns an encoding which has a DecoderFallback defined as System.Text.InternalDecoderBestFitFallback, which I assume is providing additional mappings for some characters which would normally fail. It must contain an additional mapping for yen (U+00A5), but unfortunately nothing for overline (U+203E). When I replace this with EncoderExceptionFallback if fails for bother characters.
Hence, I conclude that for Shift JIS, this is an error. But for codepage 932, it is the expected result.

Related

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) 
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Ã± character in in put file being interpreted as ñ in C# console app

I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the Ã± character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees Ã± as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from Ã± to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.

You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, Ã±
Since you say "the input file sees Ã± as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.

You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be Ã±, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .

C# File Method not reading accented character [duplicate]

This question already has answers here:
Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
(2 answers)
Closed 8 years ago.
I'm using C# to automate an insert into command for a users table, and there's a user whose first name has an accented E, with a grave I believe?
Desirée
Whenever it makes it into the SQL Server table it appears as:
Desir?e
Which data type should I use on this column to ensure that it keeps the accented e?
I've tried varchar and nvarchar, neither seemed to matter.
Code for inserting:
var lines = File.ReadAllLines(users_feed_file);
I believe that there is an encoding issue occurring. When Visual Studio reads my file it reads the name as Desir?e.
So far I've tried to overload the File method, using:
Encoding enc = new UTF8Encoding(true, true);
var lines = File.ReadAllLines(users_feed_file,enc);
But this had no effect.
var lines = File.ReadAllLines(users_feed_file, Encoding.UTF8);
Doesn't work either.

Sql Server stores unicode text essentially as Unicode-2 or UTF-16. That is, it uses fixed, two-bytes for all characters. UTF-8 uses variable three-bytes for all characters, using one, two, or three bytes as needed. If the character in questions (it would be good to post the actual unicode value) is translated by UTF-8 into three bytes, then Sql Server reads that back as two two-byte characters, one of which probably is not a valid, displayable character, thus rendering a question mark. Note that Sql Server is not storing a question mark, that is just how whatever text editor you are using renders this garbled character.
Try changing your C# encoding to Encoding.Unicode and see if that helps round-trip the character in question.
The same reasoning applies to characters that ought to fit into one-byte, but are represented with two by UTF-8. So for example, the unicode hex value for small e with grave is xE8, which could be represented as 00 E8 in two bytes. But UTF-8 renders it as C3 E8. Now, look for that value in Unicode (UTF-16) - there is no such character. So in this case it is not two bytes represented as three, but one byte represented incorrectly as two. This resource is invaluable when trying to debug extended character issues.
Note that for the basic Latin ascii set, UTF-8 uses the same values as Unicode, and thus those characters round-trip just fine. It's when using extended character sets that compatibility for both encodings cannot be guaranteed.

Hi try with this code:
var lines = File.ReadAllLines(users_feed_file, Encoding.Unicode);
but in notepade++ you can view the file encoding, check this.

Which encoding does Alt+Numpad keys generate?

In short:
For this code:
Encoding.ASCII.GetBytes("‚")
I want the output to be 130, but this gives me 63.
I am typing the string using Alt+0130.

On my setup:
Encoding.ASCII.GetBytes("‚"); // 63
Encoding.Default.GetBytes("‚"); // 130
Of course 'default' could very well be environment-dependent...

When you try to encode the string using the ASCII encoding, it will be converted to a question mark as there is no such character in the ASCII character set. The character code for the question mark is 63.
You need to use an encoding that supports the character, to get it's actual character code.
One option is to use the Encoding.Default property to get the encoding for the system codepage, as David suggested. However as the system codepage can differ, it's not guaranteed to give the same result on all computers.
The unicode character code is 8218, which you can get by simply converting the character to an int:
int characterCode = (int)'‚';
As this is not depending on any system settings, you should consider if you can use that instead of the encoded byte value.

Base64 String throwing invalid character error

I keep getting a Base64 invalid character error even though I shouldn't.
The program takes an XML file and exports it to a document. If the user wants, it will compress the file as well. The compression works fine and returns a Base64 String which is encoded into UTF-8 and written to a file.
When its time to reload the document into the program I have to check whether its compressed or not, the code is simply:
byte[] gzBuffer = System.Convert.FromBase64String(text);
return "1F-8B-08" == BitConverter.ToString(new List<Byte>(gzBuffer).GetRange(4, 3).ToArray());
It checks the beginning of the string to see if it has GZips code in it.
Now the thing is, all my tests work. I take a string, compress it, decompress it, and compare it to the original. The problem is when I get the string returned from an ADO Recordset. The string is exactly what was written to the file (with the addition of a "\0" at the end, but I don't think that even does anything, even trimmed off it still throws). I even copy and pasted the entire string into a test method and compress/decompress that. Works fine.
The tests will pass but the code will fail using the exact same string? The only difference is instead of just declaring a regular string and passing it in I'm getting one returned from a recordset.
Any ideas on what am I doing wrong?

You say
The string is exactly what was written
to the file (with the addition of a
"\0" at the end, but I don't think
that even does anything).
In fact, it does do something (it causes your code to throw a FormatException:"Invalid character in a Base-64 string") because the Convert.FromBase64String does not consider "\0" to be a valid Base64 character.
byte[] data1 = Convert.FromBase64String("AAAA\0"); // Throws exception
byte[] data2 = Convert.FromBase64String("AAAA"); // Works
Solution: Get rid of the zero termination. (Maybe call .Trim("\0"))
Notes:
The MSDN docs for Convert.FromBase64String say it will throw a FormatException when
The length of s, ignoring white space
characters, is not zero or a multiple
of 4.
-or-
The format of s is invalid. s contains a non-base 64 character, more
than two padding characters, or a
non-white space character among the
padding characters.
and that
The base 64 digits in ascending order
from zero are the uppercase characters
'A' to 'Z', lowercase characters 'a'
to 'z', numerals '0' to '9', and the
symbols '+' and '/'.

Whether null char is allowed or not really depends on base64 codec in question.
Given vagueness of Base64 standard (there is no authoritative exact specification), many implementations would just ignore it as white space. And then others can flag it as a problem. And buggiest ones wouldn't notice and would happily try decoding it... :-/
But it sounds c# implementation does not like it (which is one valid approach) so if removing it helps, that should be done.
One minor additional comment: UTF-8 is not a requirement, ISO-8859-x aka Latin-x, and 7-bit Ascii would work as well. This because Base64 was specifically designed to only use 7-bit subset which works with all 7-bit ascii compatible encodings.

string stringToDecrypt = HttpContext.Current.Request.QueryString.ToString()
//change to
string stringToDecrypt = HttpUtility.UrlDecode(HttpContext.Current.Request.QueryString.ToString())

If removing \0 from the end of string is impossible, you can add your own character for each string you encode, and remove it on decode.

One gotcha to do with converting Base64 from a string is that some conversion functions use the preceding "data:image/jpg;base64," and others only accept the actual data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.