Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF? - c#

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?

This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).

On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).

For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

Related

C# windows form cannot display simplified Chinese characters

Somehow my previous question has been marked as duplicate.
Question:
I have a database with records in Chinese characters. I can take them out, and use them in button.Text.
However, when I use
Console.WriteLine(button.Text);
The output displays every Chinese character as a "?"
Now, why is the question NOT duplicate?
I have THOROUGHLY searched for a solution, not just on stackoverflow, on everywhere I can search (with my limited skills). Read all those related posts. I found two potential solutions:
One:
Console.OutputEncoding = Encoding.Unicode;
Unicode, UTF8, UT7, UTF32.
Two:
Change my computer's locale in Control Panel to a region with Simplified Chinese. Then reboot and run the solution again.
I have tried both these suggested solutions, individually and together. Nothing works. The output changes from "?" to completely jibberish, unrecognizable characters.
Does anyone have any idea what to do here?
This is a more complete version of my comment. The way I was able to display Simplified Chinese characters was by changing the language of Non-Unicode programs to Chinese:
Then in the cmd properties set the font to Consolas
I didn't even need to set the Console.OutputEncoding. This is the result (these are Chinese characters copy/pasted from the internet):
I think this is a duplicate of How to write Unicode characters to the console? which indicates that although .net and Unicode support your characters, the Font you are using as the output font of the Console does not support that Unicode character.
Your post does not indicate that you have tried adjusting the Console Font.

How to work with libraries that don't support unicode characters

In my case, I recently picked up the irrKlang library which allows me to work with audio files without doing too much work. Then I ran into the issue where unicode characters in filepaths were not supported by the library. It either reads it incorrectly (I would've thought even if it was read wrong, it could still find the file), or simply ignores it, leaving me with invalid file paths.
Searched their support forums for a solution to this, but all I got out of it was "unicode? uhhh why not just use ascii?" kind of attitude towards unicode, which I suppose is not uncommon.
What are some techniques that I could use to reliably pass unicode strings to libraries that don't have unicode support?
Simply put you don't, you can pass through them using a byte array, and then interpret it back as a unicode array on the other end, but if it doesn't do unicode, it doesn't do unicode.
There is no point in passing a unicode string to a library that is incapable of interpreting it.
If you need to do something specific(like using a load command on a filesystem with unicode paths e.g HFS+), then don't. Rather use the system provided file APIs and push the data into the uncooperative libraries constructor.
If you're seriously having problems with this unicode file path business, cause you don't work well with passing addresses and bitstreams around, then a simple solution is to make your own function:
obj_ptr* loadObjFromUnicodePath(path)
{
//create tmp ASCII named symlink to file at arg(path).
//call load API of irrKlang on symlink.
//delete tmp symlink, return object.
}

Encoding text file to appear on IBM Mainframe

I have a C++ program that sends data via FTP via ASCII mode to an IBM Mainframe. I am now doing this via C#.
When it gets there and viewed the file looks like garbage.
I cannot see anything in the C++ code that does anything special to encode the file into something like EPCDIC. When the C++ files are sent they are viewed ok. The only thing I see different is \015 & \012 for line feeds whereas C# is using \r\n.
Would these characters have an effect and if so how can I get my C# app to use \015?
Do I have to do any special encoding to make it appear ok?
It sounds like you should indeed be using an EBCDIC encoding, and then probably transferring the text in binary. I have an EBCDIC encoding class you can use, should you wish.
Note that \015\012 is \r\n - they're characters 13 and 10 in decimal, just different ways of representing them. If you think the C++ code really is producing the same files as C#, compare two files which should be the same in a binary file editor.
Make sure you have the TYPE TEXT instead of TYPE BINARY command before you transfer the file.
If you are truly sending the files in ASCII mode, then the mainframe itself will convert that to EBCDIC (it's receiver-makes-good).
The fact that you're getting apparent garbage at the mainframe end, and character codes \015 and \012 (which are CR and LF respectively) means that you're not transferring in ASCII mode.
As an aside, the ISPF editor has been able to view ASCII data sets for quite a few versions now. Open up the file and enter the commands source ascii and lf.
The first renders converts the characters from ASCII to EBCDIC so you can see what they are, the second goes through and pads out "lines" so that linefeed markers are replaced with enough spaces to reach the record length.
Invaluable commands when dealing with mixed-encoding environments, which is where I do a lot of my work.

Strange UTF-8 encoding issues when reading XML, writing results in C#

I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?
The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.

How to convert the encoding of an string to UTF-8 without know the original encoding in C#?

I'm reading a CSV file with Fast CSV Reader (on codeproject). When I print the content of the fields, the console show the character '?' in some words. How can fix it?
The short version is that you have to know the encoding of any text file you're going to read up front. You could use things like byte order marks and other heuristics if you really aren't going to know, but you should always allow for the value to be tweaked (in the same way that Excel does if you're importing CSV).
It's also worth double checking the values in the debugger, as it may be that it is the output that is wrong, as opposed to the reading -- bear in mind that all strings are Unicode internally, and conversion to '?' sounds like it is failing converting the unicode to the relevant code page for the console.

Categories

Resources