µ and é in namespace - c#

We have developed a c# program. The program is distributed in Europe without problem on misc hardware configurations. Some of the namespaces in our program contains a 'µ' or a 'é' character. When deploying our program on 'non-european' ie China or some US systems a problem occurs somewhere in the process the 'µ' is changed into 'µ' causing lots of problems. What is causing this problem and how can we work around it (preferably without changing the name of the namespace)
edit 2015.08.07
Thanks all for your comments, but to clarify: the source files are not distributed as such. The program is compiled to an exe and then distributed using nsis. Source control is done using SVN. How can I verify the presence of the BOM in my source files ?

Either you or the recipient or both are using a character encoding other than UTF-8.
People shouldn't do that, but they do.
Some tools will default to a legacy encoding unless you include a BOM at the start of each file, so include a BOM at the start of each file.

You are hitting a difference in the character sets used by the different systems. Your software was probably running on systems assuming ISO-8859, most often used in European languages, while the Chinese and US systems you are encountering are probably using the Universal Character Set (ISO/IEC 10646). The mapping between the two is not a simple 1-to-1, so you run into the problems you are having. W3.org has a good article on this topic at http://www.w3.org/International/articles/definitions-characters/
Pay special attention to the sections on "Character sets, coded character sets, and encodings", and "The Document Character Set". If this is a web app, "The HTTP Header" might be particularly useful.

This isn't an answer. However I want to point out that this might be an encoding problem but this can happen on the same system and show up only when running code that (guess here) reads a byte at a time, as opposed to explicitly reading text of a particular encoding.
I have a C program (32 bit if that matters) which reads in a file using fgetc and saves characters to be used as "illegal" characters in names. It isn't fancy, just to prevent a few ascii characters from coming in accidentally, like an ' (apostrophe) in the name of a thing/object/label. Some one asked me to test µ (mu, appears as single character in this interface to stackoverflow). I generated this (without examining underlying encoding in MS Word) using Insert-Symbol in MS Word. I cut it from MS Word and inserted it into a text file using Notepad++. In Notepad++ and MS Word, it seems to be the same symbol. BUT fgetc (taking one int or char, however you like to think of it) at a time, sees in my debug output for a test case:
About to check for illegal characters in =>NameOfItemµ<=
Illegal character =>Â<= was found. Illegal characters are: '`µ
Illegal character =>µ<= was found. Illegal characters are: '`µ
I am compiling with Visual C++ Express 2013.
I'm happy it catches the illegal characters, and hope this isn't just noise to readers of this topic.

Related

How to handle directory separator character in japanese and korean? [duplicate]

tl;dr: How do I ask Windows what the current directory separator character on the system is?
Different versions of Windows seem to behave differently (e.g. \ and / both work on the English versions, ¥ is apparently on the Japanese version, ₩ is apparently on the Korean version, etc...
Is there any way to avoid hard-coding this, and instead ask Windows at run time?
Note:
Ideally, the solution should not depend on a high-level DLL like ShlWAPI.dll, because lower-level libraries also depend on this. So it should really either depend on kernel32.dll or ntdll.dll or the like... although I'm having a trouble finding anything at all, whether at a high level or at a low level.
Edit:
A little experimentation told me that it's the Win32 subsystem (i.e. kernel32.dll... or is it perhaps RtlDosPathNameToNtPathName_U in ntdll.dll? not sure, didn't test...) which converts forward slashes to backslashes, not the kernel. (Prefixing \\?\ makes it impossible to use forward slashes later in the path -- and the NT native user-mode API also fails with forward slashes.)
So apparently it's not quite "built into" Windows, but rather just a compatibility feature -- which means you can't just blindly substitute slashes instead of backslashes, because any program which randomly prefixes \\?\ to paths will automatically break on forward slashes.
I have mixed feelings on what conclusions to make regarding this, but I just thought I'd mention it.
(I tagged this as "path separator" even though that's technically incorrect because the path separator is used for separating paths, not directories (; vs. \). Hopefully people get what I meant.)
While the ₩ and ¥ characters are shown as directory separator symbols in the respective Korean and Japanese windows versions, they are only how those versions of Windows represent the same Unicode code point U+005c as a glyph. The underlying code point for backslash is still the same across English Windows and the Japanese and Korean windows versions.
Extra confirmation for this can be found on this page: http://msdn.microsoft.com/en-us/library/dd374047(v=vs.85).aspx
Security Considerations for Character Sets in File Names
Windows code page and OEM character sets used on Japanese-language systems contain the Yen symbol (¥) instead of a backslash (\). Thus, the Yen character is a prohibited character for NTFS and FAT file systems. When mapping Unicode to a Japanese-language code page, conversion functions map both backslash (U+005C) and the normal Unicode Yen symbol (U+00A5) to this same character. For security reasons, your applications should not typically allow the character U+00A5 in a Unicode string that might be converted for use as a FAT file name.
Also, I don't know of any Windows API function that gets you the system's path separator, but you can rely on it being \ in all circumstances.
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx#naming_conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use a backslash (\) to separate the components of a path. The backslash divides the file name from the path to it, and one directory name from another directory name in a path. You cannot use a backslash in the name for the actual file or directory because it is a reserved character that separates the names into components.
...
About /
Windows should support the use of / as a directory separator in the API functions, though not necessarily in the command prompt (command.com).
Note File I/O functions in the Windows API convert "/" to "\" as part of converting the name to an NT-style name, except when using the "\?\" prefix as detailed in the following sections.
It's 'tough' to figure out the truth of all this, but this might be a really helpful link about / in Windows paths: http://bytes.com/topic/python/answers/23123-when-did-windows-start-accepting-forward-slash-path-separator
The original poster added the phrase "kernel-mode" in a comment to someone else's answer.
If the original question intended to ask about kernel mode, then it probably isn't a good idea to depend on / being a path separator. Different file systems allow different character sets on disk. Different file system drivers in Windows can also allow different characters sets, which normally cannot include characters which the underlying file systems don't accept on disk, but sometimes they can behave strangely. For example Posix mode allows a component name to contain some characters in a path name in an NTFS partition, even though NTFS ordinarily doesn't allow those characters. (But obviously / isn't one of them, in Posix.)
In kernel mode in Unicode, U+005C is always a backslash and it is always the path separator. Unicode code points for yen and won are not U+005C and are not path separators.
In kernel mode in ANSI, complications arise depending on which ANSI code page. In code pages that are sufficiently similar to ASCII, 0x5C is a backslash and it is the path separator. In ANSI code pages 932 and 949, 0x5C is not a backslash but 0x5C might be a path separator depending on where it occurs. If 0x5C is the first byte of a multibyte character, then it's a yen sign or won sign and it is a path separator. If 0x5C is the second byte of a multibyte character, then it's not a character by itself, so it's not a yen sign or won sign and it's not a path separator. You have to start parsing from the beginning of the string to figure out if a particular char is actually a whole character or not. Also in Chinese and UTF-8, multibyte characters can be longer than two chars.
The standard forward slash (/) has always worked in all versions of DOS and Windows. If you use it, you don't have to worry about issues with how the backslash is displayed on Japanese and Korean versions of Windows, and you also don't have to special-case the path separator for Windows as opposed to POSIX (including Mac). Just use forward slash everywhere.

Visual C# Character Encoding Mix Up

I've been working on a song Metadata program for a short while now, and have run in to what seems like quite a simple but challenging issue...
As the program reads in some pieces of data from songs, it may come across an usual symbol, such as ℗. I noticed that some things happened that I didn't expect and so stepped through each line of my code individually. I noticed that said character was actually being read as "â„—". I tried this with another character, Ü, to see what the result was, and this time I got "Ãœ".
I did a quick Google search which led me to this page:
http://www.i18nqa.com/debug/utf8-debug.html
From this I would assume that my desired characters are actually being interpreted using the mentioned Windows-1252 encoding, and not UTF-8?
I should add that to further investigate this, I tried writing the Windows-1252 versions into the Metadata using my program, and these came out in Windows itself as the correct ℗ and Ü characters... Can anybody tell me how to ensure that both read and write functions are done using UTF-8, and also tell me why this really happens if I was not correct. Thanks.
I am using C#.

Windows - Can console output inadvertently cause a system beep?

I have a C# console application that logs a lot to the console (using Trace). Some of the stuff it logs is the compressed representation of a network message (so a lot of that is rendered as funky non-alphabetic characters).
I'm getting system beeps every so often while the application is running. Is it possible that some "text" I am writing to the console is causing them?
(By system beep, I mean from the low-tech speaker inside the PC case, not any kind of Windows sound scheme WAV)
If so, is there any way to disable it for my application? I want to be able to output any possible text without the it being interpreted as a sound request.
That's usually caused by outputting character code 7, CTRL-G, which is the BEL (bell) character.
The first thing I normally do when buying a new computer or motherboard is to ensure the wire from the motherboard to the speaker is not connected. I haven't used the speaker since the days of Commander Keen (and removing that wire is the best OS-agnostic way of stopping the sound :-).
HKEY_CURRENT_USER\Control Panel\Sound
set the "Beep" key to "no".
absolutely, if you output ASCII control code "Bell" (0x7) to a console, it beeps.
If you don't want if to beep, you'll either have to replace the 0x7 character before outputting it, or disable the "Beep" device driver, which you'll find in the Non-Plug and Play Drivers section, visible if you turn on the Show Hidden Devices option. Or take the speaker out.
Even if you check the input for BELL characters, it may still beep. This is due to font settings and unicode conversion. The character in question is U+2022, Bullet.
Raymond Chen explains:
In the OEM code page, the bullet character is being converted to a
beep. But why is that?
What you're seeing is MB_USEGLYPHCHARS in reverse. Michael Kaplan
discussed MB_USEGLYPHCHARS a while ago. It determines whether certain
characters should be treated as control characters or as printable
characters when converting to Unicode. For example, it controls
whether the ASCII bell character 0x07 should be converted to the
Unicode bell character U+0007 or to the Unicode bullet U+2022. You
need the MB_USEGLYPHCHARS flag to decide which way to go when
converting to Unicode, but there is no corresponding ambiguity when
converting from Unicode. When converting from Unicode, both U+0007 and
U+2022 map to the ASCII bell character.
\b in output string will cause beep, if not disabled on the OS level.

C# : Characters do not display well when in Console, why?

The picture below explains all:
alt text http://img133.imageshack.us/img133/4206/accentar9.png
The variable textInput comes from File.ReadAllText(path); and characters like : ' é è ... do not display. When I run my UnitTest, all is fine! I see them... Why?
The .NET classes (System.IO.StreamReader and the likes) take UTF-8 as the default encoding. If you want to read a different encoding you have to pass this explicitly to the appropriate constructor overload.
Also note that there's not one single encoding called “ANSI”. You're probably referring to the Windows codepage 1252 aka “Western European”. Notice that this is different from the Windows default encoding in other countries. This is relevant when you try to use System.Text.Encoding.Default because this actually differs from system to system.
/EDIT: It seems you misunderstood both my answer and my comment:
The problem in your code is that you need to tell .NET what encoding you're using.
The other remark, saying that “ANSI” may refer to different encodings, didn't have anything to do with your problem. It was just a “by the way” remark to prevent misunderstandings (well, that one backfired).
So, finally: The solution to your problem should be the following code:
string text = System.IO.File.ReadAllText("path", Encoding.GetEncoding(1252));
The important part here is the usage of an appropriate System.Text.Encoding instance.
However, this assumes that your encoding is indeed Windows-1252 (but I believe that's what Notepad++ means by “ANSI”). I have no idea why your text gets displayed correctly when read by NUnit. I suppose that NUnit either has some kind of autodiscovery for text encodings or that NUnit uses some weird defaults (i.e. not UTF-8).
Oh, and by the way: “ANSI” really refers to the “American National Standards Institute”. There are a lot of completely different standards that have “ANSI” as part of their names. For example, C++ is (among others) also an ANSI standard.
Only in some contexts it's (imprecisely) used to refer to the Windows encodings. But even there, as I've tried to explain, it usually doesn't refer to a specific encoding but rather to a class of encodings that Windows uses as defaults for different countries. One of these is Windows-1252.
Try setting your console sessin's output code page using the chcp command. The code pages supported by windows are here, here, and here. Remember, fundametnaly the console is pretty simple: it displays UNCICODE or DBCS characters by using a code page to dtermine the glyph that will be displayed.
I do not know why It works with NUnit, but I open the file with NotePad++ and I see ANSI in the format. Now I converted to UTF-8 and it works.
I am still wondering why it was working with NUnit and not in the console? but at least it works now.
Update
I do not get why I get down voted on the question and in this answer because the question is still good, why in a Console I can't read an ANSI file but in NUNit I can?

Localization: How to map culture info to a script name or Unicode character range?

I need some information about localization. I am using .net 2.0 with C# 2.0 which takes care of most of the localization related issues. However, I need to manually draw the alphabets corresponding to the current culture on the screen in one particular screen.
This would be similar to the Contacts screen in Microsoft Outlook (Address Cards view or Detailed Address Cards View under Contacts), and so it needs a the column of buttons at the right end, one for each alphabet.
I am trying to emulate that, but I don't want to ask the user to choose the script. If the current culture is say, Chinese, I want to draw Chinese alphabets. When the user changes the culture info to English (and when he restarts the application) I want to draw English alphabets instead. Hope you understand where I am going with this query.
I can determine the culture of the current user (Application.CurrentCulture or System.Globalization.CultureInfo.CurrentCulture will give the culture related information). I also have all the scripts to render the alphabets. However, the problem is that I don't know how to map the culture info to the name of a script.
In other words, is there a way to determine the script name corresponding to a culture? Or is it possible to determine the range of Unicode character values corresponding to a culture? Either of them would allow me to render the alphabets on the button properly.
Any suggestions or guidance regarding this is truly appreciated. If there is something fundamentally wrong with my approach (or with what I am trying to achieve), please point out that as well. Thanks for your time.
PS: I know the easiest solution is to either configure the script name as part of user preferences or display a list of languages for the user to choose from (a la Contact in Outlook 2007). But I am just trying to see whether I can render the alphabets corresponding to the culture without the user having to do anything.
In native code there's LOCALE_SSCRIPTS for GetLocaleInfoEx() (Vista & above) that shows you what scripts are expected for a locale. There isn't a similar concept for .Net at this time.
Chinese has thousands of characters, so it might not be feasible to show all the characters in their character set. There's no native concept of 'alphabet' in Chinese, and I don't think Chinese has a syllabary like Japanese does.
Pinyin (Chinese written in roman alphabet) can be used to represent the Chinese characters, and that might help you index them. I know this doesn't answer your question, but I hope it's helpful.
I fully agree with mikiemacman. In addition, a given laguage doesn't necessarily uses all the letters of a script.
Anyway, the closest I can think of is CultureInfo.TextInfo.ANSICodePage -> There are only a handful of ANSI code pages. You could have create a table (or a switch() statement, whatever) that lists the script for each ANSI codepage.
Proto, wait! There's a much more accurate solution. It's an unmanaged on hance you may have to P/Invoke.
GetLocaleInfoW(MAKELCID(wLangId, SORT_DEFAULT), LOCALE_FONTSIGNATURE, wcBuf, MAXWCBUF);
This gives you a LOCALESIGNATURE stucture. The anwer is in the lsUsb field: Unicode subsets bitfield. Rats! the MS page for this structure is empty. But look it up in your MSDN copy. It's fully documented there: A whole set of flags that describe which scripts are spported. And yes, there's a flag for Tamil ;-)
HTH.
EDIT: Oops! Hadn't seen Shawne's answer. Wow! Answer from an in-house expert! ;-) Anyway, you may still be interested in a Pre-Vista compatible answer.
Fascinating topic. While it might not answer your question, Omniglot is a good resource.
The correct answer is likely to be complex, and depend on the exact problem you're solving. Assuming your goal showing only letters used in a particular language to separate phonebook sections (as in Outlook), few of the issues are:
People who have contact names spanning several scripts/languages.
2-glyph letters (e.g. 'Lj' in Serbian). It is one phoneme, always treated as a single letter although it has 2 Unicode symbols. 'It would have its own section in the phonebook (separate from 'L').
Too many glyphs to list (e.g. Chinese)
Unorthodox ordering (e.g. Thai -- a phone book would be separated by consonants only, ignoring the vowels).
Uppercase / lowercase distinction (presumably you'd only want one case for languages that support it -- which breaks down in minor ways Turkish 'i').

Categories

Resources