Chinese Character Encoding (UTF-8, GBK) - c#

I have a web crawler that is run on different websites (Chinese in this case).
Now when I retrieve the data and display it on my website, the Chinese characters all end up as garbage. Well I read about character encoding, And I found out that UTF-8 is generally the best encoding.
Now the problem is when I use UTF-8 - The data crawled from WEBSITE-1 are shown correctly but not for WEBSITE-2.
For WEBSITE-2, the character encoding gb18030 is working correctly.
My question is, is there a way to know the character encoding for a website so that I can build a generic solution ? I mean I can render a page on my local website knowing what character encoding to use. In this way I can code in the backend, and not really worry on the front end what encoding is required to open a page.
Right now I have two pages, 1 for UTF-8 chinese characters, and one for GB18030 chinese characters.

Use the html meta tag "Content-Type" for html < 5 or the meta tag "char-set" for html 5
W3schools charset

Related

Reading accented characters (á) C# UTF-8 / Windows 1252

I am trying to read a file that has some letters that aren't showing up correctly when I am trying to convert the file to XMl. The letters come up as blocks when I open them in notepad++, but in the original document they are correct. An example letter is á.
I am using UTF-8 to encode the file so it should be covered in that but it isn't for some reason. If I change it to windows 1252 then it shows the character correctly.
Why is it not available in UTF-8 encoding but is in Windows 1252?
If you need anymore information then just ask, Thanks in advance.

Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF?

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?
This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).
On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).
For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

Encoding Of An Asp.net Web Page - Possible Ways - Deifference - web.config (Globalization) vs. Meta Tag vs. etc

What are the possible ways to encode an Asp.net Web Page?
What is the difference between web.config (Globalization) in the link below:
How to: Select an Encoding for ASP.NET Web Page Globalization
And a meta tag like below:
http://www.w3schools.com/tags/att_meta_http_equiv.asp
(also we can select encoding on every page in PAGE DIRECTIVE, so what is the difference between that and the <meta> http-equiv attribute?)
The encoding you set in web.config allows you to configure the encoding that pages should be sent in.
The encoding you set in a Page directive allows you to override the web.config setting for individual pages (Word of advise - don't use it).
The encoding you set in the META tags or the response headers (ASP .NET will set response headers automagically for you); is a helpful hint to the browser about which encoding the page is sent in, so that it can decode it correctly.
In other words, the encoding in web.config, headers and in meta-tags should be set to the same encoding for things to work properly. UTF-8 is a good choice - it handles most (western european, at least) international chaaracters, and it is single byte per character unless the character is above codepoint 128 (in other words, english text is the same size in UTF8 and ASCII - so no excuse for sticking with ASCII !).
Link to the obligatory article about encodings - if you haven't yet, read it. It will save you some grief in the future.

Cyrillic NSString to Unicode in objective-c

I want to send Cyrillic string as parameter over webservice from iPhone to .net framework server. How should I encode it correctly? I would like the result to be something like:
"myParam=\U0438\U0422"
If it's doable, would it matter if it is Cyrillic or just Latin letters?
And how should I decode it on the server, where I am using C#?
I would like the result to be something like "myParam=\U0438\U0422"
Really? That's not the standard for URL parameter encoding, which would be:
myParam=%d0%b8%d0%a2
assuming the UTF-8 encoding, which will be the default for an ASP.NET app. You don't need to manually decode anything then, the Request.QueryString/Form collections will give you native Unicode strings.
URL-encoding would normally be done using stringByAddingPercentEscapesUsingEncoding, except that it's a bit broken. See this question for background.
The C# strings default encoding Unicode. So for you, it's enough to ensure that your string is encoded like unicode.
One time it's encoded like unicode, there is no any difference if you put there cyrilic, latin, arabic or whatever letters, should be enough to use correct Code Page.
EDIT
Was searching for.. good article here Globalization Step by Step
Correction by #chibacity note: even if default string encoding is Unicode in C#, Web Services in your case use UTF-8. (more flexible one)

Strange UTF-8 encoding issues when reading XML, writing results in C#

I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?
The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.

Categories

Resources