Cyrillic NSString to Unicode in objective-c

Cyrillic NSString to Unicode in objective-c - c#

I want to send Cyrillic string as parameter over webservice from iPhone to .net framework server. How should I encode it correctly? I would like the result to be something like:
"myParam=\U0438\U0422"
If it's doable, would it matter if it is Cyrillic or just Latin letters?
And how should I decode it on the server, where I am using C#?

I would like the result to be something like "myParam=\U0438\U0422"
Really? That's not the standard for URL parameter encoding, which would be:
myParam=%d0%b8%d0%a2
assuming the UTF-8 encoding, which will be the default for an ASP.NET app. You don't need to manually decode anything then, the Request.QueryString/Form collections will give you native Unicode strings.
URL-encoding would normally be done using stringByAddingPercentEscapesUsingEncoding, except that it's a bit broken. See this question for background.

The C# strings default encoding Unicode. So for you, it's enough to ensure that your string is encoded like unicode.
One time it's encoded like unicode, there is no any difference if you put there cyrilic, latin, arabic or whatever letters, should be enough to use correct Code Page.
EDIT
Was searching for.. good article here Globalization Step by Step
Correction by #chibacity note: even if default string encoding is Unicode in C#, Web Services in your case use UTF-8. (more flexible one)

Related

C# replaces special characters with question marks

i'm having a problem with encoding in c#
i'm downloading an xml file encoded in windows-1250 and then, when saved to a file, special characters like Š and Đ are replaced with ? even tho the file is saved correctly using the windows-1250 encoding.
this is an example of my code (simplified):
var res = Encoding.GetEncoding("Windows-1250").GetBytes(client.DownloadString("http://link/file.xml"));
var result = Encoding.GetEncoding("Windows-1250").GetString(res);
File.AppendAllText("file.xml", result);
the xml file is in fact encoded using windows-1250 and it reads just fine when i download it using the browser.
anyone knows what's going on here?

The problem could result from two different sources, one at the beginning and one at the end of your snippet.
And as has been pointed out, the Encoding and Decoding you are doing in your code is actually useless, because the origin (what DownloadString returns) and target (the variable result) are both C# Unicode strings.
Source 1: DownloadString
DownloadString could not properly decode the Windows-1250 encoded string, because either the server did not send the correct charset in the Content-Type header, or DownloadString doesn't even support this (unlikely, but I'm not familiar with DownloadString).
Source 2: File.AppendAllText
The string was downloaded correctly, then encoded in memory to Windows-1250, then decoded to a Unicode string again and everything worked well.
But then it was written by File.AppendAllText in another default encoding. AppendAllText has an optional, third parameter that you can use to specify the encoding. You should set this to Windows-1250 to actually write a file in Windows-1250 encoding.
Also, make sure that whatever editor you use to open the file uses the same encoding - this is often not very easy to guarantee, so I'd suggest you open it in a "developer-friendly" editor that lets you specify the encoding when opening a text file. (Vim, Emacs, Notepad++, Visual Studio, ...).

DynamoDB automatically converting special characters

Working with DynamoDB and AWS (.net C#). For some reason when saving strings containing "é" get replaced with a question mark when saved.
How can I prevent it from happening?

DynamoDB stores strings in UTF-8 encoding. Somewhere in the your application you must be assigning that string in something other than UTF-8.
Im using Java (which uses UTF-16). I don't do anything special when storing strings. I just tried storing and retrieving "é" in DynamoDB using the Java SDK and there was no problem.

Just to add to the previous answer, your IDE will generally have encoding settings. You'll want to change the string encoding for your project to UTF-8 to minimize the changes of an encoding error, which can create what is thought to be an unknown string.
For example, in Eclipse editors, you can see this answer to change your encoding.

Chinese Character Encoding (UTF-8, GBK)

I have a web crawler that is run on different websites (Chinese in this case).
Now when I retrieve the data and display it on my website, the Chinese characters all end up as garbage. Well I read about character encoding, And I found out that UTF-8 is generally the best encoding.
Now the problem is when I use UTF-8 - The data crawled from WEBSITE-1 are shown correctly but not for WEBSITE-2.
For WEBSITE-2, the character encoding gb18030 is working correctly.
My question is, is there a way to know the character encoding for a website so that I can build a generic solution ? I mean I can render a page on my local website knowing what character encoding to use. In this way I can code in the backend, and not really worry on the front end what encoding is required to open a page.
Right now I have two pages, 1 for UTF-8 chinese characters, and one for GB18030 chinese characters.

Use the html meta tag "Content-Type" for html < 5 or the meta tag "char-set" for html 5
W3schools charset

Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF?

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?

This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).

On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).

For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

Strange UTF-8 encoding issues when reading XML, writing results in C#

I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?

The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.