String and UTF16 - c#

Is the data stored in String object always encoded with UTF16?
I am asking this because my database does stores non English in non Unicode. and I assumed that the data will not be readable because it is read in wrong encoding.
Thanks

Internally .NET strings are in UTF-16, yes... but what's important is how the data is transferred between .NET and your database.
So long as the characters can be represented in Unicode, and the driver performs the appropriate conversion, you should be fine. If you're trying to represent text which can't be represented in Unicode, you may well run into some interesting behaviour.

Yes, .NET strings are always encoded in UTF-16 - with the exception of surrogate pairs that means 2 byte characters.

.NET Strings are ALWAYS Unicode. If your database is unicode you are fine, otherwise you will need to convert the text from whatever format it is in to unicode.

The internal storage of characters (and therefore strings) in .NET is done in UTF-16.
You will need to re-encode the string to the encoding used by your database.
See the Encoding class - this is what you can use to convert a string from one encoding to another.

If you are using ADO.NET with SqlDataCommands (or other types of DataCommands), any required conversion should be handled for you, and you won't need to worry about it.

Related

Formatting a String using a String variable as the Format string

I have a COM server app (App_A) that only supports native data types. I send the parameters over the COM server to a C# app (App_B) that then sends on the data as a web request.
My problem is that the String data read by App_A is Unicode, but App_A does not support non-UTF-8 encoding for its COM String values, so the data can be sent as a byte array or char array.
If I use the byte array, the generic App_B is now broken as I now have to handle this single data update differently to all the others (and I fear there will be more), so I would like to keep the App_B handling of values generic (obj.ToString).
If I hard code an App_B C# String as a literal, e.g. "\u5f90", the String contains a Unicode character and the HttpUtility.UrlEncode call in App_B works exactly as expected. If the String is passed in as a value (obj.ToString() = "\u5f90") the '\' is escaped and the UrlEncode does not UTF-8-encode a Unicode character as the '\u' escape sequence is lost.
I guess my question comes down to:
So far I have manipulated the byte array in App_A to replace the Unicode values (xxxx) with '\uxxxx': - is there any way I can use a String variable as a format string in the C# App_B?
Alternatively, if I'm going about this the wrong way, what would anyone suggest?
Please bear in mind that I have approx 300 data value updates that all use a generic o.ToString for part of the UrlEncode argument and I would like to keep this if possible.
Is it an option for you to support different encodings in your deserialization of the byte arrays in App_B? I'd suggest modifying App_A so that each sent string has an additional first byte which defines the encoding, which then has to be respected by App_B. That way it doesn't matter which encoding you use, as long as both apps support it.
I'd strongly suggest not modifying the strings as you've described by preceeding it with \u, that's just gonna be a mess of code later on which needs to be documented well and needs to be understood again if you come back to it later etc.

How do i use 32 bit unicode characters in C#?

Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.

c#, Textbox.Text encoding

I have code:
string text = sampleTextBox.Text;
and I'm wondering in what encoding text is? Is it utf16 (as it is string) or maybe it is my operating system encoding?
It's all Unicode, basically - there's no conversion between the .NET textual types (char/string) and binary going on, so there's no encoding to worry about.
You potentially need to worry about surrogate pairs to get from the UTF-16 textual representation of char and string to full UTF-32, but that's slightly different to the normal encoding issues.
Philosophically, a textbox contains text, not binary data. You should only be thinking about encodings when there's a conversion to a binary format - such as a file.
String variables in .Net are UTF-16 internally. Encoding comes into play when you want to output the string outside your program: a file, a webpage, or over the network in some fashion.

Passing a string from C# to cpp with COM

I have a C# COM server which is consumed by a cpp client.
One of the C# methods returns a string.
In cpp the returned string is represented in Unicode (UTF-16), at least according to the memory view.
Is this always the case with COM strings?
Is there a way to use UTF-8 instead?
I saw some code where strings were passed between cpp and c# as byte arrays. Is there any benefit in this?
Yes. The standard COM string type is BSTR. It is a Unicode string encoded in UTF16, just like Windows' native string type.
No, a COM method isn't going to understand a UTF8 string, it will turn it into Chinese. UTF8 is a good encoding for a text file, not for programs manipulating strings in memory. UTF8 requires anywhere between 1 and 4 bytes to encode a Unicode codepoint. Very incompatible with basic string manipulations like getting the size or indexing a character.
C and C++ programs tend to use 8-bit encodings, compatible with the "char" type. That's an old practice, dating back from an era before Unicode was around. There's nothing attractive about it, there are many 8-bit encodings. The typical problem is that data entered as text can only be interpreted correctly if it is read by a program that uses the same 8-bit encoding. In other words, when the computers are less than 1000 miles apart. Less in Europe.
No.
Yes. Put the attribute [return: MarshalAs(UnmanagedType.LPStr)] before the method definition in C# if you'd like to return the string as an ANSI string instead of Unicode.
Yeah--the author may have done that to maintain very fine-grained control on the encoding of the contents of the string by side-stepping the default marshalling behavior.

Force C# to use ASCII

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.
C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.
Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.
If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.
If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

Categories

Resources