Character code different between C++ and C#

Character code different between C++ and C# - c#

I have a string handling function in C++ as well as in C#. In C++ the code for the character ˆ is returned as -120 where as in C# it is 710. While building in C++ using visual studio 2010 I have set the character set as "Not set" in the project settings. In C# I am using System.Text.Encoding.Default during one of the conversions. Does that make any difference? How can I get same behavior in C++ as well as in C#?

The character is U+02C6. The encoding you're using in C++ is probably CP 1252 which encodes this character as the byte 0x88 (which is -120 when showing a signed char in decimal) . C# uses the encoding UTF-16, which encodes this character as 0x02C6 (710 in decimal).
You can use UTF-16 in C++ on Windows by using wchar_t insead of char.
You can't make C# strings use CP1252, but you can get byte arrays in different encodings from a String using Encodings.
byte[] in_cp1252 = Encoding.GetEncoding(1252).GetBytes("Your string here");

Related

Difference between char(x) in delphi and (char)x in C#

I have a program coded with Delphi 7 and there is this function char(x) (x is variable), and I need to write the same code in C# using (char)x, but I don't always get the same result,
So I printed all the characters x =[1..255] in Delphi 7 and C# and I found a difference in some numbers here is some examples
[![C# vs Delphi 7][1]][1]
So I want to know what is doing exactly char function of Delphi 7 so I can do the same in C#?
this is how i printed the two listes :
in C#:
for (int i = 0; i < 256; i++)
{ richTextBox1.Text = richTextBox1.Text + " I " + i.ToString() + " " + (char)(i) + Environment.NewLine; }
In Delphi 7:
for I := 0 to 255 do begin
Memo5.Text := Memo5.Text +' I='+IntToStr(I)+' char(I) '+ char(I)+#13#10;
end;
the answer was that char in Delphi uses ANSI code page and to do so in C# :
char[] characters = System.Text.Encoding.Default.GetChars(new byte[]{X});
char c = characters[0];
"System.Text.Encoding.Default.GetChars" uses ANSI code page
thank you
[1]: https://i.stack.imgur.com/yGZ8R.jpg

In Delphi, Char() is simply a typecast of an ordinal value to the Delphi Char type. In Delphi 7, Char is an alias to AnsiChar, which is an 8 bit character type.
In C#, the char type is a 16 bit type, typically representing an element of UTF-16 encoded text.
So you might translate the Delphi code to Encoding.Default.GetChars() in C#, but that is speculation at best. For instance, there is an assumption there that the ANSI locale is being used. In my view it's not possible to translate the code you have presented without more information.
I think it is quite likely that the right way to translate your code is not to translate it literally. In other words, you need to look at the broader code to understand the complete task it performs, rather than looking line by line.

C# char types are defined by the language standard to be (partial) UTF-16 characters, with potentially multiple char instances being necessary to define a glyph (the thing you see on screen).
A quick Google search shows that the latest Delphi version defines its char to be a wide Unicode character, equivalent to C++'s wchar_t. That is not the same as C#'s type, even though they use the same amount of space.
Also note that your ancient Delphi version most likely has the ancient 1-byte Ansi char, though I couldn't find an authoritative specification of that. For Ansi characters, their mapping to Unicode glyphs is dictated by mapping tables called "code pages", which you can switch at will.

It looks like in the C# code you're using Windows-1252 while in the Delphi code you're using some extended ASCII charset because the output on the right is actually the C1 control characters
In ASCII 32 characters in the range [0, 31] are called control characters. When the high bit is enabled you'll get a character in the range [128, 159] as in your image, which are reserved for the high control characters because some old software can't deal with non-ASCII character.
Unicode's first 256 code points are exactly the same as ISO 8859-1 with the control characters being left out, and Windows-1252 is an exact superset of ISO 8859-1 with the C1 control characters. Characters in a white rounded rectangle like that are actually control characters visualized by Notepad++. You need to use the same codepage when outputting
It's also possible that you're choosing the wrong charset in the editor as Hans Passant said

Encoding of a string in c#

I was translating some C++ code to C# and I saw the below function
myMultiByteToWideChar( encryptedBufUnicode, (char*)encryptedBuf, sizeof(encryptedBufUnicode) );
This basically converts the char array to unicode.
In C#, aren't strings and char arrays already unicode? Or do we need to make it unicode using a system.text function?

C# strings and characters are UTF-16.
If you have an array of bytes, you can use the Encoding class to read it as a string using a correct encoding.

Reading & displaying Unicode from HEX string (written by a legacy Microsoft COM application)

I have a legacy C++ COM application which writes a BSTR (UTF-16 on Windows) like this.
Say, ☻ (Black Smiley i.e. ALT + Numpad 2) is written like this in HEX
- 060000003B260D000A00 by the application. Note that 1st 4 bytes are reserved for BSTR length
Now, how do I display back the black smiley in C# from this HEX string ? In VS debugger, '\u263B' displays the smiley, but here the string is 3B26. This is just an example of a kind of data. Any data can be dumped by that app (like large XSLs, Texts, etc. - all converted in HEX format). Idea is to interpret the HEX correctly in C#.
This link talks something similar, but not very sure. Any pointers ?

Use Marshal.PtrToStringBSTR to get an instance of a managed String from your BSTR.
Note that the IntPtr argument should be a pointer to the start of the string characters themselves, not the start of the 4 bytes which encode the length of the string.

How do i use 32 bit unicode characters in C#?

Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?

I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.

You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.

Passing a string from C# to cpp with COM

I have a C# COM server which is consumed by a cpp client.
One of the C# methods returns a string.
In cpp the returned string is represented in Unicode (UTF-16), at least according to the memory view.
Is this always the case with COM strings?
Is there a way to use UTF-8 instead?
I saw some code where strings were passed between cpp and c# as byte arrays. Is there any benefit in this?

Yes. The standard COM string type is BSTR. It is a Unicode string encoded in UTF16, just like Windows' native string type.
No, a COM method isn't going to understand a UTF8 string, it will turn it into Chinese. UTF8 is a good encoding for a text file, not for programs manipulating strings in memory. UTF8 requires anywhere between 1 and 4 bytes to encode a Unicode codepoint. Very incompatible with basic string manipulations like getting the size or indexing a character.
C and C++ programs tend to use 8-bit encodings, compatible with the "char" type. That's an old practice, dating back from an era before Unicode was around. There's nothing attractive about it, there are many 8-bit encodings. The typical problem is that data entered as text can only be interpreted correctly if it is read by a program that uses the same 8-bit encoding. In other words, when the computers are less than 1000 miles apart. Less in Europe.

No.
Yes. Put the attribute [return: MarshalAs(UnmanagedType.LPStr)] before the method definition in C# if you'd like to return the string as an ANSI string instead of Unicode.
Yeah--the author may have done that to maintain very fine-grained control on the encoding of the contents of the string by side-stepping the default marshalling behavior.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Character code different between C++ and C# - c#

Related

Difference between char(x) in delphi and (char)x in C#

Encoding of a string in c#

Reading & displaying Unicode from HEX string (written by a legacy Microsoft COM application)

How do i use 32 bit unicode characters in C#?

Passing a string from C# to cpp with COM

Categories

Resources