Difference between char(x) in delphi and (char)x in C# - c#

I have a program coded with Delphi 7 and there is this function char(x) (x is variable), and I need to write the same code in C# using (char)x, but I don't always get the same result,
So I printed all the characters x =[1..255] in Delphi 7 and C# and I found a difference in some numbers here is some examples
[![C# vs Delphi 7][1]][1]
So I want to know what is doing exactly char function of Delphi 7 so I can do the same in C#?
this is how i printed the two listes :
in C#:
for (int i = 0; i < 256; i++)
{ richTextBox1.Text = richTextBox1.Text + " I " + i.ToString() + " " + (char)(i) + Environment.NewLine; }
In Delphi 7:
for I := 0 to 255 do begin
Memo5.Text := Memo5.Text +' I='+IntToStr(I)+' char(I) '+ char(I)+#13#10;
end;
the answer was that char in Delphi uses ANSI code page and to do so in C# :
char[] characters = System.Text.Encoding.Default.GetChars(new byte[]{X});
char c = characters[0];
"System.Text.Encoding.Default.GetChars" uses ANSI code page
thank you
[1]: https://i.stack.imgur.com/yGZ8R.jpg

In Delphi, Char() is simply a typecast of an ordinal value to the Delphi Char type. In Delphi 7, Char is an alias to AnsiChar, which is an 8 bit character type.
In C#, the char type is a 16 bit type, typically representing an element of UTF-16 encoded text.
So you might translate the Delphi code to Encoding.Default.GetChars() in C#, but that is speculation at best. For instance, there is an assumption there that the ANSI locale is being used. In my view it's not possible to translate the code you have presented without more information.
I think it is quite likely that the right way to translate your code is not to translate it literally. In other words, you need to look at the broader code to understand the complete task it performs, rather than looking line by line.

C# char types are defined by the language standard to be (partial) UTF-16 characters, with potentially multiple char instances being necessary to define a glyph (the thing you see on screen).
A quick Google search shows that the latest Delphi version defines its char to be a wide Unicode character, equivalent to C++'s wchar_t. That is not the same as C#'s type, even though they use the same amount of space.
Also note that your ancient Delphi version most likely has the ancient 1-byte Ansi char, though I couldn't find an authoritative specification of that. For Ansi characters, their mapping to Unicode glyphs is dictated by mapping tables called "code pages", which you can switch at will.

It looks like in the C# code you're using Windows-1252 while in the Delphi code you're using some extended ASCII charset because the output on the right is actually the C1 control characters
In ASCII 32 characters in the range [0, 31] are called control characters. When the high bit is enabled you'll get a character in the range [128, 159] as in your image, which are reserved for the high control characters because some old software can't deal with non-ASCII character.
Unicode's first 256 code points are exactly the same as ISO 8859-1 with the control characters being left out, and Windows-1252 is an exact superset of ISO 8859-1 with the C1 control characters. Characters in a white rounded rectangle like that are actually control characters visualized by Notepad++. You need to use the same codepage when outputting
It's also possible that you're choosing the wrong charset in the editor as Hans Passant said

Related

Why C# Unicode range cover limited range (up to 0xFFFF)?

I'm getting confused about C# UTF8 encoding...
Assuming those "facts" are right:
Unicode is the "protocol" which define each character.
UTF-8 define the "implementation" - how to store those characters.
Unicode define character range from 0x0000 to 0x10FFFF (source)
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:
u"\U00010000" #WORKING!!!
which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!
# Python (write):
import codecs
with codes.open("file.txt", "w+", encoding="utf-8") as f:
f.write(text) # len(text) -> 1
// C# (read):
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2
Why? How to fix?
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
Unfortunately, a C#/.NET char does not represent a Unicode character.
A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.
The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".
Why?
The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.
How to fix?
There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.
Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.
A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.
Unicode has so-called planes (wiki).
As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.
I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).
This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.
UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.
You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

ASCII.GetString() stop on null character

I have big problem...
My piece of code:
string doc = System.Text.Encoding.ASCII.GetString(stream);
Variable doc is ending at first null (/0) character (a lot of data is missing at this point). I want to get whole string.
What's more, when I copied this piece of code and run in immediate window in Visual Studio - everything is fine...
What I'm doing wrong?
No, it doesn't:
string doc = System.Text.Encoding.ASCII.GetString(new byte[] { 65, 0, 65 }); // A\0A
int len = doc.Length; //3
But Winforms (and Windows API) truncate (when showing) at first \0.
Example: https://dotnetfiddle.net/yjwO4Y
I'll add that (in Visual Studio 2013), the \0 is correctly showed BUT in a single place: if you activate the Text Visualizer (the magnifying glass), that doesn't support the \0 and truncates at it.
Why this happens? because historically there were two "models" for string, C-strings that are NUL (\0) terminated (and so can't use \0 as a character) and Pascal strings that have the length prepended, and so can have the \0 as a character. From the wiki
Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.
Now, Windows is written in C, and uses null terminated strings (but then Microsoft changed idea, and COM strings are more similar to Pascal strings and can contain the NUL character). So Windows API can't use the \0 character (unless they are COM based, and probably quite often the COM based could be buggy, because they aren't fully tested for the \0). For .NET Microsoft decided to use something similar to Pascal strings and COM strings, so .NET strings can use the \0.
Winforms is built directly on top of Windows API, so it can't show the \0. WPF is instead built "from the ground up" in .NET, so in general it can show the \0 character.

Reading & displaying Unicode from HEX string (written by a legacy Microsoft COM application)

I have a legacy C++ COM application which writes a BSTR (UTF-16 on Windows) like this.
Say, ☻ (Black Smiley i.e. ALT + Numpad 2) is written like this in HEX
- 060000003B260D000A00 by the application. Note that 1st 4 bytes are reserved for BSTR length
Now, how do I display back the black smiley in C# from this HEX string ? In VS debugger, '\u263B' displays the smiley, but here the string is 3B26. This is just an example of a kind of data. Any data can be dumped by that app (like large XSLs, Texts, etc. - all converted in HEX format). Idea is to interpret the HEX correctly in C#.
This link talks something similar, but not very sure. Any pointers ?
Use Marshal.PtrToStringBSTR to get an instance of a managed String from your BSTR.
Note that the IntPtr argument should be a pointer to the start of the string characters themselves, not the start of the 4 bytes which encode the length of the string.

How do i use 32 bit unicode characters in C#?

Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.

Categories

Resources