ASCII.GetString() stop on null character - c#

I have big problem...
My piece of code:
string doc = System.Text.Encoding.ASCII.GetString(stream);
Variable doc is ending at first null (/0) character (a lot of data is missing at this point). I want to get whole string.
What's more, when I copied this piece of code and run in immediate window in Visual Studio - everything is fine...
What I'm doing wrong?

No, it doesn't:
string doc = System.Text.Encoding.ASCII.GetString(new byte[] { 65, 0, 65 }); // A\0A
int len = doc.Length; //3
But Winforms (and Windows API) truncate (when showing) at first \0.
Example: https://dotnetfiddle.net/yjwO4Y
I'll add that (in Visual Studio 2013), the \0 is correctly showed BUT in a single place: if you activate the Text Visualizer (the magnifying glass), that doesn't support the \0 and truncates at it.
Why this happens? because historically there were two "models" for string, C-strings that are NUL (\0) terminated (and so can't use \0 as a character) and Pascal strings that have the length prepended, and so can have the \0 as a character. From the wiki
Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.
Now, Windows is written in C, and uses null terminated strings (but then Microsoft changed idea, and COM strings are more similar to Pascal strings and can contain the NUL character). So Windows API can't use the \0 character (unless they are COM based, and probably quite often the COM based could be buggy, because they aren't fully tested for the \0). For .NET Microsoft decided to use something similar to Pascal strings and COM strings, so .NET strings can use the \0.
Winforms is built directly on top of Windows API, so it can't show the \0. WPF is instead built "from the ground up" in .NET, so in general it can show the \0 character.

Related

How to map C# compiler error location (line, column) onto the SyntaxTree produced by Roslyn API?

So:
The C# compiler outputs the (line,column) style location.
The Roslyn API expects sequential text location
How to map the former to the latter?
The C# code could be UTF8 with or without the BOM or even UTF16. It could contain all kinds of characters in the form of comments or embedded strings.
Let us assume we know the encoding and have the respective Encoding object handy. I can convert the file bytes to char[]. The problem is that some chars may contribute zero to the final sequential position. I know that the BOM character does. I have no idea if others may too.
Now, if we know for sure that BOM is the only character that contributes 0 to the length, then I can skip it and count the characters and my question becomes trivial. This is what I do today - I just assume that the BOM is the only "bad" player.
But maybe there is a better way? Maybe Roslyn API contains some hidden gem that knows for a change to accept (line,column) and spit the sequential position? Or maybe some of the Microsoft.Build libraries?
EDIT 1
As per the accepted answer the following gives the location:
var srcText = SourceText.From(File.ReadAllText(err.FilePath));
int location = srcText.Lines[err.Line - 1].Start + err.Column - 1;
You have uncovered the reason that the SourceText type exists in the roslyn apis. Its entire purpose is to handle encoding of strings and preform calculations of lines, columns, and spans.
Due to the way .NET handles unicode and depending on which code pages are installed in your OS there could be cases that SourceText does not do what you need. It has generally proven "good enough" for our purposes though.

Why C# Unicode range cover limited range (up to 0xFFFF)?

I'm getting confused about C# UTF8 encoding...
Assuming those "facts" are right:
Unicode is the "protocol" which define each character.
UTF-8 define the "implementation" - how to store those characters.
Unicode define character range from 0x0000 to 0x10FFFF (source)
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:
u"\U00010000" #WORKING!!!
which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!
# Python (write):
import codecs
with codes.open("file.txt", "w+", encoding="utf-8") as f:
f.write(text) # len(text) -> 1
// C# (read):
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2
Why? How to fix?
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
Unfortunately, a C#/.NET char does not represent a Unicode character.
A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.
The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".
Why?
The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.
How to fix?
There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.
Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.
A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.
Unicode has so-called planes (wiki).
As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.
I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).
This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.
UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.
You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.

Caveats Encoding a C# string to a Javascript string

I'm trying to write a custom Javascript MVC3 Helper class foe my project, and one of the methods is supposed to escape C# strings to Javascript strings.
I know C# strings are UTF-16 encoded, and Javascript strings also seem to be UTF-16. No problem here.
I know some characters like backslash, single quotes or double quotes must be backslash-escaped on Javascript so:
\ becomes \\
' becomes \'
" becomes \"
Is there any other caveat I must be aware of before writing my conversion method ?
EDIT:
Great answers so far, I'm adding some references from the answers in the question to help others in the future.
Alex K. suggested using System.Web.HttpUtility.JavaScriptStringEncode, which I marked as the right answer for me, because I'm using .Net 4. But this function is not available to previous .Net versions, so I'm adding some other resources here:
CR becomes \r // Javascript string cannot be broke into more than 1 line
LF becomes \n // Javascript string cannot be broke into more than 1 line
TAB becomes \t
Control characters must be Hex-Escaped
JP Richardson gave an interesting link informing that Javascript uses UCS-2, which is a subset of UTF-16, but how to encode this correctly is an entirely new question.
LukeH on the comments below reminded the CR, LF and TAB chars, and that reminded me of the control chars (BEEP, NULL, ACK, etc...).
(.net 4) You can;
System.Web.HttpUtility.JavaScriptStringEncode(#"aa\bb ""cc"" dd\tee", true);
==
"aa\\bb \"cc\" dd\\tee"
It's my understanding that you do have to be careful, as JavaScript is not UTF-16, rather, it's UCS-2 which I believe is a subset of UTF-16. What this means for you, is that any character that is represented than a higher code point of 2 bytes (0xFFFF) could give you problems in JavaScript.
In summary, under the covers, the engine may use UTF-16, but it only exposes UCS-2 like methods.
Great article on the issue:
http://mathiasbynens.be/notes/javascript-encoding
Just use Microsoft.JScript.GlobalObject.escape
Found it here: http://forums.asp.net/p/1308104/4468088.aspx/1?Re+C+equivalent+of+JavaScript+escape+
Instead of using JavaScriptStringEncode() method, you can encode server side using:
HttpUtility.UrlEncode()
When you need to read the encoded string client side, you have to call unescape() javascript function before using the string.

How do i use 32 bit unicode characters in C#?

Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.

How do I get \0 off my string from C++ when read in C#

I'm kind of stuck here. I'm developing a custom Pipleline component for Commerce Server 2009, but that has little to do with my problem.
In the setup of the pipe, I give the user a windows form to enter some values for configuration. One of those values is a URL for a SharePoint site. Commerce Server uses C++ components behind all this pipeline stuff, so the entered values are put into an IDictionary and eventually persisted to the DB via the C++ component from Microsoft.
When I read the string in during pipeline execution, it is handed to me in an IDictionary object from C++. My C# code sees that URL suffixed with \0\0. I'm not sure where those are coming from, but my code blows up because it's not a valid URI. I am trimming the string before I save it and trimming it when I read it and still can't get rid of those.
Any ideas what is causing this and how I can get rid of it? I prefer not to have a hack like substring it, but something that gets at the root cause.
Thanks,
Corey
Would this help:
string sFixedUrl = "hello\0\0".Trim('\0');
As the others' posts explained, strings in C are null-terminated. (Notice that C++, however, already provides a string type which doesn't depend on that.)
Your case is just a bit different because you're getting double-null-terminated string. I'm not an expert here, so anyone should feel free to correct me if I'm wrong. But this looks like a typical string representation for unicode/i18n aware applications in Windows which use wide characters. Please, take a look at this.
One guess is that the application which is persisting the string into the database is not using a "portable" strategy. For example, it might be persisting the string buffer considering its size in raw bytes instead of its actual length. The former would be counting the extra two zeros in the end (and, consequently, persisting them too) while the latter would discard them.
From this site:
A string in C is simply an array of characters, with the final character set to the NUL character (ascii/unicode point 0). This null-terminator is required; a string is ill-formed if it isn't there. The string literal token in C/C++ ("string") guarantees this.
const char *str = "foo";
is the same as
const char *str = {'f', 'o', 'o', 0};
So as soon as the C++ component gets your IDictionary, it will add the null-terminated string to the end. If you want to remove it, you will have to remove the null terminated char from the end before sending back the dictionary. See this post on how to remove a null terminated character. Basically you need to know the exact size and trim it off.
Another technique you can use is an array of characters and the length of the array. An array of characters does not need a terminating null character.
When you pass this data structure, you must pass the length also. The convention for the C-style strings is to determine the end of the string by searching for a '\0' (or in Unicode, '\0\0'). Since the array doesn't have the terminating characters, the length is always needed.
A much better solution is to use the std::string. It doesn't append null characters. When you need compatibility, or the C-style format, use the c_str() method. I have to use this technique with my program because the GUI framework has its own string data type that is incompatible with std::string.

Categories

Resources