Encoding of a string in c# - c#

I was translating some C++ code to C# and I saw the below function
myMultiByteToWideChar( encryptedBufUnicode, (char*)encryptedBuf, sizeof(encryptedBufUnicode) );
This basically converts the char array to unicode.
In C#, aren't strings and char arrays already unicode? Or do we need to make it unicode using a system.text function?

C# strings and characters are UTF-16.
If you have an array of bytes, you can use the Encoding class to read it as a string using a correct encoding.

Related

C# LPUTF8Str string marshaling does not appear to read string correctly from memory

LPUTF8Str string marshaling in C# simply does not work for me. I feel like I'm can't be understanding its use case correctly, but after poring over the documentation and doing various other tests, I'm not sure what I'm doing wrong.
Context
First of all, to state my base (possibly incorrect) understanding of character encodings and why C# needs to convert them, in case something is wrong here:
Standard C/C++ strings (const char* and std::string respectively) use single-byte characters by default, on Windows and elsewhere. You can have strings with two-byte characters, but these are only used if you choose to use std::wstring (which I am not doing).
The default Windows single-byte character encoding is ANSI (7-bit ASCII + an extra set of characters that uses the 8th bit).
Unicode is the mapping of printable characters to code points (ie. to unique numbers). Strings of Unicode code points are commonly encoded using conventions such as:
UTF-8: mostly one byte per character for English, with special bytes specifying where a chain of more than one byte should form a single character (for the more funky ones). 7-bit ASCII is a subset of the UTF-8 encoding.
UTF-16: two bytes per character, with similar (but rarer) continuation patterns for the really funky characters.
UTF-32: four bytes per character, which is basically never used for English and adjacent languages because it's not a very memory-efficient encoding.
To write non-ASCII characters in C/C++ strings, you can encode the literal UTF-8 bytes using \xhh, where hh is the hex encoding of the bytes. Eg. "\xF0\x9F\xA4\xA0" equates to "🤠".
C# encodes all managed strings using two-byte characters - I'm unsure if this is explicitly UTF-16, or some other Microsoft encoding. When a C/C++ string is passed to C#, it needs to be converted from single-byte (narrow) characters to two-byte (wide) characters.
Microsoft abuses the term "Unicode". They refer to two-byte character strings as "Unicode strings" in the C# documentation, thereby implying (incorrectly) that any strings that aren't two bytes per character are not Unicode. As we know from the UTF-8 encoding, this is not necessarily true - just because a string is represented as a const char* does not mean that it is not formed of Unicode characters. Colour me "\xF0\x9F\x98\x92" => "😒"
The actual issue
So, for a C++ program, it must expose strings to C# using const char* pointers, and a C# application must marshal these strings by converting them to wide characters. Let's say I have the following C++ function, which for the sake of demonstrating C# marshaling, passes data out via a struct:
// Header:
extern "C"
{
struct Library_Output
{
const char* str;
};
API_FUNC void Library_GetString(Library_Output* out);
}
// Source:
extern "C"
{
void Library_GetString(Library_Output* out)
{
if ( out )
{
// Static string literal:
out->str = "This is a UTF-8 string. \xF0\x9F\xA4\xA0";
}
}
}
In C#, I call the function like so:
public class Program
{
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
struct Library_Output
{
// This is where the marshaling type is defined.
// C# will convert the const char* pointer to
// a string automatically.
[MarshalAs(UnmanagedType.LPUTF8Str)]
public string str;
}
[DllImport("Library.dll")]
static extern void Library_GetString(IntPtr output);
private static void Main()
{
int structSize = Marshal.SizeOf(typeof(Library_Output));
IntPtr structPtr = Marshal.AllocHGlobal(structSize);
Library_GetString(structPtr);
// Tell C# to convert the data in the unmanaged memory
// buffer to a managed object.
Library_Output outputStruct =
(Library_Output)Marshal.PtrToStructure(structPtr, typeof(Library_Output));
Console.WriteLine(outputStruct.str);
Marshal.FreeHGlobal(structPtr);
}
}
Instead of printing the string to the console, what the application actually prints out is:
���n�
However, if I change the marshaling type to be UnmanagedType.LPStr rather than UnmanagedType.LPUTF8Str, I get:
This is a UTF-8 string. 🤠
This is confusing to me, because the documentation for string marshaling of structure members states:
UnmanagedType.LPStr: A pointer to a null-terminated array of ANSI characters.
UnmanagedType.LPUTF8Str: A pointer to a null-terminated array of UTF-8 encoded characters.
So ANSI string marshaling prints a UTF-8 (non-ANSI) string, but UTF-8 string marshaling prints garbage? To work out where the garbage was coming from, I had a look at what the data being printed actually was, and it appeared to be the value of the pointer itself.
Either the UTF-8 marshaling routine is treating the memory where the string pointer value resides as the string itself, or I'm misunderstanding something crucial about this process. My question, fundamentally, is twofold: firstly, why does the UTF-8 marshaling process not follow the string pointer properly, and secondly, what is actually the proper way to marshal UTF-8 strings from C++ to C#? Is it to use LPUTF8Str, or something else?

Serialize a string in binary with C# and deserialize it with C++

I'm struggling to find an effective way to serialize a string that could contain both unicode and non-unicode characters into a binary array which I then serialize to a file that I have to deserialize using C++.
I have already implemented a serializer/deserializer in C++ which I use to do most of my serialization which can handle both unicode and non-unicode characters (basically I convert non-unicode characters into their unicode equivalent and serialize everything as a unicode string, not the most effective way since every string now has 2 bytes per character but works).
What I'm trying to achieve is to transform an arbitrary string into a 2 byte per character string that I can then deserialize from C++.
What would be the most effective effective way to achieve what I'm looking for?
Also, any suggestion regarding the way I'm serializing strings is well accepted of course.
Encoding.Unicode.GetBytes("my string") encodes the string as UTF-16, which has a size of 2 Bytes for each character. So if you are searching still an alternative consider the encoding.

Cryptotext in Unicode from Byte Array Seems to Contain Invalid Characters

My encryption application (written in C# & GTK# and using Rijndeal) takes a string from a textview to encrypt, and returns the result in a Byte array. I then use Encoding.Unicode.GetString() to convert it to a string, but my output doesn't look right, it seems to contain invalid characters: `zźr[� ��ā�֖�Z�_����
W��h�.
I'm assuming that the encoding for the textview is not Unicode, but ASCII doesn't work either. How can I ensure that the output is not invalid? Or is my approach wrong to begin with?
I'm new to C# and not very experienced with programming in general (I have decent skill in PHP and know a little JavaScript, but that's about it) so if you could baby-down your answers it would be much appreciated.
Thank you in advance for taking the time to assist me.
While every string can be represented as a sequence of bytes using UTF-16, not every sequence of bytes represents a UTF-16 encoded string. Especially if the sequence of bytes is the result of an encryption process.
You can use the Convert.ToBase64String Method to convert the sequence of bytes to a Base64 string.

Character code different between C++ and C#

I have a string handling function in C++ as well as in C#. In C++ the code for the character ˆ is returned as -120 where as in C# it is 710. While building in C++ using visual studio 2010 I have set the character set as "Not set" in the project settings. In C# I am using System.Text.Encoding.Default during one of the conversions. Does that make any difference? How can I get same behavior in C++ as well as in C#?
The character is U+02C6. The encoding you're using in C++ is probably CP 1252 which encodes this character as the byte 0x88 (which is -120 when showing a signed char in decimal) . C# uses the encoding UTF-16, which encodes this character as 0x02C6 (710 in decimal).
You can use UTF-16 in C++ on Windows by using wchar_t insead of char.
You can't make C# strings use CP1252, but you can get byte arrays in different encodings from a String using Encodings.
byte[] in_cp1252 = Encoding.GetEncoding(1252).GetBytes("Your string here");

Are there any well know REGEX libraries for .NET specifically for byte[] arrays?

I understand that .NET's regex works with strings, but I need an implementation for byte[] arrays. Are there any open source implementation in .NET? Does byte[] regex exists for any other programming language other than C# which I can use to build a wrapper for it in C#?
My limitation is that I have to stay within byte arrays. So cannot do any conversions to strings.
Thanks for the advice.
Regular expressions work with strings. A byte array can contain just about any data. So if you want to use regular expressions convert this byte array into a string using the encoding that was used to encode it. For example if your byte array represents a UTF-8 encoded string:
byte[] buffer = ...
string foo = Encoding.UTF8.GetString(buffer);
// Go ahead and use regexes on foo

Categories

Resources