Sending UTF-8 chars C# to C does not work properly

Sending UTF-8 chars C# to C does not work properly - c#

I work with SQLite for C. I try to send UTF-8 Chars to .dll from c# app but everytime it's work different. For example sometimes it add "değirmenci" and another time with same code it add "değirmencil" but I don't change the word. And sometimes it's adding samething in the UNIQUE column ( I think there is a char but it is not visible like 0x01 in ascii)
Sorry about my English.
This is my c# code;
[DllImport("dllfile.dll", CharSet = CharSet.Unicode)]
static void Main()
{
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci");
int r;
//
IntPtr unmanagedPointer = Marshal.AllocHGlobal(bytes.Length);
Marshal.Copy(bytes, 0, unmanagedPointer, bytes.Length);
IntPtr ch = Tahmin_Baslat();
r = Sozcuk_Ekle(unmanagedPointer);
Console.WriteLine(r);
Console.Read();
//
}
and this is my C code
int Sozcuk_Ekle(const char* kok,int tip_1=1,int tip_2=0,int tip_3=0)
{
sqlite3 *ch;
int rc;
char *HataMsj = 0;
rc = sqlite3_open(veritabani, &ch); // Veritabanının açılması
if( rc )
{
return HATA_DEGERI;
}
char buff[strlen(kok) + 64];
sprintf(buff,"INSERT INTO kokler (kok,tip_1,tip_2,tip_3) VALUES('%s',%d,%d,%d)",kok,tip_1,tip_2,tip_3); // Verilerin Birleştirilmesi
sqlite3_exec(ch,buff,GeriBildirim,0,&HataMsj); // Komutun Yürütülmesi
sqlite3_close(ch); // Veritabanını kaynaklarının serbest bırakılması
return DOGRU_DEGERI; // Doğru Dönder
}
(header files etc. included)
And it is how it goes:
http://i.stack.imgur.com/BJ4fE.png
Solution
Adding NULL terminator to end of the bytes.
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci\0"); like this.

Check calling convention in the DllImport attribute (should be Cdecl). And add a NULL terminator to your UTF-8 string:
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci" + '\0');
This will add NULL terminator to the resulting UTF-8 string (which is not needed for native .NET strings).

Related

Marshalling utf8 encoded chinese characters from C# to C++

I'm marshaling some Chinese characters which have the decimal representation (utf8) as
228,184,145,230,161,148
however when I receive this in C++ I end up with the chars
-77,-13,-67,-37
I can solve this using a sbyte[] instead of string in c#, but now I'm trying to marshal a string[] so I can't use this method. Anyone have an idea as to why this is happening?
EDIT: more detailed code:
C#
[DllImport("mydll.dll",CallingConvention=CallingConvention.Cdecl)]
static extern IntPtr inputFiles(IntPtr pAlzObj, string[] filePaths, int fileNum);
string[] allfiles = Directory.GetFiles("myfolder", "*.jpg", SearchOption.AllDirectories);
string[] allFilesutf8 = allfiles.Select(i => Encoding.UTF8.GetString(Encoding.Default.GetBytes(i))).ToArray();
IntPtr pRet = inputFiles(pObj, allfiles, allfiles.Length);
C++
extern __declspec(dllexport) char* inputFiles(Alz* pObj, char** filePaths, int fileNum);
char* massAdd(Alz* pObj, char** filePaths, int fileNum)
{
if (pObj != NULL) {
try{
std::vector<const char*> imgPaths;
for (int i = 0; i < fileNum; i++)
{
char* s = *(filePaths + i);
//Here I would print out the string and the result in bytes (decimals representation) are already different.
imgPaths.push_back(s);
}
string ret = pAlzObj->myfunc(imgPaths);
const char* retTemp = ret.c_str();
char* retChar = _strdup(retTemp);
return retChar;
}
catch (const std::runtime_error& e) {
cout << "some runtime error " << e.what() << endl;
}
}
}
Also, something I found is that if I change the windows universal encoding (In language settings) to use unicode UTF-8, it works fine. Not sure why though.
When marshaling to unsigned char* (or unsigned char** as it's an array) I end up with another output, which is literally just 256+the nummbers shown when in char. 179,243,189,219. This leads me to believe there is something happening during marshaling rather than a conversion mistake on the C++ side of things.

That is because C++ strings uses standard char when stored. The char type is indeed signed and that makes those values being interpreted as negative ones.
I guess that traits may be handled inside the <xstring> header on windows (as far as I know). Specifically in:
_STD_BEGIN
template <class _Elem, class _Int_type>
struct _Char_traits { // properties of a string or stream element
using char_type = _Elem;
using int_type = _Int_type;
using pos_type = streampos;
using off_type = streamoff;
using state_type = _Mbstatet;
#if _HAS_CXX20
using comparison_category = strong_ordering;
#endif // _HAS_CXX20

I have some ideas: You solve problem by using a sbyte[] instead of string in c#, and now you are trying to marshal a string[], just use List<sbyte[]> for string array.
I am not experienced with c++ but I guess there are another libraries for strings use one of them. Look this link, link show string types can marshalling to c#. https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.unmanagedtype?view=net-7.0

The issue was in the marshaling. I think it was because as the data is transferred, the locale setting in the C++ dll was set to GBK (at least not UTF-8). The trick was to convert the incoming strings into UTF-8 from GBK, which I was able to do with the following function:
std::string gb_to_utf8(char* src)
{
wchar_t* strA;
int i = MultiByteToWideChar(CP_ACP, 0, src, -1, NULL, 0);
strA = (wchar_t*)malloc(i * 2);
MultiByteToWideChar(CP_ACP, 0, src, -1, strA, i);
if (!strlen((char*)strA)) {
throw std::runtime_error("error converting");
}
char utf8[1024]; //Unsure how long converted string could be, set as large number
int n = 0;
n = wcstombs(utf8, strA, sizeof(utf8));
std::string resStr = utf8;
free(strA);
return resStr;
}
Also needed to set setlocale(LC_ALL, "en_US.UTF-8"); in order for the function above to work.

Marshalling a C++ Encoded String to C#

I am trying to read in a structure that was written to a file by a C++ program (I don't have the source code). I have been trying to read this structure into C# and marshal it so far without success.
The structure is basically a set of strings of fixed length, two-bytes per character. In C++, they can be declared as TCHAR[8].
The data on disk looks like as follows:
I have tried the following C# code that I know can successfully read in the values as a string:
public void ReadTwoStringsOfFixedLength()
{
string field1 = string.Empty;
string field2 = string.Empty;
FileReadString(handle, out field1, 16);
FileReadString(handle, out field2, 16);
}
public static void FileReadString(BinaryReader reader, out string outVal, int length)
{
var mem = new MemoryStream();
outVal = string.Empty;
byte b = 0;
for (int i = 0; i < length; i++)
{
b = reader.ReadByte();
if (b != 0) mem.WriteByte(b);
}
outVal = Encoding.GetEncoding(1252).GetString(mem.ToArray());
}
However, what I really would like to do is use c# structs, since this data is represented as a struct in C++ (and contains other fields which I have not depicted here).
I have tried various methods of attempting to marshal this data based on answers I have read on StackOverflow, but none have yielded the result I wanted. In most cases, either the string encoding was incorrect, I ended up with a memory exception or I ended up with only the first character in each field (probably due to null-termination?)
Here is my code:
void Main()
{
byte[] abBuffer = handle.ReadBytes(Marshal.SizeOf(typeof(MyStruct)));
//Access data
GCHandle pinnedPacket = GCHandle.Alloc(abBuffer, GCHandleType.Pinned);
var atTestStruct = (MyStruct)Marshal.PtrToStructure(pinnedPacket.AddrOfPinnedObject(), typeof(MyStruct));
}
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Ansi), Serializable]
struct MyStruct
{
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 16)]
string Field1 // Resulting value = "F";
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 16)]
string Field2; // Resulting value = "F"
}
Note that I have also attempted to use CharSet.Unicode, however the resulting strings are garbled.
Any help to fix the code would be much appreciated.

I think you need to set CharSet = CharSet.Unicode on your StructLayout.
46 00 69 00 in ASCII/ANSI is considered a single character and a null terminator. The documentation shows that CharSet.Unicode is needed for two-byte characters, such as those you're showing.
The SizeConst value must also be the number of characters, not bytes.

(C#) AccessViolationException when getting char ** from C++ DLL

I've written a basic C++ library that gets data from an OPC UA server and formats it into an array of strings (char **). I've confirmed that it works standalone, but now I'm trying to call it from a C# program using DLLs/pInvoke and running into serious memory errors.
My C# main:
List<String> resultList = new List<string>();
IntPtr inArr = new IntPtr();
inArr = Marshal.AllocHGlobal(inArr);
resultList = Utilities.ReturnStringArray(/*data*/,inArr);
C# Helper functions:
public class Utilities{
[DllImport(//DllArgs- confirmed to be correct)]
private static extern void getTopLevelNodes(/*data*/, IntPtr inArr);
public static List<String> ReturnStringArray(/*data*/,IntPtr inArr)
{
getTopLevelNodes(/*data*/,inArr); // <- this is where the AccessViolationException is thrown
//functions that convert char ** to List<String>
//return list
}
And finally, my C++ DLL implementation:
extern "C" EXPORT void getTopLevelNodes(*/data*/,char **ret){
std::vector<std::string> results = std::vector<std::string>();
//code that fills vector with strings from server
ret = (char **)realloc(ret, sizeof(char *));
ret[0] = (char *)malloc(sizeof(char));
strcpy(ret[0], "");
int count = 0;
int capacity = 1;
for (auto string : results){
ret[count] = (char*)malloc(sizeof(char) * 2048);
strcpy(ret[count++], string.c_str());
if (count == capacity){
capacity *= 2;
ret = (char **)realloc(ret, sizeof(char *)*capacity + 1);
}
}
What this should do is, initialize a List to hold the final result and IntPtr to be populated as a char ** by the C++ DLL, which is then processed back in C# and formatted into a List. However, an AccessViolationException is thrown every time I call getTopLevelNodes from C#. What can I do to fix this memory issue? Is this the best way to pass an array of strings via interop?
Thank you in advance
Edit:
I'm still looking for more answers, if there's a simpler way to implement string array interop between C# and a DLL, please, let me know!

METHOD 1 - Advanced Struct Marshalling.
As opposed to marshalling a list, try creating a c# struct like this:
[StructLayout(LayoutKind.Sequential, Pack = 2)]
public struct StringData
{
public string [] mylist; /* maybe better yet byte[][] (never tried)*/
};
Now in c# marshall like this:
IntPtr pnt = Marshal.AllocHGlobal(Marshal.SizeOf(StringData)); // Into Unmanaged space
Get A pointer to the structure.
StringData theStringData = /*get the data*/;
Marshal.StructureToPtr(theStringData, pnt, false);
// Place structure into unmanaged space.
getTopLevelNodes(/* data */, pnt); // call dll
theStringData =(StringData)Marshal.PtrToStructure(pnt,typeof(StringData));
//get structure back from unmanaged space.
Marshal.FreeHGlobal(pnt); // Free shared mem
Now in CPP:
#pragma pack(2)
/************CPP STRUCT**************/
struct StringDataCpp
{
char * strings[]
};
And the function:
extern "C" EXPORT void getTopLevelNodes(/*data*/,char *ret){ //just a byte pointer.
struct StringDataCpp *m = reinterpret_cast<struct StringDataCpp*>(ret);
//..do ur thing ..//
}
I have used this pattern with much more complicated structs as well. The key is that you're just copying byte by byte from c# and interpreting byte by byte in c++.
The 'pack' is key here, to ensure the structs align the same way in memory.
METHOD 2 - Simple byte array with fixed
//USE YOUR LIST EXCEPT List<byte>.
unsafe{
fixed (byte* cp = theStringData.ToArray)
{
getTopLevelNodes(/* data */, cp)
/////...../////
//SNIPPET TO CONVERT STRING ARRAY TO BYTE ARRAY
string[] stringlist = (/* get your strings*/);
byte[] theStringData = new stringlist [stringlist .Count()];
foreach (string b in parser)
{
// ADD SOME DELIMITER HERE FOR CPP TO SPLIT ON?
theStringData [i] = Convert.ToByte(stringlist [i]);
i++;
}
NOW
CPP just receives char*. You'll need a delimiter now to seperate the strings.
NOTE THAT YOUR STRING PROBABLY HAS DELIMETER '\0' ALREADY USE A REPLACE ALGORITHM TO REPLACE THAT WITH a ';' OR SOMETHING AND TOKENIZE EASILY IN A LOOP IN CPP USING STRTOK WITH ';' AS THE DELIMITER OR USE BOOST!
OR, try making a byte pointer array if possible.
Byte*[i] theStringStartPointers = &stringList[i]/* in a for loop*/
fixed(byte* *cp = theStringStartPointers) /// Continue
This way is much simpler. The unsafe block allows the fixed block and the fixed ensures that the c# memory management mechanism does not move that data.

Sending an Array of strings from c# to c++ via COM

I tried to send an array from c# to c++ via com interop.
Here is the c# Code
public void SendArraytoCPlusPlus()
{
GGXForVBA.GeoAtlas GA = new GGXForVBA.GeoAtlas();
string[] arr = new string[3];
arr[0] = "One";
arr[1] = "Two";
arr[2] = "Five";
GA.GetArrayVar(arr);
}
Here is the c++ code
void GeoAtlas::GetArrayVar(VARIANT& arr)
{
AFX_MANAGE_STATE(AfxGetStaticModuleState());
SAFEARRAY* pSafeArray = arr.parray;
long lStartBound = 0;
long lEndBound = 0;
SafeArrayGetLBound(pSafeArray,1,&lStartBound);
SafeArrayGetUBound(pSafeArray,1,&lEndBound);
LPCSTR * arrayAccess = NULL;
SafeArrayAccessData( pSafeArray , (void**)&arrayAccess);
for(int iIndex = lStartBound; iIndex <= lEndBound; iIndex ++)
{
LPCTSTR myString = (LPCTSTR)arrayAccess[iIndex];
AfxMessageBox(myString);
}
}
This is the idl
[id(23)] void GetArrayVar(VARIANT arr);
The problem is, The message box only shows the FIRST letters of the strings, i.e ''O'. 'T', 'F' . I want to read the whole string. Any suggestions ?

This is based on the fact that you're sending an Unicode/UTF8 string but the message box expects an ANSI C string. Try to convert the strings to ASCII before sending them. Look at the Encoding class here http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
You may want to try to send them as byte array.
var bytes = Encoding.ASCII.GetBytes("one");

You should pass a SAFEARRAY instead of a VARIANT:
[id(23)] void GetArrayVar([in] SAFEARRAY(BSTR) arr);
And in the implementation, access it like this:
void GeoAtlas::GetArrayVar(SAFEARRAY* arr)
{
CComSafeArray<BSTR> sa;
sa.Attach(arr);
for (int i = sa.GetLowerBound(); i <= sa.GetUpperBound(); ++i)
{
AfxMessageBox(sa[i]);
}
sa.Detach();
}

Sending side looks fine. On receiving side try this.
[id(1), helpstring("method SetArrayVar")] HRESULT SetArrayVar([in] VARIANT varMachineList);
STDMETHODIMP GeoAtlas::SetArrayVar(VARIANT arr)
{
long lower, upper;
SafeArrayGetLBound(pSafeArray, 1, &lower);
SafeArrayGetUBound(pSafeArray, 1, &upper);
DWORD dwItems = upper - lower + 1;
resize(dwItems);
SAFEARRAY* pSafeArray = arr.parray;
BSTR* pBstr;
if ( SUCCEEDED(SafeArrayAccessData(pSafeArray, (LPVOID*) &pBstr)) )
{
iterator it = begin();
for ( long i=lower; i<=upper; i++, pBstr++, it++ )
{
USES_CONVERSION;
*it = OLE2CT(*pBstr);
// Here you gets array elements and use it
}
}
// release the data block
SafeArrayUnaccessData(pSafeArray);
}

Not sure, but looks like you have Unicode/MBCS mismatch.
The managed part creates an array of UTF-16 encoded strings, while it seems that the unmanaged part is compiled with multibyte charset. Therefore, upon seeing first null character (which is actually just the second byte of UTF-16 character) your MBCS code thinks it has reached the end of the string.
One solution is to change the character set setting in your C++ project to Unicode and rebuild.
Another one is to change the code:
...
LPCWSTR * arrayAccess = NULL; // change LPCSTR -> LPCWSTR
...
for(int iIndex = lStartBound; iIndex <= lEndBound; iIndex ++)
{
bstr_t myString = arrayAccess[iIndex];
AfxMessageBox(myString);
}
bstr_t will be correctly initialized with BSTR, but it also provides conversion to const char* so call to AfxMessageBox will be OK.

C# checking for binary reader end of file

I was searching for a way to check whether I've reached the end of a file for my binary reader and one suggestion was to use PeekChar as such
while (inFile.PeekChar() > 0)
{
...
}
However, it looks like I've run into an issue
Unhandled Exception: System.ArgumentException: The output char buffer is too sma
ll to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'Syste
m.Text.DecoderReplacementFallback'.
Parameter name: chars
at System.Text.Encoding.ThrowCharsOverflow()
at System.Text.Encoding.ThrowCharsOverflow(DecoderNLS decoder, Boolean nothin
gDecoded)
at System.Text.UTF8Encoding.GetChars(Byte* bytes, Int32 byteCount, Char* char
s, Int32 charCount, DecoderNLS baseDecoder)
at System.Text.DecoderNLS.GetChars(Byte* bytes, Int32 byteCount, Char* chars,
Int32 charCount, Boolean flush)
at System.Text.DecoderNLS.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteC
ount, Char[] chars, Int32 charIndex, Boolean flush)
at System.Text.DecoderNLS.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteC
ount, Char[] chars, Int32 charIndex)
at System.IO.BinaryReader.InternalReadOneChar()
at System.IO.BinaryReader.PeekChar()
So maybe PeekChar isn't the best way to do it, and I don't think it should even be used that way because I'm checking the current position of my reader and not really what the next character is supposed to be.

There is a more accurate way to check for EOF when working with binary data. It avoids all of the encoding issues that come with the PeekChar approach and does exactly what is needed: to check whether the position of the reader is at the end of the file or not.
while (inFile.BaseStream.Position != inFile.BaseStream.Length)
{
...
}

Wrapping it into a Custom Extension Method that'll extend the BinaryReader class by adding the missing EOF method.
public static class StreamEOF {
public static bool EOF( this BinaryReader binaryReader ) {
var bs = binaryReader.BaseStream;
return ( bs.Position == bs.Length);
}
}
So now you can just write:
while (!infile.EOF()) {
// Read....
}
:)
... assuming you have created infile somewhere like this:
var infile= new BinaryReader();
Note: var is implicit typing.
Happy to found it - it's other puzzle piece for well styled code in C#. :D

I suggest very similar to #MxLDevs, but with a '<' operator rather than a '!=' operator. As it is possible to set Position to anything you want (within long confines), this will stop any attempts to access an invalid file Position by the loop.
while (inFile.BaseStream.Position < inFile.BaseStream.Length)
{
...
}

This work for me:
using (BinaryReader br = new BinaryReader(File.Open(fileName,
FileMode.Open))) {
//int pos = 0;
//int length = (int)br.BaseStream.Length;
while (br.BaseStream.Position != br.BaseStream.Length) {
string nume = br.ReadString ();
string prenume = br.ReadString ();
Persoana p = new Persoana (nume, prenume);
myArrayList.Add (p);
Console.WriteLine ("ADAUGAT XXX: "+ p.ToString());
//pos++;
}
}

I'll add my suggestion: if you don't need the "encoding" part of the BinaryReader (so you don't use the various ReadChar/ReadChars/ReadString) then you can use an encoder that won't ever throw and that is always one-byte-per-char. Encoding.GetEncoding("iso-8859-1") is perfect for this. You pass it as a parameter of the BinaryReader constructor. The iso-8859-1 encoding is a one-byte-per-character encoding that maps 1:1 all the first 256 characters of Unicode (so the byte 254 is the char 254 for example)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sending UTF-8 chars C# to C does not work properly - c#

Check calling convention in the DllImport attribute (should be Cdecl). And add a NULL terminator to your UTF-8 string: byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci" + '\0'); This will add NULL terminator to the resulting UTF-8 string (which is not needed for native .NET strings).

Related

Marshalling utf8 encoded chinese characters from C# to C++

Marshalling a C++ Encoded String to C#

(C#) AccessViolationException when getting char ** from C++ DLL

Sending an Array of strings from c# to c++ via COM

C# checking for binary reader end of file

Categories

Resources