I'm working on a project that uses both .net and java, using zeromq to communicate between them.
I can connect to the .net server, however when I try to convert the byte array to a string strange things happen. In the eclipse debugger I can see the string, and its length. When I click on the string its value changes to being only the first letter, and the length changes to 1. In the eclipse console when I try to copy and paste the output I only get the first letter. I also tried running it in NetBeans and get the same issue.
I thought it might be due to Endianness, so have tired both
BIG_ENDIAN
LITTLE_ENDIAN
Anyone know how I an get the full string, and not just the first letter?
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import org.zeromq.ZMQ;
class local_thr
{
private static final String ENDPOINT = "tcp://127.0.0.1:8000";
static String[] myargs={ENDPOINT, "1000", "100"};
public static void main (String [] args) {
args = myargs;
ZMQ.Context ctx = ZMQ.context (1);
ZMQ.Socket s = ctx.socket (ZMQ.SUB);
s.subscribe("".getBytes());
s.connect (ENDPOINT);
while(true){
byte [] data = s.recv (0);
ByteBuffer buf = ByteBuffer.wrap(data);
buf.order(ByteOrder.nativeOrder());
byte[] bytes = new byte[buf.remaining()];
buf.get(bytes, 0, bytes.length);
String quote;
quote = new String(bytes);
String myQuote;
myQuote = new String();
System.out.println (quote);
}
}
}
1 char suggests that the data is being encoded as little-endian UTF-16 and decoded as nul-terminated (could be expecting single-byte, could be expecting UTF-8).
Make sure you are familiar with encodings, and ensure that both ends of the pipe are using the same encoding.
The java string(byte[]) constructor uses the default system charset; I would start by investigating how to read UTF-16 from java. Or maybe use UTF-8 from both ends. Using a default charset is never robust.
Related
I've been trying to create a consistent method to take the bytes of characters and display the bytes representation in alternative text codepages. For example, hex D1 in Windows 1251, KOI-8U, etc. The idea is to take text that appears scrambled, because it is being interpreted and displayed in the wrong character set and transform it to the correct display. Below is a shortened portion of the code I've used. I've gotten it to work on ideone, but can't get it to work as an add-type in powershell or compiling with csc. I just get question marks or incorrect characters.
The output of the below code from ideone, which is the correct transformation, is:
D1-00-C1-00
СБ
windows-1251
When compiled with PowerShell or csc it is (incorrect):
D1-00-C1-00
?A
windows-1251
Is there a way to make this work in the Windows environment?
using System;
using System.Text;
public class Test
{
public static void Main()
{
string str = "ÑÁ"
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
Encoding enc = Encoding.GetEncoding(1251);
char[] ca = enc.GetChars(bytes);
Console.WriteLine(BitConverter.ToString(bytes));
Console.WriteLine(ca);
Console.WriteLine(enc.HeaderName);
}
}
First of all, the best way to solve this problem is to avoid it—make sure that when you have bytes, you always know which character set was used to encode these bytes.
To answer the question: you can't. There is no consistent method to make this work anywhere. This will always involve guesswork.
What you see is a string which was encoded to bytes with some encoding, and then decoded using a different encoding. Here is how you fix these strings:
figure out (or guess) what encoding was originally used to encode the string to bytes.
figure out (or guess) what encoding was used when displaying the string,
reverse the operations: encode the mojibake using the encoding from step (2) and encode the bytes with the encoding from step (1)
If you already have bytes, you only do step (1) and use that decoding to decode the bytes to a string.
A program doing that would look like this:
using System;
using System.Text;
public class Test
{
public static void Main()
{
// our corrupted string
string str = "ÑÁ"
// encoding from step (2)
Encoding enc1 = Encoding.GetEncoding(1252);
byte[] bytes = enc1.GetBytes(str);
// encoding from step (1)
Encoding enc2 = Encoding.GetEncoding(1251);
string originalStr = enc.GetString(bytes);
Console.WriteLine(originalStr);
}
}
UPDATE/SOLUTION
As roeland notes, there's quite a bit of guess work involved with this. The problem as a C# solution is also two parts on Windows. It appears that the display default encoding for the console doesn't change with the encoding object automatically (it seems to with Mac and the Mono Framework). The console display has to be set manually with SetConsoleCP and SetConsoleOutputCP. I also had to create multiple encodings and use an inner loop to get the proper intersection of codepages. The link below pointed towards the display issue's resolution.
UTF-8 output from PowerShell
Example below is focused on scenario where Russian is the suspected language.
CODE
using System;
using System.Text;
using System.Runtime.InteropServices;
namespace Language
{
public class Test
{
//Imports dll to set console display
[DllImport("kernel32.dll"]
public static extern bool SetConsoleCP(int codepage);
[DllImport("kernel32.dll"]
public static extern bool SetConsoleOutputCP(int codepage);
public static void Main()
{
string s = "ÑÁÅ";
byte[] bytes = new byte[s.Length * sizeof(char)];
System.Buffer.BlockCopy(s.ToCharArray(), 0, bytes, 0, bytes.Length);
Console.WriteLine(BitConverter.ToString(bytes);
//produce possible combinations
foreach (Encoding encw in Russian.GetCps())
{
bool cp = SetConsoleOutputCP(encw.CodePage);
bool cp2 = SetConsoleCP(encw.CodePage);
foreach (Encoding enc in Russian.GetCps())
{
char[] ca = enc.GetChars(bytes);
Console.WriteLine(ca);
}
}
}
}
public class Russian
{
public static Encoding[] GetCps()
{
// get applicable Cyrillic pages
Encoding[] = russian = new Encoding[8];
russian[0] = Encoding.GetEncoding(855);
russian[1] = Encoding.GetEncoding(866);
russian[2] = Encoding.GetEncoding(1251);
russian[3] = Encoding.GetEncoding(10007);
russian[4] = Encoding.GetEncoding(20866);
russian[5] = Encoding.GetEncoding(21866);
russian[6] = Encoding.GetEncoding(20880);
russian[7] = Encoding.GetEncoding(28595);
return russian;
}
}
}
The output is lengthy, but gives a string with the correct output as one member of a list.
I made a shorter version in PowerShell, which appears to change the display codepage automatically and requires less iterations:
function Get-Language ([string]$source) {
$encodings = [System.Text.Encoding]::GetEncoding(855),[System.Text.Encoding]::GetEncoding(866),[System.Text.Encoding]::GetEncoding(1251),[System.Text.Encoding]::GetEncoding(10007),[System.Text.Encoding]::GetEncoding(20866),[System.Text.Encoding]::GetEncoding(21866),[System.Text.Encoding]::GetEncoding(20880),[System.Text.Encoding]::GetEncoding(28595)
$C = ""
$bytes = gc $source -encoding byte
for ($i=0; $i -le $encodings.Length - 1; $i++) {
$bytes | %{$C = $C + $encodings[$i].GetChars($_)}
Write-Host $C
$C = ""
}
}
Currently I am using something like this:
private static ASCEncoding = new Encoding();
...
...
and my method:
...
public object some_method(object BinaryRequest)
{
byte[] byteRequest = (byte[])BinaryRequest;
string strRequest = ASCEncoding.GetString(byteRequest);
...
}
some characters when checked under Windows are different when checked Under Linux
9I9T (win)
98T (linux)
When you are communicating between systems, it's a good idea to use a specific and documented encoding for your text. For text written in the English language (including programming languages which use English for keywords/etc), the UTF-8 encoding is likely to use the fewest overall number of bytes in the encoded representation.
byte[] byteRequest = (byte[])BinaryRequest;
string strRequest = Encoding.UTF8.GetString(byteRequest);
Obviously to use this, you are expected to produce your requests using the same encoding.
string strRequest = ...
byte[] byteRequest = Encoding.UTF8.GetBytes(strRequest);
string stringValue = Encoding.Default.GetString(byteArray);
I'm trying to send a string containing special characters through a TcpClient (byte[]). Here's an example:
Client enters "amé" in a textbox
Client converts string to byte[] using a certain encoding (I've tried all the predefined ones plus some like "iso-8859-1")
Client sends byte[] through TCP
Server receives and outputs the string reconverted with the same encoding (to a listbox)
Edit :
I forgot to mention that the resulting string was "am?".
Edit-2 (as requested, here's some code):
#DJKRAZE here's a bit of code :
byte[] buffer = Encoding.ASCII.GetBytes("amé");
(TcpClient)server.Client.Send(buffer);
On the server side:
byte[] buffer = new byte[1024];
Client.Recieve(buffer);
string message = Encoding.ASCII.GetString(buffer);
ListBox1.Items.Add(message);
The string that appears in the listbox is "am?"
=== Solution ===
Encoding encoding = Encoding.GetEncoding("iso-8859-1");
byte[] message = encoding.GetBytes("babé");
Update:
Simply using Encoding.Utf8.GetBytes("ééé"); works like a charm.
Never too late to answer a question I think, hope someone will find answers here.
C# uses 16 bit chars, and ASCII truncates them to 8 bit, to fit in a byte. After some research, I found UTF-8 to be the best encoding for special characters.
//data to send via TCP or any stream/file
byte[] string_to_send = UTF8Encoding.UTF8.GetBytes("amé");
//when receiving, pass the array in this to get the string back
string received_string = UTF8Encoding.UTF8.GetString(message_to_send);
Your problem appears to be the Encoding.ASCII.GetBytes("amé"); and Encoding.ASCII.GetString(buffer); calls, as hinted at by '500 - Internal Server Error' in his comments.
The é character is a multi-byte character which is encoded in UTF-8 with the byte sequence C3 A9. When you use the Encoding.ASCII class to encode and decode, the é character is converted to a question mark since it does not have a direct ASCII encoding. This is true of any character that has no direct coding in ASCII.
Change your code to use Encoding.UTF8.GetBytes() and Encoding.UTF8.GetString() and it should work for you.
Your question and your error is not clear to me but using Base64String may solve the problem
Something like this
static public string EncodeTo64(string toEncode)
{
byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
return returnValue;
}
static public string DecodeFrom64(string encodedData)
{
byte[] encodedDataAsBytes
= System.Convert.FromBase64String(encodedData);
string returnValue =
System.Text.ASCIIEncoding.ASCII.GetString(encodedDataAsBytes);
return returnValue;
}
Just a question of curiosity here.
When you write plugins for Unity on the iOS platform, the plugins have a limited native-to-managed callback functionality (from the plugin and then to Unity). Basically this documentation:
iOS plugin Unity documentation
states that the function signature you are able to call back to is this:
Only script methods that correspond to the following signature can be called from native code: function MethodName(message:string)
The signature defined in C looks like this:
void UnitySendMessage(const char* obj, const char* method, const char* msg);
So this pretty much means I can only send strings back to Unity.
Now in my plugin I'm using protobuf-net to serialize objects and send them back to unity to be deserialized. I have gotten this to work, but by a solution I feel is quite ugly and not very elegant at all:
Person* person = [[[[[Person builder] setId:123]
setName:#"Bob"]
setEmail:#"bob#example.com"] build];
NSData* data = [person data];
NSString *rawTest = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
UnitySendMessage("GameObject", "ReceiveProductRequestResponse", [rawTest cStringUsingEncoding:NSUTF8StringEncoding]);
Basically I simply encode the bytestream into a string. In Unity I then get the bytes of the string and deserialize from there:
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
Byte[] bytes = encoding.GetBytes(message);
This does work. But is there really no other way of doing it? Perhaps someone have an idea of how it could be done in some alternative way?
Base-64 (or another similar base) is the correct way to do this; you cannot use an encoding here (such as UTF8) - an encoding is intended to transform:
arbitrary string <===encoding===> structured bytes
i.e. where the bytes have a defined structure; this is not the case with protobuf; what you want is:
arbitrary bytes <===transform===> structured string
and base-64 is the most convenient implementation of that in most cases. Strictly speaking, you can sometimes go a bit higher than 64, but you'd probably have to roll it manually - not pretty. Base-64 is well-understood and well-supported, making it a good choice. I don't know how you do that in C, but in Unity it should be just:
string s = Convert.ToBase64String(bytes);
Often, you can also avoid an extra buffer here, assuming you are serializing in-memory to a MemoryStream:
string s;
using(var ms = new MemoryStream()) {
// not shown: serialization steps
s = Convert.ToBase64String(ms.GetBuffer(), 0, (int)ms.Length);
}
Example based on Marc Gravell's answer:
On the ios side:
-(void)sendData:(NSData*)data
{
NSString* base64String = [data base64Encoding];
const char* utf8String = [base64String cStringUsingEncoding:NSUTF8StringEncoding];
UnitySendMessage("iOSNativeCommunicationManager", "dataReceived", utf8String);
}
and on the unity side:
public delegate void didReceivedData( byte[] data );
public static event didReceivedData didReceivedDataEvent;
public void dataReceived( string bytesString )
{
byte[] data = System.Convert.FromBase64String(bytesString);
if( didReceivedDataEvent != null )
didReceivedDataEvent(data);
}
I converted some code from a C++ application I wrote a long time ago to C#. In C++ I had a library I used that was a bit buffer, but my lack of C# knowledge has somewhat complicated the conversion.
When I query my application, and I simply use a ByteWriter without casting any values properly (just like bf.Write(-1) and bf.Write("stringhere") the query programs atleast query it, just get the wrong information. When I cast the values properly (to long, byte, short, etc) it completely breaks, and the query application doesn't even see it anymore.
C++ Code Snippet
void PlayerManager::BuildReplyInfo()
{
// Delete the old packet
g_ReplyInfo.Reset();
g_ReplyInfo.WriteLong(-1);
g_ReplyInfo.WriteByte(73);
g_ReplyInfo.WriteByte(g_ProtocolVersion.GetInt());
g_ReplyInfo.WriteString(iserver->GetName());
g_ReplyInfo.WriteString(iserver->GetMapName());
g_ReplyInfo.WriteString(gGameType);
}
C# Code
public static byte[] ConvertStringToByteArray(string str)
{
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
}
//-----------------------------------
while (true)
{
data = new byte[1024];
recv = socket.ReceiveFrom(data, ref Remote);
Console.WriteLine("Message length is " + recv);
// If the length is 25 and the 5th byte is 'T' it is a A2S_INFO QUERY
if (recv == 25 && data[4] == 84)
{
Console.WriteLine("Source Engine Query!");
data = BuildReplyInformation();
socket.SendTo(data, 0, data.Length, SocketFlags.None, Remote);
}
}
}
public static byte[] BuildReplyInformation()
{
MemoryStream stream = new MemoryStream();
BinaryWriter writer = new BinaryWriter(stream);
writer.Write((long)(-1));
writer.Write((byte)(73)); // Steam Version
writer.Write((byte)(15)); // Protocol
writer.Write(ConvertStringToByteArray("Minecraft Server\0")); // Hostname
writer.Write(ConvertStringToByteArray("Map Name\0")); // Map Name
writer.Write(ConvertStringToByteArray("tf\0")); // Game Directory
writer.Write(ConvertStringToByteArray("Minecraft Server\0")); // Game Description
writer.Write((short)(440));
writer.Write((byte)(15)); // Players
writer.Write((byte)(32)); // Max Players
writer.Write((byte)(0)); // Bots
writer.Write((byte)(100));
writer.Write((byte)(119)); // 108 Linux, 119 Windows
writer.Write((byte)(0)); // Password Boolean
writer.Write((byte)(01)); // Vac Secured
writer.Write(ConvertStringToByteArray("1.1.3.7\0"));
return stream.ToArray();
}
A couple of ideas that might get you on track:
Are you sure you need UTF8 as string encoding?
When you look at the array and compare it to the intended structure, are you able to find out at what point the array does not comply to the standard?
Just a few things to keep in mind:
UTF-8 strings sometimes start with a BOM (byte order mark), sometimes not.
Strings sometimes are serialized length prefixed, sometimes null-terminated.
My suggestion is to double-check the original C++ method WriteString(...) to find out how it behaves with respect to #1 and #2, and then to double-check the C# method GetBytes(...) for the same. If I recall, the .NET binary serializer writes length-prefixed strings for each string written, but the UTF8 encoder does not (and does not output a null character either). The UTF8 encoder may also (depending on how you use it?) output a BOM.
Also, I'm suspicious of how \0 might be written out when passing through the UTF8 encoder. You might (for kicks) try outputting the null marker separately from the string content, as just a 0-valued byte.
Long size in C# was different from C++, resolved the issue.