When sending information from a java application to a C# application through sockets, is the byte order different? Or can I just send an integer from C# to a java application and read it as integer?
(And do the OS matter, or is the same for java/.net no matter how the actual OS handles it?)
It all comes down to how you encode the data. If you are treating it only as a raw sequence of bytes, there is no conflict; the sequence is the same. When the matters is in endianness when interpreting chunks of the data as (for example) integers.
Any serializer written with portability in mind will have defined endianness - for example, in protocol buffers (available for both Java and C#) little-endian is always used regardless of your local hardware.
If you are doing manual writing to the stream, using things like shift-based encoding (rather than direct memory copying) will give you defined endianness.
If you use pre-canned platform serializers, you are at the mercy of the implementation. It might be endian-safe, it might not be (i.e. it might depend on the platform at both ends). For example, the .NET BitConverter class is not safe - it is usually assumed (incorrectly) to be little-endian, but on some platforms (and particularly in Mono on some hardware) it could be big-endian; hence the .IsLittleEndian property.
My advice would be to use a serializer that handles it all for you ;p
In Java, you can use a DataInputStream or DataOutputStream which read and write the high-byte first, as documented:
http://download.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt%28int%29
You should check corresponding C# documentation to see what it does (or maybe someone here can tell you).
You also have, in Java, the option of using ByteByffer:
http://download.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html
... which has the "order" method to allow you to specify a byte order for operations reading multi-byte primitive types.
Java uses Big Endian for some libraries like DataInput/OutputStream. IP protocols all use Big Endian which can lead people to use Big Endian as default for network protocols.
However NIO, ByteBuffer allows you to specify BigEndian, LittleEndian or NativeEndian (whatever the system uses by default)
x86 systems tend to use little endian and so many Microsoft/Linux applications use little endian by default but can support big-endian.
Yes, the byte order may be different. C# assumes little-endian may use the platform's byte ordering, Java tends to use big-endian. This has been discussed before on SO. See for example C# little endian or big endian?
Related
I'm working on a Client app, written in C#, which should communicate with a legacy app (let's call it a server). The problem is that the server's API is represented as a bunch of plain old C structs. Every struct has a 4-byte header and the data that follows it. It's just a byte stream.
I understand that this particular binary format is unique (given by the legacy server app). Because of that, it's not possible to use any SerDes libraries like Protocol Buffers, which uses its way of encoding binary data.
Is there any project/library for binary serialization that allows me to specify the type of message (like protobuff does) and its binary format? Every library I've seen was based either on JSON, XML, or proprietary binary format.
Suppose I would decide to write my own SerDes library (in C#). What would be the best/recommended strategy for doing this? I want to do it the professional way, at least once in my life. Thanks!
PS: We're talking about little-endian only.
This is how the server defines a message:
struct Message1
{
byte Size; // Header 1st byte
byte Type;
byte ReqI;
byte Zero; //Header 4th byte.
word UDPPort; // Actual data starts here.
word Flags;
byte Sp0;
byte Prefix;
word Interval;
char Admin[16];
char IName[16];
};
It sounds like you have fixed sized c structs being sent via a socket connection and you need to interpret those into handy C# classes.
The easiest way may be to do all the message handling in code written in managed C++. In that you’d have a structure that, possibly with a bunch of pragmas, I’m sure could be made to have the same memory layout as the structure being sent through the socket. You would then also define a similar managed c++ class (eg containing managed strings instead of char arrays). You would also write code that converts the struct field by field into one of these managed classes. Wrap the whole thing up in a DLL, and include it in your C# project as a dependency.
The reason for this is because managed C++, weird though it is as a language, is a far easier bridge between unmanaged and managed code and data structures. There’s no need to marshall anything, this is done for you. I’ve used his route to create libraries that make calls into Windows’ hardware discovery facilities, for which there isn’t (or wasn’t) any pre-existing C# library. Using managed C++ code to call the necessary win32 functions was far easier than doing the same thing from C#.
Good luck!
CRC32 is calculated as uint32, while HashAlgorithm in .NET by convention returns byte[]. I can, of course, easily convert it with bytes = BitConverter.GetBytes(hash) but this is affected by "endianness" of a system (almost no chance for big-endian, of course).
Anyway, I've been thinking is there any convention to follow? I have a feeling that it should be big-endian to make hash.ToString("X") and bytes.ToHex() (assuming .ToHex() exists) look the same.
I've checked https://github.com/vurdalakov/crc32/wiki/CRC32 and it does not do that. Any thoughts?
I can only cite examples, where the zip and gzip file formats store the CRC in little-endian order. I'm sure someone can find examples where a 32-bit CRC is stored in big-endian order, sometimes called "network order" as big-endian is intended to be a convention for network communications.
If you are defining your own protocol, then you can pick whichever you like. For the code to be portable, you would need to use shift operators to pick apart the bytes so that the endianess of the machine does not affect the result.
We want to pass a forest - a dictionary with values which can be: dictionaries, arrays, sets, numbers, strings, byte buffers - between Objective C and C# efficiently (time-wise, space is a lesser concern). Google's Protocol Buffers looked good, but they seem to handle only structured data, while ours is arbitrary. Ultimately we can write a binary (de)serialiser ourselves, but surely this was done before and released as FOSS somewhere?
Have you considered using ASN.1? Since ASN.1 is independent of programing language or system architecture, it can be used efficiently regardless of whether you need C, C#, C++, or Java.
You create a description of the information you wish to exchange, and the use an ASN.1 tool to generate an encoder/decoder for your target programming language. ASN.1 also supports a few different rules for transmitting the date which range from the efficient PER (Packed Encoding Rules) to the verbose, but flexible XER (XML Encoding Rules).
To play with ASN.1 to see if this might work for you, try the free online ASN.1 compiler and encoder/decoder at http://asn1-playground.oss.com to see if this might work for you.
I am working on a system that has components written in the following languages:
C
C++
C#
PHP
Python
These components all use (infrequently changing) data that comes from the same source and can be cached and accesed from memcache for performance reasons.
Because different data types may be stored differently by different language APIs to memcache, I am wondering if it would be better to store ALL data as string (objects will be stored as JSON string).
However, this in itself may pose problems as strings (will almost surely) have different internal representations accross the different languages, so I'm wondering about how wise that decision is.
As an aside, I am using the 1 writer, multiple readers 'pattern' so concurrency is not an issue.
Can anyone (preferably with ACTUAL experience of doing something similar) advice on the best format/way to store data in memcache so that it may be consumed by different programming languages?
memcached I think primarily only understands byte[] and representation of byte is same in all languages. You can serialize your objects using protocol buffers or a similar library and consume it in any other language. I've done this in my projects.
Regardless of the back-end chosen, (memcached, mongodb, redis, mysql, carrier pigeon) the most speed-efficient way to store data in it would be a simple block of data (so the back-end has no knowledge of it.) Whether that's string, byte[], BLOB, is really all the same.
Each language will need an agreed mechanism to convert objects to a storable data format and back. You:
Shouldn't build your own mechanism, that's just reinventing the wheel.
Should think about whether 'invalid' objects might end up in the back-end. (either because of a bug in a writer, or because objects from a previous revision are still present)
When it comes to choosing a format, I'd recommend two: JSON or Protocol Buffers. This is because their encoded size and encode/decode speed is among the smallest/fastest of all the available encodings.
Comparison
JSON:
Libraries available for dozens of languages, sometimes part of the standard library.
Very simple format - Human-readable when stored, human-writable!
No coordination required between different systems, just agreement on object structure.
No set-up needed in many languages, eg PHP: $data = json_encode($object); $object = json_decode($data);
No inherent schema, so readers need to validate decoded messages manually.
Takes more space than Protocol Buffers.
Protocol Buffers:
Generating tools provided for several languages.
Minimal size - difficult to beat.
Defined schema (externally) through .proto files.
Auto-generated interface objects for encoding/decoding, eg C++: person.SerializeToOstream(&output);
Support for differing versions of object schemas to add new optional members, so that existing objects aren't necessarily invalidated.
Not human-readable or writable, so possibly harder to debug.
Defined schema introduces some configuration management overhead.
Unicode
When it comes to Unicode support, both handle it without issues:
JSON: Will typically escape non-ascii characters inside the string as \uXXXX, so no compatibility problem there. Depending on the library, it may be also possible to force UTF-8 encoding.
Protocol Buffers: Seem to use UTF-8, though I haven't found info in Google's documentation in 3-foot-high letters to that effect.
Summary
Which one you go with will depend on how exactly your system will behave, how often changes to the data structure occur, and how all the above points will affect you.
Not going to lie you could do it in redis. Redis is a key-value database written to be high performance it allows the transfer of data between languages using a number of different client libraries these are the client libraries Here is an example in java and python
Edit 1: Code is untested. If you spot an error please let me know :)
Edit 2: I know I didn't use the prefered redis client for java but the point still stands.
Python
import redis
r = redis.Redis()
r.set('test','123')
Java
import org.jredis.RedisException;
import org.jredis.ri.alphazero.JRedisClient;
import static org.jredis.ri.alphazero.support.DefaultCodec.*;
class ExampleCode{
private final JRedisClient client = new JRedisClient();
public static void main(String[] args) throws RedisException {
System.out.println(toStr(client.get('test')))
}
}
We are beginning to roll out more and more WAN deployments of our product (.NET fat client with an IIS hosted Remoting backend). Because of this we are trying to reduce the size of the data on the wire.
We have overridden the default serialization by implementing ISerializable (similar to this), and we are seeing anywhere from 12% to 50% gains. Most of our efforts focus on optimizing arrays of primitive types. Is there a fancy way of serializing primitive types, beyond the obvious?
For example, today we serialize an array of ints as follows:
[4-bytes (array length)][4-bytes][4-bytes]
Can anyone do significantly better?
The most obvious example of a significant improvement, for boolean arrays, is putting 8 bools in each byte, which we already do.
Note: Saving 7 bits per bool may seem like a waste of time, but when you are dealing with large magnitudes of data (which we are), it adds up very fast.
Note: We want to avoid general compression algorithms because of the latency associated with it. Remoting only supports buffered requests/responses (no chunked encoding). I realize there is a fine line between compression and optimal serialization, but our tests indicate we can afford very specific serialization optimizations at very little cost in latency. Whereas reprocessing the entire buffered response into new compressed buffer is too expensive.
(relates to messages/classes, not just primitives)
Google designed "protocol buffers" for this type of scenario (they shift a huge amount of data around) - their format is compact (using things like base-128 encoding) but extensible and version tolerant (so clients and servers can upgrade easily).
In the .NET world, I can recommend 2 protocol buffers implementations:
protobuf-net (by me)
dotnet-protobufs (by Jon Skeet)
For info, protobuf-net has direct support for ISerializable and remoting (it is part of the unit tests). There are performance/size metrics here.
And best of all, all you do is add a few attributes to your classes.
Caveat: it doesn't claim to be the theoretical best - but pragmatic and easy to get right - a compromise between performance, portability and simplicity.
Check out the base-128 varint type used in Google's protocol buffers; that might be what you're looking for.
(There are a number of .NET implementations of protocol buffers available if you search the web which, depending on their license, you might be able to grovel some code from!)
Yes, there is a fancy way of serialising primitive types. As a bonus it is also much faster (typically 20-40 times).
Simon Hewitt's open source library, see Optimizing Serialization in
.NET - part 2, uses various tricks. For example, if it is known that an array contains small integers then less is going to the serialised output. This is described in detail in part 1 of the article. For example:
...So, an Int32 that is less
than 128 can be stored in a single byte (by using 7-bit
encoding) ....
The full and the size optimised integers can be mixed and matched. This may seem obvious, but there are other things; for example, special things happen for integer value 0 - optimisation to store numeric type and a zero value.
Part 1 states:
... If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialisation - large amounts of data can make it unfeasible for use in apps ....
I have used this library with great success in my application.
To make sure .NET serialisation is never used put an
ASSERT 0, Debug.WriteLine() or similar into the place in the
library code where it falls back on .NET serialisation.
That's at the end of function WriteObject() in file
FastSerializer.cs, near createBinaryFormatter().Serialize(BaseStream, value);.
If your arrays can be sorted you can perform a simple RLE to save space. Even if they aren't sorted RLE can still be beneficial. It is fast to implement for both writing and reading.
Here's a trick I used once for encoding arrays of integers:
Group array elements in groups of 4.
Precede each group with a byte (let's call it length mask) that indicates the length of the following 4 elements. The length mask is a bitmask composed of dibits which indicate how long the corresponding element is (00 - 1 byte, 01 - 2 bytes, 10 - 3 bytes, 11 - 4 bytes).
Write out the elements as short as you can.
For example, to represent unsigned integers 0x0000017B, 0x000000A9, 0xC247E8AD and 0x00032A64, you would write (assuming little-endian): B1, 7B, 01, A9, AD, E8, 47, C2, 64, 2A, 03.
It can save you up to 68.75% (11/16) of space in the best case. In the worst case, you would actually waste additional 6.25% (1/16).
If you know which int values are more common, you can encode those values in fewer bits (and encode the less-common values using correspondingly more bits): this is called "Huffman" coding/encoding/compression.
In general though I'd suggest that one of the easiest things you could do would be to run a standard 'zip' or 'compression' utility over your data.
For integer, if you usually have small numbers (under 127 or 32768) you can encode the number using the MSB as a flag to determine if it's the last byte or not. A little bit similar to UTF-8 but the flag bit is actually wasted (which is not the case with UTF-8)
Example (big-endian):
125 which is usually encoded as 00 00 00 7D
Could be encoded as 7D
270 which is usually encoded as 00 00 01 0E
Could be encoded as 82 0E
The main limitation is that the effective range of a 32 bit value is reduced to 28 bits. But for small values you will usually gain a lot.
This method is actually used in old formats such as MIDI because old electronics needed very efficient and simple encoding techniques.
If you want to control the serialization format yourself, with just library help for compact integer storage, derive a class from BinaryWriter that uses Write7BitEncodedInt. Do likewise for BinaryReader.Read7BitEncodedInt.
Before implementing ISerializable yourself, you were probably using XmlSerializer or the SOAP formatter in a web service. Given you have all fat clients all running .NET, you could try using the BinaryFormatter.