Binary encoding for low bandwidth connections? - c#

In my application I have a simple XML formatted file containing structured data. Each data entry has a data type and a value. Something like
<entry>
<field type="integer">5265</field>
<field type="float">34.23</field>
<field type="string">Jorge</field>
</entry>
Now, this formatting allow us to have the data in a human readable form in order to check for various values, as well as performing transformation and reading of the file easily for interoperability.
The problem is we have a very low bandwidth connection (about 1000 bps, yeah, thats bits per second) so XML is no exactly the best format to transmit the data. I'm looking for ways to encode the xml file into a binary equivalent that its more suitable for transmission.
Do you know of any good tutorial on the matter?
Additionally we compress the data before sending (simple GZIP) so I'm a little concerned with losing compression ratio if I go binary. Would the size be affected (when compressing) so badly that it would be a bad idea to try to optimize it in the first place?
Note: This is not premature optimization, it's a requisite. 1000 bps is a really low bandwidth so every byte counts.
Note2: Application is written in c# but any tutorial will do.

Try using ASN.1. The packed encoding rules should yield a pretty decently compressed form on their own and and the xml encoding rules should yield something equivalent to your existing xml.
Also, consider using 7zip instead of gzip.

You may want to investigate Google Protocol Buffers. They produce far smaller payloads than XML, though not necessarily the smallest payloads possible; whether they're acceptable for your use depends on a lot of factors. They're certainly easier than devising your own scheme from scratch, though.
They've been ported to C#/.NET and seem to work quite well there in my (thus far, somewhat limited) experience. There's a package at that link to integrate somewhat with VS and automatically create C# classes from the .proto files, which is very nice.

I'd dump (for transmission anyway, you could deconstruct at the sender, and reconstruct at the receiver, in Java you could use a custom Input/OutputStream to do the work neatly) the XML. Go binary with fixed fields - data type, length, data.
Say if you have 8 or fewer datatypes, encode that in three bits. Then the length, e.g., as an 8-bit value (0..255).
Then for each datatype, encode differently.
Integer/Float: BCD - 4 bits per digit, use 15 as decimal point. Or just the raw bits themselves (might want different datatypes for 8-bit int, 16-bit int, 32-bit int, 64-bit long, 32-bit float, 64-bit double).
String - can you get away with 7-bit ASCII instead of 8? Etc. All upper-case letters + digits and some punctuation could get you down to 6-bits per character.
You might want to prefix it all with the total number of fields to transmit. And perform a CRC or 8/10 encoding if the transport is lossy, but hopefully that's already handled by the system.
However don't underestimate how well XML text can be compressed. I would certainly do some calculations to check how much compression is being achieved.

Anything which is efficient at converting the plaintext form to binary is likely to make the compression ratio much worse, yes.
However, it could well be that an XML-optimised binary format will be better than the compressed text anyway. Have a look at the various XML Binary formats listed on the Wikipedia page. I have a bit of experience with WBXML, but that's all.
As JeeBee says, a custom binary format is likely to be the most efficient approach, to be honest. You can try to the gzip it, but the results of that will depend on what the data is like in the first place.
And yes, as Skirwan says, Protocol Buffers are a fairly obvious candidate here - but you may want to think about custom floating point representations, depending on what your actual requirements are. If you only need 4SF (and you know the scale) then sending a two byte integer may well be the best bet.

The first thing to try is gzip; beyond that, I would try protobuf-net - I can think of a few ways of encoding that quite easily, but it depends how you are building the xml, and whether you mind a bit of code to shim between the two formats. In particular, I can imagine representing the different data-types as either 3 optional fields on the same type, or 3 different subclasses of an abstract contract.
[ProtoContract]
class EntryItem {
[ProtoMember(1)]
public int? Int32Value {get;set;}
[ProtoMember(2)]
public float? SingleValue {get;set;}
[ProtoMember(3)]
public string StringValue {get;set;}
}
[ProtoContract]
class Entry {
[ProtoMember(1)]
public List<EntryItem> Items {get; set;}
}
With test:
[TestFixture]
public class TestEntries {
[Test]
public void ShowSize() {
Entry e = new Entry {
Items = new List<EntryItem>{
new EntryItem { Int32Value = 5265},
new EntryItem { SingleValue = 34.23F },
new EntryItem { StringValue = "Jorge" }
}
};
var ms = new MemoryStream();
Serializer.Serialize(ms, e);
Console.WriteLine(ms.Length);
Console.WriteLine(BitConverter.ToString(ms.ToArray()));
}
}
Results (21 bytes)
0A-03-08-91-29-0A-05-15-85-EB-08-42-0A-07-1A-05-4A-6F-72-67-65

I would look into configuring your app to be responsive to smaller XML fragments; in particular ones which are small enough to fit in a single network packet.
Then arrange your data to be transmitted in order of importance to the user so that they can see useful stuff and maybe even start working on it before all the data arrives.

Late response -- at least it comes before year's end ;-)
You mentioned Fast Infoset. Did you try it? It should give you the best results in terms of both compactness and performance. Add GZIP compression and the final size will be really small, and you will have avoided the processing penalties of compressing XML. WCF-Xtensions offers a Fast Infoset message encoding and GZIP/DEFLATE/LZMA/PPMs compression too (works on .NET/CF/SL/Azure).

Here's the pickle you're in though: You're compresing things with Gzip. Gzip is horrible on plain text until you hit about the length of the total concatonated works of Dickens or about 1200 lines of code. The overhead of the dictionary and other things Gzip uses for compression.
1Kbps is fine for the task of 7500 chars (it'll take about a minute given optimal conditions, but for <300 chars, you should be fine!) However if you're really that concerned, you're going to want to compress this down for brevity. Here's how I do things of this scale:
T[ype]L[ength][data data data]+
That is, that T represents the TYPE. say 0x01 for INT, 0x02 for STRING, etc. LENGTH is just an int... so 0xFF = 254 chars long, etc. An example datapacket would look like:
0x01 0x01 0x3F 0x01 0x01 0x2D 0x02 0x06 H E L L O 0x00
This says I have an INT, length 1, of value 0x3F, an INT, length 1, of value 0x2D then a STRING, length 6 of a null terminated "HELLO" (Ascii assumed). Learn the wonders that are System.Text.Encoding.Utf8.getBytes and BitConverter and ByteConverter.
for reference see This page to see just how much 1Kbps is. Really, for the size you're dealing with you should bee fine.

Related

Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16

I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?
Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.
I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.

amount of memory allocated when double is converted to string

i have data structure i am passing this from server to client using data contract.
the data structure is
class datatransfer
{
double m_value1;
double m_value2;
double m_value3;
};
in the client i want to write into file.
The idea is to convert the values of the data transfer into string using string builder
than transfer the string to the client.
or
send the data structure and write the file using stream writer
which is the best approach? converting to string or send the datastructure and write to the file?
Reason for the question: to avoid the generation of string.
double size is 8 bytes. If i convert it to string what will be the size allocated .
It depends entirely on how you format the String representation of the Double. There is no pre-defined sizing guideline.
For example:
var myDouble = 5.183498029834092834D;
var shortString = myDouble.ToString("#.00"); // 5.18, uses 8 bytes
var longerString = myDouble.ToString("#.0000000"); // 5.1834980 18 bytes
Note that the sizes are the result of a Char being 2 bytes on my system.
What you're really trying to do is called serialization. There's a number of ways to do this. The simplest of which may be to just decorate your class with the [Serializable] attribute:
[Serializable]
public class datatransfer {
double m_value1;
double m_value2;
double m_value3;
}
You'll need to make your member variables public, or provide publicly accessible properties for setting the member variables. Otherwise they will not be serialized using this approach.
The size of a double when converted to string depends on the number itself and the encoding of the string.
Just as example, assuming ANSI encoding, the number 1 will need 1 byte, the number 1.123 will need 5 bytes. Moreover if you transmit that as text you need to consider more bytes used as delimiters (for example, using a space, you'll need N - 1 extra bytes). You should always transfer data in binary (if possible but it may depend on the type of connection you have and the protocol you have to use).
As a general rule you should think that binary data is smaller (then faster to transfer) but it's not easy to debug and you may have problems with a firewall. Text data is larger, verbose, you have to validate them on client side (your structure, XML for example, may be corrupted) and then slower to transfer. Big advantages are it's more easy to debug and whatever connection/protocol you may have usually you can transfer text (but don't forget you can transfer binary data encoded as text).
So this is not a definitive answer, what kind of data transfer method/framework/protocol you're using? A WCF service? .NET Remoting? Custom TCP/IP connection? If data structure is not too big you may find that binary serialization is a very good solution

Manipulating very large binary strings in c#

I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.

Compression of small string

I have data 0f 340 bytes in string mostly consists of signs and numbers like "føàA¹º#ƒUë5§Ž§"
I want to compress into 250 or less bytes to save it on my RFID card.
As this data is related to finger print temp. I want lossless compression.
So is there any algorithm which i can implement in C# to compress it?
If the data is strictly numbers and signs, I highly recommend changing the numbers into int based values. eg:
+12939272-23923+927392
can be compress into 3 piece of 32-bit integers, which is 22 bytes => 16 bytes. Picking the right integer size (whether 32-bit, 24-bit, 16-bit) should help.
If the integer size varies greatly, you could possibly use 8-bit to begin and use the value 255 to specify that the next 8-bit becomes the 8 more significant bits of the integer, making it 15-bit.
alternatively, you could identify the most significant character and assign 0 for it. the second most significant character gets 10, and the third 110. This is a very crude compression, but if you data is very limited, this might just do the job for you.
Is there any other information you know about your string? For instance does it contain certain characters more often than others? Does it contain all 255 characters or just a subset of them?
If so, huffman encoding may help you, see this or this other link for implementations in C#.
To be honest it just depends on how your input string looks like. What I'd do is try the using rar, zip, 7zip (LZMA) with very small dictionary sizes (otherwise they'll just use up too much space for preprocessed information) and see how big the raw compressed file they produce is (will probably have to use their libraries in order to make them strip headers to conserve space). If any of them produce a file under 250b, then find the c# library for it and there you go.

Fixed length strings or structures in C#

I need to create a structure or series of strings that are fixed lenght for a project I am working on. Currently it is written in COBOL and is a communication application. It sends a fixed length record via the web and recieves a fixed length record back. I would like to write it as a structure for simplicity, but so far the best thing I have found is a method that uses string.padright to put the string terminator in the correct place.
I could write a class that encapsulates this and returns a fixed length string, but I'm hoping to find a simple way to fill a structure and use it as a fixed length record.
edit--
The fixed length record is used as a parameter in a URL, so its http:\somewebsite.com\parseme?record="firstname lastname address city state zip". I'm pretty sure I won't have to worry about ascii to unicode conversions since it's in a url. It's a little larger than that and more information is passed than address, about 30 or 35 fields.
Add the MarshalAs tag to your structure. Here is an example:
<StructLayout (LayoutKind.Sequential, CharSet:=CharSet.Auto)> _
Public Structure OSVERSIONINFO
Public dwOSVersionInfoSize As Integer
Public dwMajorVersion As Integer
Public dwMinorVersion As Integer
Public dwBuildNumber As Integer
Public dwPlatformId As Integer
<MarshalAs (UnmanagedType.ByValTStr, SizeConst:=128)> _
Public szCSDVersion As String
End Structure
http://bytes.com/groups/net-vb/369711-defined-fixed-length-string-structure
You could use the VB6 compat FixedLengthString class, but it's pretty trivial to write your own.
If you need parsing and validation and all that fun stuff, you may want to take a look at FileHelpers which uses attributes to annotate and parse fixed length records.
FWIW, on anything relatively trivial (data processing, for instance) I'd probably just use a ToFixedLength() extension method that took care of padding or truncating as needed when writing out the records. For anything more complicated (like validation or parsing), I'd turn to FileHelpers.
byte[] arrayByte = new byte[35];
// Fill arrayByte with the other function
//Convert array of byte to string chain
string arrayStringConverr = Encoding.ASCII.GetString(arrayByte, 0, arrayByte.Length);
As an answer, you should be able to use char arrays of the correct size, without having to marshal.
Also, the difference between a class and a struct in .net is minimal. A struct cannot be null while a class can. Otherwise their use and capabilities are pretty much identical.
Finally, it sounds like you should be mindful of the size of the characters that are being sent. I'm assuming (I know, I know) that COBOL uses 8bit ASCII characters, while .net strings are going to use a UTF-16 encoded character set. This means that a 10 character string in COBOL is 10 bytes, but in .net, the same string is 20 bytes.

Categories

Resources