I'm fixing to use sockets to stream a sequence of bytes across a network. I'm using a binaryformatter to serialize the objects I wish to send, then just beam them between machines and deserialize them when the reach their destination. The question has been asked before "how do I detect the end of a bytestream", but my predicament is a little different. I want to be able to send multiple types of objects across this network. I would simply use a marker or agreed upon sequence of bytes to communicate the end of a message, but because I have no idea what the binaryformatter will produce, I don't know what sequence of characters I could use that would be unique. If i decided to use the sequence 255, 128, 0, 255 to denote the end of one object stream, how do I ensure that that sequence doesn't also occur within the binaryformatter output? What else can I do to signify the end of one object and the beginning of another?
Thanks for any help.
Normaly this is done with a lenght indication at the start of the message.
Lets say your compress your data (which can be a image, text etc.) into a byte array. You take the lenght of this byte array, lets say 500, and save this number at the start of your stream. The lenght in bytes of this lenght number must always be the same so lets say 32 bytes in the case of the number 500 it would look something like this)
00000000 00000000 00000001 11110100 and after this the 500 bytes of your actualy message comes
This way you always know how long your message is from reading the first 32 bytes and with that you know when the message ends.
Related
What's going on, I do this on the server:
var msg = Server.Api.CreateMessage();
msg.Write(2);
msg.Write(FreshChunks.Count());
Server.Api.SendMessage(msg, peer.Connection, NetDeliveryMethod.ReliableUnordered);
then on the client it succesfuly reads the byte = 2 and the switch then routes to function which reads Int32 (FreshChunks.Count) which was equal 4 but when received it equals 67108864. I've tried with Int16-64 and UInt16-64, none of them work out the correct value.
Given that:
In your usage of msg.Write(2), the compiler reads the 2 as an int (Int32)
You mentioned that you "successfully read the byte = 2".
It seems that one of these options is happening:
msg.Write is writing only bytes that have at least one-bit set (=1) in them. (to save space)
msg.Write is always casting the given argument to a byte.
When asking for 4 bytes (Int32),
You got:
0x04 00 00 00. The first byte is exactly the 4 you passed.
It seems that when asking from msg.Read more bytes than it has (you requested 4bytes and it has only 1 due to msg.Write logic)
It does one of these:
Appends the remaining bytes with zeros
Keeps on reading, and in your case, there were 3 0's bytes in the message's metadata that was returned to you.
For solving your problem, you should read the documentation of the Write and Read methods and understand how they behave.
So I've been working on this project for a while now, involving LSB steganography. Really fun stuff. Anyways, I just finished writing the code for embedding and extracting files from an image(instead of just plaintext), and I'm running into this problem. I can recognize the MIME and extension of the bytes, but because the embedded file doesn't usually take up all of the LSBs of the image, there's a lot of garbage data. So I have the extracted file + some garbage in the byte array right after it. I need to figure out how to cut these, so that the file that is being exported is the correct, smaller size.
TLDR: I have a byte array with a recognized file in it, with some additional random bytes. How do I find out where the file ends and the random bytes begin?
Remember this is all in C#.
Any advice is appreciated.
Link to my project for reference: https://github.com/nicosogangstar/Steg
Generally you have two options.
End of stream marker
This is the more direct approach of the two, but it may lack some versatily depending on what data you want to hide. After you embed your data, continue with embedding a unique sequence of bits/bytes such that you know it cannot be prematurely encountered in the data before. As you extract the bits, you can stop reading once you encounter this sequence. If you expect to hide only readable text, i.e. bytes with ascii codes between 32 and 127, your marker can be as short as eight 0s, or eight 1s. However, if you intend to hide any sort of binary data, where each byte has a chance of appearing, you may accidentally encounter the marker while extracting legitimate data and thus halt the process prematurely.
Header information
You can add a header preceding data, e.g, another 16-24 bits (or any other amount) which can be translated to a number that tells you how many bits/bytes/pixels to read before stopping. For example, if you want to hide a byte array of size 1000, first embed 2 bytes related to the length of the secret and then follow it with the actual data. More specifically, split the length in 2 bytes, where the first byte has the 8th to 15th bits and the second byte has the 0th to 7th bits of the number 1000 in binary.
00000011 11101000 1000 in binary
3 -24 byte values
You can embed all sorts of information in a header, such as whether the data is encrypted or compressed with some algorithm, the original filename of the date, how many LSBs to read for extracting the information, etc.
I'm working on a protocol which will transfer block of xml data via tcp socket. Now say I need to read all the bytes from a xml file and build a memory buffer. Then before sending the actual data bytes I need to send one header to other peer end. Say my protocol using below header type.
MessageID=100,Size=232,CRC=190
string strHeader = "100,232,190"
Now I would like to know how can I make this header length fixed (fixed header length is required for other peer to identify it as a header) for any amount of xml data. Currently say I'm having a xml file sized 283637bytes, so the message header will look like.
string strHeader = "100,283637,190"
How can I make it generic for any size of data? The code is being written both in c++ and c#.
There are a number of ways to do it.
Fixed Length
You can pad the numbers numbers with leading zeroes so you know exactly what length of the text you need to work with. 000100,000232,000190
Use Bytes instead of strings
If you are using integers, you can read the bytes as integers instead of manipulating the string. Look into the BinaryReader class. If needing to do this on the C++ side, the concept is still the same. I am sure there many ways to convert 4 bytes into an int.
Specify the length at the beginning
Usually when working with dynamic length strings. There is an indicator of how many bytes need to be read in order to get the entire string. You could specify the first 4 bytes of your message as the length of your string and then read up to that point.
The best approach for you is to implement this as a struct like
struct typedef _msg_hdr {
int messageID;
int size;
int crc;
}msg_hdr;
This will always have 12 bytes length. Now when sending your message, first send header to the receiver. The receiver should receive it in the same structure. This is the best and easiest way
I have a protocol buffer setup like this:
[ProtoContract]
Foo
{
[ProtoMember(1)]
Bar[] Bars;
}
A single Bar gets encoded to a 67 byte protocol buffer. This sounds about right because I know that a Bar is pretty much just a 64 byte array, and then there are 3 bytes overhead for length prefixing.
However, when I encode a Foo with an array of 20 Bars it takes 1362 bytes. 20 * 67 is 1340, so there are 22 bytes of overhead just for encoding an array!
Why does this take up so much space? And is there anything I can do to reduce it?
This overhead is quite simply the information it needs to know where each of the 20 objects starts and ends. There is nothing I can do different here without breaking the format (i.e. doing something contrary to the spec).
If you really want the gory details:
An array or list is (if we exclude "packed", which doesn't apply here) simply a repeated block of sub-messages. There are two layouts available for sub-messages; strings and groups. With a string, the layout is:
[header][length][data]
where header is the varint-encoded mash of the wire-type and field-number (hex 08 in this case with field 1), length is the varint-encoded size of data, and data is the sub-object itself. For small objects (data less than 128 bytes) this often means 2 bytes overhead per object, depending on a: the field number (fields above 15 take more space), and b: the size of the data.
With a group, the layout is:
[header][data][footer]
where header is the varint-encoded mash of the wire-type and field-number (hex 0B in this case with field 1), data is the sub-object, and footer is another varint mash to indicate the end of the object (hex 0C in this case with field 1).
Groups are less favored generally, but they have the advantage that they don't incur any overhead as data grows in size. For small field-numbers (less than 16) again the overhead is 2 bytes per object. Of course, you pay double for large field-numbers, instead.
By default, arrays aren't actually passed as arrays, but as repeated members, which have a little more overhead.
So I'd guess you actually have 1 byte of overhead for each repeated array element, plus 2 extra bytes overhead on top.
You can lose the overhead by using a "packed" array. protobuf-net supports this: http://code.google.com/p/protobuf-net/
The documentation for the binary format is here: http://code.google.com/apis/protocolbuffers/docs/encoding.html
My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).
How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?
Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?
Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.
Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by #jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.
For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).
The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.