Issue with BinaryReader.ReadChars() - c#

I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item

I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.

Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.

This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.

I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);

Related

Best way to read a stream to a delimiter and no farther

My code has to consume data from a NetworkStream, and the data read from the stream will contain three parts: metadata, a well-known delimiter, and data.
I'm trying to determine the most efficient way of reading from the NetworkStream, up to the end of the delimiter. The metadata portion is generally measured in hundreds of bytes (but could be as small as 32 bytes), the delimiter is a specific 2-byte sequence, and the data could range from zero bytes to several gigabytes in size (the metadata provides information on the data length). I should only read up to the delimiter, because the rest of the stream (containing payload data) needs to be used elsewhere, and NetworkStream doesn't support seek and the data may be so large that I can't dump it all into a MemoryStream.
I've been using the following, and it works, but it seems there could be a more efficient way of reading up to the delimiter. Since the minimum metadata size is 32 bytes, I start with a 34-byte buffer (32 bytes of metadata + 2 bytes delimiter), read from the stream, and check for the delimiter. If the delimiter is found (smallest possible metadata), the code then breaks and the balance of the stream contains the data. If the delimiter is not found, the code then loops reading a single byte at a time, checking the last two bytes of the StringBuilder used to hold what has been read from the stream, until the delimiter is found at the end.
(code reduced for brevity, removed checking of negative cases, etc)
string delim = "__";
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[1];
byte[] initialBuffer = new byte[34];
int bytesRead = stream.Read(initialBuffer, 0, 34); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(initialBuffer);
while (true)
{
string delimCheck = sb.ToString((sb.Length - 2), 2);
if (delimCheck.Equals(delim)) break;
else
{
buffer = new byte[1];
bytesRead = stream.Read(buffer, 0, 1); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(buffer));
}
}
The code works, but it seems really inefficient and slow to read one byte at a time to reach the end of the delimiter. Is anything readily apparent that might better optimize this code?
Thanks!
Do you see those Read(array, offset, count) return values you are putting into a variable bytesRead and then happily ignoring?
Those (along with setting the socket in non-blocking mode) are the solution to your problem. Then you can access "everything received so far" without getting stuck waiting for enough extra data to arrive to fill your array.
Even in blocking mode, ignoring that return value is a bug, because when the socket is gracefully shut down, you will get a partial read where bytesRead < bytesRequested
Regarding your concerns about how to save the extra data for later, Microsoft provided a class for that. See System.IO.BufferedStream and the example:
The following code examples show how to use the BufferedStream class over the NetworkStream class to increase the performance of certain I/O operations. Start the server on a remote computer before starting the client. Specify the remote computer name as a command-line argument when starting the client. Vary the dataArraySize and streamBufferSize constants to view their effect on performance.
Source: https://learn.microsoft.com/en-us/dotnet/api/system.io.bufferedstream
Not shown in the example is that you still need to put the socket into non-blocking mode to avoid having the BufferedStream block until an entire buffer chunk is received. The Socket class provides the Blocking property to make that easy.
https://learn.microsoft.com/en-us/dotnet/api/system.net.sockets.socket.blocking

NetworkStream.Length substitute

I am using a networkstream to pass short strings around the network.
Now, on the receiving side I have encountered an issue:
Normally I would do the reading like this
see if data is available at all
get count of data available
read that many bytes into a buffer
convert buffer content to string.
In code that assumes all offered methods work as probably intended, that would look something like this:
NetworkStream stream = someTcpClient.GetStream();
while(!stream.DataAvailable)
;
byte[] bufferByte;
stream.Read(bufferByte, 0, stream.Lenght);
AsciiEncoding enc = new AsciiEncoding();
string result = enc.GetString(bufferByte);
However, MSDN says that NetworkStream.Length is not really implemented and will always throw an Exception when called.
Since the incoming data are of varying length I cannot hard-code the count of bytes to expect (which would also be a case of the magic-number antipattern).
Question:
If I cannot get an accurate count of the number of bytes available for reading, then how can I read from the stream properly, without risking all sorts of exceptions within NetworkStream.Read?
EDIT:
Although the provided answer leads to a better overall code I still want to share another option that I came across:
TCPClient.Available gives the bytes available to read. I knew there had to be a way to count the bytes in one's own inbox.
There's no guarantee that calls to Read on one side of the connection will match up 1-1 with calls to Write from the other side. If you're dealing with variable length messages, it's up to you to provide the receiving side with this information.
One common way to do this is to first work out the length of the message you're going to send and then send that length information first. On the receiving side, you then obtain the length first and then you know how big a buffer to allocate. You then call Read in a loop until you've read the correct number of bytes. Note that, in your original code, you're currently ignoring the return value from Read, which tells you how many bytes were actually read. In a single call and return, this could be as low as 1, even if you're asking for more than 1 byte.
Another common way is to decide on message "formats" - where e.g. message number 1 is always 32 bytes in length and has X structure, and message number 2 is 51 bytes in length and has Y structure. With this approach, rather than you sending the message length before sending the message, you send the format information instead - first you send "here comes a message of type 1" and then you send the message.
A further common way, if applicable, is to use some form of sentinels - if your messages will never contain, say, a byte with value 0xff then you scan the received bytes until you've received an 0xff byte, and then everything before that byte was the message you wanted to receive.
But, whatever you want to do, whether its one of the above approaches, or something else, it's up to you to have your sending and receiving sides work together to allow the receiver to discover each message.
I forgot to say but a further way to change everything around is - if you want to exchange messages, and don't want to do any of the above fiddling around, then switch to something that works at a higher level - e.g. WCF, or HTTP, or something else, where those systems already take care of message framing and you can, then, just concentrate on what to do with your messages.
You could use StreamReader to read stream to the end
var streamReader = new StreamReader(someTcpClient.GetStream(), Encoding.ASCII);
string result = streamReader.ReadToEnd();

What does BinaryReader do if the bytes I am reading aren't present yet?

I am using a BinaryReader on top of a NetworkStream to read data off of a network. This has worked really well for me, but I want to understand what's going on behind the scenes, so I took a look at the documentation for BinaryReader and found it to be extremely sparse.
My question is this: What will BinaryReader.ReadBytes(bufferSize) do if bufferSize bytes are not present on the network stream when I call ReadBytes?
In my mind there are a few options:
1) Read any bytes that are present on the network stream and return only that many
2) Wait until bufferSize bytes are present on the stream, then read
3) Throw an exception
I assume option 2 is happening, since I've never received any exceptions and all my data is received whole, not in pieces. However, I would like to know for sure what is going on. If someone could enlighten me, I would be grateful.
I believe it actually goes for hidden option 4:
Read the data as it becomes available, looping round in the same way that you normally would do manually. It will only return a value less than the number of bytes you asked for if it reaches the end of the stream while reading.
This is subtly different from your option 2 as it does drain the stream as data becomes available - it doesn't wait until it could read all of the data in one go.
It's easy to show that it does return a lower number of bytes than you asked for if it reaches the end:
var ms = new MemoryStream(new byte[10]);
var readData = new BinaryReader(ms).ReadBytes(100);
Console.WriteLine(readData.Length); // 10
It's harder to prove the looping part, without a custom stream which would explicitly require multiple Read calls to return all the data.
The documentation isn't as clear as it might be, but the return value part is at least somewhat helpful:
A byte array containing data read from the underlying stream. This might be less than the number of bytes requested if the end of the stream is reached.
Note the final part that I've highlighted, and compare that with Stream.Read:
The total number of bytes read into the buffer. This can be less than the number of bytes requested if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.
If you're expecting an exact amount of data and only that amount will be useful, I suggest you write a ReadExactly method which calls Read and throws EndOfStreamException if you need more data than the stream provided before it was closed.
If, by “present on the stream”, you’re asking whether the method would block until the specified number of bytes are available, then it is option 2. It would only return a smaller amount of bytes if the end of the stream is reached.
Here is some sample code on how BinaryReader.ReadBytes(int) may be implemented:
byte[] ReadBytes(int count)
{
byte[] buffer = new byte[count];
int total = 0;
int read = 0;
do
{
read = stream.Read(buffer, read, count - total);
total += read;
}
while (read > 0 && total < count);
// Resize buffer if smaller than count (code not shown).
return buffer;
}

how to convert Image to string the most efficient way?

I want to convert an image file to a string. The following works:
MemoryStream ms = new MemoryStream();
Image1.Save(ms, ImageFormat.Jpeg);
byte[] picture = ms.ToArray();
string formmattedPic = Convert.ToBase64String(picture);
However, when saving this to a XmlWriter, it takes ages before it's saved(20secs for a 26k image file). Is there a way to speed this action up?
Thanks,
Raks
There are three points where you are doing large operations needlessly:
Getting the stream's bytes
Converting it to Base64
Writing it to the XmlWriter.
Instead. First call Length and GetBuffer. This let's you operate upon the stream's buffer directly. (Do flush it first though).
Then, implement base-64 yourself. It's relatively simple as you take groups of 3 bytes, do some bit-twiddling to get the index into the character it'll be converted to, and then output that character. At the very end you add some = symbols according to how many bytes where in the last block sent (= for one remainder byte, == for two remainder bytes and none if there were no partial blocks).
Do this writting into a char buffer (a char[]). The most efficient size is a matter for experimentation but I'd start with 2048 characters. When you've filled the buffer, call XmlWriter.WriteRaw on it, and then start writing back at index 0 again.
This way, you're doing less allocations, and you're started on the output from the moment you've got your image loaded into the memory stream. Generally, this should result in better throughput.

Why I need to read file piece by piece to buffer?

I have seen following code for getting the file into array, which is in turn used as a parameter for SQL command inserting it into a blob column:
using (FileStream fs = new FileStream(soubor,FileMode.Open,FileAccess.Read))
int length = (int)fs.Length;
buffer = new byte[length];
int count;
int sum = 0;
while ((count = fs.Read(buffer, sum, length - sum)) > 0)
sum += count;
Why I cannot simply do that:
fs.Read(buffer, 0, length) in order to just copy content of file to the buffer?
Thanks
There's more to it than just "the file may not fit in memory". The contract for Stream.Read explicitly says:
Implementations of this method read a
maximum of count bytes from the
current stream and store them in
buffer beginning at offset. The
current position within the stream is
advanced by the number of bytes read;
however, if an exception occurs, the
current position within the stream
remains unchanged. Implementations
return the number of bytes read. The
return value is zero only if the
position is currently at the end of
the stream. The implementation will
block until at least one byte of data
can be read, in the event that no data
is available. Read returns 0 only when
there is no more data in the stream
and no more is expected (such as a
closed socket or end of file). An
implementation is free to return fewer
bytes than requested even if the end
of the stream has not been reached.
Note the last sentence - you can't rely on a single call to Stream.Read to read everything.
The docs for FileStream.Read have a similar warning:
The total number of bytes read into
the buffer. This might be less than
the number of bytes requested if that
number of bytes are not currently
available, or zero if the end of the
stream is reached.
For a local file system I don't know for sure whether this will ever actually happen - but it could do for a network mounted file. Do you want your app to brittle in that way?
Reading in a loop is the robust way to do things. Personally I prefer not to require the stream to support the Length property, either:
public static byte[] ReadFully(Stream stream)
{
byte[] buffer = new byte[8192];
using (MemoryStream tmpStream = new MemoryStream())
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
tmpStream.Write(buffer, 0, bytesRead);
}
return tmpStream.ToArray();
}
}
That is slightly less efficient when the length is known beforehand, but it's nice and simple. You only need to implement it once, put it in a utility library, and call it whenever you need to. If you really mind the efficiency loss, you could use CanSeek to test whether the Length property is supported, and repeatedly read into a single buffer in that case. Be aware of the possibility that the length of the stream could change while you're reading though...
Of course, File.ReadAllBytes will do the trick even more simply when you only need to deal with a file rather than a general stream.
Because your file could be very large and the buffer has usually a fixed size of 4-32 KB. This way you know you're not filling your memory unnessecarily.
Of course, if you KNOW the size of your file is not too large or if you store the contents in memory anyways, there is no reason not to read it all in one shot.
Although, if you want to read the contents of your file directly into a variable, you don't need the Stream API. Rather use
File.ReadAllText(...)
or
File.ReadAllBytes(...)
A simple fs.Read(buffer, 0, length) will probably work, and it will even be hard to find a test to break it. But it simply is not guaranteed, and it might break in the future.
The best answer here is to use a specialized method from the library. In this case
byte[] buffer = System.IO.File.ReadAllBytes(fileName);
A quick look with Reflector confirms that this will get you the partial-buffer logic and the exception-safe Dispose() of your stream.
And when future versions of the Framework allow for better ways to do this your code will automatically profit.

Categories

Resources