C# - Switching System.Text.Decoder mid-stream - buffered data

C# - Switching System.Text.Decoder mid-stream - buffered data - c#

I have a Decoder instance, and am using its convert method in conjunction with a file reader to read in data with the appropriate encoding.
I'd like to switch the Decoder instance I'm using mid-way through my read, however I'm conscious that the original decoder may have some bytes buffered internally (incomplete chars) despite bytesUsed equalling byteCount, and this switch sounds as though it could then result in data loss.
Can I retrieve the internal byte buffer so I might pass it through? Additionally, this switch is only to occur when a fallback takes place - I have considered using the invalid bytes position made available by the fallback exception to 'split' the current read buffer (as presumably at this point any previous buffered bytes have been used), but perhaps there's a better way?
Thanks in advance,
James

Related

Adding Int32 to List<byte> using AddRange throwing exception despite being successful

I'm creating an application that will take an image in a certain format from one of a video game's files and convert it to a DDS. This requires me to build the DDS in a buffer and then write it out to a DDS file. This buffer is of type List<byte>.
I first write the magic number, which is just the text "DDS ", with this code:
ddsFile.AddRange(Encoding.ASCII.GetBytes("DDS "));
I then need to write the header size, which is always 0x7C000000 (124), and this is where I've hit a wall. I used this code to write it to the buffer:
ddsFile.AddRange(BitConverter.GetBytes(0x0000007C));
This made sense to me because Encoding.ASCII.GetBytes()says itself that it returns a byte[], and it does accept an int as a parameter, no problem. And additionally, this was what I saw recommended when looking for a method for adding multi-byte values to a byte list. But for whatever reason, when the program tries to execute that line, this exception is thrown:
Unable to cast object of type 'System.Byte[]' to type 'System.IConvertible'.
But what's even more strange to the point of being ridiculous is that, upon seeing what did make it into the buffer, I see that the int actually was being written to the buffer, but the exception was still occurring for who knows what reason.
Bizarrely, even writing a single byte to the list after writing the magic number e.g. ddsFile.Add((byte)0x00)); results in the same thing.
Any help in figuring out why this exception occurs and/or a solution would be greatly appreciated.

This is not an answer to the question but a suggestion to do it differently.
Instead of using a List<byte> and manually doing all the conversions (while certainly possible, it's cumbersome), use a stream and a BinaryWriter - the stream can be a memory stream if you want to buffer the image in memory or a file stream if you want to write it to disk right away.
Using a BinaryWriter against the stream makes the conversions a lot simpler (and you can still manually convert parts of the data easily, if you need to do so).
Here's a short example:
var ms = new MemoryStream();
var bw = new BinaryWriter(ms, Encoding.ASCII);
bw.Write("DDS ");
bw.Write(124); // writes 4 bytes
bw.Write((byte) 124); // writes 1 byte
...
Use whichever overload of Write() you need to output the right bytes. (This short example omits cleaning up things but if you use a file stream, you'll need to make sure that you properly close it.)

Weird behavior with a BinaryReader

I have a socket-based application that exposes received data with a BinaryReader object on the client side. I've been trying to debug an issue where the data contained in the reader is not clean... i.e. the buffer that I'm reading contains old data past the size of the new data.
In the code below:
System.Diagnostics.Debug.WriteLine("Stream length: {0}", _binaryReader.BaseStream.Length);
byte[] buffer = _binaryReader.ReadBytes((int)_binaryReader.BaseStream.Length);
When I comment out the first line, the data doesn't end up being dirty (or, doesn't end up being dirty as regularly) as when I have that print line statement. As far as I can tell, from the server side the data is coming in cleanly, so it's possible that my socket implementation has some issues. But does anyone have any idea why adding that print line would cause the data to be dirty more often?

Your binary reader looks like it is a private member variable (if the leading underscore is a tell tell sign).
Is your application multithreaded? You could be experiencing a race condition if another thread is attempting to do also use your binaryReader while you are reading from it. The fact that you experience issues even without that line seems quite suspect to me.

Are you sure that your reading logic is correct? Stream.Length indicates the length of the entire stream, not of the remaining data to be read.
Suppose that, initially, 100 bytes were available. Length is 100, and BinaryReader corrects reads 100 bytes and advances the stream position by 100. Then, another 20 bytes arrive. Length is now 120; however, your BinaryReader should only be reading 20 bytes, not 120. The ‘extra’ 100 bytes requested in the second read would either cause it to block or (if the stream is not implemented correctly) break.

The problem was silly and unrelated. I believe my reading logic above is correct, however. The issue was that the _binaryReader I was using was a reference that was not owned by my class and hence the underlying stream was being rewritten with bad data.

How many bits does BinaryReader.PeekChar() read?

I am working on improving a stream reader class that uses a BinaryReader. It consists of a while loop that uses .PeekChar() to check if more data exists to continue processing.
The very first operation is a .ReadInt32() which reads 4 bytes. What if PeekChar only "saw" one byte (or one bit)? This doesn't seem like a reliable way of checking for EOF.
The BinaryReader is constructed using its default parameters, which as I understand it, uses UTF8 as the default encoding. I assume that .PeekChar() checks for 8 bits but I really am not sure.
How many bits does .PeekChar() look for? (and what are some alternate methods to checking for EOF?)

Here BinaryReader.PeekChar
I read:
ArgumentException: The current character cannot be decoded into the
internal character buffer by using the Encoding selected for the
stream.
This makes clear that amount of memory read depends on Encoding applied to that stream.
EDIT
Actually definition according to MSDN is:
Returns the next available character and does not advance the
byte or character position.*
Infact, it depends on encoding if this is a byte or more...
Hope this helps.

Making your Read*() calls blindly and handling any exceptions that are thrown is the normal method. I don't believe that the stream position is moved if anything goes wrong.

The PeekChar() method of BinaryReader is very buggy. Even when trying to read a from a memory stream with UTF8 encoded data, PeekChar() throws an exception after reading a particular length of the stream. The BCL team has acknowledged the issue, but they have not committed to resolving the issue. Their only response is to avoid using PeekChar() if you can.

C# performance methods of receiving data from a socket?

Let's assume we have a simple internet socket, and it's going to send 10 megabytes (because I want to ignore memory issues) of random data through.
Is there any performance difference or a best practice method that one should use for receiving data? The final output data should be represented by a byte[]. Yes I know writing an arbitrary amount of data to memory is bad, and if I was downloading a large file I wouldn't be doing it like this. But for argument's sake let's ignore that and assume it's a smallish amount of data. I also realise that the bottleneck here is probably not the memory management but rather the socket receiving. I just want to know what would be the most efficient method of receiving data.
A few dodgy ways can think of is:
Have a List and a buffer, after the buffer is full, add it to the list and at the end list.ToArray() to get the byte[]
Write the buffer to a memory stream, after its complete construct a byte[] of the stream.Length and read it all into it in order to get the byte[] output.
Is there a more efficient/better way of doing this?

Just write to a MemoryStream and then call ToArray - that does the business of constructing an appropriately-sized byte array for you. That's effectively what a List<byte> would be like anyway, but using a MemoryStream will be a lot simpler.

Well, Jon Skeet's answer is great (as usual), but there's no code, so here's my interpretation. (Worked fine for me.)
using (var mem = new MemoryStream())
{
using (var tcp = new TcpClient())
{
tcp.Connect(new IPEndPoint(IPAddress.Parse("192.0.0.192"), 8880));
tcp.GetStream().CopyTo(mem);
}
var bytes = mem.ToArray();
}
(Why not combine the two usings? Well, if you want to debug, you might want to release the tcp connection before taking your time looking at the bytes received.)
This code will receive multiple packets and aggregate their data, FYI. So it's a great way to simply receive all tcp data sent during a connection.

What is the encoding of your data? is it plain ASCII, or is it something else, like UTF-8/Unicode?
if it is plain ASCII, you could just allocate a StringBuilder() of the required size (get the size from the ContentLength header of the response) and keep on appending your data to the builder, after converting it into a string using Encoding.ASCII.
If it is Unicode/UTF8 then you have an issue - you cannot just call Encoding..GetString(buffer, 0, bytesRead) on the bytes read, because the bytesRead might not constitute a logical string fragment in that encoding. For this case you will need to buffer the entire entity body into memory(or file), then read that file and decode it using the encoding.

You could write to a memory stream, then use a streamreader or something like that to get the data. What are you doing with the data? I ask because would be more efficient from a memory standpoint to write the incoming data to a file or database table as the data is being received rather than storing the entire contents in memory.

Issue with BinaryReader.ReadChars()

I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item

I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.

Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.

This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.

I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - Switching System.Text.Decoder mid-stream - buffered data - c#

Related

Adding Int32 to List<byte> using AddRange throwing exception despite being successful

Weird behavior with a BinaryReader

How many bits does BinaryReader.PeekChar() read?

C# performance methods of receiving data from a socket?

Issue with BinaryReader.ReadChars()

Categories

Resources