Need help understanding Stream.Read() - c#

I am a little confused as to the steps of reading a file into buffer gradually.
from the MSDN docs
public abstract int Read(
byte[] buffer,
int offset,
int count
)
source from C# Examples
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
can I say that the line fileStream.Read(buffer, sum, length - sum) reads as "Read fileStream from sum (offset) to length - sum (total bytes to read) into buffer". OK so at the start, when sum = 0, I will effectively read the whole fileStream into buffer in 1 short, but I think this is not the case. Maybe Read() reads whatever it possibly can into buffer? then returns so that you can Read() it again? I am a little confused.

Read will read whatever's available (blocking until something is available) but there may not be enough data ready to fill the buffer to start with.
Imagine downloading data over the network - there may be megabytes of data to download, but only some of it is available to start with. So you need to keep calling Read() until you've read as much as you want.
Stream.Read will read at most the number of bytes you've asked for, but can easily read less. Admittedly for local file streams I suspect it always reads as much as you've asked for unless the file is shorter, but that's not true of streams in general, and I don't believe it's guaranteed even for local file streams.

The Read method will read at least one byte, and at most the number of bytes specified.
The method will usually return all data that is currently available. If the stream is for example coming over the internet it will usually return what it has recieved to far, and for a file stream it will usually return the entire file.
However, it's up to the implementation to decide what the best behaviour is. It might for example first return what it can get from the file cache, which can be returned immediately, and let you do another call to get the data that requires an actual disk read.
When using the Read method, you should always use a loop so that you are sure to get all the data. It might not seem neccesary if the first call appears to always return all data, but there might be situations where it doesn't.

From MSDN:
When overridden in a derived class, reads a sequence of bytes from the current stream and advances the position within the stream by the number of bytes read.
Return Value
Type: System.Int32
The total number of bytes read into the buffer. This can be less than the number of bytes requested if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.

Related

Best way to read a stream to a delimiter and no farther

My code has to consume data from a NetworkStream, and the data read from the stream will contain three parts: metadata, a well-known delimiter, and data.
I'm trying to determine the most efficient way of reading from the NetworkStream, up to the end of the delimiter. The metadata portion is generally measured in hundreds of bytes (but could be as small as 32 bytes), the delimiter is a specific 2-byte sequence, and the data could range from zero bytes to several gigabytes in size (the metadata provides information on the data length). I should only read up to the delimiter, because the rest of the stream (containing payload data) needs to be used elsewhere, and NetworkStream doesn't support seek and the data may be so large that I can't dump it all into a MemoryStream.
I've been using the following, and it works, but it seems there could be a more efficient way of reading up to the delimiter. Since the minimum metadata size is 32 bytes, I start with a 34-byte buffer (32 bytes of metadata + 2 bytes delimiter), read from the stream, and check for the delimiter. If the delimiter is found (smallest possible metadata), the code then breaks and the balance of the stream contains the data. If the delimiter is not found, the code then loops reading a single byte at a time, checking the last two bytes of the StringBuilder used to hold what has been read from the stream, until the delimiter is found at the end.
(code reduced for brevity, removed checking of negative cases, etc)
string delim = "__";
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[1];
byte[] initialBuffer = new byte[34];
int bytesRead = stream.Read(initialBuffer, 0, 34); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(initialBuffer);
while (true)
{
string delimCheck = sb.ToString((sb.Length - 2), 2);
if (delimCheck.Equals(delim)) break;
else
{
buffer = new byte[1];
bytesRead = stream.Read(buffer, 0, 1); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(buffer));
}
}
The code works, but it seems really inefficient and slow to read one byte at a time to reach the end of the delimiter. Is anything readily apparent that might better optimize this code?
Thanks!
Do you see those Read(array, offset, count) return values you are putting into a variable bytesRead and then happily ignoring?
Those (along with setting the socket in non-blocking mode) are the solution to your problem. Then you can access "everything received so far" without getting stuck waiting for enough extra data to arrive to fill your array.
Even in blocking mode, ignoring that return value is a bug, because when the socket is gracefully shut down, you will get a partial read where bytesRead < bytesRequested
Regarding your concerns about how to save the extra data for later, Microsoft provided a class for that. See System.IO.BufferedStream and the example:
The following code examples show how to use the BufferedStream class over the NetworkStream class to increase the performance of certain I/O operations. Start the server on a remote computer before starting the client. Specify the remote computer name as a command-line argument when starting the client. Vary the dataArraySize and streamBufferSize constants to view their effect on performance.
Source: https://learn.microsoft.com/en-us/dotnet/api/system.io.bufferedstream
Not shown in the example is that you still need to put the socket into non-blocking mode to avoid having the BufferedStream block until an entire buffer chunk is received. The Socket class provides the Blocking property to make that easy.
https://learn.microsoft.com/en-us/dotnet/api/system.net.sockets.socket.blocking

When does Stream.Read returns?

Consider this simple code:
public void TransferStream(Stream source, Stream target)
{
Int32 read = -1;
Byte[] buffer = new Byte[4096];
do
{
read = source.Read(buffer, 0, buffer.Length);
target.Write(buffer, 0, read);
}
while (read != 0);
}
Imagine that source is a NetworkStream, and the information is arriving little by little.
source.Read can return with a full buffer, or with a partial one, the read will tell how much was read.
When does source.Read return if there are no enough data to fill the buffer? When it is enough to return with a partial buffer?
The documentation for `Stream.Read()' states:
Implementations of this method read a maximum of count bytes from the
current stream and store them in buffer beginning at offset. The
current position within the stream is advanced by the number of bytes
read; however, if an exception occurs, the current position within the
stream remains unchanged.
Implementations return the number of bytes
read. The implementation will block until at least one byte of data
can be read, in the event that no data is available. Read returns 0
only when there is no more data in the stream and no more is expected
(such as a closed socket or end of file). An implementation is free to
return fewer bytes than requested even if the end of the stream has
not been reached.
So according to that, the NetworkStream should block until either it is closed or until at least one byte is available. Whether it actually does or not is a different matter - but I would hope it does.
It can definitely return a partial buffer.
However more generally, I did discover in the past that if you have a file opened in shared mode, and something else is writing to that file while you are reading from it then the File.Read() can return zero even though data is still being written to it (and a subsequent read returned non-zero) - so watch out for that kind of thing. I think NetworkStream won't suffer from that problem though. Nevertheless, it might be a good idea to write a test program to determine that for sure!
[EDIT]
I looked at the source code for it, and ultimately it ends up calling the Windows API recv() function. I know that can return partial buffers, so we must assume that NetworkStream.Read() can also return partial buffers.

What does BinaryReader do if the bytes I am reading aren't present yet?

I am using a BinaryReader on top of a NetworkStream to read data off of a network. This has worked really well for me, but I want to understand what's going on behind the scenes, so I took a look at the documentation for BinaryReader and found it to be extremely sparse.
My question is this: What will BinaryReader.ReadBytes(bufferSize) do if bufferSize bytes are not present on the network stream when I call ReadBytes?
In my mind there are a few options:
1) Read any bytes that are present on the network stream and return only that many
2) Wait until bufferSize bytes are present on the stream, then read
3) Throw an exception
I assume option 2 is happening, since I've never received any exceptions and all my data is received whole, not in pieces. However, I would like to know for sure what is going on. If someone could enlighten me, I would be grateful.
I believe it actually goes for hidden option 4:
Read the data as it becomes available, looping round in the same way that you normally would do manually. It will only return a value less than the number of bytes you asked for if it reaches the end of the stream while reading.
This is subtly different from your option 2 as it does drain the stream as data becomes available - it doesn't wait until it could read all of the data in one go.
It's easy to show that it does return a lower number of bytes than you asked for if it reaches the end:
var ms = new MemoryStream(new byte[10]);
var readData = new BinaryReader(ms).ReadBytes(100);
Console.WriteLine(readData.Length); // 10
It's harder to prove the looping part, without a custom stream which would explicitly require multiple Read calls to return all the data.
The documentation isn't as clear as it might be, but the return value part is at least somewhat helpful:
A byte array containing data read from the underlying stream. This might be less than the number of bytes requested if the end of the stream is reached.
Note the final part that I've highlighted, and compare that with Stream.Read:
The total number of bytes read into the buffer. This can be less than the number of bytes requested if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.
If you're expecting an exact amount of data and only that amount will be useful, I suggest you write a ReadExactly method which calls Read and throws EndOfStreamException if you need more data than the stream provided before it was closed.
If, by “present on the stream”, you’re asking whether the method would block until the specified number of bytes are available, then it is option 2. It would only return a smaller amount of bytes if the end of the stream is reached.
Here is some sample code on how BinaryReader.ReadBytes(int) may be implemented:
byte[] ReadBytes(int count)
{
byte[] buffer = new byte[count];
int total = 0;
int read = 0;
do
{
read = stream.Read(buffer, read, count - total);
total += read;
}
while (read > 0 && total < count);
// Resize buffer if smaller than count (code not shown).
return buffer;
}

Is there any reason GZipStream.Read would return less than the number of requested bytes?

The docs say:
Return Value:
The total number of bytes read into the buffer. This can be less than the number of bytes requested if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.
But why would the bytes "not be available" when reading from disk?
Let me clarify a bit:
I'm reading from disk (underlying type is FileStream)
There are at least N bytes left to be read (before EOF)
I request to read N bytes
Will the return value/number of bytes read ever be less than N in this scenario?
In answer to your edited question:
Will the return value/number of bytes read ever be less than N in this scenario?
I think you need to ask a hardware expert, and I suppose this isn't the right forum to attract the attention of such a person.
Disclaimer: I'm not a hardware expert, and I don't play one on TV. This is just speculation:
I think that when you're reading from disk, the only reason you'd get fewer bytes than you request is because the stream has run out of bytes to give you. However, it's conceivable that you might have a situation similar to a network stream, where your program is reading bytes faster than the hardware can provide them. In that case, the Read method would presumably populate the buffer only partially and then return.
Obviously, the answer to the question depends on whether such a situation could occur. I think the answer is "no". I have certainly never seen a counterexample. But it would be a mistake to write code depending on that assumption.
Consider: even if you could examime the specifications of all the hardware your code will run on, and prove that the buffer will always be completely filled until the end of the stream is reached, there's no saying what new disk drive somebody might install on the machine in the future, that might behave differently. It's much simpler just to treat all streams the same, and undertake the modest amount of work required to handle the possibility that the buffer comes back incompletely filled.
My solution:
public static class StreamExt
{
public static void ReadBytes(this Stream stream, byte[] buffer, int offset, int count)
{
int totalBytesRead = 0;
while (totalBytesRead < count)
{
int bytesRead = stream.Read(buffer, offset + totalBytesRead, count - totalBytesRead);
if (bytesRead == 0) throw new IOException("Premature end of stream");
totalBytesRead += bytesRead;
}
}
}
Using this method should safely read in all the bytes you requested.
Compression streams are wrapper stream - so behavior depends on behavior of underlying stream. As result it has the same restriction as generic stream "can return less bytes than requested".
As for potential reasons (see also provided by Lloyd)
EOF reached on last read
No data in network stream yet
Custom stram decided to return data in fixed size chunks (perfectly ok).
Stream.Read requires that you pre-allocate the space to read into, and if you allocate more space then can be read from the stream, then the return value is its way of telling you how much of it has been used.
Consider the following:
If you allocated a buffer of say 4096 bytes and the EOF is reached at 2046, then the return value would only be 2046. This allows you to know how much of your buffer is full on return.

Why I need to read file piece by piece to buffer?

I have seen following code for getting the file into array, which is in turn used as a parameter for SQL command inserting it into a blob column:
using (FileStream fs = new FileStream(soubor,FileMode.Open,FileAccess.Read))
int length = (int)fs.Length;
buffer = new byte[length];
int count;
int sum = 0;
while ((count = fs.Read(buffer, sum, length - sum)) > 0)
sum += count;
Why I cannot simply do that:
fs.Read(buffer, 0, length) in order to just copy content of file to the buffer?
Thanks
There's more to it than just "the file may not fit in memory". The contract for Stream.Read explicitly says:
Implementations of this method read a
maximum of count bytes from the
current stream and store them in
buffer beginning at offset. The
current position within the stream is
advanced by the number of bytes read;
however, if an exception occurs, the
current position within the stream
remains unchanged. Implementations
return the number of bytes read. The
return value is zero only if the
position is currently at the end of
the stream. The implementation will
block until at least one byte of data
can be read, in the event that no data
is available. Read returns 0 only when
there is no more data in the stream
and no more is expected (such as a
closed socket or end of file). An
implementation is free to return fewer
bytes than requested even if the end
of the stream has not been reached.
Note the last sentence - you can't rely on a single call to Stream.Read to read everything.
The docs for FileStream.Read have a similar warning:
The total number of bytes read into
the buffer. This might be less than
the number of bytes requested if that
number of bytes are not currently
available, or zero if the end of the
stream is reached.
For a local file system I don't know for sure whether this will ever actually happen - but it could do for a network mounted file. Do you want your app to brittle in that way?
Reading in a loop is the robust way to do things. Personally I prefer not to require the stream to support the Length property, either:
public static byte[] ReadFully(Stream stream)
{
byte[] buffer = new byte[8192];
using (MemoryStream tmpStream = new MemoryStream())
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
tmpStream.Write(buffer, 0, bytesRead);
}
return tmpStream.ToArray();
}
}
That is slightly less efficient when the length is known beforehand, but it's nice and simple. You only need to implement it once, put it in a utility library, and call it whenever you need to. If you really mind the efficiency loss, you could use CanSeek to test whether the Length property is supported, and repeatedly read into a single buffer in that case. Be aware of the possibility that the length of the stream could change while you're reading though...
Of course, File.ReadAllBytes will do the trick even more simply when you only need to deal with a file rather than a general stream.
Because your file could be very large and the buffer has usually a fixed size of 4-32 KB. This way you know you're not filling your memory unnessecarily.
Of course, if you KNOW the size of your file is not too large or if you store the contents in memory anyways, there is no reason not to read it all in one shot.
Although, if you want to read the contents of your file directly into a variable, you don't need the Stream API. Rather use
File.ReadAllText(...)
or
File.ReadAllBytes(...)
A simple fs.Read(buffer, 0, length) will probably work, and it will even be hard to find a test to break it. But it simply is not guaranteed, and it might break in the future.
The best answer here is to use a specialized method from the library. In this case
byte[] buffer = System.IO.File.ReadAllBytes(fileName);
A quick look with Reflector confirms that this will get you the partial-buffer logic and the exception-safe Dispose() of your stream.
And when future versions of the Framework allow for better ways to do this your code will automatically profit.

Categories

Resources