So I'm reading a file which could be encoded in any encoding. But for this example lets say UTF-16. I need to read the file in BYTES (So using FileStream, not StreamReader), AND in chunks of 1MB, and then convert the UTF-16 byte buffer into a UTF8 byte buffer.
What I'm doing right now:
char[] charBuffer = new char[bufferSize];
Encoding.Unicode.GetChars(utf16Buffer, 0, read, charBuffer, 0);
byte[] utf8Array = new byte[Encoding.UTF8.GetByteCount(charBuffer, 0, charsRead)];
int numBytes = Encoding.UTF8.GetBytes(charBuffer, 0, charsRead, utf8Array, 0);
//Do something with utf8Array
//This is what Encoding.Convert does in the background.
This isn't actually that slow, but I was wondering if there was a faster way. Thanks.
Yes, there is a faster way. You can use multiple threads to process your chunks. To avoid ruining the order for the buffers, you need to pass on the index of the buffer to each thread, and have them edit the same collection using that thread.
Related
When working with binary streams (i.e. byte[] arrays), the main point of using BinaryReader or BinaryWriter seems to be simplified reading/writing of primitive data types from a stream, using methods such as ReadBoolean() and taking encoding into account. Is that the whole story? Is there an inherent advantage or disadvantage if one works directly with a Stream, without using BinaryReader/BinaryWriter? Most methods, such as Read(), seem to be the same in both classes, and my guess is that they work identically underneath.
Consider a simple example of processing a binary file in two different ways (edit: I realize this way is ineffective and that a buffer can be used, it's just a sample):
// Using FileStream directly
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
int value = 0;
while ((value = stream.ReadByte()) != -1)
{
Console.WriteLine(value);
}
}
// Using BinaryReader
using (BinaryReader reader = new BinaryReader(FileStream fs = new FileStream("file.dat", FileMode.Open)))
{
// Read bytes and interpret them as ints
byte value = 0;
while (reader.BaseStream.Position < reader.BaseStream.Length)
{
value = reader.ReadByte();
Console.WriteLine(Convert.ToInt32(value));
}
}
The output will be the same, but what's happening internally (e.g. from OS perspective)? Is it - generally speaking - important which implementation is used? Is there any purpose to using BinaryReader/BinaryWriter if you don't need the extra methods that they provide?
For this specific case, MSDN says this in regard to Stream.ReadByte():
The default implementation on Stream creates a new single-byte array
and then calls Read. While this is formally correct, it is
inefficient.
Using GC.GetTotalMemory(), this first approach does seem to allocate 2x as much space as the second one, but AFAIK this shouldn't be the case if a more general Stream.Read() method is used (e.g. for reading in chunks using a buffer). Still, it seems to me that these methods/interfaces could be unified easily...
No, there is no principal difference between the two approaches. The extra Reader adds some buffering so you shouldn't mix them. But don't expect any significant performance differences, it's all dominated by the actual I/O.
So,
use a stream when you have (only) byte[] to move. As is common in a lot of streaming scenarios.
use BinaryWriter and BinaryReader when you have any other basic type (including simple byte) of data to process. Their main purpose is conversion of the built-in framework types to byte[].
One big difference is how you can buffer the I/O. If you are writing/reading only a few bytes here or there, BinaryWriter/BinaryReader will work well. But if you have to read MBs of data, then reading one byte, Int32, etc... at a time will be a bit slow. You could instead read larger chunks and parse from there.
Example:
// Using FileStream directly with a buffer
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
byte[] buffer = new byte[1024];
int count;
// Read from the IO stream fewer times.
while((count = stream.Read(buffer, 0, buffer.Length)) > 0)
for(int i=0; i<count; i++)
Console.WriteLine(Convert.ToInt32(buffer[i]));
}
Now this is a bit off topic... but I'll throw it out there:
If you wanted to get VERY crafty... and really give yourself a performance boost... (Albeit, it might be considered dangerous) Instead of parsing EACH Int32, you could do them all at once using Buffer.BlockCopy()
Another example:
// Using FileStream directly with a buffer and BlockCopy
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
byte[] buffer = new byte[1024];
int[] intArray = new int[buffer.Length >> 2]; // Each int is 4 bytes
int count;
// Read from the IO stream fewer times.
while((count = stream.Read(buffer, 0, buffer.Length)) > 0)
{
// Copy the bytes into the memory space of the Int32 array in one big swoop
Buffer.BlockCopy(buffer, 0, intArray, count);
for(int i=0; i<count; i+=4)
Console.WriteLine(intArray[i]);
}
}
A few things to note about this example: This one takes 4 bytes per Int32 instead of one... So it will yield different results. You can also do this for other data types other than Int32, but many would argue that marshalling should be on your mind. (I just wanted to present something to think about...)
Both your codes are doing the same ie. ReadByte(), which ends you up with a byte array, so the result from either method is same (from the same file).
The OS implementation (internal difference) is that streams are buffered in virtual memory eg. if you were transporting a file over a network via a stream, you will still have remaining system memory left for other (multi)task(ing).
In case of byte arrays, the whole file will be stored in memory before being transferred to disk (file create) or another stream, hence is not recommended for large files.
There's some discussion of binary data transfer over a network here:
When to use byte array, and when to use stream?
#Jason C & #Jon Skeet makes a good point here:
Why do most serializers use a stream instead of a byte array?
I have noticed my Win 10 machine (4GB RAM) sometimes skips files over 5MB if I transport a file via System.Net.Http.Httpclient GetByteArrayAsync method (vs GetStreamAsync) when I continue working on it, without waiting for transfer to complete.
PS: .Net 4.0 byte array is limited to 2GB
For work, the specification on my project is to use .Net 2.0 so I don't get the handy CopyTo function brought about later on.
I need to copy the response stream from an HttpWebResponse to another stream (most likely a MemoryStream, but it could be any subclass of Stream). My normal tactic has been something along the lines of:
BufferedStream bufferedresponse = new BufferedStream(HttpResponse.GetResponseStream());
int count = 0;
byte[] buffer = new byte[1024];
do {
count = bufferedresponse.Read(buffer, 0, buffer.Length);
target.Write(buffer, 0, count);
} while (count > 0);
bufferedresponse.Close();
Are there more efficient ways to do this? Does the size of the buffer really matter? What is the best way to copy from one stream to another in .Net 2.0?
P.S. This is for downloading large, 200+ MB GIS tif images. Of course reliability is paramount.
This is a handy function. And yes, the buffer size matters. Increasing it might give you better performance on large files.
public static void WriteTo(Stream sourceStream, Stream targetStream)
{
byte[] buffer = new byte[0x10000];
int n;
while ((n = sourceStream.Read(buffer, 0, buffer.Length)) != 0)
targetStream.Write(buffer, 0, n);
}
The size of the buffer does matter. If you're copying a megabyte of data one byte at a time, for example, you're going to make a 2^20 iterations through the loop. If you're copying 1 kilobyte at a time, you'll only make 2^10 iterations through the loop. There is significant overhead in the calls to Read and Write when you're making a million of them.
For reading FileStream, I typically use a buffer that's between 64K and 256K. Anything less than 32K shows a marked decrease in performance, as does anything above 256K. The difference between using a 64K buffer and a 256K buffer is not worth the extra memory. Be aware, though, that those numbers are on my system and across my network. Your numbers will vary depending on hardware and operating system.
For network streams, you should select a buffer size that will keep up with the incoming data stream. I'd suggest at least 4 kilobytes, which will give you some buffer if the write stalls for any reason.
You can get rid of the BufferedStream, it's only useful if you are reading small chunks of the stream. Just get the response stream in a variable and use that:
Stream response = HttpResponse.GetResponseStream();
A buffer that is too small will reduce the performance. You can use a larger buffer, so that at least the data from an entire IP packet fits. I looked around a bit, and 4096 bytes should be enough for that. You can really use any size up to 85 kb, after that it's allocated in the large objects heap, which you should avoid to do when there is no reason for it.
Other than that, it's about as efficient as it gets.
I can think of two ways.
Check if Stream.MemberWiseClone() fits your need. It gets you a shallow copy of your object.
Check if this works, if both the ends are of the Stream type.
BufferedStream bs = new BufferedStream((Stream)memoryStreamObject);
I have seen following code for getting the file into array, which is in turn used as a parameter for SQL command inserting it into a blob column:
using (FileStream fs = new FileStream(soubor,FileMode.Open,FileAccess.Read))
int length = (int)fs.Length;
buffer = new byte[length];
int count;
int sum = 0;
while ((count = fs.Read(buffer, sum, length - sum)) > 0)
sum += count;
Why I cannot simply do that:
fs.Read(buffer, 0, length) in order to just copy content of file to the buffer?
Thanks
There's more to it than just "the file may not fit in memory". The contract for Stream.Read explicitly says:
Implementations of this method read a
maximum of count bytes from the
current stream and store them in
buffer beginning at offset. The
current position within the stream is
advanced by the number of bytes read;
however, if an exception occurs, the
current position within the stream
remains unchanged. Implementations
return the number of bytes read. The
return value is zero only if the
position is currently at the end of
the stream. The implementation will
block until at least one byte of data
can be read, in the event that no data
is available. Read returns 0 only when
there is no more data in the stream
and no more is expected (such as a
closed socket or end of file). An
implementation is free to return fewer
bytes than requested even if the end
of the stream has not been reached.
Note the last sentence - you can't rely on a single call to Stream.Read to read everything.
The docs for FileStream.Read have a similar warning:
The total number of bytes read into
the buffer. This might be less than
the number of bytes requested if that
number of bytes are not currently
available, or zero if the end of the
stream is reached.
For a local file system I don't know for sure whether this will ever actually happen - but it could do for a network mounted file. Do you want your app to brittle in that way?
Reading in a loop is the robust way to do things. Personally I prefer not to require the stream to support the Length property, either:
public static byte[] ReadFully(Stream stream)
{
byte[] buffer = new byte[8192];
using (MemoryStream tmpStream = new MemoryStream())
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
tmpStream.Write(buffer, 0, bytesRead);
}
return tmpStream.ToArray();
}
}
That is slightly less efficient when the length is known beforehand, but it's nice and simple. You only need to implement it once, put it in a utility library, and call it whenever you need to. If you really mind the efficiency loss, you could use CanSeek to test whether the Length property is supported, and repeatedly read into a single buffer in that case. Be aware of the possibility that the length of the stream could change while you're reading though...
Of course, File.ReadAllBytes will do the trick even more simply when you only need to deal with a file rather than a general stream.
Because your file could be very large and the buffer has usually a fixed size of 4-32 KB. This way you know you're not filling your memory unnessecarily.
Of course, if you KNOW the size of your file is not too large or if you store the contents in memory anyways, there is no reason not to read it all in one shot.
Although, if you want to read the contents of your file directly into a variable, you don't need the Stream API. Rather use
File.ReadAllText(...)
or
File.ReadAllBytes(...)
A simple fs.Read(buffer, 0, length) will probably work, and it will even be hard to find a test to break it. But it simply is not guaranteed, and it might break in the future.
The best answer here is to use a specialized method from the library. In this case
byte[] buffer = System.IO.File.ReadAllBytes(fileName);
A quick look with Reflector confirms that this will get you the partial-buffer logic and the exception-safe Dispose() of your stream.
And when future versions of the Framework allow for better ways to do this your code will automatically profit.
I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item
I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.
Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.
This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.
I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);
I am trying to write an audio file using C#'s System.IO.BinaryWriter class.
It works fine when I perform this action:
File.WriteAllBytes(#"C:\TestFile\File1", br.ReadBytes((int)br.BaseStream.Length));
However it does not work when I do the following:
Encoding e = Encoding.ASCII;
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
I checked out the byte arrays, in both cases the array contained the same values ( I copied their value into a separate excel file and checked to see if there are any differences, there are none).
In both cases the base stream position is set to 0.
The code itself works, the file is created successfully. The problem occurs when I open the file, the first file works fine, the second file is not recognized by Windows Media Player.
Does anyone have any ideas of what I could be doing wrong?
Encoding e = Encoding.ASCII;
That looks suspicious. I know not of any audio format that is encoded using ASCII.
Never, ever read arbitrary binary data as if it's text. It always goes wrong, because they're just not the same thing.
Also, don't rely on a stream's length - it could change, or it might not support fetching it anyway.
You want something like this:
public static byte[] CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[16*1024];
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, read);
}
}
Then just open a FileStream to wherever you want to write the data, and call
CopyStream(inputStream, outputFile);
You haven't shown where you're getting the input from, but basically you don't really need the BinaryReader. It's not like you're using BinaryReader for its normal purpose, which is to read individual primitive values, length-prefixed strings etc. You just want to copy data verbatim - and a Stream is fine for that.
With most encodings, including Encoding.ASCII, converting bytes to chars and back again will not give you back the same data. For example, IIRC Encoding.ASCII maps all bytes in the range 128-255 to the '?' character. Then when you convert back to bytes, they will all convert back to the same value.
An exception is the Latin1 aka ISO-8859-1 aka Code Page 28591 encoding, which will map all bytes with hex values are in the range 0-255 to the Unicode character with the same hex value. The following should give the result you want:
Encoding e = Encoding.GetEncoding(28591);
// or either of the following is equivalent
// Encoding.GetEncoding("Latin1");
// Encoding.GetEncoding("ISO-8859-1");
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
Nevertheless, unless you have a good reason to do so, I wouldn't convert bytes to chars in this way.
Encoding is used to encode a string into bytes, then back into a string.
You are using it backwards, decoding something that wasn't a string to begin with.