Writing File using System.IO.BinaryWriter - c#

I am trying to write an audio file using C#'s System.IO.BinaryWriter class.
It works fine when I perform this action:
File.WriteAllBytes(#"C:\TestFile\File1", br.ReadBytes((int)br.BaseStream.Length));
However it does not work when I do the following:
Encoding e = Encoding.ASCII;
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
I checked out the byte arrays, in both cases the array contained the same values ( I copied their value into a separate excel file and checked to see if there are any differences, there are none).
In both cases the base stream position is set to 0.
The code itself works, the file is created successfully. The problem occurs when I open the file, the first file works fine, the second file is not recognized by Windows Media Player.
Does anyone have any ideas of what I could be doing wrong?

Encoding e = Encoding.ASCII;
That looks suspicious. I know not of any audio format that is encoded using ASCII.

Never, ever read arbitrary binary data as if it's text. It always goes wrong, because they're just not the same thing.
Also, don't rely on a stream's length - it could change, or it might not support fetching it anyway.
You want something like this:
public static byte[] CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[16*1024];
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, read);
}
}
Then just open a FileStream to wherever you want to write the data, and call
CopyStream(inputStream, outputFile);
You haven't shown where you're getting the input from, but basically you don't really need the BinaryReader. It's not like you're using BinaryReader for its normal purpose, which is to read individual primitive values, length-prefixed strings etc. You just want to copy data verbatim - and a Stream is fine for that.

With most encodings, including Encoding.ASCII, converting bytes to chars and back again will not give you back the same data. For example, IIRC Encoding.ASCII maps all bytes in the range 128-255 to the '?' character. Then when you convert back to bytes, they will all convert back to the same value.
An exception is the Latin1 aka ISO-8859-1 aka Code Page 28591 encoding, which will map all bytes with hex values are in the range 0-255 to the Unicode character with the same hex value. The following should give the result you want:
Encoding e = Encoding.GetEncoding(28591);
// or either of the following is equivalent
// Encoding.GetEncoding("Latin1");
// Encoding.GetEncoding("ISO-8859-1");
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
Nevertheless, unless you have a good reason to do so, I wouldn't convert bytes to chars in this way.

Encoding is used to encode a string into bytes, then back into a string.
You are using it backwards, decoding something that wasn't a string to begin with.

Related

Fastest way to convert one encoding to another

So I'm reading a file which could be encoded in any encoding. But for this example lets say UTF-16. I need to read the file in BYTES (So using FileStream, not StreamReader), AND in chunks of 1MB, and then convert the UTF-16 byte buffer into a UTF8 byte buffer.
What I'm doing right now:
char[] charBuffer = new char[bufferSize];
Encoding.Unicode.GetChars(utf16Buffer, 0, read, charBuffer, 0);
byte[] utf8Array = new byte[Encoding.UTF8.GetByteCount(charBuffer, 0, charsRead)];
int numBytes = Encoding.UTF8.GetBytes(charBuffer, 0, charsRead, utf8Array, 0);
//Do something with utf8Array
//This is what Encoding.Convert does in the background.
This isn't actually that slow, but I was wondering if there was a faster way. Thanks.
Yes, there is a faster way. You can use multiple threads to process your chunks. To avoid ruining the order for the buffers, you need to pass on the index of the buffer to each thread, and have them edit the same collection using that thread.

Why does whitespace appear at the end of my C# TextWriter file?

I have created a text file using TextWriter C#, on final creation the text file often has various rows of whitespace at the end of the file. The whitespace is not included in any of the string objects that make up the file and I don’t know what is causing it. The larger the file the more whitespace there is.
I've tried various tests to see if the whitespace occurs based upon the content on the string, but this is not the case. i.e. I have identified the number of rows where the whitespace starts and changed the string for something completely different but the whitespace still occurs.
//To start:
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
//Loop through records & create a concatenated string object
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
//Add the line to the text file
tw.WriteLine(strUTL1);
//Once all rows are added I complete the file
tw.Flush();
tw.Close();
//Then return the file
return File(memoryStream.GetBuffer(), "text/plain", txtFileName);
I don't want to manipulate the file after completion (e.g. replace blank spaces), as this could lead to other problems. The file will be exchanged with a third party and needs to be formatted exactly.
Thank you for your assistance.
As the doc for MemoryStream.GetBuffer explains:
Note that the buffer contains allocated bytes which might be unused. For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray method; however, ToArray creates a copy of the data in memory.
Use .ToArray() (which will allocate a new array of the right size), or you can use the buffer returned from .GetBuffer() but you'll need to check the .Length to see how many valid bytes are in it.
GetBuffer() returns all the memory that was allocated, which is almost always more bytes than what you actually wrote into it.
Might I suggest using Encoding.UTF8.GetBytes(...) instead:
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
var bytes = Encoding.UTF8.GetBytes(strUTL1);
return File(bytes, "text/plain", txtFileName);
Use ToArray() instead of GetBuffer(), since the buffer is larger than needed.
That's often the case. Classes or functions that work with buffers usually reserve a certain size of memory to hold the data. The function will then return a value, how many bytes have been written to the buffer. You shall then only use the first n bytes of the buffer.
Citation of MSDN:
For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer() is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray() method; however, ToArray() creates a copy of the data in memory.

Extracting a binary file from other file encoding\conversion mistake

I have two binary files, "bigFile.bin" and "smallFile.bin".
The "bigFile.bin" contains "smallFile.bin".
Opening it in beyond compare confirms that.
I want to extract the smaller file form the bigger into a "result.bin" that equals "smallFile.bin".
I have two keywords- one for the start position ("Section") and one for the end position ("Man");
I tried the following:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
UTF8Encoding enc = new UTF8Encoding();
string text = enc.GetString(bigFile);
int startIndex = text.IndexOf("Section");
int endIndex = text.IndexOf("Man");
string smallFile = text.Substring(startIndex, endIndex - startIndex);
File.WriteAllBytes("result.bin",enc.GetBytes(smallFile));
I tried to compare the result file with the origin small file in beyond compare, which shows hex representation comparison.
nost of the bytes areequal -but some not.
For example in the new file I have 84 but in the old file I have EF BF BD sequence instead.
What can cause those differences? Where am I mistaken?
Since you are working with binary files, you should not use text-related functionality (which includes encodings etc). Work with byte-related methods instead.
Your current code could be converted to work by making it into something like this:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
int startIndex = /* assume we somehow know this */
int endIndex = /* assume we somehow know this */
var length = endIndex - startIndex;
var smallFile = new byte[length];
Array.Copy(bigFile, startIndex, smallFile, 0, length);
File.WriteAllBytes("result.bin", smallFile);
To find startIndex and endIndex you could even use your previous technique, but something like this would be more appropriate.
However this would still be problematic because:
Stuffing both binary data and "text" into the same file is going to complicate matters
There is still a lot of unnecessary copying going on here; you should work with your input as a Stream rather than an array of bytes
Even worse than the unnecessary copying, any non-stream solution would either need to load all of your input file in memory as happens above (wasteful), or be exceedingly complicated to code
So, what to do?
Don't read file contents in memory as byte arrays. Work with FileStream instead.
Wrap a StreamReader around the FileStream and use it to find the markers for the start and end indexes. Even better, change your file format so that you don't need to search for text.
After you know startIndex and length, use stream functions to seek to the relevant part of your input stream and copy length bytes to the output stream.

Issue with BinaryReader.ReadChars()

I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item
I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.
Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.
This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.
I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);

Is there a better way to convert to ASCII from an arbitrary input?

I need to be able to take an arbitrary text input that may have a byte order marker (BOM) on it to mark its encoding, and output it as ASCII. We have some old tools that don't understand BOM's and I need to send them ASCII-only data.
Now, I just got done writing this code and I just can't quite believe the inefficiency here. Four copies of the data, not to mention any intermediate buffers internally in StreamReader. Is there a better way to do this?
// i_fileBytes is an incoming byte[]
string unicodeString = new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd();
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeString.ToCharArray());
byte[] ansiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string ansiString = Encoding.ASCII.GetString(ansiBytes);
I need the StreamReader() because it has an internal BOM detector to choose the encoding to read the rest of the file. Then the rest is just to make it convert into the final ASCII string.
Is there a better way to do this?
If you've got i_fileBytes in memory already, you can just check whether or not it starts with a BOM, and then convert either the whole of it or just the bit after the BOM using Encoding.Unicode.GetString. (Use the overload which lets you specify an index and length.)
So as code:
int start = (i_fileBytes[0] == 0xff && i_fileBytes[1] == 0xfe) ? 2 : 0;
string text = Encoding.Unicode.GetString(i_fileBytes, start, i_fileBytes.Length-start);
Note that that assumes a genuinely little-endian UTF-16 encoding, however. If you really need to detect the encoding first, you could either reimplement what StreamReader does, or perhaps just build a StreamReader from the first (say) 10 bytes, and use the CurrentEncoding property to work out what you should use for the encoding.
EDIT: Now, as for the conversion to ASCII - if you really only need it as a .NET string, then presumably all you want to do is replace any non-ASCII characters with "?" or something similar. (Alternatively it might be better to throw an exception... that's up to you, of course.)
EDIT: Note that when detecting the encoding, it would be a good idea to just call Read() a single time to read one character. Don't call ReadToEnd() as by picking 10 bytes as an arbitrary amount of data, it might end mid-character. I don't know offhand whether that would throw an exception, but it has no benefits anyway...
System.Text.Encoding.ASCII.GetBytes(new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd())
That should save a few round-trips.

Categories

Resources