Extracting a binary file from other file encoding\conversion mistake - c#

I have two binary files, "bigFile.bin" and "smallFile.bin".
The "bigFile.bin" contains "smallFile.bin".
Opening it in beyond compare confirms that.
I want to extract the smaller file form the bigger into a "result.bin" that equals "smallFile.bin".
I have two keywords- one for the start position ("Section") and one for the end position ("Man");
I tried the following:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
UTF8Encoding enc = new UTF8Encoding();
string text = enc.GetString(bigFile);
int startIndex = text.IndexOf("Section");
int endIndex = text.IndexOf("Man");
string smallFile = text.Substring(startIndex, endIndex - startIndex);
File.WriteAllBytes("result.bin",enc.GetBytes(smallFile));
I tried to compare the result file with the origin small file in beyond compare, which shows hex representation comparison.
nost of the bytes areequal -but some not.
For example in the new file I have 84 but in the old file I have EF BF BD sequence instead.
What can cause those differences? Where am I mistaken?

Since you are working with binary files, you should not use text-related functionality (which includes encodings etc). Work with byte-related methods instead.
Your current code could be converted to work by making it into something like this:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
int startIndex = /* assume we somehow know this */
int endIndex = /* assume we somehow know this */
var length = endIndex - startIndex;
var smallFile = new byte[length];
Array.Copy(bigFile, startIndex, smallFile, 0, length);
File.WriteAllBytes("result.bin", smallFile);
To find startIndex and endIndex you could even use your previous technique, but something like this would be more appropriate.
However this would still be problematic because:
Stuffing both binary data and "text" into the same file is going to complicate matters
There is still a lot of unnecessary copying going on here; you should work with your input as a Stream rather than an array of bytes
Even worse than the unnecessary copying, any non-stream solution would either need to load all of your input file in memory as happens above (wasteful), or be exceedingly complicated to code
So, what to do?
Don't read file contents in memory as byte arrays. Work with FileStream instead.
Wrap a StreamReader around the FileStream and use it to find the markers for the start and end indexes. Even better, change your file format so that you don't need to search for text.
After you know startIndex and length, use stream functions to seek to the relevant part of your input stream and copy length bytes to the output stream.

Related

Difficulty reading large file into byte array

I have a very large BMP file that I have to read in all at once because I need to reverse the bytes when writing it to a temp file. This BMP is 1.28GB, and I'm getting the "Out of memory" error. I can't read it completely (using ReadAllBytes) or using a buffer into a binary array because I can't initialize an array of that size. I also can't read it into a List (which I could then Reverse()) using a buffer because halfway through it runs out of memory.
So basically the question is, how do I read a very large file backwards (ie, starting at LastByte and ending at FirstByte) and then write that to disk?
Bonus: when writing the reversed file to disk, do not write the last 54 bytes.
With a StreamReader object, you can Seek (place the "cursor") to any particular byte, so you can use that to go over the entire file's contents in reverse.
Example:
const int bufferSize = 1024;
string fileName = 'yourfile.txt';
StreamReader myStream = new StreamReader(fileName);
myStream.BaseStream.Seek(bufferSize, SeekOrigin.End);
char[] bytes = new char[bufferSize];
while(myStream.BaseStream.Position > 0)
{
bytes.Initialize();
myStream.BaseStream.Seek(bufferSize, SeekOrigin.Current);
int bytesRead = myStream.Read(bytes, 0, bufferSize);
}
You can not normally handle so big files in .NET, due the implied memory limit for CLR applications and collections inside them neither for 32 nor for 64 platform.
For this you can use Memory Mapped File, to read a file directly from the disk, without loading it into the memory. One time memory mapping created move the reading pointer to end of the file and read backwards.
Hope this helps.
You can use Memory Mapped Files.
http://msdn.microsoft.com/en-us/library/vstudio/dd997372%28v=vs.100%29.aspx
Also, you can use FileStream and positioning on necessary position by stream.Seek(xxx, SeekOrigin.Begin) (relative position) or Position property (absolute position).

Remove item from binary file

What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.

How to split a wmv file in bytes implemented in C# language?

I have a wmv file whose size is 300 bytes. I want to split it into several bytes (example: (150 bytes each) or (3 100 bytes)). How do I implement this in C# Language?
It really depends on whether you want the files to work or not. Splitting them in chunks is easy: Read them into a byte array, have a for loop that copies part of the array to a file of size CHUNK, without forgetting to copy the final bytes of the file. Splitting them in working files is different.
I would try to just stream it without explicit splitting (the tcp stack will split it as it likes^^). If you have a good codec it will play it anyway. (Vlc can always plays the videos while downloading)
The real answer is, just use a streaming server and forget about writing a streaming protocol. Thats crazy. To split a file into byte segments you could use something like the code below. Not thats untested, but it should be about 95% complete.
You should take a look at the proto spec if you havent already. http://msdn.microsoft.com/en-us/library/cc251059(v=PROT.10).aspx And if you have, and you asked this question, you dont stand an ice cubes chance in hell at making it work,
int chunkSize = 300;
var file = File.Open("c:\file.wmv", FileMode.Open);
var numberOfChunks = (file.Length/chunkSize)+1;
byte[][] fileBytes = new byte[numberOfChunks][];
for (int i = 0; i < numberOfChunks; i++)
{
int bytesToRead = chunkSize;
if (i == numberOfChunks + 1)
{
bytesToRead = (int)(file.Length - (i * chunkSize));
}
fileBytes[i] = new byte[bytesToRead];
file.Read(fileBytes[i], i * chunkSize, bytesToRead);
}

Issue with BinaryReader.ReadChars()

I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item
I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.
Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.
This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.
I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);

Writing File using System.IO.BinaryWriter

I am trying to write an audio file using C#'s System.IO.BinaryWriter class.
It works fine when I perform this action:
File.WriteAllBytes(#"C:\TestFile\File1", br.ReadBytes((int)br.BaseStream.Length));
However it does not work when I do the following:
Encoding e = Encoding.ASCII;
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
I checked out the byte arrays, in both cases the array contained the same values ( I copied their value into a separate excel file and checked to see if there are any differences, there are none).
In both cases the base stream position is set to 0.
The code itself works, the file is created successfully. The problem occurs when I open the file, the first file works fine, the second file is not recognized by Windows Media Player.
Does anyone have any ideas of what I could be doing wrong?
Encoding e = Encoding.ASCII;
That looks suspicious. I know not of any audio format that is encoded using ASCII.
Never, ever read arbitrary binary data as if it's text. It always goes wrong, because they're just not the same thing.
Also, don't rely on a stream's length - it could change, or it might not support fetching it anyway.
You want something like this:
public static byte[] CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[16*1024];
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, read);
}
}
Then just open a FileStream to wherever you want to write the data, and call
CopyStream(inputStream, outputFile);
You haven't shown where you're getting the input from, but basically you don't really need the BinaryReader. It's not like you're using BinaryReader for its normal purpose, which is to read individual primitive values, length-prefixed strings etc. You just want to copy data verbatim - and a Stream is fine for that.
With most encodings, including Encoding.ASCII, converting bytes to chars and back again will not give you back the same data. For example, IIRC Encoding.ASCII maps all bytes in the range 128-255 to the '?' character. Then when you convert back to bytes, they will all convert back to the same value.
An exception is the Latin1 aka ISO-8859-1 aka Code Page 28591 encoding, which will map all bytes with hex values are in the range 0-255 to the Unicode character with the same hex value. The following should give the result you want:
Encoding e = Encoding.GetEncoding(28591);
// or either of the following is equivalent
// Encoding.GetEncoding("Latin1");
// Encoding.GetEncoding("ISO-8859-1");
char[] reslts = e.GetChars(br.ReadBytes((int)br.BaseStream.Length));
br.BaseStream.Position = 0;
File.WriteAllBytes((#"C:\TestFile\File2",e.GetBytes(reslts));
Nevertheless, unless you have a good reason to do so, I wouldn't convert bytes to chars in this way.
Encoding is used to encode a string into bytes, then back into a string.
You are using it backwards, decoding something that wasn't a string to begin with.

Categories

Resources