I have a very large BMP file that I have to read in all at once because I need to reverse the bytes when writing it to a temp file. This BMP is 1.28GB, and I'm getting the "Out of memory" error. I can't read it completely (using ReadAllBytes) or using a buffer into a binary array because I can't initialize an array of that size. I also can't read it into a List (which I could then Reverse()) using a buffer because halfway through it runs out of memory.
So basically the question is, how do I read a very large file backwards (ie, starting at LastByte and ending at FirstByte) and then write that to disk?
Bonus: when writing the reversed file to disk, do not write the last 54 bytes.
With a StreamReader object, you can Seek (place the "cursor") to any particular byte, so you can use that to go over the entire file's contents in reverse.
Example:
const int bufferSize = 1024;
string fileName = 'yourfile.txt';
StreamReader myStream = new StreamReader(fileName);
myStream.BaseStream.Seek(bufferSize, SeekOrigin.End);
char[] bytes = new char[bufferSize];
while(myStream.BaseStream.Position > 0)
{
bytes.Initialize();
myStream.BaseStream.Seek(bufferSize, SeekOrigin.Current);
int bytesRead = myStream.Read(bytes, 0, bufferSize);
}
You can not normally handle so big files in .NET, due the implied memory limit for CLR applications and collections inside them neither for 32 nor for 64 platform.
For this you can use Memory Mapped File, to read a file directly from the disk, without loading it into the memory. One time memory mapping created move the reading pointer to end of the file and read backwards.
Hope this helps.
You can use Memory Mapped Files.
http://msdn.microsoft.com/en-us/library/vstudio/dd997372%28v=vs.100%29.aspx
Also, you can use FileStream and positioning on necessary position by stream.Seek(xxx, SeekOrigin.Begin) (relative position) or Position property (absolute position).
Related
I'm loading in a binary file into a memory stream and modifying the bytes and then storing the file to disk. However to save time I retain the modified byte array to calculate a checksum. When I load the saved file from disk and calculate the checksum the file length is about 150 bytes different from the original byte length when it was saved and obviously the checksum doesn't match the one before it was saved. Any ideas as to why this happens? I've searched and searched for clues but it looks like I'd have to reload the file after it was saved to calculate an accurate checksum.
Also note that the shorter byte array does render its contents correctly and so does the longer byte array, in fact the two arrays render identically!
Here's the code that collects the modified bytes from the memory stream:
writerStream.Flush();
storedFile = new Byte[writerStream.Length];
writerStream.Position = 0;
writerStream.Read(storedFile, 0, Convert.ToInt32(writerStream.Length));
And here's how I read the file:
using (BinaryReader readFile = new BinaryReader(Delimon.Win32.IO.File.Open(filePath, Delimon.Win32.IO.FileMode.Open)))
{
byte[] cgmBytes = readFile.ReadBytes(Convert.ToInt32(readFile.BaseStream.Length));
hash = fileCheck.ComputeHash(cgmBytes);
}
And here's how the file is saved:
(using BinaryWriter aWriter = new BinaryWriter(Delimon.Win32.IO.File.Create(filePath))
{
aWriter.Write(storedFile);
}
Any suggestions would be much appreciated.
Thx
The problem seems to have resolved itself by simply changing the point where the stream position is set:
writerStream.Flush();
writerStream.Position = 0;
storedFile = new Byte[writerStream.Length];
writerStream.Read(storedFile, 0, Convert.ToInt32(writerStream.Length));
In the previous code the Position was set after reading the stream length, now the position is set before reading the stream length. In either case the byte length DOES NOT CHANGE, but the saved file when retrieved returns the identical byte length now. Why? Not sure, setting the stream position does not affect the stream length nor should it affect how a newly instantiated writer decides to save the byte array. Gremlins?...
When working with binary streams (i.e. byte[] arrays), the main point of using BinaryReader or BinaryWriter seems to be simplified reading/writing of primitive data types from a stream, using methods such as ReadBoolean() and taking encoding into account. Is that the whole story? Is there an inherent advantage or disadvantage if one works directly with a Stream, without using BinaryReader/BinaryWriter? Most methods, such as Read(), seem to be the same in both classes, and my guess is that they work identically underneath.
Consider a simple example of processing a binary file in two different ways (edit: I realize this way is ineffective and that a buffer can be used, it's just a sample):
// Using FileStream directly
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
int value = 0;
while ((value = stream.ReadByte()) != -1)
{
Console.WriteLine(value);
}
}
// Using BinaryReader
using (BinaryReader reader = new BinaryReader(FileStream fs = new FileStream("file.dat", FileMode.Open)))
{
// Read bytes and interpret them as ints
byte value = 0;
while (reader.BaseStream.Position < reader.BaseStream.Length)
{
value = reader.ReadByte();
Console.WriteLine(Convert.ToInt32(value));
}
}
The output will be the same, but what's happening internally (e.g. from OS perspective)? Is it - generally speaking - important which implementation is used? Is there any purpose to using BinaryReader/BinaryWriter if you don't need the extra methods that they provide?
For this specific case, MSDN says this in regard to Stream.ReadByte():
The default implementation on Stream creates a new single-byte array
and then calls Read. While this is formally correct, it is
inefficient.
Using GC.GetTotalMemory(), this first approach does seem to allocate 2x as much space as the second one, but AFAIK this shouldn't be the case if a more general Stream.Read() method is used (e.g. for reading in chunks using a buffer). Still, it seems to me that these methods/interfaces could be unified easily...
No, there is no principal difference between the two approaches. The extra Reader adds some buffering so you shouldn't mix them. But don't expect any significant performance differences, it's all dominated by the actual I/O.
So,
use a stream when you have (only) byte[] to move. As is common in a lot of streaming scenarios.
use BinaryWriter and BinaryReader when you have any other basic type (including simple byte) of data to process. Their main purpose is conversion of the built-in framework types to byte[].
One big difference is how you can buffer the I/O. If you are writing/reading only a few bytes here or there, BinaryWriter/BinaryReader will work well. But if you have to read MBs of data, then reading one byte, Int32, etc... at a time will be a bit slow. You could instead read larger chunks and parse from there.
Example:
// Using FileStream directly with a buffer
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
byte[] buffer = new byte[1024];
int count;
// Read from the IO stream fewer times.
while((count = stream.Read(buffer, 0, buffer.Length)) > 0)
for(int i=0; i<count; i++)
Console.WriteLine(Convert.ToInt32(buffer[i]));
}
Now this is a bit off topic... but I'll throw it out there:
If you wanted to get VERY crafty... and really give yourself a performance boost... (Albeit, it might be considered dangerous) Instead of parsing EACH Int32, you could do them all at once using Buffer.BlockCopy()
Another example:
// Using FileStream directly with a buffer and BlockCopy
using (FileStream stream = new FileStream("file.dat", FileMode.Open))
{
// Read bytes from stream and interpret them as ints
byte[] buffer = new byte[1024];
int[] intArray = new int[buffer.Length >> 2]; // Each int is 4 bytes
int count;
// Read from the IO stream fewer times.
while((count = stream.Read(buffer, 0, buffer.Length)) > 0)
{
// Copy the bytes into the memory space of the Int32 array in one big swoop
Buffer.BlockCopy(buffer, 0, intArray, count);
for(int i=0; i<count; i+=4)
Console.WriteLine(intArray[i]);
}
}
A few things to note about this example: This one takes 4 bytes per Int32 instead of one... So it will yield different results. You can also do this for other data types other than Int32, but many would argue that marshalling should be on your mind. (I just wanted to present something to think about...)
Both your codes are doing the same ie. ReadByte(), which ends you up with a byte array, so the result from either method is same (from the same file).
The OS implementation (internal difference) is that streams are buffered in virtual memory eg. if you were transporting a file over a network via a stream, you will still have remaining system memory left for other (multi)task(ing).
In case of byte arrays, the whole file will be stored in memory before being transferred to disk (file create) or another stream, hence is not recommended for large files.
There's some discussion of binary data transfer over a network here:
When to use byte array, and when to use stream?
#Jason C & #Jon Skeet makes a good point here:
Why do most serializers use a stream instead of a byte array?
I have noticed my Win 10 machine (4GB RAM) sometimes skips files over 5MB if I transport a file via System.Net.Http.Httpclient GetByteArrayAsync method (vs GetStreamAsync) when I continue working on it, without waiting for transfer to complete.
PS: .Net 4.0 byte array is limited to 2GB
I have two binary files, "bigFile.bin" and "smallFile.bin".
The "bigFile.bin" contains "smallFile.bin".
Opening it in beyond compare confirms that.
I want to extract the smaller file form the bigger into a "result.bin" that equals "smallFile.bin".
I have two keywords- one for the start position ("Section") and one for the end position ("Man");
I tried the following:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
UTF8Encoding enc = new UTF8Encoding();
string text = enc.GetString(bigFile);
int startIndex = text.IndexOf("Section");
int endIndex = text.IndexOf("Man");
string smallFile = text.Substring(startIndex, endIndex - startIndex);
File.WriteAllBytes("result.bin",enc.GetBytes(smallFile));
I tried to compare the result file with the origin small file in beyond compare, which shows hex representation comparison.
nost of the bytes areequal -but some not.
For example in the new file I have 84 but in the old file I have EF BF BD sequence instead.
What can cause those differences? Where am I mistaken?
Since you are working with binary files, you should not use text-related functionality (which includes encodings etc). Work with byte-related methods instead.
Your current code could be converted to work by making it into something like this:
byte[] bigFile = File.ReadAllBytes("bigFile.bin");
int startIndex = /* assume we somehow know this */
int endIndex = /* assume we somehow know this */
var length = endIndex - startIndex;
var smallFile = new byte[length];
Array.Copy(bigFile, startIndex, smallFile, 0, length);
File.WriteAllBytes("result.bin", smallFile);
To find startIndex and endIndex you could even use your previous technique, but something like this would be more appropriate.
However this would still be problematic because:
Stuffing both binary data and "text" into the same file is going to complicate matters
There is still a lot of unnecessary copying going on here; you should work with your input as a Stream rather than an array of bytes
Even worse than the unnecessary copying, any non-stream solution would either need to load all of your input file in memory as happens above (wasteful), or be exceedingly complicated to code
So, what to do?
Don't read file contents in memory as byte arrays. Work with FileStream instead.
Wrap a StreamReader around the FileStream and use it to find the markers for the start and end indexes. Even better, change your file format so that you don't need to search for text.
After you know startIndex and length, use stream functions to seek to the relevant part of your input stream and copy length bytes to the output stream.
What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.
I have seen following code for getting the file into array, which is in turn used as a parameter for SQL command inserting it into a blob column:
using (FileStream fs = new FileStream(soubor,FileMode.Open,FileAccess.Read))
int length = (int)fs.Length;
buffer = new byte[length];
int count;
int sum = 0;
while ((count = fs.Read(buffer, sum, length - sum)) > 0)
sum += count;
Why I cannot simply do that:
fs.Read(buffer, 0, length) in order to just copy content of file to the buffer?
Thanks
There's more to it than just "the file may not fit in memory". The contract for Stream.Read explicitly says:
Implementations of this method read a
maximum of count bytes from the
current stream and store them in
buffer beginning at offset. The
current position within the stream is
advanced by the number of bytes read;
however, if an exception occurs, the
current position within the stream
remains unchanged. Implementations
return the number of bytes read. The
return value is zero only if the
position is currently at the end of
the stream. The implementation will
block until at least one byte of data
can be read, in the event that no data
is available. Read returns 0 only when
there is no more data in the stream
and no more is expected (such as a
closed socket or end of file). An
implementation is free to return fewer
bytes than requested even if the end
of the stream has not been reached.
Note the last sentence - you can't rely on a single call to Stream.Read to read everything.
The docs for FileStream.Read have a similar warning:
The total number of bytes read into
the buffer. This might be less than
the number of bytes requested if that
number of bytes are not currently
available, or zero if the end of the
stream is reached.
For a local file system I don't know for sure whether this will ever actually happen - but it could do for a network mounted file. Do you want your app to brittle in that way?
Reading in a loop is the robust way to do things. Personally I prefer not to require the stream to support the Length property, either:
public static byte[] ReadFully(Stream stream)
{
byte[] buffer = new byte[8192];
using (MemoryStream tmpStream = new MemoryStream())
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
tmpStream.Write(buffer, 0, bytesRead);
}
return tmpStream.ToArray();
}
}
That is slightly less efficient when the length is known beforehand, but it's nice and simple. You only need to implement it once, put it in a utility library, and call it whenever you need to. If you really mind the efficiency loss, you could use CanSeek to test whether the Length property is supported, and repeatedly read into a single buffer in that case. Be aware of the possibility that the length of the stream could change while you're reading though...
Of course, File.ReadAllBytes will do the trick even more simply when you only need to deal with a file rather than a general stream.
Because your file could be very large and the buffer has usually a fixed size of 4-32 KB. This way you know you're not filling your memory unnessecarily.
Of course, if you KNOW the size of your file is not too large or if you store the contents in memory anyways, there is no reason not to read it all in one shot.
Although, if you want to read the contents of your file directly into a variable, you don't need the Stream API. Rather use
File.ReadAllText(...)
or
File.ReadAllBytes(...)
A simple fs.Read(buffer, 0, length) will probably work, and it will even be hard to find a test to break it. But it simply is not guaranteed, and it might break in the future.
The best answer here is to use a specialized method from the library. In this case
byte[] buffer = System.IO.File.ReadAllBytes(fileName);
A quick look with Reflector confirms that this will get you the partial-buffer logic and the exception-safe Dispose() of your stream.
And when future versions of the Framework allow for better ways to do this your code will automatically profit.