What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.
Related
I have a very large BMP file that I have to read in all at once because I need to reverse the bytes when writing it to a temp file. This BMP is 1.28GB, and I'm getting the "Out of memory" error. I can't read it completely (using ReadAllBytes) or using a buffer into a binary array because I can't initialize an array of that size. I also can't read it into a List (which I could then Reverse()) using a buffer because halfway through it runs out of memory.
So basically the question is, how do I read a very large file backwards (ie, starting at LastByte and ending at FirstByte) and then write that to disk?
Bonus: when writing the reversed file to disk, do not write the last 54 bytes.
With a StreamReader object, you can Seek (place the "cursor") to any particular byte, so you can use that to go over the entire file's contents in reverse.
Example:
const int bufferSize = 1024;
string fileName = 'yourfile.txt';
StreamReader myStream = new StreamReader(fileName);
myStream.BaseStream.Seek(bufferSize, SeekOrigin.End);
char[] bytes = new char[bufferSize];
while(myStream.BaseStream.Position > 0)
{
bytes.Initialize();
myStream.BaseStream.Seek(bufferSize, SeekOrigin.Current);
int bytesRead = myStream.Read(bytes, 0, bufferSize);
}
You can not normally handle so big files in .NET, due the implied memory limit for CLR applications and collections inside them neither for 32 nor for 64 platform.
For this you can use Memory Mapped File, to read a file directly from the disk, without loading it into the memory. One time memory mapping created move the reading pointer to end of the file and read backwards.
Hope this helps.
You can use Memory Mapped Files.
http://msdn.microsoft.com/en-us/library/vstudio/dd997372%28v=vs.100%29.aspx
Also, you can use FileStream and positioning on necessary position by stream.Seek(xxx, SeekOrigin.Begin) (relative position) or Position property (absolute position).
I conceived this idea to merge an arbitrary number of small text file into 1 single zip file with GZipStream class. I spent several nights to make it work, but the outcome is that the final zip file ended up being bigger than if the text files had concatenated together. I vaguely know how Huffman coding works, so I don't know if it's practical to do this, or if there's a better alternative. Ultimately, I want an external sorted index file to map out each blob for fast access. What do you think?
// keep track of index current position
long indexByteOffset = 0;
// in reality the blobs vary in size from 1k to 300k bytes
string[] originalData = { "data blob1", "data blob2", "data blob3", "data blob4" /* etc etc etc */};
// merged compressed file
BinaryWriter zipWriter = new BinaryWriter(File.Create(#"c:\temp\merged.gz"));
// keep track of begining position and size of each blob
StreamWriter indexWriter = new StreamWriter(File.Create(#"c:\temp\index.txt"));
foreach(var blob in originalData){
using(MemoryStream ms = new MemoryStream()){
using(GZipStream zipper = new GZipStream(ms, CompressionMode.Compress)){
Encoding utf8Encoder = new UTF8Encoding();
byte[] encodeBuffer = utf8Encoder.GetBytes(blob);
zipper.Write(encodeBuffer, 0, encodeBuffer.Length);
}
byte[] compressedData = ms.ToArray();
zipWriter.Write(compressedData);
zipWriter.Seek(0, SeekOrigin.End);
indexWriter.WriteLine(indexByteOffset + '\t' + (indexByteOffset + compressedData.Length));
indexByteOffset += compressedData.Length;
}
}
Different data can compress with different effectiveness. Small data usually isn't worth trying to compress. One common approach is to allow for an "is it compressed?" flag - do a speculative compress, but if it is larger store the original. That information could be included in the index. Personally, though, I'd probably be tempted to go for a single file - either a .zip, or just including the length of each fragment as a 4-byte chunk (or maybe a "varint") before each - then seeking to the n-th fragment is just a case of "read length prefix, decode as int, seek that many bytes, repeat". You could also reserve one bit of that for "is it compressed".
But as for "is it worth compression": that depends on your data.
I am using a windows mobile compact edition 6.5 phone and am writing out binary data to a file from bluetooth. These files get quite large, 16M+ and what I need to do is to once the file is written then I need to search the file for a start character and then delete everything before, thus eliminating garbage. I cannot do this inline when the data comes in due to graphing issues and speed as I get alot of data coming in and there is already too many if conditions on the incoming data. I figured it was best to post process. Anyway here is my dilemma, speed of search for the start bytes and the rewrite of the file takes sometimes 5mins or more...I basically move the file over to a temp file parse through it and rewrite a whole new file. I have to do this byte by byte.
private void closeFiles() {
try {
// Close file stream for raw data.
if (this.fsRaw != null) {
this.fsRaw.Flush();
this.fsRaw.Close();
// Move file, seek the first sync bytes,
// write to fsRaw stream with sync byte and rest of data after it
File.Move(this.s_fileNameRaw, this.s_fileNameRaw + ".old");
FileStream fsRaw_Copy = File.Open(this.s_fileNameRaw + ".old", FileMode.Open);
this.fsRaw = File.Create(this.s_fileNameRaw);
int x = 0;
bool syncFound = false;
// search for sync byte algorithm
while (x != -1) {
... logic to search for sync byte
if (x != -1 && syncFound) {
this.fsPatientRaw.WriteByte((byte)x);
}
}
this.fsRaw.Close();
fsRaw_Copy.Close();
File.Delete(this.s_fileNameRaw + ".old");
}
} catch(IOException e) {
CLogger.WriteLog(ELogLevel.ERROR,"Exception in writing: " + e.Message);
}
}
There has got to be a faster way than this!
------------Testing times using answer -------------
Initial Test my way with one byte read and and one byte write:
27 Kb/sec
using a answer below and a 32768 byte buffer:
321 Kb/sec
using a answer below and a 65536 byte buffer:
501 Kb/sec
You're doing a byte-wise copy of the entire file. That can't be efficient for a load of reasons. Search for the start offset (and end offset if you need both), then copy from one stream to another the entire contents between the two offsets (or the start offset and end of file).
EDIT
You don't have to read the entire contents to make the copy. Something like this (untested, but you get the idea) would work.
private void CopyPartial(string sourceName, byte syncByte, string destName)
{
using (var input = File.OpenRead(sourceName))
using (var reader = new BinaryReader(input))
using (var output = File.Create(destName))
{
var start = 0;
// seek to sync byte
while (reader.ReadByte() != syncByte)
{
start++;
}
var buffer = new byte[4096]; // 4k page - adjust as you see fit
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
output.Write(buffer, 0, actual);
} while (reader.PeekChar() >= 0);
}
}
EDIT 2
I actually needed something similar to this today, so I decided to write it without the PeekChar() call. Here's the kernel of what I did - feel free to integrate it with the second do...while loop above.
var buffer = new byte[1024];
var total = 0;
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, actual);
total += actual;
} while (total < reader.BaseStream.Length);
Don't discount an approach because you're afraid it will be too slow. Try it! It'll only take 5-10 minutes to give it a try and may result in a much better solution.
If the detection process for the start of the data is not too complex/slow, then avoiding writing data until you hit the start may actually make the program skip past the junk data more efficiently.
How to do this:
Use a simple bool to know whether or not you have detected the start of the data. If you are reading junk, then don't waste time writing it to the output, just scan it to detect the start of the data. Once you find the start, then stop scanning for the start and just copy the data to the output. Just copying the good data will incur no more than an if (found) check, which really won't make any noticeable difference to your performance.
You may find that in itself solves the problem. But you can optimise it if you need more performance:
What can you do to minimise the work you do to detect the start of the data? Perhaps if you are looking for a complex sequence you only need to check for one particular byte value that starts the sequence, and it's only if you find that start byte that you need to do any more complex checking. There are some very simple but efficient string searching algorithms that may help in this sort of case too. Or perhaps you can allocate a buffer (e.g. 4kB) and gradually fill it with bytes from your incoming stream. When the buffer is filled, then and only then search for the end of the "junk" in your buffer. By batching the work you can make use of memory/cache coherence to make the processing considerably more efficient than it would be if you did the same work byte by byte.
Do all the other "conditions on the incoming data" need to be continually checked? How can you minimise the amount of work you need to do but still achieve the required results? Perhaps some of the ideas above might help here too?
Do you actually need to do any processing on the data while you are skipping junk? If not, then you can break the whole thing into two phases (skip junk, copy data), and skipping the junk won't cost you anything when it actually matters.
Hey there! I'm trying to read a 150mb file with a file stream but every time I do it all I get is: |zl instead of the whole stream. Note that it has some special characters in it.
Does anybody know what the problem could be? here is my code:
using (FileStream fs = File.OpenRead(path))
{
byte[] buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
extract = Encoding.Default.GetString(buffer);
}
Edit:
I tried to read all text but it still returned the same four characters. It works fine on any other file except for these few. When I use read all lines it only gets the first line.
fs.Read() does not read the whole smash of bytes all at once, it reads some number of bytes and returns the number of bytes read. MSDN has an excellent example of how to use it to get the whole file:
http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
For what it's worth, reading the entire 150MB of data into memory is really going to put a drain on your client's system -- the preferred option would be to optimize it so that you don't need the whole file all at once.
If you want to read text this way File.ReadAllLine (or ReadAllText) - http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx is better option.
My guess the file is not text file to start with and the way you display resulting string does stop at 0 characters.
As debracey pointed out Read returns number of bytes read - check that out. Also for file operations it is unlikely to stop at 4 characters...
I have seen following code for getting the file into array, which is in turn used as a parameter for SQL command inserting it into a blob column:
using (FileStream fs = new FileStream(soubor,FileMode.Open,FileAccess.Read))
int length = (int)fs.Length;
buffer = new byte[length];
int count;
int sum = 0;
while ((count = fs.Read(buffer, sum, length - sum)) > 0)
sum += count;
Why I cannot simply do that:
fs.Read(buffer, 0, length) in order to just copy content of file to the buffer?
Thanks
There's more to it than just "the file may not fit in memory". The contract for Stream.Read explicitly says:
Implementations of this method read a
maximum of count bytes from the
current stream and store them in
buffer beginning at offset. The
current position within the stream is
advanced by the number of bytes read;
however, if an exception occurs, the
current position within the stream
remains unchanged. Implementations
return the number of bytes read. The
return value is zero only if the
position is currently at the end of
the stream. The implementation will
block until at least one byte of data
can be read, in the event that no data
is available. Read returns 0 only when
there is no more data in the stream
and no more is expected (such as a
closed socket or end of file). An
implementation is free to return fewer
bytes than requested even if the end
of the stream has not been reached.
Note the last sentence - you can't rely on a single call to Stream.Read to read everything.
The docs for FileStream.Read have a similar warning:
The total number of bytes read into
the buffer. This might be less than
the number of bytes requested if that
number of bytes are not currently
available, or zero if the end of the
stream is reached.
For a local file system I don't know for sure whether this will ever actually happen - but it could do for a network mounted file. Do you want your app to brittle in that way?
Reading in a loop is the robust way to do things. Personally I prefer not to require the stream to support the Length property, either:
public static byte[] ReadFully(Stream stream)
{
byte[] buffer = new byte[8192];
using (MemoryStream tmpStream = new MemoryStream())
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
tmpStream.Write(buffer, 0, bytesRead);
}
return tmpStream.ToArray();
}
}
That is slightly less efficient when the length is known beforehand, but it's nice and simple. You only need to implement it once, put it in a utility library, and call it whenever you need to. If you really mind the efficiency loss, you could use CanSeek to test whether the Length property is supported, and repeatedly read into a single buffer in that case. Be aware of the possibility that the length of the stream could change while you're reading though...
Of course, File.ReadAllBytes will do the trick even more simply when you only need to deal with a file rather than a general stream.
Because your file could be very large and the buffer has usually a fixed size of 4-32 KB. This way you know you're not filling your memory unnessecarily.
Of course, if you KNOW the size of your file is not too large or if you store the contents in memory anyways, there is no reason not to read it all in one shot.
Although, if you want to read the contents of your file directly into a variable, you don't need the Stream API. Rather use
File.ReadAllText(...)
or
File.ReadAllBytes(...)
A simple fs.Read(buffer, 0, length) will probably work, and it will even be hard to find a test to break it. But it simply is not guaranteed, and it might break in the future.
The best answer here is to use a specialized method from the library. In this case
byte[] buffer = System.IO.File.ReadAllBytes(fileName);
A quick look with Reflector confirms that this will get you the partial-buffer logic and the exception-safe Dispose() of your stream.
And when future versions of the Framework allow for better ways to do this your code will automatically profit.