I am using a windows mobile compact edition 6.5 phone and am writing out binary data to a file from bluetooth. These files get quite large, 16M+ and what I need to do is to once the file is written then I need to search the file for a start character and then delete everything before, thus eliminating garbage. I cannot do this inline when the data comes in due to graphing issues and speed as I get alot of data coming in and there is already too many if conditions on the incoming data. I figured it was best to post process. Anyway here is my dilemma, speed of search for the start bytes and the rewrite of the file takes sometimes 5mins or more...I basically move the file over to a temp file parse through it and rewrite a whole new file. I have to do this byte by byte.
private void closeFiles() {
try {
// Close file stream for raw data.
if (this.fsRaw != null) {
this.fsRaw.Flush();
this.fsRaw.Close();
// Move file, seek the first sync bytes,
// write to fsRaw stream with sync byte and rest of data after it
File.Move(this.s_fileNameRaw, this.s_fileNameRaw + ".old");
FileStream fsRaw_Copy = File.Open(this.s_fileNameRaw + ".old", FileMode.Open);
this.fsRaw = File.Create(this.s_fileNameRaw);
int x = 0;
bool syncFound = false;
// search for sync byte algorithm
while (x != -1) {
... logic to search for sync byte
if (x != -1 && syncFound) {
this.fsPatientRaw.WriteByte((byte)x);
}
}
this.fsRaw.Close();
fsRaw_Copy.Close();
File.Delete(this.s_fileNameRaw + ".old");
}
} catch(IOException e) {
CLogger.WriteLog(ELogLevel.ERROR,"Exception in writing: " + e.Message);
}
}
There has got to be a faster way than this!
------------Testing times using answer -------------
Initial Test my way with one byte read and and one byte write:
27 Kb/sec
using a answer below and a 32768 byte buffer:
321 Kb/sec
using a answer below and a 65536 byte buffer:
501 Kb/sec
You're doing a byte-wise copy of the entire file. That can't be efficient for a load of reasons. Search for the start offset (and end offset if you need both), then copy from one stream to another the entire contents between the two offsets (or the start offset and end of file).
EDIT
You don't have to read the entire contents to make the copy. Something like this (untested, but you get the idea) would work.
private void CopyPartial(string sourceName, byte syncByte, string destName)
{
using (var input = File.OpenRead(sourceName))
using (var reader = new BinaryReader(input))
using (var output = File.Create(destName))
{
var start = 0;
// seek to sync byte
while (reader.ReadByte() != syncByte)
{
start++;
}
var buffer = new byte[4096]; // 4k page - adjust as you see fit
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
output.Write(buffer, 0, actual);
} while (reader.PeekChar() >= 0);
}
}
EDIT 2
I actually needed something similar to this today, so I decided to write it without the PeekChar() call. Here's the kernel of what I did - feel free to integrate it with the second do...while loop above.
var buffer = new byte[1024];
var total = 0;
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, actual);
total += actual;
} while (total < reader.BaseStream.Length);
Don't discount an approach because you're afraid it will be too slow. Try it! It'll only take 5-10 minutes to give it a try and may result in a much better solution.
If the detection process for the start of the data is not too complex/slow, then avoiding writing data until you hit the start may actually make the program skip past the junk data more efficiently.
How to do this:
Use a simple bool to know whether or not you have detected the start of the data. If you are reading junk, then don't waste time writing it to the output, just scan it to detect the start of the data. Once you find the start, then stop scanning for the start and just copy the data to the output. Just copying the good data will incur no more than an if (found) check, which really won't make any noticeable difference to your performance.
You may find that in itself solves the problem. But you can optimise it if you need more performance:
What can you do to minimise the work you do to detect the start of the data? Perhaps if you are looking for a complex sequence you only need to check for one particular byte value that starts the sequence, and it's only if you find that start byte that you need to do any more complex checking. There are some very simple but efficient string searching algorithms that may help in this sort of case too. Or perhaps you can allocate a buffer (e.g. 4kB) and gradually fill it with bytes from your incoming stream. When the buffer is filled, then and only then search for the end of the "junk" in your buffer. By batching the work you can make use of memory/cache coherence to make the processing considerably more efficient than it would be if you did the same work byte by byte.
Do all the other "conditions on the incoming data" need to be continually checked? How can you minimise the amount of work you need to do but still achieve the required results? Perhaps some of the ideas above might help here too?
Do you actually need to do any processing on the data while you are skipping junk? If not, then you can break the whole thing into two phases (skip junk, copy data), and skipping the junk won't cost you anything when it actually matters.
Related
This question already has answers here:
What's the fastest way to read a text file line-by-line?
(9 answers)
Closed 4 years ago.
I want to loop on all the lines of a very large file (10GB for example) using foreach
I am currently using File.ReadLines like that:
var lines = File.ReadLines(fileName);
foreach (var line in lines) {
// Process line
}
But this is very slow if the file is larger than 2MB and it will do the loop very slowly.
How can I loop on very large files?
Any help would be appreciated.
Thanks!
The way you do it is the best way available given that
you don't want to read your whole file into RAM at once
your line processing is independent of previous lines
Sorry, reading stuff from a hard disk is just slow.
Improvements will likely come from other sources:
store your file on a faster device (SSD?)
get more RAM to read your file into memory to at least speed up processing
First of all do you need to read the whole file or only the section of the file.
If you only need to read the section of the file
const int chunkSize = 1024; // read the file by chunks of 1KB
using (var file = File.OpenRead("yourfile"))
{
int bytesRead;
var buffer = new byte[chunkSize];
while ((bytesRead = file.Read(buffer, 0 /* start offset */, buffer.Length)) > 0)
{
// TODO: Process bytesRead number of bytes from the buffer
// not the entire buffer as the size of the buffer is 1KB
// whereas the actual number of bytes that are read are
// stored in the bytesRead integer.
}
}
If you need to load the whole file to the memory.
Use this method repeatedly instead of directly loading to the memory since you have control over what you are doing and at any time you can stop the process.
Or you can use MemoryMappedFile https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx?f=255&MSPPError=-2147217396
Memory mapped files will give a view to the program as beign accessed from the Memory, but it will load from the Disk for the first time only.
long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(#"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
// Create a random access view, from the 256th megabyte (the offset)
// to the 768th megabyte (the offset plus length).
using (var accessor = mmf.CreateViewAccessor(offset, length))
{
//Your process
}
}
The looping will always be slow because of the sheer number of items that you have to loop through. Im pretty sure that its not the looping but the actual work you are doing on each one of those lines that slows it down. A file with 10GB of lines could literally have trillions of lines and anything but the most simple of tasks will take a lot of time.
You could always try making the job threaded so that a different thread is working on a different line. That way at least you have more cores working on the problem.
Set up a for loop and have them increment at different amounts.
Also, im not 100% but I think that you could get a huge increase in speed by splitting the whole thing into an array of string by splitting based on new lines and then working through those since everything is stored in the memory.
string lines = "your huge text";
string[] words = lines.Split('\n');
foreach(string singleLine in lines)
{
}
** Added based on comments **
So there's massive downsides and will take a huge amount of memory. At least the amount that the original file used but this gets round the problem of a slow hard drive and all the data will be read directly into the RAM of the machine, which will be far far faster than reading from the hard drive in small chunks.
There is also an issue here of having a limit of about 2 billion lines, since that the is the maximum number of entries in an array that you can have.
I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.
Compare whole file byte by byte - not fast enough for large files.
File Comparison using Hashes - not 100% match the two files with the same hash.
What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?
If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
Different size, not byte-for-byte equal. Simples!
If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.
Serial Compare
Test File Size(s): 118 MB
Duration: 579 ms
Equal? true
static bool Compare(string filePath1, string filePath2)
{
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
if (file.Length != file2.Length)
{
return false;
}
int count;
const int size = 0x1000000;
var buffer = new byte[size];
var buffer2 = new byte[size];
while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
{
file2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
return false;
}
}
}
}
}
return true;
}
Parallel Compare
Test File Size(s): 118 MB
Duration: 340 ms
Equal? true
static bool Compare2(string filePath1, string filePath2)
{
bool success = true;
var info = new FileInfo(filePath1);
var info2 = new FileInfo(filePath2);
if (info.Length != info2.Length)
{
return false;
}
long fileLength = info.Length;
const int size = 0x1000000;
Parallel.For(0, fileLength / size, x =>
{
var start = (int)x * size;
if (start >= fileLength)
{
return;
}
using (FileStream file = File.OpenRead(filePath1))
{
using (FileStream file2 = File.OpenRead(filePath2))
{
var buffer = new byte[size];
var buffer2 = new byte[size];
file.Position = start;
file2.Position = start;
int count = file.Read(buffer, 0, size);
file2.Read(buffer2, 0, size);
for (int i = 0; i < count; i++)
{
if (buffer[i] != buffer2[i])
{
success = false;
return;
}
}
}
}
});
return success;
}
MD5 Compare
Test File Size(s): 118 MB
Duration: 702 ms
Equal? true
static bool Compare3(string filePath1, string filePath2)
{
byte[] hash1 = GenerateHash(filePath1);
byte[] hash2 = GenerateHash(filePath2);
if (hash1.Length != hash2.Length)
{
return false;
}
for (int i = 0; i < hash1.Length; i++)
{
if (hash1[i] != hash2[i])
{
return false;
}
}
return true;
}
static byte[] GenerateHash(string filePath)
{
MD5 crypto = MD5.Create();
using (FileStream stream = File.OpenRead(filePath))
{
return crypto.ComputeHash(stream);
}
}
tl;dr Compare byte segments in parallel to determine if two files are equal.
Why not both?
Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.
There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.
So a there are two things you'll want to address:
Concurrency - Make sure you're reading data at the same time you're checking it.
Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)
The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.
Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?
If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense.
I might try something like the following:
1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.
2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.
3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.
To calculate a hash, the entire file needs to be read.
How about opening both files together, and comparing them chunk by chunk?
Pseudo code:
open file A
open file B
while file A has more data
{
if next chunk of A != next chunk of B return false
}
return true
This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.
I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?
First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.
Why don't you convert your csv to a normal database. Even sqlexpress will be fine.
Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.
Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace
You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option
This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.
What's the best and the fastest method to remove an item from a binary file?
I have a binary file and I know that I need to remove B number of bytes from a position A, how to do it?
Thanks
You might want to consider working in batches to prevent allocation on the LOH but that depends on the size of your file and the frequency in which you call this logic.
long skipIndex = 100;
int skipLength = 40;
using (FileStream fileStream = File.Open("file.dat", FileMode.Open))
{
int bufferSize;
checked
{
bufferSize = (int)(fileStream.Length - (skipLength + skipIndex));
}
byte[] buffer = new byte[bufferSize];
// read all data after
fileStream.Position = skipIndex + skipLength;
fileStream.Read(buffer, 0, bufferSize);
// write to displacement
fileStream.Position = skipIndex;
fileStream.Write(buffer, 0, bufferSize);
fileStream.SetLength(fileStream.Position); // trim the file
}
Depends... There are a few ways to do this, depending on your requirements.
The basic solution is to read chunks of data from the source file into a target file, skipping over the bits that must be removed (is it always only one segment to remove, or multiple segments?). After you're done, delete the original file and rename the temp file to the original's name.
Things to keep in mind here are that you should tend towards larger chunks rather than smaller. The size of your files will determine a suitable value. 1MB is a good 'default'.
The simple approach assumes that deleting and renaming a new file is not a problem. If you have specific permissions attached to the file, or used NTFS streams or some-such, this approach won't work.
In that case, make a copy of the original file. Then, skip to the first byte after the segment to ignore in the copied file, skip to the start of the segment in the source file, and transfer bytes from copy to original. If you're using Streams, you'll want to call Stream.SetLength to truncate the original to the correct size
If you want to just rewrite the original file, and remove a sequence from it the best way is to "rearrange" the file.
The idea is:
for i = A+1 to file.length - B
file[i] = file[i+B]
For better performance it's best to read and write in chunks and not single bytes. Test with different chunk sizes to see what best for your target system.
I am creating a downloading application and I wish to preallocate room on the harddrive for the files before they are actually downloaded as they could potentially be rather large, and noone likes to see "This drive is full, please delete some files and try again." So, in that light, I wrote this.
// Quick, and very dirty
System.IO.File.WriteAllBytes(filename, new byte[f.Length]);
It works, atleast until you download a file that is several hundred MB's, or potentially even GB's and you throw Windows into a thrashing frenzy if not totally wipe out the pagefile and kill your systems memory altogether. Oops.
So, with a little more enlightenment, I set out with the following algorithm.
using (FileStream outFile = System.IO.File.Create(filename))
{
// 4194304 = 4MB; loops from 1 block in so that we leave the loop one
// block short
byte[] buff = new byte[4194304];
for (int i = buff.Length; i < f.Length; i += buff.Length)
{
outFile.Write(buff, 0, buff.Length);
}
outFile.Write(buff, 0, f.Length % buff.Length);
}
This works, well even, and doesn't suffer the crippling memory problem of the last solution. It's still slow though, especially on older hardware since it writes out (potentially GB's worth of) data out to the disk.
The question is this: Is there a better way of accomplishing the same thing? Is there a way of telling Windows to create a file of x size and simply allocate the space on the filesystem rather than actually write out a tonne of data. I don't care about initialising the data in the file at all (the protocol I'm using - bittorrent - provides hashes for the files it sends, hence worst case for random uninitialised data is I get a lucky coincidence and part of the file is correct).
FileStream.SetLength is the one you want. The syntax:
public override void SetLength(
long value
)
If you have to create the file, I think that you can probably do something like this:
using (FileStream outFile = System.IO.File.Create(filename))
{
outFile.Seek(<length_to_write>-1, SeekOrigin.Begin);
OutFile.WriteByte(0);
}
Where length_to_write would be the size in bytes of the file to write. I'm not sure that I have the C# syntax correct (not on a computer to test), but I've done similar things in C++ in the past and it's worked.
Unfortunately, you can't really do this just by seeking to the end. That will set the file length to something huge, but may not actually allocate disk blocks for storage. So when you go to write the file, it will still fail.