How to merge efficiently gigantic files with C#

How to merge efficiently gigantic files with C# - c#

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).
Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.
Thanks!

So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:
static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
using (Stream output = File.OpenWrite(outputFile))
{
foreach (string inputFile in inputFiles)
{
using (Stream input = File.OpenRead(inputFile))
{
input.CopyTo(output);
}
}
}
}
That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:
private static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, bytesRead);
}
}
There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.
EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.

Do it from the command line:
copy 1.txt+2.txt+3.txt combined.txt
or
copy *.txt combined.txt

Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?
In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (/b is for binary, remove if not needed):
copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"
Using C#, I'd take the following approach. Write a simple function that copies two streams:
void CopyStreamToStream(Stream dest, Stream src)
{
int bytesRead;
// experiment with the best buffer size, often 65536 is very performant
byte[] buffer = new byte[GOOD_BUFFER_SIZE];
// copy everything
while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
{
dest.Write(buffer, 0, bytesRead);
}
}
// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);

Using a folder of 100MB text files totalling ~12GB, I found that a small time saving could be made over the accepted answer by using File.ReadAllBytes and then writing that out to the stream.
[Test]
public void RaceFileMerges()
{
var inputFilesPath = #"D:\InputFiles";
var inputFiles = Directory.EnumerateFiles(inputFilesPath).ToArray();
var sw = new Stopwatch();
sw.Start();
ConcatenateFilesUsingReadAllBytes(#"D:\ReadAllBytesResult", inputFiles);
Console.WriteLine($"ReadAllBytes method in {sw.Elapsed}");
sw.Reset();
sw.Start();
ConcatenateFiles(#"D:\CopyToResult", inputFiles);
Console.WriteLine($"CopyTo method in {sw.Elapsed}");
}
private static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
using (var output = File.OpenWrite(outputFile))
{
foreach (var inputFile in inputFiles)
{
using (var input = File.OpenRead(inputFile))
{
input.CopyTo(output);
}
}
}
}
private static void ConcatenateFilesUsingReadAllBytes(string outputFile, params string[] inputFiles)
{
using (var stream = File.OpenWrite(outputFile))
{
foreach (var inputFile in inputFiles)
{
var currentBytes = File.ReadAllBytes(inputFile);
stream.Write(currentBytes, 0, currentBytes.Length);
}
}
}
ReadAllBytes method in 00:01:22.2753300
CopyTo method in 00:01:30.3122215
I repeated this a number of times with similar results.

Related

Can I get a GZipStream for a file without writing to intermediate temporary storage?

Can I get a GZipStream for a file on disk without writing the entire compressed content to temporary storage? I'm currently using a temporary file on disk in order to avoid possible memory exhaustion using MemoryStream on very large files (this is working fine).
public void UploadFile(string filename)
{
using (var temporaryFileStream = File.Open("tempfile.tmp", FileMode.CreateNew, FileAccess.ReadWrite))
{
using (var fileStream = File.OpenRead(filename))
using (var compressedStream = new GZipStream(temporaryFileStream, CompressionMode.Compress, true))
{
fileStream.CopyTo(compressedStream);
}
temporaryFileStream.Position = 0;
Uploader.Upload(temporaryFileStream);
}
}
What I'd like to do is eliminate the temporary storage by creating GZipStream, and have it read from the original file only as the Uploader class requests bytes from it. Is such a thing possible? How might such an implementation be structured?
Note that Upload is a static method with signature static void Upload(Stream stream).
Edit: The full code is here if it's useful. I hope I've included all the relevant context in my sample above however.

Yes, this is possible, but not easily with any of the standard .NET stream classes. When I needed to do something like this, I created a new type of stream.
It's basically a circular buffer that allows one producer (writer) and one consumer (reader). It's pretty easy to use. Let me whip up an example. In the meantime, you can adapt the example in the article.
Later: Here's an example that should come close to what you're asking for.
using (var pcStream = new ProducerConsumerStream(BufferSize))
{
// start upload in a thread
var uploadThread = new Thread(UploadThreadProc(pcStream));
uploadThread.Start();
// Open the input file and attach the gzip stream to the pcStream
using (var inputFile = File.OpenRead("inputFilename"))
{
// create gzip stream
using (var gz = new GZipStream(pcStream, CompressionMode.Compress, true))
{
var bytesRead = 0;
var buff = new byte[65536]; // 64K buffer
while ((bytesRead = inputFile.Read(buff, 0, buff.Length)) != 0)
{
gz.Write(buff, 0, bytesRead);
}
}
}
// The entire file has been compressed and copied to the buffer.
// Mark the stream as "input complete".
pcStream.CompleteAdding();
// wait for the upload thread to complete.
uploadThread.Join();
// It's very important that you don't close the pcStream before
// the uploader is done!
}
The upload thread should be pretty simple:
void UploadThreadProc(object state)
{
var pcStream = (ProducerConsumerStream)state;
Uploader.Upload(pcStream);
}
You could, of course, put the producer on a background thread and have the upload be done on the main thread. Or have them both on background threads. I'm not familiar with the semantics of your uploader, so I'll leave that decision to you.

Decompress tar files using C#

I'm searching a way to add embedded resource to my solution. This resources will be folders with a lot of files in them. On user demand they need to be decompressed.
I'm searching for a way do store such folders in executable without involving third-party libraries (Looks rather stupid, but this is the task).
I have found, that I can GZip and UnGZip them using standard libraries. But GZip handles single file only. In such cases TAR should come to the scene. But I haven't found TAR implementation among standard classes.
Maybe it possible decompress TAR with bare C#?

While looking for a quick answer to the same question, I came across this thread, and was not entirely satisfied with the current answers, as they all point to using third-party dependencies to much larger libraries, all just to achieve simple extraction of a tar.gz file to disk.
While the gz format could be considered rather complicated, tar on the other hand is quite simple. At its core, it just takes a bunch of files, prepends a 500 byte header (but takes 512 bytes) to each describing the file, and writes them all to single archive on a 512 byte alignment. There is no compression, that is typically handled by compressing the created file to a gz archive, which .NET conveniently has built-in, which takes care of all the hard part.
Having looked at the spec for the tar format, there are only really 2 values (especially on Windows) we need to pick out from the header in order to extract the file from a stream. The first is the name, and the second is size. Using those two values, we need only seek to the appropriate position in the stream and copy the bytes to a file.
I made a very rudimentary, down-and-dirty method to extract a tar archive to a directory, and added some helper functions for opening from a stream or filename, and decompressing the gz file first using built-in functions.
The primary method is this:
public static void ExtractTar(Stream stream, string outputDir)
{
var buffer = new byte[100];
while (true)
{
stream.Read(buffer, 0, 100);
var name = Encoding.ASCII.GetString(buffer).Trim('\0');
if (String.IsNullOrWhiteSpace(name))
break;
stream.Seek(24, SeekOrigin.Current);
stream.Read(buffer, 0, 12);
var size = Convert.ToInt64(Encoding.ASCII.GetString(buffer, 0, 12).Trim(), 8);
stream.Seek(376L, SeekOrigin.Current);
var output = Path.Combine(outputDir, name);
if (!Directory.Exists(Path.GetDirectoryName(output)))
Directory.CreateDirectory(Path.GetDirectoryName(output));
using (var str = File.Open(output, FileMode.OpenOrCreate, FileAccess.Write))
{
var buf = new byte[size];
stream.Read(buf, 0, buf.Length);
str.Write(buf, 0, buf.Length);
}
var pos = stream.Position;
var offset = 512 - (pos % 512);
if (offset == 512)
offset = 0;
stream.Seek(offset, SeekOrigin.Current);
}
}
And here is a few helper functions for opening from a file, and automating first decompressing a tar.gz file/stream before extracting.
public static void ExtractTarGz(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTarGz(stream, outputDir);
}
public static void ExtractTarGz(Stream stream, string outputDir)
{
// A GZipStream is not seekable, so copy it first to a MemoryStream
using (var gzip = new GZipStream(stream, CompressionMode.Decompress))
{
const int chunk = 4096;
using (var memStr = new MemoryStream())
{
int read;
var buffer = new byte[chunk];
do
{
read = gzip.Read(buffer, 0, chunk);
memStr.Write(buffer, 0, read);
} while (read == chunk);
memStr.Seek(0, SeekOrigin.Begin);
ExtractTar(memStr, outputDir);
}
}
}
public static void ExtractTar(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTar(stream, outputDir);
}
Here is a gist of the full file with some comments.

Tar-cs will do the job, but it is quite slow. I would recommend using SharpCompress which is significantly quicker. It also supports other compression types and it has been updated recently.
using System;
using System.IO;
using SharpCompress.Common;
using SharpCompress.Reader;
private static String directoryPath = #"C:\Temp";
public static void unTAR(String tarFilePath)
{
using (Stream stream = File.OpenRead(tarFilePath))
{
var reader = ReaderFactory.Open(stream);
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
{
ExtractionOptions opt = new ExtractionOptions {
ExtractFullPath = true,
Overwrite = true
};
reader.WriteEntryToDirectory(directoryPath, opt);
}
}
}
}

See tar-cs
using (FileStream unarchFile = File.OpenRead(tarfile))
{
TarReader reader = new TarReader(unarchFile);
reader.ReadToEnd("out_dir");
}

Since you are not allowed to use outside libraries, you are not restricted to a specific format of the tar file either. In fact, they don't even need it to be all in the same file.
You can write your own tar-like utility in C# that walks a directory tree, and produces two files: a "header" file that consists of a serialized dictionary mapping System.IO.Path instances to an offset/length pairs, and a big file containing the content of individual files concatenated into one giant blob. This is not a trivial task, but it's not overly complicated either.

there are 2 ways to compress/decompress in .NET first you can use Gzipstream class and DeflatStream both can actually do compress your files in .gz format so if you compressed any file in Gzipstream it can be opened with any popular compression applications such as winzip/ winrar, 7zip but you can't open compressed file with DeflatStream. these two classes are from .NET 2.
and there is another way which is Package class it's actually same as Gzipstream and DeflatStream the only different is you can compress multiple files which then can be opened with winzip/ winrar, 7zip.so that's all .NET has. but it's not even generic .zip file,
it something Microsoft uses to compress their *x extension office files. if you decompress any docx file with package class you can see everything stored in it. so don't use .NET libraries for compressing or even decompressing cause you can't even make a generic compress file or even decompress a generic zip file. you have to consider for a third party library such as
http://www.icsharpcode.net/OpenSource/SharpZipLib/
or implement everything from the ground floor.

writing a large stream to a file in C#.Net 64K by 64K

Please suggest me a good method that can be used to write a stream into a file.
I just need a simple c# function that can take a stream as input and do the job..
I need to do this for very large files ie files > 4GB.
Can this be done better using linq,extension methods etc?
Please provide me a good utility function that can also return the progress in percentage through yield.
Edit: I know about looping through a byte[] and writing it to a file. I've tried File.WriteAllBytes method. But,I'm just looking for a very nice way of doing it using linq,yield and extension methods.

Edit: Here is a utility function that should do the trick:
Update: Changed second parameter to file name
public delegate void ProgressCallback(long position, long total);
public void Copy(Stream inputStream, string outputFile, ProgressCallback progressCallback)
{
using (var outputStream = File.OpenWrite(outputFile))
{
const int bufferSize = 4096;
while (inputStream.Position < inputStream.Length)
{
byte[] data = new byte[bufferSize];
int amountRead = inputStream.Read(data, 0, bufferSize);
outputStream.Write(data, 0, amountRead);
if (progressCallback != null)
progressCallback(inputStream.Position, inputStream.Length);
}
outputStream.Flush();
}
}

How to write super-fast file-streaming code in C#?

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:
private void copy(string srcFile, string dstFile, int offset, int length)
{
BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
reader.BaseStream.Seek(offset, SeekOrigin.Begin);
byte[] buffer = reader.ReadBytes(length);
BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
writer.Write(buffer);
}
Considering that I have to call this function about 100,000 times, it is remarkably slow.
Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:
public static void CopySection(Stream input, string targetFile, int length)
{
byte[] buffer = new byte[8192];
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:
public static void CopySection(Stream input, string targetFile,
int length, byte[] buffer)
{
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
Note that this also closes the output stream (due to the using statement) which your original code didn't.
The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.
I think it'll be significantly faster, but obviously you'll need to try it to see...
This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:
http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.
(edit) something like:
private static void copy(string srcFile, string dstFile, int offset,
int length, byte[] buffer)
{
using(Stream inStream = File.OpenRead(srcFile))
using (Stream outStream = File.OpenWrite(dstFile))
{
inStream.Seek(offset, SeekOrigin.Begin);
int bufferLength = buffer.Length, bytesRead;
while (length > bufferLength &&
(bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
while (length > 0 &&
(bytesRead = inStream.Read(buffer, 0, length)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.
If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:
offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000
can be grouped to one read:
offset = 1234, length = 1074
Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.
static void Main(string[] args)
{
Dispatcher dp = new Dispatcher();
DispatcherQueue dq = new DispatcherQueue("DQ", dp);
Port<long> offsetPort = new Port<long>();
Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
new Handler<long>(Split)));
FileStream fs = File.Open(file_path, FileMode.Open);
long size = fs.Length;
fs.Dispose();
for (long i = 0; i < size; i += split_size)
{
offsetPort.Post(i);
}
}
private static void Split(long offset)
{
FileStream reader = new FileStream(file_path, FileMode.Open,
FileAccess.Read);
reader.Seek(offset, SeekOrigin.Begin);
long toRead = 0;
if (offset + split_size <= reader.Length)
toRead = split_size;
else
toRead = reader.Length - offset;
byte[] buff = new byte[toRead];
reader.Read(buff, 0, (int)toRead);
reader.Dispose();
File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
}
This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?
Over 100,000 accesses (sum the times):
How much time is spent allocating the buffer array?
How much time is spent opening the file for read (is it the same file every time?)
How much time is spent in read and write operations?
If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.
Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.
If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

(For future reference.)
Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).
Memory Mapped files are supported in managed code in .NET 4.0.
But as noted, you need to profile, and expect to switch to native code for maximum performance.

How to read multiple text files and save them into one text file?

In my case I have five huge text files,which I have to embedd into one text file.
I tried with StreamReader(),but I don't know how to make it Read one more file,do I have to assign another variable?
Showing an example will be appreciated greatfully.

New answer
(See explanation for junking original answer below.)
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
// Somewhat arbitrary limit, but it won't go on the large object heap
char[] buffer = new char[16 * 1024];
foreach (string source in sources)
{
using (TextReader reader = File.OpenText(source))
{
int charsRead;
while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
writer.Write(buffer, 0, charsRead);
}
}
}
}
}
This new answer is quite like Martin's approach, except:
It reads into a smaller buffer; 16K is going to be acceptable in almost all situations, and won't end up on the large object heap (which doesn't get compacted)
It reads text data instead of binary data, for two reasons:
The code can easily be modified to convert from one encoding to another
If each input file contains a byte-order mark, that will be skipped by the reader, instead of ending up with byte-order marks scattered through the output file at input file boundaries
Original answer
Martin Stettner pointed out an issue in the answer below - if the first file ends without a newline, it will still create a newline in the output file. Also, it will translate newlines into the "\r\n" even if they were previously just "\r" or "\n". Finally, it pointlessly risks using large amounts of data for long lines.
Something like:
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
foreach (string source in sources)
{
using (TextReader reader = File.OpenText(source))
{
string line;
while ((line = reader.ReadLine()) != null)
{
writer.WriteLine(line);
}
}
}
}
}
Note that this reads line by line to avoid reading too much into memory at a time. You could make it simpler if you're happy to read each file completely into memory (still one at a time):
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
foreach (string source in sources)
{
string text = File.ReadAllText(source);
writer.Write(text);
}
}
}

Edit:
As Jon Skeet pointed out, text files usually should be handled differently than binary files
.
I just leave this answer since it might be more performant if you have really big files and aren't concernded by encoding issues (such as different input files having different encodings or multiple Byte Order Marks in the output file):
public void CopyFiles(string destPath, string[] sourcePaths) {
byte[] buffer = new byte[10 * 1024 * 1024]; // Just allocate a buffer as big as you can afford
using (var destStream= = new FileStream(destPath, FileMode.Create) {
foreach (var sourcePath in sourcePaths) {
int read;
using (var sourceStream = FileStream.Create(sourcePath, FileMode.Open) {
while ((read = sourceStream.Read(buffer, 0, 10*1024*1024)) != 0)
destStream.Write(buffer, 0, read);
}
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to merge efficiently gigantic files with C# - c#

Do it from the command line: copy 1.txt+2.txt+3.txt combined.txt or copy *.txt combined.txt

Related

Can I get a GZipStream for a file without writing to intermediate temporary storage?

Decompress tar files using C#

writing a large stream to a file in C#.Net 64K by 64K

How to write super-fast file-streaming code in C#?

How to read multiple text files and save them into one text file?

Categories

Resources