Read n first Characters of a big Text File - C# - c#

I have a very big text file, for example about 1 GB. I need to just read 100 first characters and nothing more.
I searched StackOverflow and other forums but all of them have some solutions which first read the whole file and then will return some n characters of the file.
I do not want to read and load the whole file into memory etc. just need the first characters.

You can use StreamReader.ReadBlock() to read a specified number of characters from a file:
public static char[] ReadChars(string filename, int count)
{
using (var stream = File.OpenRead(filename))
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
char[] buffer = new char[count];
int n = reader.ReadBlock(buffer, 0, count);
char[] result = new char[n];
Array.Copy(buffer, result, n);
return result;
}
}
Note that this assumes that your file has UTF8 encoding. If it doesn't, you'll need to specify the correct encoding (in which case you could add an encoding parameter to ReadChars() rather than hard-coding it).
The advantage of using ReadBlock() rather than Read() is that it blocks until either all the characters have been read, or the end of the file has been reached. However, for a FileStream this is of no consequence; just be aware that Read() can return less bytes than asked for in the general case, even if the end of the stream has not been reached.
If you want an async version you can just call ReadBlockAsync() like so:
public static async Task<char[]> ReadCharsAsync(string filename, int count)
{
using (var stream = File.OpenRead(filename))
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
char[] buffer = new char[count];
int n = await reader.ReadBlockAsync(buffer, 0, count);
char[] result = new char[n];
Array.Copy(buffer, result, n);
return result;
}
}
Which you might call like so:
using System;
using System.IO;
using System.Text;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static async Task Main()
{
string filename = "Your filename here";
Console.WriteLine(await ReadCharsAsync(filename, 100));
}
}
}

Let's read with StreamReader:
char[] buffer = new char[100];
using (StreamReader reader = new StreamReader(#"c:\MyFile.txt")) {
// Technically, StreamReader can read less than buffer.Length characters
// if the file is too short;
// in this case reader.Read returns the number of actually read chars
reader.Read(buffer, 0, buffer.Length);
}

fs.Read(); does not read the whole bytes all at once, it reads some number of bytes and returns the number of bytes read. MSDN has a good example of how to use it.
http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
Reading the entire 1 GB of data into memory is really going to put a drain on your client's system -- the preferred option would be to optimize it so that you don't need the whole file all at once.

Related

ToArray() function limitation

I am using the .ToArray() method to convert my string to char array whose size i have kept char[] buffer = new char[1000000]; but when I am using the following code:
using (StreamReader streamReader = new StreamReader(path1))
{
buffer = streamReader.ReadToEnd().ToCharArray();
}
// buffer = result.ToArray();
threadfunc(data_path1);
The size of the buffer getting fixed up to 8190, even it is not reading the whole file after using .ToCharArray() or .ToArray().
What is the reason for this does .ToCharArray() or .ToArray() have size limitations? As if I do not use this function I'm able to read whole file in string format, but when trying to convert it into char array by using this function I am getting size limitations.
My guess is the problem is that read to end should finish before you call the ToCharArray(). This might help you. You don't need to define buffer since ToCharArray() creates a new instance of char[] itself.
string content;
using (StreamReader streamReader = new StreamReader(path1))
{
content = streamReader.ReadToEnd();
}
var buffer = content.ToCharArray();
ToCharArray() returns new instance of of array. So your buffer will refer to the new instance which is the size of data returned by ReadToEnd.
If you want keep buffer same size just add new array to the existed one
char[] buffer = new char[1000000];
using (StreamReader streamReader = new StreamReader(path1))
{
var tempArray = streamReader.ReadToEnd().ToCharArray();
tempArray.CopyTo(buffer, 0);
}
If you want just use the result array - you don't need to "predict" the size of array - just use returned one
public char[] GetArrayFromFile(string pathToFile)
{
using (StreamReader streamReader = new StreamReader(path1))
{
var data = streamReader.ReadToEnd();
}
return data.ToCharArray();
}
var arrayFromFile = GetArrayFromFile(#"..\path.file");
You are probably using incorrect encoding. By default StreamReader(String) uses UTF8 encoding:
The complete file path is specified by the path parameter. This
constructor initializes the encoding to UTF8Encoding and the buffer
size to 1024 bytes.
Don't pre-allocate the buffer size, unless you have a specific need.
If your file is in ASCII format, you need to update your StreamReader constructor:
char[] buffer = null;
using (StreamReader streamReader = new StreamReader(path1, Encoding.ASCII))
{
buffer = streamReader.ReadToEnd().ToCharArray();
}
// buffer = result.ToArray();
threadfunc(data_path1);
Does your file contain binary data? If it contains EOF character and the stream is opened in text mode (which StreamReader does), that character will signal end of file, even if it is not actually the end of the file.
I can reproduce this by reading random .exe files in text mode.

How memory management works between bytes and string?

In C# If I have 4-5 GB of data which is now in form of bytes then I am converting it into string then what will be the impact of this on memory and how to do better memory management while using variable for large string ?
Code
public byte[] ExtractMessage(int start, int end)
{
if (end <= start)
return null;
byte[] message = new byte[end - start];
int remaining = Size - end;
Array.Copy(Frame, start, message, 0, message.Length);
// Shift any remaining bytes to front of the buffer
if (remaining > 0)
Array.Copy(Frame, end, Frame, 0, remaining);
Size = remaining;
ScanPosition = 0;
return message;
}
byte[] rawMessage = Buffer.ExtractMessage(messageStart, messageEnd);
// Once bytes are received, I want to create an xml file which will be used for further more work
string msg = Encoding.UTF8.GetString(rawMessage);
CreateXMLFile(msg);
public void CreateXMLFile(string msg)
{
string fileName = "msg.xml";
if (File.Exists(fileName))
{
File.Delete(fileName);
}
using (File.Create(fileName)) { };
TextWriter tw = new StreamWriter(fileName, true);
tw.Write(msg);
tw.Close();
}
.NET strings are stored as unicode which means two bytes per character. As you are using UTF8 you'll double the memory usage when converting to a string.
Once you've converted the text to a string nothing more will happen unless you try do modify it. string objects are immutable which means that a new copy of the string will be created each time you modify it using one of the methods such as Remove().
You can read more here: How are strings passed in .NET?
A byte array however is always passed by reference, and each change will affect all variables holding it. Thus changes will not hurt performance/memory consumption.
You can get a byte[] from a string by using var buffer = yourEncoding.GetBytes(yourString);. Common encodings can be accessed using static variables: var buffer= Encoding.UTF8.GetBytes(yourString);

HttpWebResponse + Stream.Read, adding null chars at the end

I'm trying to get a byte[] array filled with request response, without any extra garbage data.
This is how I fetch the data:
using (Stream MyResponseStream = hwresponse.GetResponseStream())
{
byte[] MyBuffer = new byte[4096];
int BytesRead;
while (0 < (BytesRead = MyResponseStream.Read(MyBuffer, 0, MyBuffer.Length)))
{
ByteArrayToFile("request.txt", MyBuffer);
}
}
I use the function 'ByteArrayToFile' to see what data has been recieved.
public void ByteArrayToFile(string _FileName, byte[] _ByteArray)
{
System.IO.FileStream _FileStream = new System.IO.FileStream(_FileName, System.IO.FileMode.Append, System.IO.FileAccess.Write);
_FileStream.Write(_ByteArray, 0, _ByteArray.Length);
_FileStream.Close();
}
I get request written to the file, but a lot of 'null' characters are added at the end. How do I trim them? Since I'm going to need this to handle binary files, how can I safely trim out the endings and have just pure array of response? Thanks!
You need to utilise the value BytesRead, this will indicate exactly how many bytes were received:
public void ByteArrayToFile(string _FileName, byte[] _ByteArray, int _BytesRead)
{
using (var _FileStream = new FileStream(
_FileName, FileMode.Append, FileAccess.Write))
{
_FileStream.Write(_ByteArray, 0, _BytesRead);
}
}
Otherwise you're writing out an array of length X which has only been populated with Y number of elements, causing a number of 'unused' elements in the array to also be written out. There is also the possibility of stale data remaining in the buffer with a pass, meaning misinformation could also end up being written out with the next write.
You should also dispose of FileStream instances when done (although Close does this for a Stream, I'd recommend the consistency of calling Dispose in one of two ways: explicitly or as illustrated in the code above, implicitly using the using construct).

How to merge efficiently gigantic files with C#

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).
Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.
Thanks!
So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:
static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
using (Stream output = File.OpenWrite(outputFile))
{
foreach (string inputFile in inputFiles)
{
using (Stream input = File.OpenRead(inputFile))
{
input.CopyTo(output);
}
}
}
}
That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:
private static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, bytesRead);
}
}
There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.
EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.
Do it from the command line:
copy 1.txt+2.txt+3.txt combined.txt
or
copy *.txt combined.txt
Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?
In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (/b is for binary, remove if not needed):
copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"
Using C#, I'd take the following approach. Write a simple function that copies two streams:
void CopyStreamToStream(Stream dest, Stream src)
{
int bytesRead;
// experiment with the best buffer size, often 65536 is very performant
byte[] buffer = new byte[GOOD_BUFFER_SIZE];
// copy everything
while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
{
dest.Write(buffer, 0, bytesRead);
}
}
// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);
Using a folder of 100MB text files totalling ~12GB, I found that a small time saving could be made over the accepted answer by using File.ReadAllBytes and then writing that out to the stream.
[Test]
public void RaceFileMerges()
{
var inputFilesPath = #"D:\InputFiles";
var inputFiles = Directory.EnumerateFiles(inputFilesPath).ToArray();
var sw = new Stopwatch();
sw.Start();
ConcatenateFilesUsingReadAllBytes(#"D:\ReadAllBytesResult", inputFiles);
Console.WriteLine($"ReadAllBytes method in {sw.Elapsed}");
sw.Reset();
sw.Start();
ConcatenateFiles(#"D:\CopyToResult", inputFiles);
Console.WriteLine($"CopyTo method in {sw.Elapsed}");
}
private static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
using (var output = File.OpenWrite(outputFile))
{
foreach (var inputFile in inputFiles)
{
using (var input = File.OpenRead(inputFile))
{
input.CopyTo(output);
}
}
}
}
private static void ConcatenateFilesUsingReadAllBytes(string outputFile, params string[] inputFiles)
{
using (var stream = File.OpenWrite(outputFile))
{
foreach (var inputFile in inputFiles)
{
var currentBytes = File.ReadAllBytes(inputFile);
stream.Write(currentBytes, 0, currentBytes.Length);
}
}
}
ReadAllBytes method in 00:01:22.2753300
CopyTo method in 00:01:30.3122215
I repeated this a number of times with similar results.

How to read multiple text files and save them into one text file?

In my case I have five huge text files,which I have to embedd into one text file.
I tried with StreamReader(),but I don't know how to make it Read one more file,do I have to assign another variable?
Showing an example will be appreciated greatfully.
New answer
(See explanation for junking original answer below.)
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
// Somewhat arbitrary limit, but it won't go on the large object heap
char[] buffer = new char[16 * 1024];
foreach (string source in sources)
{
using (TextReader reader = File.OpenText(source))
{
int charsRead;
while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
writer.Write(buffer, 0, charsRead);
}
}
}
}
}
This new answer is quite like Martin's approach, except:
It reads into a smaller buffer; 16K is going to be acceptable in almost all situations, and won't end up on the large object heap (which doesn't get compacted)
It reads text data instead of binary data, for two reasons:
The code can easily be modified to convert from one encoding to another
If each input file contains a byte-order mark, that will be skipped by the reader, instead of ending up with byte-order marks scattered through the output file at input file boundaries
Original answer
Martin Stettner pointed out an issue in the answer below - if the first file ends without a newline, it will still create a newline in the output file. Also, it will translate newlines into the "\r\n" even if they were previously just "\r" or "\n". Finally, it pointlessly risks using large amounts of data for long lines.
Something like:
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
foreach (string source in sources)
{
using (TextReader reader = File.OpenText(source))
{
string line;
while ((line = reader.ReadLine()) != null)
{
writer.WriteLine(line);
}
}
}
}
}
Note that this reads line by line to avoid reading too much into memory at a time. You could make it simpler if you're happy to read each file completely into memory (still one at a time):
static void CopyFiles(string dest, params string[] sources)
{
using (TextWriter writer = File.CreateText(dest))
{
foreach (string source in sources)
{
string text = File.ReadAllText(source);
writer.Write(text);
}
}
}
Edit:
As Jon Skeet pointed out, text files usually should be handled differently than binary files
.
I just leave this answer since it might be more performant if you have really big files and aren't concernded by encoding issues (such as different input files having different encodings or multiple Byte Order Marks in the output file):
public void CopyFiles(string destPath, string[] sourcePaths) {
byte[] buffer = new byte[10 * 1024 * 1024]; // Just allocate a buffer as big as you can afford
using (var destStream= = new FileStream(destPath, FileMode.Create) {
foreach (var sourcePath in sourcePaths) {
int read;
using (var sourceStream = FileStream.Create(sourcePath, FileMode.Open) {
while ((read = sourceStream.Read(buffer, 0, 10*1024*1024)) != 0)
destStream.Write(buffer, 0, read);
}
}
}
}

Categories

Resources