How to read a large (1 GB) txt file in .NET? - c#

I have a 1 GB text file which I need to read line by line. What is the best and fastest way to do this?
private void ReadTxtFile()
{
string filePath = string.Empty;
filePath = openFileDialog1.FileName;
if (string.IsNullOrEmpty(filePath))
{
using (StreamReader sr = new StreamReader(filePath))
{
String line;
while ((line = sr.ReadLine()) != null)
{
FormatData(line);
}
}
}
}
In FormatData() I check the starting word of line which must be matched with a word and based on that increment an integer variable.
void FormatData(string line)
{
if (line.StartWith(word))
{
globalIntVariable++;
}
}

If you are using .NET 4.0, try MemoryMappedFile which is a designed class for this scenario.
You can use StreamReader.ReadLine otherwise.

Using StreamReader is probably the way to since you don't want the whole file in memory at once. MemoryMappedFile is more for random access than sequential reading (it's ten times as fast for sequential reading and memory mapping is ten times as fast for random access).
You might also try creating your streamreader from a filestream with FileOptions set to SequentialScan (see FileOptions Enumeration), but I doubt it will make much of a difference.
There are however ways to make your example more effective, since you do your formatting in the same loop as reading. You're wasting clockcycles, so if you want even more performance, it would be better with a multithreaded asynchronous solution where one thread reads data and another formats it as it becomes available. Checkout BlockingColletion that might fit your needs:
Blocking Collection and the Producer-Consumer Problem
If you want the fastest possible performance, in my experience the only way is to read in as large a chunk of binary data sequentially and deserialize it into text in parallel, but the code starts to get complicated at that point.

You can use LINQ:
int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));
File.ReadLines returns an IEnumerable<String> that lazily reads each line from the file without loading the whole file into memory.
Enumerable.Count counts the lines that start with the word.
If you are calling this from an UI thread, use a BackgroundWorker.

Probably to read it line by line.
You should rather not try to force it into memory by reading to end and then processing.

StreamReader.ReadLine should work fine. Let the framework choose the buffering, unless you know by profiling you can do better.

TextReader.ReadLine()

I was facing same problem in our production server at Agenty where we see large files (sometimes 10-25 gb (\t) tab delimited txt files). And after lots of testing and research I found the best way to read large files in small chunks with for/foreach loop and setting offset and limit logic with File.ReadLines().
int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);
// Do all your processing here and with limit and offset and save to drive in append mode
// The append mode will write the output in same file for each processed batch.
table.TableToFile(#"C:\output.txt");
}
See the complete code in my Github library : https://github.com/Agenty/FileReader/
Full Disclosure - I work for Agenty, the company who owned this library and website

My file is over 13 GB:
You can use my class:
public static void Read(int length)
{
StringBuilder resultAsString = new StringBuilder();
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(#"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
{
for (int i = 0; i < length; i++)
{
//Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
int result = memoryMappedViewStream.ReadByte();
if (result == -1)
{
break;
}
char letter = (char)result;
resultAsString.Append(letter);
}
}
}
This code will read text of file from start to the length that you pass to the method Read(int length) and fill the resultAsString variable.
It will return the bellow text:

I'd read the file 10,000 bytes at a time. Then I'd analyse those 10,000 bytes and chop them into lines and feed them to the FormatData function.
Bonus points for splitting the reading and line analysation on multiple threads.
I'd definitely use a StringBuilder to collect all strings and might build a string buffer to keep about 100 strings in memory all the time.

Related

How to read the from a text file then calculate an average

I plan on reading the marks from a text file and then calculating what the average mark is based upon data written in previous code. I haven't been able to read the marks though or calculate how many marks there are as BinaryReader doesn't let you use .Length.
I have tried using an array to hold each mark but it doesn't like each mark being an integer
public static int CalculateAverage()
{
int count = 0;
int total = 0;
float average;
BinaryReader markFile;
markFile = new BinaryReader(new FileStream("studentMarks.txt", FileMode.Open));
//A loop to read each line of the file and add it to the total
{
//total = total + eachMark;
//count++;
}
//average = total / count;
//markFile.Close();
//Console.WriteLine("Average mark:", average);
return 0;
}
This is my studentMark.txt file in VS
First of all, don't use BinerayRead you can use StreamReader for example.
Also with using statement is not necessary implement the close().
There is an answer using a while loop, so using Linq you can do in one line:
var avg = File.ReadAllLines("file.txt").ToArray().Average(a => Int32.Parse(a));
Console.WriteLine("avg = "+avg); //5
Also using File.ReadAllLines() according too docs the file is loaded into memory and then close, so there is no leak memory problem or whatever.
Opens a text file, reads all lines of the file into a string array, and then closes the file.
Edit to add the way to read using BinaryReader.
First thing to know is you are reading a txt file. Unless you have created the file using BinaryWriter, the binary reader will not work. And, if you are creating a binary file, there is not a good practice name as .txt.
So, assuming your file is binary, you need to loop and read every integer, so this code shoul work.
var fileName = "file.txt";
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
while (reader.BaseStream.Position < reader.BaseStream.Length)
{
total +=reader.ReadInt32();
count++;
}
}
average = total/count;
Console.WriteLine("Average = "+average); // 5
}
I've used using to ensure file is close at the end.
If your file only contains numbers, you only have to use ReadInt32() and it will work.
Also, if your file is not binary, obviously, binary writer will not work. By the way, my binary file.txt created using BinaryWriter looks like this:
So I'm assuming you dont have a binary file...

Performance using Span<T> to parse a text file

I am trying to take advantage of Span<T>, using .NETCore 2.2 to improve the performance of parsing text from a text file. The text file contains multiple consecutive rows of data which will each be split into fields that are then each mapped to a data class.
Initially, the parsing routine uses a traditional approach of using StreamReader to read each row, and then using Substring to copy the individual fields from that row.
From what I have read (on MSDN), amongst others, using Span<T> with Slice should perform more efficiently as less data allocations are made, and instead, a pointer to the byte[] array is passed around and acted upon.
After some experimentation I have compared 3 approaches to parsing the file and used BenchmarkDotNet to compare the results. What I found was that, when parsing a single row from the text file using Span, both mean execution time and allocated memory are indeed significantly less. So far so good. However, when parsing more than one row from the file, the performance gain quickly disappears to the point that it is almost insignificant, even from as little as 50 rows.
I am sure I must be missing something. Something seems to be outweighing the performance gain of Span.
The best performing approach WithSpan_StringFirst looks like this:
private static byte[] _filecontent;
private const int ROWSIZE = 252;
private readonly Encoding _encoding = Encoding.ASCII;
public void WithSpan_StringFirst()
{
var buffer1 = new Span<byte>(_filecontent).Slice(0, RowCount * ROWSIZE);
var buffer = _encoding.GetString(buffer1).AsSpan();
int cursor = 0;
for (int i = 0; i < RowCount; i++)
{
var row = buffer.Slice(cursor, ROWSIZE);
cursor += ROWSIZE;
Foo.ReadWithSpan(row);
}
}
[Params(1, 50)]
public int RowCount { get; set; }
Implementation of Foo.ReadWithSpan:
public static Foo ReadWithSpan(ReadOnlySpan<char> buffer) => new Foo
{
Field1 = buffer.Read(0, 2),
Field2 = buffer.Read(3, 4),
Field3 = buffer.Read(5, 6),
// ...
Field30 = buffer.Read(246, 249)
};
public static string Read(this ReadOnlySpan<char> input, int startIndex, int endIndex)
{
return new string(input.Slice(startIndex, endIndex - startIndex));
}
Any feedback would be appreciated. I have posted a full working sample on github.
For small files < 10,000 lines and simple line structure to parse, most any .net Core method will be the same.
For large, multi-gigibyte files and millions of lines of data, optimizations matter more.
If file processing time is in hours or even in tens of minutes, getting all the C# code together in the same class will drastically speed up processing the file as the compiler can do better code optimizations. Inlining the methods called into the main processing code can help also.
It's the same 1960s answer, changing the processing algorithm and how it chunks input and output data is an order of magnitude better than small code optimizations.

Read random line from a large text file

I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.
What is the most efficient way: the random method or the new file method?
The program will be run every 5 mins and I'm using c# 4.5
In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:
string line = File.ReadLines(FileName).Skip(X).First();
Full example:
var fileName = #"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);
Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:
Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename))
{
if (r.Next(currentLine) == 0) {
pick = line;
}
++currentLine;
}
return pick;
At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.
This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.
I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:
1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:
1/(N-1) * (1-(1/N))
1/(N-1) * (N/N-(1/N))
1/(N-1) * (N-1)/N
(1*(N-1)) / (N*(N-1))
1/N
If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.
I assume that the goal is to randomly choose one line from a file of 5000+ lines.
Try this:
Get the line count using File.ReadLines(file).Count().
Generate a random number, using the line count as an upper limit.
Do a lazy read of the file with File.ReadLines(file).
Choose a line from this array using the random number.
EDIT: as pointed out, doing File.ReadLines(file).toArray() is pretty inefficient.
Here's a quick implementation of #LucasTrzesniewskis proposed method in the comments to the question:
// open the file
using(FileStream stream = File.OpenRead("yourfile.dat"))
{
// 1. index all offsets that are the beginning of a line
List<Long> lineOffsets = new List<Long>();
lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line!
int ch;
while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file
{
if(ch == '\n')
lineOffsets.Add(stream.Position);
}
// 2. read a random line
stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file
// set the position of the stream to one the previously saved offsets
stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)];
// read the whole line from the specified offset
using(StreamReader reader = new StreamReader(stream))
{
Console.WriteLine(reader.ReadLine());
}
}
I don't have any VS near me at the moment, so this is untested.

Need a fast method of deserializing 1 million Strings & Guids in c#

I want to deserialize a list of 1 million pairs of (String,Guid) for a performance critical app. The format can be anything I choose, and serialization does not have the same performance requirements.
What sort of approach is best? Text or binary? Write each pair (string,guid) consecutively, or write all strings followed by all guids?
I started playing with LinqPad, (and the simpler example of deserializing strings only) and found that (slightly counter-intuitively), using a TextReader and ReadLine() was a fair bit faster than using a BinaryReader and ReadString(). (Is the filesystem cache playing tricks on me?)
public string[] DeSerializeBinary()
{
var tmr = System.Diagnostics.Stopwatch.StartNew();
long ms = 0;
string[] arr = null;
using (var rdr = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read)))
{
var num = rdr.ReadInt32();
arr = new String[num];
for (int i = 0; i < num; i++)
{
arr[i] = rdr.ReadString();
}
tmr.Stop();
ms = tmr.ElapsedMilliseconds;
Console.WriteLine("DeSerializeBinary took {0}ms", ms);
}
return arr;
}
public string[] DeserializeText()
{
var tmr = System.Diagnostics.Stopwatch.StartNew();
long ms = 0;
string[] arr = null;
using (var rdr = File.OpenText(file))
{
var num = Int32.Parse(rdr.ReadLine());
arr = new String[num];
for (int i = 0; i < num; i++)
{
arr[i] = rdr.ReadLine();
}
tmr.Stop();
ms = tmr.ElapsedMilliseconds;
Console.WriteLine("DeserializeText took {0}ms", ms);
}
return arr;
}
Some Edits:
I used RamMap to clear the file system cache, and it turns out there was very little difference to Text & Binary reader for strings only.
I have a fairly simple class that holds the string and guid. It also holds an int index which corresponds to its position in the list. Obviously there's no need to include this in serialization.
In a test for (binary) deSerializing Strings and Guids alternately, I get around 500ms.
Ideal timing is 50ms, or as close as I can get. However, a simple experiment showed it takes at least 120ms to read the (compressed) file into memory from a reasonably fast SSD drive, without any sort of parsing at all. So 50ms seems unlikely.
Our strings have no theoretical length restrictions. However, we can assume that the performance target only applies if they are all 20 characters or less.
Timings include opening the file.
Reading the Strings is the clear bottleneck now (hence my experiments with serializing strings only). The JIT_NewFast took 30% before I preallocated an array of 16bytes for reading GUIDs.
It's not surprising that reading a bunch of strings is faster with StreamReader than with BinaryReader. StreamReader reads in blocks from the underlying stream, and parses the strings from that buffer. BinaryReader doesn't have a buffer like that. It reads the string length from the underlying stream, and then reads that many characters. So BinaryReader makes more calls to the base stream's Read method.
But there's more to deserializing a (String, Guid) pair than just reading. You also have to parse the Guid. If you write the file in binary then the Guid is written in binary, which makes it much easier and faster to create a Guid structure. If it's a string, then you have to call new Guid(string) to parse the text and create a Guid, after you split the line into its two fields.
Hard to say which of those will be faster.
I can't imagine that we're talking about a whole lot of time here. Certainly reading a file with a million lines will take around a second. Unless the string is really long. A GUID is only 36 characters if you count the separators, right?
With BinaryWriter, you can write the file like this:
writer.Write(count); // integer number of records
foreach (var pair in pairs)
{
writer.Write(pair.theString);
writer.Write(pair.theGuid.ToByteArray());
}
And to read it, you have:
count = reader.ReadInt32();
byte[] guidBytes = new byte[16];
for (int i = 0; i < count; ++i)
{
string s = reader.ReadString();
reader.Read(guidBytes, 0, guidBytes.Length);
pairs.Add(new Pair(s, new Guid(guidBytes));
}
Whether that's faster than splitting a string and calling the Guid constructor that takes a string parameter, I don't know.
I suspect that any difference is going to be pretty slight. I'd probably go with the simplest method: a text file.
If you want to get really crazy, you can write a custom format that you can easily slurp up in just a couple of large reads (a header, an index, and two arrays for strings and GUIDs), and do everything else in memory. That would almost certainly be faster. But faster enough to warrant the extra work? Doubtful.
Update
Or maybe not doubtful. Here's some code that writes and reads a custom binary format. The format is:
count (int32)
guids (count * 16 bytes)
strings (one big concatenated string)
index (index of each string's starting character in the big string)
I assume you're using a Dictionary<string, Guid> to hold these things. But your data structure doesn't really matter. The code would be substantially the same.
Note that I tested this very briefly. I won't say that the code is 100% bug free, but I think you can get the idea of what I'm doing.
private void WriteGuidFile(string filename, Dictionary<string, Guid>guids)
{
using (var fs = File.Create(filename))
{
using (var writer = new BinaryWriter(fs, Encoding.UTF8))
{
List<int> stringIndex = new List<int>(guids.Count);
StringBuilder bigString = new StringBuilder();
// write count
writer.Write(guids.Count);
// Write the GUIDs and build the string index
foreach (var pair in guids)
{
writer.Write(pair.Value.ToByteArray(), 0, 16);
stringIndex.Add(bigString.Length);
bigString.Append(pair.Key);
}
// Add one more entry to the string index.
// makes deserializing easier
stringIndex.Add(bigString.Length);
// Write the string that contains all of the strings, combined
writer.Write(bigString.ToString());
// write the index
foreach (var ix in stringIndex)
{
writer.Write(ix);
}
}
}
}
Reading is just slightly more involved:
private Dictionary<string, Guid> ReadGuidFile(string filename)
{
using (var fs = File.OpenRead(filename))
{
using (var reader = new BinaryReader(fs, Encoding.UTF8))
{
// read the count
int count = reader.ReadInt32();
// The guids are in a huge byte array sized 16*count
byte[] guidsBuffer = new byte[16*count];
reader.Read(guidsBuffer, 0, guidsBuffer.Length);
// Strings are all concatenated into one
var bigString = reader.ReadString();
// Index is an array of int. We can read it as an array of
// ((count+1) * 4) bytes.
byte[] indexBuffer = new byte[4*(count+1)];
reader.Read(indexBuffer, 0, indexBuffer.Length);
var guids = new Dictionary<string, Guid>(count);
byte[] guidBytes = new byte[16];
int startix = 0;
int endix = 0;
for (int i = 0; i < count; ++i)
{
endix = BitConverter.ToInt32(indexBuffer, 4*(i+1));
string key = bigString.Substring(startix, endix - startix);
Buffer.BlockCopy(guidsBuffer, (i*16),
guidBytes, 0, 16);
guids.Add(key, new Guid(guidBytes));
startix = endix;
}
return guids;
}
}
}
A couple of notes here. First, I'm using BitConverter to convert the data in the byte arrays to integers. It would be faster to use unsafe code and just index into the arrays using an int32*.
You might gain some speed by using pointers to index into the guidBuffer and calling Guid Constructor (Int32, Int16, Int16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte) rather than using Buffer.BlockCopy to copy the GUID into the temporary array.
You could make the string index an index of lengths rather than the starting positions. That would eliminate the need for the extra value at the end of the array, but it's unlikely that it'd make any difference in the speed.
There might be other optimization opportunities, but I think you get the general idea here.

How to use Stream.Write Method to overwrite existing text

I am using StreamWriter to write records into a file. Now I want to overwrite specific record.
string file="c:\\......";
StreamWriter sw = new StreamWriter(new FileStream(file, FileMode.Open, FileAccess.Write));
sw.write(...);
sw.close();
I read somewhere here that I can use Stream.Write method to do that, I have no previous experience or knowledge of how to deal with bytes.
public override void Write(
byte[] array,
int offset,
int count
)
So how to use this method.
I need someone to explain what exactly byte[] array and int count are in this method, and any simple sample code shows how to use this method to overwrite existing record in a file.
ex. change any record like record Mark1287,11100,25| to Bill9654,22100,30| .
If you want to override a particular record, you must use FileStream.Seek-method to set the put your stream in position.
Example for Seek
using System;
using System.IO;
class FStream
{
static void Main()
{
const string fileName = "Test####.dat";
// Create random data to write to the file.
byte[] dataArray = new byte[100000];
new Random().NextBytes(dataArray);
using(FileStream
fileStream = new FileStream(fileName, FileMode.Create))
{
// Write the data to the file, byte by byte.
for(int i = 0; i < dataArray.Length; i++)
{
fileStream.WriteByte(dataArray[i]);
}
// Set the stream position to the beginning of the file.
fileStream.Seek(0, SeekOrigin.Begin);
// Read and verify the data.
for(int i = 0; i < fileStream.Length; i++)
{
if(dataArray[i] != fileStream.ReadByte())
{
Console.WriteLine("Error writing data.");
return;
}
}
Console.WriteLine("The data was written to {0} " +
"and verified.", fileStream.Name);
}
}
}
After having sought the position, use Write, whereas
public override void Write(
byte[] array,
int offset,
int count
)
Parameters
array
Type: System.Byte[]
The buffer containing data to write to the stream.
offset
Type: System.Int32
The zero-based byte offset in array from which to begin copying bytes to the stream.
count
Type: System.Int32
The maximum number of bytes to write.
And most important: always consider the documentation when unsure!
So... in short:
Your file is text base (but is allowed to become binary based).
Your record have various sizes.
This way there is, without analyzing your file, no way to know where a given record starts and ends. If you want to overwrite a record, the new record can be larger than the old record, so all records further in that file will have to be moved.
This requires a complex management system. Options could be:
When your application starts it analyzes your file and stores in memory the start and length of each record.
There is a seperate (binary)file which holds per record the start and length of each record. This will cost an additional 8 bytes in total (an Int32 for both start+length. Perhapse you want to conside Int64.)
If you want to rewrite a record, u can use this "record/start/length"-system to know where to start to write your record. But before you do that, you have to assure space, thus moving all records after the record being rewritten. Of course you have to update your managementsystem witht the new positions and length.
Another option is to do as a database: every record exists of fixed width columns. Even text columns have a maximum length. Because of this you can calculate very easy where each record start in the file. For example: if each record has a size of 200 bytes, then record #0 will start at position 0, the next record at position 200, the one after that at 400, etc. You do not have to move record when a record is rewritten.
Another suggestion is: create a mangementsystem like how memory is managed. Once a record is written it stays there. The managementsystem keeps a list of allocated portions and free portions of the file. If a new record is written, a free and fitting portion is search by the managementsystem and the record is written at that position (optionally leaving a smaller free portion). When a record is deleted, that space is freeds up. When you rewrite a record you actually delete the old record and write a new record (possibly at a totalle different location).
My last suggestion: Use a database :)

Categories

Resources