Read random line from a large text file

Read random line from a large text file - c#

I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.
What is the most efficient way: the random method or the new file method?
The program will be run every 5 mins and I'm using c# 4.5

In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:
string line = File.ReadLines(FileName).Skip(X).First();
Full example:
var fileName = #"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);

Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:
Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename))
{
if (r.Next(currentLine) == 0) {
pick = line;
}
++currentLine;
}
return pick;
At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.
This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.
I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:
1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:
1/(N-1) * (1-(1/N))
1/(N-1) * (N/N-(1/N))
1/(N-1) * (N-1)/N
(1*(N-1)) / (N*(N-1))
1/N
If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.

I assume that the goal is to randomly choose one line from a file of 5000+ lines.
Try this:
Get the line count using File.ReadLines(file).Count().
Generate a random number, using the line count as an upper limit.
Do a lazy read of the file with File.ReadLines(file).
Choose a line from this array using the random number.
EDIT: as pointed out, doing File.ReadLines(file).toArray() is pretty inefficient.

Here's a quick implementation of #LucasTrzesniewskis proposed method in the comments to the question:
// open the file
using(FileStream stream = File.OpenRead("yourfile.dat"))
{
// 1. index all offsets that are the beginning of a line
List<Long> lineOffsets = new List<Long>();
lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line!
int ch;
while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file
{
if(ch == '\n')
lineOffsets.Add(stream.Position);
}
// 2. read a random line
stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file
// set the position of the stream to one the previously saved offsets
stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)];
// read the whole line from the specified offset
using(StreamReader reader = new StreamReader(stream))
{
Console.WriteLine(reader.ReadLine());
}
}
I don't have any VS near me at the moment, so this is untested.

Related

Read multiple lines from a large file in non-ascending order

I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.
Example of original large file:
ogfile line 1
some text here
another line
blah blah
So when I get a List of "2,4,4,1" the output file should read:
some text here
blah blah
blah blah
ogfile line 1
I have tried
string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();
but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.
Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?

The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.
If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:
Make a set containing the line numbers to be read. The set must be fast to check for membership.
Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
Dump that sequence to the destination file.
We have five operations, so hopefully it is around five lines of code.
void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
var lines = new HashSet<int>(lineNumbers);
var dict = File.ReadLines(pathIn)
.Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
.Where(p => lines.Contains(p.Key))
.ToDictionary(p => p.Key, p => p.Value);
File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}
OK, got it in six. Pretty good.
Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.
Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.
The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.

Use a StreamReader, so you don't have to read the entire file, just until the last desired line, and store them in a Dictionary, for later fast search.
Edit: Thanks to Erick Lippert, I included a HashSet for fast lookup.
List<int> lineNumbers = new List<int>{2,4,4,1};
HashSet<int> lookUp = new HashSet<int>(lineNumbers);
Dictionary<int,string> lines = new Dictionary<int,string>();
using(StreamReader sr = new StreamReader(inputFile)){
int lastLine = lookUp.Max();
for(int currentLine=1;currentLine<=lastLine;currentLine++){
if(lookUp.Contains(currentLine)){
lines[currentLine]=sr.ReadLine();
}
else{
sr.ReadLine();
}
}
}
using(StreamWriter sw = new StreamWriter(outputFile)){
foreach(var line in lineNumbers){
sw.WriteLine(lines[line]);
}
}

You may use a StreamReader and ReadLine method to read line by line without shocking the memory:
var lines = new Dictionary<int, string>();
var indexesProcessed = new HashSet<int>();
var indexesNew = new List<int> { 2, 4, 4, 1 };
using ( var reader = new StreamReader(#"c:\\file.txt") )
for ( int index = 1; index <= indexesNew.Count; index++ )
if ( reader.Peek() >= 0 )
{
string line = reader.ReadLine();
if ( indexesNew.Contains(index) && !indexesProcessed.Contains(index) )
{
lines[index] = line;
indexesProcessed.Add(index);
}
}
using ( var writer = new StreamWriter(#"c:\\file-new.txt", false) )
foreach ( int index in indexesNew )
if ( indexesProcessed.Contains(index) )
writer.WriteLine(lines[index]);
It reads the file and select the desired indexes then save them in the desired order.
We use a HashSet to store processed indexes to speedup Contains calls as you indicate the file can be over 1GB.
The code is made to avoid index out of bound in case of mismatches between the source file and the desired indexes, but it slows down the process. You can optimize if you are sure that there will be no problem. In this case you can remove all usage of indexesProcessed.
Output:
some text here
blah blah
blah blah
ogfile line 1

One way to do this would be to simply read the input file once (and store the result in a variable), and then grab the lines you need and write them to the output file.
Since the line number is 1-based and arrays are 0-based (i.e. line number 1 is array index 0), we subtract 1 from the line number when specifying the array index:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var fileLines = File.ReadAllLines(inputFile);
var linesToDisplay = new[] {2, 4, 4, 1};
// Write each specified line in linesToDisplay from fileLines to the outputFile
File.WriteAllLines(outputFile,
linesToDisplay.Select(lineNumber => fileLines[lineNumber - 1]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}
Another way to do this that should be more efficient is to only read the file up to the maximum line number (using the ReadLines method), rather than reading the whole file (using the ReadAllLines method), and save just the lines we care about in a dictionary that maps the line number to the line text:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var linesToDisplay = new[] {2, 4, 4, 1};
var maxLineNumber = linesToDisplay.Max();
var fileLines = new Dictionary<int, string>(linesToDisplay.Distinct().Count());
// Start lineNumber at 1 instead of 0
int lineNumber = 1;
// Just read up to the largest line number we need
// and save the lines we care about in our dictionary
foreach (var line in File.ReadLines(inputFile))
{
if (linesToDisplay.Contains(lineNumber))
{
fileLines[lineNumber] = line;
}
// Increment our lineNumber and break if we're done
if (++lineNumber > maxLineNumber) break;
}
// Write the output to our file
File.WriteAllLines(outputFile, linesToDisplay.Select(line => fileLines[line]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}

Randomly select a specific quantity of indices from an array?

I have an array of boolean values and need to randomly select a specific quantity of indices for values which are true.
What is the most efficient way to generate the array of indices?
For instance,
BitArray mask = GenerateSomeMask(length: 100000);
int[] randomIndices = RandomIndicesForTrue(mask, quantity: 10);
In this case the length of randomIndices would be 10.

There's a faster way to do this that requires only a single scan of the list.
Consider picking a line at random from a text file when you don't know how many lines are in the file, and the file is too large to fit in memory. The obvious solution is to read the file once to count the lines, pick a random number in the range of 0 to Count-1, and then read the file again up to the chosen line number. That works, but requires you to read the file twice.
A faster solution is to read the first line and save it as the selected line. You replace the selected line with the next line with probability 1/2. When you read the third line, you replace with probability 1/3, etc. When you've read the entire file, you have selected a line at random, and every line had equal probability of being selected. The code looks something like this:
string selectedLine = null;
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
++numLines;
double prob = 1.0/numLines;
if (rnd.Next() >= prob)
selectedLine = line;
}
Now, what if you want to select 2 lines? You select the first two. Then, as each line is read the probability that it will replace one of the two lines is 2/n, where n is the number of lines already read. If you determine that you need to replace a line, you randomly select the line to be replaced. You can follow that same basic idea to select any number of lines at random. For example:
string[] selectedLines = new int[M];
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
++numLines;
if (numLines <= M)
{
selectedLines[numLines-1] = line;
}
else
{
double prob = (double)M/numLines;
if (rnd.Next() >= prob)
{
int ix = rnd.Next(M);
selectedLines[ix] = line;
}
}
}
You can apply that to your BitArray quite easily:
int[] selected = new int[quantity];
int num = 0; // number of True items seen
Random rnd = new Random();
for (int i = 0; i < items.Length; ++i)
{
if (items[i])
{
++num;
if (num <= quantity)
{
selected[num-1] = i;
}
else
{
double prob = (double)quantity/num;
if (rnd.Next() > prob)
{
int ix = rnd.Next(quantity);
selected[ix] = i;
}
}
}
}
You'll need some special case code at the end to handle the case where there aren't quantity set bits in the array, but you'll need that with any solution.
This makes a single pass over the BitArray, and the only extra memory it uses is for the list of selected indexes. I'd be surprised if it wasn't significantly faster than the LINQ version.
Note that I used the probability calculation to illustrate the math. You can change the inner loop code in the first example to:
if (rnd.Next(numLines+1) == numLines)
{
selectedLine = line;
}
++numLines;
You can make a similar change to the other examples. That does the same thing as the probability calculation, and should execute a little faster because it eliminates a floating point divide for each item.

There are two families of approaches you can use: deterministic and non-deterministic. The first one involves finding all the eligible elements in the collection and then picking N at random; the second involves randomly reaching into the collection until you have found N eligible items.
Since the size of your collection is not negligible at 100K and you only want to pick a few out of those, at first sight non-deterministic sounds like it should be considered because it can give very good results in practice. However, since there is no guarantee that N true values even exist in the collection, going non-deterministic could put your program into an infinite loop (less catastrophically, it could just take a very long time to produce results).
Therefore I am going to suggest going for a deterministic approach, even though you are going to pay for the guarantees you need through the nose with resource usage. In particular, the operation will involve in-place sorting of an auxiliary collection; this will practically undo the nice space savings you got by using BitArray.
Theory aside, let's get to work. The standard way to handle this is:
Filter all eligible indices into an auxiliary collection.
Randomly shuffle the collection with Fisher-Yates (there's a convenient implementation on StackOverflow).
Pick the N first items of the shuffled collection. If there are less than N then your input cannot satisfy your requirements.
Translated into LINQ:
var results = mask
.Select((i, f) => Tuple.Create) // project into index/bool pairs
.Where(t => t.Item2) // keep only those where bool == true
.Select(t => t.Item1) // extract indices
.ToList() // prerequisite for next step
.Shuffle() // Fisher-Yates
.Take(quantity) // pick N
.ToArray(); // into an int[]
if (results.Length < quantity)
{
// not enough true values in input
}

If you have 10 indices to choose from, you could generate a random number from 0 to 2^10 - 1, and use that as you mask.

file handling in C# .net

There is a list of things I want to do. I have a forms application.
Go to a particular line. I know how to go in a serial manner, but is there any way by which I can jump to a particular line no.
To find out total no of line.

If the file is not too big, you can try the ReadAllLines.
This reads the whole file, into a string array, where every line is an element of the array.
Example:
var fileName = #"C:\MyFolder\MyFileName.txt";
var contents = System.IO.File.ReadAllLines(fileName);
Console.WriteLine("Line: 10: " + contents[9]);
Console.WriteLine("Number of lines:");
Console.WriteLine(contents.Lenght);
But be aware: This reads in the whole file into memory.
If the file is too big:
Open the file (OpenText), and create a Dictionary to store the offset of every line. Scan every line, and store the offset. Now you can go to every line, and you have the number of lines.
var lineOffset = new Dictionary<int, long>();
using (var rdr = System.IO.File.OpenText(fileName)) {
int lineNr = 0;
lineOffset.Add(0,0);
while (rdr.ReadLine() != null)) {
lineNr++;
lineOffset.Add(lineNr, rdr.BaseStream.Position);
}
// Goto line 10
rdr.BaseStream.Position = lineOffset[10];
var line10 = rdr.ReadLine();
}

This would help for your first point: jump into file line c#

How to get String Line number in Foreach loop from reading array?

The program helps users to parse a text file by grouping certain part of the text files into "sections" array.
So the question is "Are there any methods to find out the line numbers/position within the array?" The program utilizes a foreach loop to read the "sections" array.
May someone please advise on the codes? Thanks!
namespace Testing
{
class Program
{
static void Main(string[] args)
{
TextReader tr = new StreamReader(#"C:\Test\new.txt");
String SplitBy = "----------------------------------------";
// Skip 5 lines of the original text file
for(var i = 0; i < 5; i++)
{
tr.ReadLine();
}
// Read the reststring
String fullLog = tr.ReadToEnd();
String[] sections = fullLog.Split(new string[] { SplitBy }, StringSplitOptions.None);
//String[] lines = sections.Skip(5).ToArray();
int t = 0;
// Tried using foreach (String r in sections.skip(4)) but skips sections instead of the Text lines found within each sections
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
// Is there a way to know or get the "r" line number?
Console.WriteLine(r);
Console.WriteLine("============================================================");
t++;
}
}
}
}

A foreach loop doesn't have a loop counter of any kind. You can keep your own counter:
int number = 1;
foreach (var element in collection) {
// Do something with element and number,
number++;
}
or, perhaps easier, make use of LINQ's Enumerable.Select that gives you the current index:
var numberedElements = collection.Select((element, index) => new { element, index });
with numberedElements being a collection of anonymous type instances with properties element and index. In the case a file you can do this:
var numberedLines = File.ReadLines(filename)
.Select((Line,Number) => new { Line, Number });
with the advantage that the whole thing is processed lazily, so it will only read the parts of the file into memory that you actually use.

As far as I know, there is not a way to know which line number you are at within the file. You'd either have to keep track of the lines yourself, or read the file again until you get to that line and count along the way.
Edit:
So you're trying to get the line number of a string inside the array after the master string's been split by the SplitBy?
If there's a specific delimiter in that sub string, you could split it again - although, this might not give you what you're looking for, except...
You're essentially back at square one.
What you could do is try splitting the section string by newline characters. This should spit it out into an array that corresponds with line numbers inside the string.

Yes, you can use a for loop instead of foreach. Also, if you know the file isn't going to be too large, you can read all of the lines into an array with:
string[] lines = File.ReadAllLines(#"C:\Test\new.txt");

Well, don't use a foreach, use a for loop
for( int i = 0; i < sections.Length; ++ )
{
string section = sections[i];
int lineNum = i + 1;
}
You can of course maintain a counter when using a foreach loop as well, but there is no reason to since you have the standard for loop at your disposal which is made for this sort of thing.
Of course, this won't necessarily give you the line number of the string in the text file unless you split on Environment.NewLine. You are splitting on a large number of '-' characters and I have no idea how your file is structured. You'll likely end up underestimating the line number because all of the '---' bits will be discarded.

Not as your code is written. You must track the line number for yourself. Problematic areas of your code:
You skip 5 lines at the beginning of your code, you must track this.
Using the Split method, you are potentially "removing" lines from the original collection of lines. You must find away to know how many splits you have made, because they are an original part of the line count.
Rather than taking the approach you have, I suggest doing the parsing and searching within a classic indexed for-loop that visits each line of the file. This probably means giving up conveniences like Split, and rather looking for markers in the file manually with e.g. IndexOf.

I've got a much simpler solution to the questions after reading through all the answers yesterday.
As the string had a newline after each line, it is possible to split the strings and convert it into a new array which then is possible to find out the line number according to the array position.
The Codes:
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
IList<String> names = r.Split('\n').ToList<String>();
}

Read random line from a file? c#

I have a text file with few hundred lines, the structure is pretty simple.
firstname lastname
I need to pick out a random firstname & listname from the file.

string[] lines = File.ReadAllLines(...); //i hope that the file is not too big
Random rand = new Random();
return lines[rand.Next(lines.Length)];
Another (and maybe better) option is to have the first line of the file contain the number of records in it and then you don't have to read all the file.

Read each line keeping a count, N, of the lines you have seen so far. Select each line with probability 1/N, i.e., the first line will always be chosen, the second line will be chosen 1/2 times to replace the first, the third 1/3 times, ... This way each line has a 1/N probability of being the selected line, you only have to read the file once, and you don't need to store all of the file in memory at any given time.
Here's an implementation that can be adapted for your needs.
public string RandomLine( StreamReader reader )
{
string chosen = null;
int numberSeen = 0;
var rng = new Random();
while ((string line = reader.ReadLine()) != null)
{
if (rng.NextInt(++numberSeen) == 0)
{
chosen = line;
}
}
return chosen;
}
Based on a C implementation for selecting a node from an arbitrarily long linked list.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read random line from a large text file - c#

Related

Read multiple lines from a large file in non-ascending order

Randomly select a specific quantity of indices from an array?

file handling in C# .net

How to get String Line number in Foreach loop from reading array?

Read random line from a file? c#

Categories

Resources