Read multiple lines from a large file in non-ascending order - c#

I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.
Example of original large file:
ogfile line 1
some text here
another line
blah blah
So when I get a List of "2,4,4,1" the output file should read:
some text here
blah blah
blah blah
ogfile line 1
I have tried
string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();
but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.
Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?

The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.
If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:
Make a set containing the line numbers to be read. The set must be fast to check for membership.
Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
Dump that sequence to the destination file.
We have five operations, so hopefully it is around five lines of code.
void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
var lines = new HashSet<int>(lineNumbers);
var dict = File.ReadLines(pathIn)
.Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
.Where(p => lines.Contains(p.Key))
.ToDictionary(p => p.Key, p => p.Value);
File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}
OK, got it in six. Pretty good.
Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.
Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.
The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.

Use a StreamReader, so you don't have to read the entire file, just until the last desired line, and store them in a Dictionary, for later fast search.
Edit: Thanks to Erick Lippert, I included a HashSet for fast lookup.
List<int> lineNumbers = new List<int>{2,4,4,1};
HashSet<int> lookUp = new HashSet<int>(lineNumbers);
Dictionary<int,string> lines = new Dictionary<int,string>();
using(StreamReader sr = new StreamReader(inputFile)){
int lastLine = lookUp.Max();
for(int currentLine=1;currentLine<=lastLine;currentLine++){
if(lookUp.Contains(currentLine)){
lines[currentLine]=sr.ReadLine();
}
else{
sr.ReadLine();
}
}
}
using(StreamWriter sw = new StreamWriter(outputFile)){
foreach(var line in lineNumbers){
sw.WriteLine(lines[line]);
}
}

You may use a StreamReader and ReadLine method to read line by line without shocking the memory:
var lines = new Dictionary<int, string>();
var indexesProcessed = new HashSet<int>();
var indexesNew = new List<int> { 2, 4, 4, 1 };
using ( var reader = new StreamReader(#"c:\\file.txt") )
for ( int index = 1; index <= indexesNew.Count; index++ )
if ( reader.Peek() >= 0 )
{
string line = reader.ReadLine();
if ( indexesNew.Contains(index) && !indexesProcessed.Contains(index) )
{
lines[index] = line;
indexesProcessed.Add(index);
}
}
using ( var writer = new StreamWriter(#"c:\\file-new.txt", false) )
foreach ( int index in indexesNew )
if ( indexesProcessed.Contains(index) )
writer.WriteLine(lines[index]);
It reads the file and select the desired indexes then save them in the desired order.
We use a HashSet to store processed indexes to speedup Contains calls as you indicate the file can be over 1GB.
The code is made to avoid index out of bound in case of mismatches between the source file and the desired indexes, but it slows down the process. You can optimize if you are sure that there will be no problem. In this case you can remove all usage of indexesProcessed.
Output:
some text here
blah blah
blah blah
ogfile line 1

One way to do this would be to simply read the input file once (and store the result in a variable), and then grab the lines you need and write them to the output file.
Since the line number is 1-based and arrays are 0-based (i.e. line number 1 is array index 0), we subtract 1 from the line number when specifying the array index:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var fileLines = File.ReadAllLines(inputFile);
var linesToDisplay = new[] {2, 4, 4, 1};
// Write each specified line in linesToDisplay from fileLines to the outputFile
File.WriteAllLines(outputFile,
linesToDisplay.Select(lineNumber => fileLines[lineNumber - 1]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}
Another way to do this that should be more efficient is to only read the file up to the maximum line number (using the ReadLines method), rather than reading the whole file (using the ReadAllLines method), and save just the lines we care about in a dictionary that maps the line number to the line text:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var linesToDisplay = new[] {2, 4, 4, 1};
var maxLineNumber = linesToDisplay.Max();
var fileLines = new Dictionary<int, string>(linesToDisplay.Distinct().Count());
// Start lineNumber at 1 instead of 0
int lineNumber = 1;
// Just read up to the largest line number we need
// and save the lines we care about in our dictionary
foreach (var line in File.ReadLines(inputFile))
{
if (linesToDisplay.Contains(lineNumber))
{
fileLines[lineNumber] = line;
}
// Increment our lineNumber and break if we're done
if (++lineNumber > maxLineNumber) break;
}
// Write the output to our file
File.WriteAllLines(outputFile, linesToDisplay.Select(line => fileLines[line]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}

Related

Loading data from text file into a dictionary

I have a file consisting of a list of text which looks as follows:
ABC Abbey something
ABD Aasdasd
This is the text file
The first string will always be the length of 3. So I want to loop through the file content, store those first 3 letters as Key and remaining as value. I am removing white space between them and Substringing as follows to store. The key works out fine but the line where I am storing the value returns following error. ArgumentOutOfRangeException
This is the exact code causing the problem.
line.Substring(4, line.Length)
If I call the subString between 0 and line.length it works fine. As long as I call it between 1and upwards - line.length I get the error. Honestly don't get it and been at it for hours. Some assistance please.
class Program {
static string line;
static Dictionary<string, string> stations = new Dictionary<string, string>();
static void Main(string[] args) {
var lines = File.ReadLines("C:\\Users\\username\\Desktop\\a.txt");
foreach (var l in lines) {
line = l.Replace("\t", "");
stations.Add(line.Substring(0, 3), line.Substring(4, line.Length));//error caused by this line
}
foreach(KeyValuePair<string, string> item in stations) {
//Console.WriteLine(item.Key);
Console.WriteLine(item.Value);
}
Console.ReadLine();
}
}
This is because the documentation specifies it will throw an ArgumentOutOfRangeException if:
startIndex plus length indicates a position not within this instance.
With the signature:
public string Substring(int startIndex, int length)
Since you use line.Length, you know that startIndex plus length will be 4+line.Length which is definitely not a position of this instance.
I recommend using the one parameter version:
public string Substring(int startIndex)
Thus line.Substring(3) (credit to #adv12 for spotting that). Since here you only should provide the startIndex. Of course you can use line.SubString(3,line.Length-3), but as always, better use a library since libraries are made to make programs fool-proof (this is not intended as offensive, simply make sure you reduce the amount of brain cycles for this task). Mind however that it still can throw an error if:
startIndex is less than zero or greater than the length of this instance.
So better provide checks that 3 is less than or equal to line.length...
Additional advice
Perhaps you should take a look to regex capturing. Now each key in your file contains three characters. But it is possible that in the (near) future four characters will be possible. Using regex capture, you could specify a pattern such that it is less likely that errors will occur during parsing.
You need to actually get less than the length of total line:
line.Substring(4, line.Length - 4) //subtract the chars which you're skipping
Your string:
ABC Abbey something
Length = 19
Start = 4
Remaining chars = 19 - 4 = 15 //and you are expecting 19, that is the error
I know this is a late answer that doesn't address what's wrong with your code but I feel that has already been done by other people. Instead I have different way to make the dictionary that doesn't involve substring at all so it's a little more robust, IMHO.
As long as you can guarantee that the two values are always separated by tab then this would work even if there were more or less characters in the key. It uses LINQ which should be fine from .NET 3.5.
// LINQ
using System.Linq;
// Creates a string[][] array with the list of keys in the first array position
// and the values in the second
var lines = File.ReadAllLines(#"path/to/file.txt")
.Select(s => s.Split('\t'))
.ToArray();
// Your dictionary
Dictionary<string, string> stations = new Dictionary<string, string>();
// Loop through the array and add the key/value pairs to the dictionary
for (int i = 0; i < lines.Length; i++)
{
// For example lines[i][0] = ABW, lines[i][1] = Abbey Wood
stations[lines[i][0]] = lines[i][1];
}
// Prove it works
foreach (KeyValuePair<string, string> entry in stations)
{
MessageBox.Show(entry.Key + " - " + entry.Value);
}
Hope this makes sense and gives you an alternate to consider ;-)

Read random line from a large text file

I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.
What is the most efficient way: the random method or the new file method?
The program will be run every 5 mins and I'm using c# 4.5
In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:
string line = File.ReadLines(FileName).Skip(X).First();
Full example:
var fileName = #"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);
Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:
Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename))
{
if (r.Next(currentLine) == 0) {
pick = line;
}
++currentLine;
}
return pick;
At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.
This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.
I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:
1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:
1/(N-1) * (1-(1/N))
1/(N-1) * (N/N-(1/N))
1/(N-1) * (N-1)/N
(1*(N-1)) / (N*(N-1))
1/N
If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.
I assume that the goal is to randomly choose one line from a file of 5000+ lines.
Try this:
Get the line count using File.ReadLines(file).Count().
Generate a random number, using the line count as an upper limit.
Do a lazy read of the file with File.ReadLines(file).
Choose a line from this array using the random number.
EDIT: as pointed out, doing File.ReadLines(file).toArray() is pretty inefficient.
Here's a quick implementation of #LucasTrzesniewskis proposed method in the comments to the question:
// open the file
using(FileStream stream = File.OpenRead("yourfile.dat"))
{
// 1. index all offsets that are the beginning of a line
List<Long> lineOffsets = new List<Long>();
lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line!
int ch;
while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file
{
if(ch == '\n')
lineOffsets.Add(stream.Position);
}
// 2. read a random line
stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file
// set the position of the stream to one the previously saved offsets
stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)];
// read the whole line from the specified offset
using(StreamReader reader = new StreamReader(stream))
{
Console.WriteLine(reader.ReadLine());
}
}
I don't have any VS near me at the moment, so this is untested.

Is there a more efficient way to iterate a collection of files and build a dictionary of file contents?

I have the following code. Is there a more efficient way to accomplish the same tasks?
Given a folder, loop over the files within the folder.
Within each file, skip the first four header lines,
After splitting the row based on a space, if the resulting array contains less than 7 elements, skip it,
Check if the specified element is already in the dictionary. If it is, increment the count. If not, add it.
It's not a complicated process. Is there a better way to do this? LINQ?
string sourceDirectory = #"d:\TESTDATA\";
string[] files = Directory.GetFiles(sourceDirectory, "*.log",
SearchOption.TopDirectoryOnly);
var dictionary = new Dictionary<string, int>();
foreach (var file in files)
{
string[] lines = System.IO.File.ReadLines(file).Skip(4).ToArray();
foreach (var line in lines)
{
var elements = line.Split(' ');
if (elements.Length > 6)
{
if (dictionary.ContainsKey(elements[9]))
{
dictionary[elements[9]]++;
}
else
{
dictionary.Add(elements[9], 1);
}
}
}
}
Something Linqy should do you. Doubt its any more efficient. And, it's almost certainly more of a hassle to debug. But it is very trendy these days:
static Dictionary<string,int> Slurp( string rootDirectory )
{
Dictionary<string,int> instance = Directory.EnumerateFiles(rootDirectory,"*.log",SearchOption.TopDirectoryOnly)
.SelectMany( fn => File.ReadAllLines(fn)
.Skip(4)
.Select( txt => txt.Split( " ".ToCharArray() , StringSplitOptions.RemoveEmptyEntries) )
.Where(x => x.Length > 9 )
.Select( x => x[9])
)
.GroupBy( x => x )
.ToDictionary( x => x.Key , x => x.Count())
;
return instance ;
}
A more efficient (performance-wise) way to do this would be to parallelize your outer foreach with the Parallel.Foreach method. You'd also need a ConcurrentDictionary object instead of a standard dictionary.
Not sure if you are looking for better performance or more elegant code.
If you prefer functional style linq, something like this maybe:
var query= from element in
(
//go through all file names
from fileName in files
//read all lines from every file and skip first 4
from line in File.ReadAllLines(fileName).Skip(4)
//split every line into words
let lineData = line.Split(new[] {' '})
//select only lines with more than 6 words
where lineData.Count() > 6
//take 6th element from line
select lineData.ElementAt(6)
)
//outer query will group by element
group element by element
into g
select new
{
Key = g.Key,
Count = g.Count()
};
var dictionary = list.ToDictionary(e=>e.Key,e=>e.Count);
Result is dictionary with word as Key and count of word occrances as value.
I would expect that reading the files is going to be the most consuming part of the operation. In many cases, trying to read multiple files at once on different threads will hurt rather than help performance, but it may be helpful to have a thread which does nothing but read files, so that it can keep the drive as busy as possible.
If the files could get big (as seems likely) and if no line will exceed 32K bytes (8,000-32,000 Unicode characters), I'd suggest that you read them in chunks of about 32K or 64K bytes (not characters). Reading a file as bytes and subdividing into lines yourself may be faster than reading it as lines, since the subdivision can happen on a different thread from the physical disk access.
I'd suggest starting with one thread for disk access and one thread for parsing and counting, with a blocking queue between them. The disk-access thread should put on the queue data items which contain a 32K array of bytes, an indication of how many bytes are valid [may be less than 32K at the end of a file], and an indicator of whether it's the last record of a file. The parsing thread should read those items, parse them into lines, and update the appropriate counts.
To improve counting performance, it may be helpful to define
class ExposedFieldHolder<T> {public T Value; }
and then have a Dictionary<string, ExposedFieldHolder<int>>. One would have to create a new ExposedFieldHolder<int> for each dictionary slot, but dictionary[elements[9]].Value++; would likely be faster than dictionary[elements[9]]++; since the latter statement translates as dictionary[elements[9]] = dictionary[elements[9]]+1;, and has to look up the element once when reading and again when writing].
If it's necessary to do the parsing and counting on multiple threads, I would suggest that each thread have its own queue, and the disk-reading thread switch queues after each file [all blocks of a file should be handled by the same thread, since a text line might span two blocks]. Additionally, while it would be possible to use a ConcurrentDictionary, it might be more efficient to have each thread have its own independent Dictionary and consolidate the results at the end.

file handling in C# .net

There is a list of things I want to do. I have a forms application.
Go to a particular line. I know how to go in a serial manner, but is there any way by which I can jump to a particular line no.
To find out total no of line.
If the file is not too big, you can try the ReadAllLines.
This reads the whole file, into a string array, where every line is an element of the array.
Example:
var fileName = #"C:\MyFolder\MyFileName.txt";
var contents = System.IO.File.ReadAllLines(fileName);
Console.WriteLine("Line: 10: " + contents[9]);
Console.WriteLine("Number of lines:");
Console.WriteLine(contents.Lenght);
But be aware: This reads in the whole file into memory.
If the file is too big:
Open the file (OpenText), and create a Dictionary to store the offset of every line. Scan every line, and store the offset. Now you can go to every line, and you have the number of lines.
var lineOffset = new Dictionary<int, long>();
using (var rdr = System.IO.File.OpenText(fileName)) {
int lineNr = 0;
lineOffset.Add(0,0);
while (rdr.ReadLine() != null)) {
lineNr++;
lineOffset.Add(lineNr, rdr.BaseStream.Position);
}
// Goto line 10
rdr.BaseStream.Position = lineOffset[10];
var line10 = rdr.ReadLine();
}
This would help for your first point: jump into file line c#

How to get String Line number in Foreach loop from reading array?

The program helps users to parse a text file by grouping certain part of the text files into "sections" array.
So the question is "Are there any methods to find out the line numbers/position within the array?" The program utilizes a foreach loop to read the "sections" array.
May someone please advise on the codes? Thanks!
namespace Testing
{
class Program
{
static void Main(string[] args)
{
TextReader tr = new StreamReader(#"C:\Test\new.txt");
String SplitBy = "----------------------------------------";
// Skip 5 lines of the original text file
for(var i = 0; i < 5; i++)
{
tr.ReadLine();
}
// Read the reststring
String fullLog = tr.ReadToEnd();
String[] sections = fullLog.Split(new string[] { SplitBy }, StringSplitOptions.None);
//String[] lines = sections.Skip(5).ToArray();
int t = 0;
// Tried using foreach (String r in sections.skip(4)) but skips sections instead of the Text lines found within each sections
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
// Is there a way to know or get the "r" line number?
Console.WriteLine(r);
Console.WriteLine("============================================================");
t++;
}
}
}
}
A foreach loop doesn't have a loop counter of any kind. You can keep your own counter:
int number = 1;
foreach (var element in collection) {
// Do something with element and number,
number++;
}
or, perhaps easier, make use of LINQ's Enumerable.Select that gives you the current index:
var numberedElements = collection.Select((element, index) => new { element, index });
with numberedElements being a collection of anonymous type instances with properties element and index. In the case a file you can do this:
var numberedLines = File.ReadLines(filename)
.Select((Line,Number) => new { Line, Number });
with the advantage that the whole thing is processed lazily, so it will only read the parts of the file into memory that you actually use.
As far as I know, there is not a way to know which line number you are at within the file. You'd either have to keep track of the lines yourself, or read the file again until you get to that line and count along the way.
Edit:
So you're trying to get the line number of a string inside the array after the master string's been split by the SplitBy?
If there's a specific delimiter in that sub string, you could split it again - although, this might not give you what you're looking for, except...
You're essentially back at square one.
What you could do is try splitting the section string by newline characters. This should spit it out into an array that corresponds with line numbers inside the string.
Yes, you can use a for loop instead of foreach. Also, if you know the file isn't going to be too large, you can read all of the lines into an array with:
string[] lines = File.ReadAllLines(#"C:\Test\new.txt");
Well, don't use a foreach, use a for loop
for( int i = 0; i < sections.Length; ++ )
{
string section = sections[i];
int lineNum = i + 1;
}
You can of course maintain a counter when using a foreach loop as well, but there is no reason to since you have the standard for loop at your disposal which is made for this sort of thing.
Of course, this won't necessarily give you the line number of the string in the text file unless you split on Environment.NewLine. You are splitting on a large number of '-' characters and I have no idea how your file is structured. You'll likely end up underestimating the line number because all of the '---' bits will be discarded.
Not as your code is written. You must track the line number for yourself. Problematic areas of your code:
You skip 5 lines at the beginning of your code, you must track this.
Using the Split method, you are potentially "removing" lines from the original collection of lines. You must find away to know how many splits you have made, because they are an original part of the line count.
Rather than taking the approach you have, I suggest doing the parsing and searching within a classic indexed for-loop that visits each line of the file. This probably means giving up conveniences like Split, and rather looking for markers in the file manually with e.g. IndexOf.
I've got a much simpler solution to the questions after reading through all the answers yesterday.
As the string had a newline after each line, it is possible to split the strings and convert it into a new array which then is possible to find out the line number according to the array position.
The Codes:
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
IList<String> names = r.Split('\n').ToList<String>();
}

Categories

Resources