There is a list of things I want to do. I have a forms application.
Go to a particular line. I know how to go in a serial manner, but is there any way by which I can jump to a particular line no.
To find out total no of line.
If the file is not too big, you can try the ReadAllLines.
This reads the whole file, into a string array, where every line is an element of the array.
Example:
var fileName = #"C:\MyFolder\MyFileName.txt";
var contents = System.IO.File.ReadAllLines(fileName);
Console.WriteLine("Line: 10: " + contents[9]);
Console.WriteLine("Number of lines:");
Console.WriteLine(contents.Lenght);
But be aware: This reads in the whole file into memory.
If the file is too big:
Open the file (OpenText), and create a Dictionary to store the offset of every line. Scan every line, and store the offset. Now you can go to every line, and you have the number of lines.
var lineOffset = new Dictionary<int, long>();
using (var rdr = System.IO.File.OpenText(fileName)) {
int lineNr = 0;
lineOffset.Add(0,0);
while (rdr.ReadLine() != null)) {
lineNr++;
lineOffset.Add(lineNr, rdr.BaseStream.Position);
}
// Goto line 10
rdr.BaseStream.Position = lineOffset[10];
var line10 = rdr.ReadLine();
}
This would help for your first point: jump into file line c#
Related
I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.
Example of original large file:
ogfile line 1
some text here
another line
blah blah
So when I get a List of "2,4,4,1" the output file should read:
some text here
blah blah
blah blah
ogfile line 1
I have tried
string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();
but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.
Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?
The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.
If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:
Make a set containing the line numbers to be read. The set must be fast to check for membership.
Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
Dump that sequence to the destination file.
We have five operations, so hopefully it is around five lines of code.
void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
var lines = new HashSet<int>(lineNumbers);
var dict = File.ReadLines(pathIn)
.Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
.Where(p => lines.Contains(p.Key))
.ToDictionary(p => p.Key, p => p.Value);
File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}
OK, got it in six. Pretty good.
Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.
Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.
The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.
Use a StreamReader, so you don't have to read the entire file, just until the last desired line, and store them in a Dictionary, for later fast search.
Edit: Thanks to Erick Lippert, I included a HashSet for fast lookup.
List<int> lineNumbers = new List<int>{2,4,4,1};
HashSet<int> lookUp = new HashSet<int>(lineNumbers);
Dictionary<int,string> lines = new Dictionary<int,string>();
using(StreamReader sr = new StreamReader(inputFile)){
int lastLine = lookUp.Max();
for(int currentLine=1;currentLine<=lastLine;currentLine++){
if(lookUp.Contains(currentLine)){
lines[currentLine]=sr.ReadLine();
}
else{
sr.ReadLine();
}
}
}
using(StreamWriter sw = new StreamWriter(outputFile)){
foreach(var line in lineNumbers){
sw.WriteLine(lines[line]);
}
}
You may use a StreamReader and ReadLine method to read line by line without shocking the memory:
var lines = new Dictionary<int, string>();
var indexesProcessed = new HashSet<int>();
var indexesNew = new List<int> { 2, 4, 4, 1 };
using ( var reader = new StreamReader(#"c:\\file.txt") )
for ( int index = 1; index <= indexesNew.Count; index++ )
if ( reader.Peek() >= 0 )
{
string line = reader.ReadLine();
if ( indexesNew.Contains(index) && !indexesProcessed.Contains(index) )
{
lines[index] = line;
indexesProcessed.Add(index);
}
}
using ( var writer = new StreamWriter(#"c:\\file-new.txt", false) )
foreach ( int index in indexesNew )
if ( indexesProcessed.Contains(index) )
writer.WriteLine(lines[index]);
It reads the file and select the desired indexes then save them in the desired order.
We use a HashSet to store processed indexes to speedup Contains calls as you indicate the file can be over 1GB.
The code is made to avoid index out of bound in case of mismatches between the source file and the desired indexes, but it slows down the process. You can optimize if you are sure that there will be no problem. In this case you can remove all usage of indexesProcessed.
Output:
some text here
blah blah
blah blah
ogfile line 1
One way to do this would be to simply read the input file once (and store the result in a variable), and then grab the lines you need and write them to the output file.
Since the line number is 1-based and arrays are 0-based (i.e. line number 1 is array index 0), we subtract 1 from the line number when specifying the array index:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var fileLines = File.ReadAllLines(inputFile);
var linesToDisplay = new[] {2, 4, 4, 1};
// Write each specified line in linesToDisplay from fileLines to the outputFile
File.WriteAllLines(outputFile,
linesToDisplay.Select(lineNumber => fileLines[lineNumber - 1]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}
Another way to do this that should be more efficient is to only read the file up to the maximum line number (using the ReadLines method), rather than reading the whole file (using the ReadAllLines method), and save just the lines we care about in a dictionary that maps the line number to the line text:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var linesToDisplay = new[] {2, 4, 4, 1};
var maxLineNumber = linesToDisplay.Max();
var fileLines = new Dictionary<int, string>(linesToDisplay.Distinct().Count());
// Start lineNumber at 1 instead of 0
int lineNumber = 1;
// Just read up to the largest line number we need
// and save the lines we care about in our dictionary
foreach (var line in File.ReadLines(inputFile))
{
if (linesToDisplay.Contains(lineNumber))
{
fileLines[lineNumber] = line;
}
// Increment our lineNumber and break if we're done
if (++lineNumber > maxLineNumber) break;
}
// Write the output to our file
File.WriteAllLines(outputFile, linesToDisplay.Select(line => fileLines[line]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}
I'm making a console app to navigate my PC.
I have a function called Askforcmd() which lets you write a command. It tests if you wrote a specific thing with ifs and else ifs (what you write is stored in the string "commands").
I'm trying to write all my games in a .txt, seperated by newline, and write the location after (seperated by "^"),
(example:
portal 2^C:/PathOfGame
portal^C:/path
)
and have the code know that if you write the name of a game, it should open the file at the path after (I know how to open the file).
I know how to read from a txt and put that in an array, but how do I make it stop reading the lines after a certain character and store that in a different array?
What I have so far:
else if (lines.Any(commands.Contains))
{
/*Code to check what game to open and at
what path
*/
Askforcmd();
}
else if (commands == "games")
{
Console.Write("\n");
int count = lines.Length;
int numsss = 0;
int ds;
while (numsss != count)
{
ds = numsss + 1;
Console.WriteLine(ds + ": " + lines[numsss]);
numsss++;
}
Askforcmd();
}
When I run the code and write "games", it lists the games with a number before them.
1: Portal 2
2: Portal
etc
You can take the first string you read and run the split command on it.
Array[] newArray = lines[numsss].split('^');
you would get a new array equal to the file name and the path is in the second element.
edit: As per your comment, you have weird requirements. You could do something like this:
//assume your previous array is lines
List<string> temp = new List<string>;
for each (string line in lines)
{
temp.Add(line.split('^')[0]);
temp.Add(line.split('^')[1]);
}
String[] outArray = temp.toArray();
I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.
What is the most efficient way: the random method or the new file method?
The program will be run every 5 mins and I'm using c# 4.5
In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:
string line = File.ReadLines(FileName).Skip(X).First();
Full example:
var fileName = #"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);
Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:
Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename))
{
if (r.Next(currentLine) == 0) {
pick = line;
}
++currentLine;
}
return pick;
At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.
This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.
I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:
1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:
1/(N-1) * (1-(1/N))
1/(N-1) * (N/N-(1/N))
1/(N-1) * (N-1)/N
(1*(N-1)) / (N*(N-1))
1/N
If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.
I assume that the goal is to randomly choose one line from a file of 5000+ lines.
Try this:
Get the line count using File.ReadLines(file).Count().
Generate a random number, using the line count as an upper limit.
Do a lazy read of the file with File.ReadLines(file).
Choose a line from this array using the random number.
EDIT: as pointed out, doing File.ReadLines(file).toArray() is pretty inefficient.
Here's a quick implementation of #LucasTrzesniewskis proposed method in the comments to the question:
// open the file
using(FileStream stream = File.OpenRead("yourfile.dat"))
{
// 1. index all offsets that are the beginning of a line
List<Long> lineOffsets = new List<Long>();
lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line!
int ch;
while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file
{
if(ch == '\n')
lineOffsets.Add(stream.Position);
}
// 2. read a random line
stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file
// set the position of the stream to one the previously saved offsets
stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)];
// read the whole line from the specified offset
using(StreamReader reader = new StreamReader(stream))
{
Console.WriteLine(reader.ReadLine());
}
}
I don't have any VS near me at the moment, so this is untested.
I have a file that contains about 2000 lines of text that I need to add a few lines to. My initial solution was to just copy the existing file to a new file and then add a few lines at the end of it. That was until I realized that the last line in the file had to always be the last line. So now I need to add my new lines before that one line of text. I know that I can just read the entire file, save it to my program and then write everything to a new file with my extra lines included. But since the file has that many lines I wanted to know if it was a better way to do it.
You will need to copy it into a new file. There's no way to inject data into the middle of a file, unfortunately.
However, you don't have to load it into memory to do that. You can use a StreamReader and read only one line at a time, or better yet, the System.IO.File.ReadLines method.
int newLineIndex = 100;
string newLineText = "Here's the new line";
using (var writer = new StreamWriter(outputFileName))
{
int lineNumber = 0;
foreach (var line in File.ReadLines(inputFileName))
{
if (lineNumber == newLineIndex)
{
writer.WriteLine(newLineText);
}
else if (lineNumber > 0)
{
writer.WriteLine();
}
writer.Write(line);
lineNumber++;
}
}
Of course, this becomes substantially easier if you're comfortable assuming that the new line will always go at index zero. If that's the case, I'd be tempted to forgo much of this, and just go with a simple Stream.CopyTo after writing the first line. But this should still work.
string newLineText = "Here's the new line";
using (var writer = new StreamWriter(outputFileName))
using (var reader = File.OpenRead(inputFileName))
{
writer.WriteLine(newLineText);
reader.CopyTo(writer.BaseStream);
}
Of course, there are any number of ways to perform this, with different trade-offs. This is just one option.
that I can just read the entire file, save it to my program and then write everything to a new file with my extra lines included.
Not everything needs to be written. Just write the inserted lines and lines after the inserted lines to the original file, starting from the position (byte index) of the inserted lines.
I have a large text file that contains GUIDs that I will use to load into the Custom Application that I am trying to create. Since the file is so large (may contain millions of lines of GUIDs), I want to break it into parts and process each part and then move to the next part afterwwards until the end of the file.
Example of text file
ASDFSADFJO23490234AJSDFKL
JOGIJO349230420GJDGJDO230
BJCIOJDFOBJOD239402390423
JFWEIOJFOWE2390423901230N
3490FJSDOFOIWEMO23MOFI23O
FJWEIOFJWEIOFJOI23J230022
Lets just say, the text file has 99,000 lines and I want to process the first 10,000 values (repeat until the end). I will create a new folder for the first batch of 10,000 using like a DateTime.Now as the folder name. Then, the 10,000 values will each have a file created using its value name as the file name. After the first 10,000 values are done, I will create a new folder using DateTime.Now again and move onto the next 10,000 values in the text file. Repeat until the end of the file.
I am able to read the text file, create a folder with the DateTime.Now, create the file with the appropriate name, but I do not know how to batch process the list of values from the text file.
This is how I am reading the file.
string[] source = new string[] {};
source = File.ReadAllLines(#"C:\guids.txt");
I tried to use the Skip/Take method, and I think it works? but I just do not know how to create a new folder and add the new subset to it. Any help will be greatly appreciated. I am open to suggestions and can help clarify if you need more details. Thanks!!
From the comments, I deduce that your problem is not in fact "how do I batch the reads from guid.txt?", but "how do I process these guids and create files in groups of ten thousands in separate folders".
With this in mind, here's an example of how you could do that.
var batchSize = 10000;
var source = File.ReadLines(#"C:\guids.txt");
var i = 0;
var currentDirPath = "";
foreach (var line in source)
{
if (i % batchSize == 0)
{
currentDirPath = Path.GetRandomFileName();
Directory.CreateDirectory(currentDirPath);
}
var newFile = Path.Combine(currentDirPath, line + ".txt");
File.WriteAllText(newFile, "Some content");
i++;
}
Avoid using DateTime for file or folder names. The odds that some unforeseen behavior makes your code try to write to a file that already exists is just too high.
EDIT: About parallelism: use it only if you need it. It is always more complex than it seems, and it has a tendency to introduce hard to find bungs. That being said, here is an untested idea.
//Make sure the current folder is empty, otherwise the folders are very likely to already exist.
if (Directory.GetFiles(Directory.GetCurrentDirectory()).Any())
{
throw new IOException("Current directory is not empty.");
}
var batchSize = 10000;
var source = File.ReadAllLines(#"C:\guids.txt");
//Create the folders synchronoulsy to avoid race conditions.
var batchCount = (source.Length/batchSize) + 1;
for (int i = 0; i < batchCount; i++)
{
Directory.CreateDirectory(i.ToString());
}
source.AsParallel().ForAll(line =>
{
var folder = ((int)(Array.IndexOf(source, line) / batchSize)).ToString();
var newFile = Path.Combine(folder.ToString(), line + ".txt");
File.WriteAllText(newFile, "Some content");
});