So I've been using the same code about a year now and normally I find new ways to do old tasks and slowly improve but I just seemed to of stagnated with this. I was curious if anyone could provide any insight on how I would do this task differently. I'm loading in a text file, reading all its lines into a string array and then looping those entries to perform a operation on each line.
string[] config = File.ReadAllLines("Config.txt");
foreach (string line in config)
{
DoOperations(line);
}
Eventually I'll just be moving to openfiledialog, but that's for a time in the future and using OFG on a console application that's multi threaded seems like bad practice.
Since you don't act on the whole file at any point you could read it one line at a time. Given that your file looks like a config it's probably not a massive file, but if you were trying to read a large file in using File.ReadAllLines() you can get into memory issues. Reading one line at a time helps avoid that.
using (StreamReader file = new StreamReader("config.txt")){
while((line = file.ReadLine()) != null)
{
DoOperations(line);
}
}
You could rename config to lines for readability ;)
You could use var
Select? (if DoSomething returns something)
var parsed = File.ReadAllLines("Config.txt").Select( l => Parsed(line));
ForEeach?
lines.ToList().ForEach( l => DoSomething(line));
Read line by line with ReadLines?
foreach (var line in File.ReadLines("Config.txt"))
{
(...)
}
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
Related
So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
I am a little new to C# and I'm having performance issues with this.
In my program, people import a .txt list and the program makes a list out of it; the problem is its consuming too much RAM, crashing PC's with low memory. I thought of using 'yield' without success. Any ideas?
private List<string> ImportList()
{
try
{
using (var ofd = new OpenFileDialog() { Filter = "Text files (*.txt) | *.txt" })
{
if (ofd.ShowDialog() == DialogResult.OK)
{
return File.ReadAllLines(ofd.FileName).ToList();
}
}
return null;
}
catch(OutOfMemoryException ex)
{
MessageBox.Show("The list is too large. Try using a smaller list or dividing it.", "Warning!");
return null;
}
}
the method ReadlAllLines returns an array of string, not a List => File.ReadAllLines Method (String)
I think that you shuld use ReadLines(), check this Question about the diferences between ReadLines and ReadlAllLines:
is there any performance difference related to these methods? YES
there is a difference
File.ReadAllLines() method reads the whole file at a time and returns
the string[] array, so it takes time while working with large size of
files and not recommended as user has to wait untill the whole array
is returned.
File.ReadLines() returns an IEnumerable and it does not read
the whole file at one go, so it is really a better option when working
with large size files.
From MSDN:
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of
strings before the whole collection is returned; when you use
ReadAllLines, you must wait for the whole array of strings be returned
before you can access the array. Therefore, when you are working with
very large files, ReadLines can be more efficient. Example 1:
File.ReadAllLines()
string[] lines = File.ReadAllLines("C:\\mytxt.txt");
Example 2: File.ReadLines()
foreach (var line in File.ReadLines("C:\\mytxt.txt"))
{
//Do something
}
Response for Sudhakar Tillapudi
Read Big TXT File, Out of Memory Exception
I just copied paste the solution from other question. See if it works.
foreach (var line in File.ReadLines(_filePath))
{
//Don't put "line" into a list or collection.
//Just make your processing on it.
}
ReadLines returns IEnumerable<string>.File.ReadLine
Concept is to not load all lines into a list at once. Even if you want to process them, process them line by line using IEnumerable instead of List.
If the exception occurs at ReadAllLines, try this:
Use a StreamReader to read the file line by line and ad it to the list. Something like this:
using (StreamReader sr = new StreamReader (ofd.FileName)) {
while (!sr.EndOfStream) {
yourList.Add (sr.ReadLine());
}
}
If the exception occurs at ToList, try this:
You should get the array returned by ReadAllLines first, and use a foreach loop to add the array elements to the list.
foreach (var str in arrayReturned) {
yourList.Add (str);
}
If this still does not work, use the ReadLines method in the same class. The difference between ReaDAllLines and ReadLines is that the latter returns an IEnumerable<string> instead of a string[]. An IEnumerable<string> use deferred execution. It will only give you one element when you ask it to. Jon Skeet's book, C# In Depth talks about this in detail.
Here is the docs for ReadLines for more information:
https://msdn.microsoft.com/en-us/library/dd383503(v=vs.110).aspx
I'm trying to locate a line which contains a specific text inside a large text file (18 MB), currently I'm using StreamReader to open the file and read it line by line checking if it contains the search string
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
But unfortunately, because the file I'm using has more than 1 million records, this method is slow. What is the quickest way to achieve this?
In general, disk IO of this nature is just going to be slow. There is likely little you can do to improve over your current version in terms of performance, at least not without dramatically changing the format in which you store your data, or your hardware.
However, you could shorten the code and simplify it in terms of maintenance and readability:
var lines = File.ReadLines(filename).Where(l => l.Contains("search string"));
foreach(var line in lines)
{
// Do something here with line
}
Reading the entire file into memory causes the application to hang and is very slow, do you think there are any other alternative
If the main goal here is to prevent application hangs, you can do this in the background instead of in a UI thread. If you make your method async, this can become:
while ((line = await reader.ReadLineAsync()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
This will likely make the total operation take longer, but not block your UI thread while the file access is occurring.
Get a hard drive with a faster read speed (moving to a solid state drive if you aren't already would likely help a lot).
Store the data across several files each on different physical drives. Search through those drives in parallel.
Use a RAID0 hard drive configuration. (This is sort of a special case of the previous approach.)
Create an index of the lines in the file that you can use to search for specific words. (Creating the index will be a lot more expensive than a single search, and will require a lot of disk space, but it will allow subsequent searches at much faster speeds.)
I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?
Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.
Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}
Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.
Use the Filestream.Position method to get the position of that file and then set the position.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.