Optimizing List<string>

Optimizing List<string> - c#

I am a little new to C# and I'm having performance issues with this.
In my program, people import a .txt list and the program makes a list out of it; the problem is its consuming too much RAM, crashing PC's with low memory. I thought of using 'yield' without success. Any ideas?
private List<string> ImportList()
{
try
{
using (var ofd = new OpenFileDialog() { Filter = "Text files (*.txt) | *.txt" })
{
if (ofd.ShowDialog() == DialogResult.OK)
{
return File.ReadAllLines(ofd.FileName).ToList();
}
}
return null;
}
catch(OutOfMemoryException ex)
{
MessageBox.Show("The list is too large. Try using a smaller list or dividing it.", "Warning!");
return null;
}
}

the method ReadlAllLines returns an array of string, not a List => File.ReadAllLines Method (String)
I think that you shuld use ReadLines(), check this Question about the diferences between ReadLines and ReadlAllLines:
is there any performance difference related to these methods? YES
there is a difference
File.ReadAllLines() method reads the whole file at a time and returns
the string[] array, so it takes time while working with large size of
files and not recommended as user has to wait untill the whole array
is returned.
File.ReadLines() returns an IEnumerable and it does not read
the whole file at one go, so it is really a better option when working
with large size files.
From MSDN:
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of
strings before the whole collection is returned; when you use
ReadAllLines, you must wait for the whole array of strings be returned
before you can access the array. Therefore, when you are working with
very large files, ReadLines can be more efficient. Example 1:
File.ReadAllLines()
string[] lines = File.ReadAllLines("C:\\mytxt.txt");
Example 2: File.ReadLines()
foreach (var line in File.ReadLines("C:\\mytxt.txt"))
{
//Do something
}
Response for Sudhakar Tillapudi

Read Big TXT File, Out of Memory Exception
I just copied paste the solution from other question. See if it works.
foreach (var line in File.ReadLines(_filePath))
{
//Don't put "line" into a list or collection.
//Just make your processing on it.
}
ReadLines returns IEnumerable<string>.File.ReadLine
Concept is to not load all lines into a list at once. Even if you want to process them, process them line by line using IEnumerable instead of List.

If the exception occurs at ReadAllLines, try this:
Use a StreamReader to read the file line by line and ad it to the list. Something like this:
using (StreamReader sr = new StreamReader (ofd.FileName)) {
while (!sr.EndOfStream) {
yourList.Add (sr.ReadLine());
}
}
If the exception occurs at ToList, try this:
You should get the array returned by ReadAllLines first, and use a foreach loop to add the array elements to the list.
foreach (var str in arrayReturned) {
yourList.Add (str);
}
If this still does not work, use the ReadLines method in the same class. The difference between ReaDAllLines and ReadLines is that the latter returns an IEnumerable<string> instead of a string[]. An IEnumerable<string> use deferred execution. It will only give you one element when you ask it to. Jon Skeet's book, C# In Depth talks about this in detail.
Here is the docs for ReadLines for more information:
https://msdn.microsoft.com/en-us/library/dd383503(v=vs.110).aspx

Related

Call Length Property on Returned Array in Chained String/LINQ Methods of C#

I found this post on selecting a range from an array, and have to use the LINQ option:
Selecting a range of items inside an array in C#
Ultimately, I'm trying to get the last four lines from some text file. After, I've read in and cleaned the lines for unwanted characters and empty lines, I have an array with all of the lines. I'm using the following to do so:
string[] allLines = GetEachLine(results);
string[] lastFourLines = allLines.Skip(allLines.Length - 4).Take(4).ToArray();
This works fine, but I'm wondering if I could somehow skip assinging to the allLines variable all together. Such as:
string[] lastFourLines = GetEachLine(results).Skip(returnedArrayLength - 4).Take(4).ToArray();

It would be better to change GetEachLine and code preceding it (however results is computed) to use IEnumerable<T> and avoid using an array to read the entire file in memory for the last four lines (unless you use all of results for something else) - consider using File.ReadLines.
However, if you are using .Net Core 2.0 or greater, you can use Enumerable.TakeLast to efficiently return the last four lines:
var lastFourLines = GetEachLine(results).TakeLast(4);

if GetEachLine() returns string[] then that should work fine, though null checking may be needed.
As you chain more you may want to use line breaks to increase readability:
string[] lastFourLines = GetEachLine(results)
.Skip(allLines.Length - 4)
.Take(4)
.ToArray();
allLines.Length won't exist unless you still have line 1 from your question, you can avoid calling GetEachLine() twice by using TakeLast().
string[] lastFourLines = GetEachLine(results)
.TakeLast(4)
.ToArray();

If you are looking to efficiently retrieve the last N (filtered) line of a large file, you really need to start at the point where you are reading the file contents.
Consider a 1GB log file containing 10M records, where you only want the last few lines. Ideally, you would want to start by reading the last couple KB and then start extracting lines by searching for line breaks from the end, extracting each line and returning them in an iterator yield. If you run out of data, read the preceding block. Continue only as long as the consumer requests more values from the iterator.
Offhand, I don't know a built-in way to do this, and coding this from scratch could get pretty involved. Luckily, a search turned up this similar question having a highly rated answer.

C# Alternate ways to load a text file into a method?

So I've been using the same code about a year now and normally I find new ways to do old tasks and slowly improve but I just seemed to of stagnated with this. I was curious if anyone could provide any insight on how I would do this task differently. I'm loading in a text file, reading all its lines into a string array and then looping those entries to perform a operation on each line.
string[] config = File.ReadAllLines("Config.txt");
foreach (string line in config)
{
DoOperations(line);
}
Eventually I'll just be moving to openfiledialog, but that's for a time in the future and using OFG on a console application that's multi threaded seems like bad practice.

Since you don't act on the whole file at any point you could read it one line at a time. Given that your file looks like a config it's probably not a massive file, but if you were trying to read a large file in using File.ReadAllLines() you can get into memory issues. Reading one line at a time helps avoid that.
using (StreamReader file = new StreamReader("config.txt")){
while((line = file.ReadLine()) != null)
{
DoOperations(line);
}
}

You could rename config to lines for readability ;)
You could use var
Select? (if DoSomething returns something)
var parsed = File.ReadAllLines("Config.txt").Select( l => Parsed(line));
ForEeach?
lines.ToList().ForEach( l => DoSomething(line));
Read line by line with ReadLines?
foreach (var line in File.ReadLines("Config.txt"))
{
(...)
}
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.

How do I use HashSet to remove duplicates from a text file? (C#)

So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}

To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}

Read multiple lines with StreamReader with StreamReader.Peek

Let's say I have the following file format (Key value pairs):
Objec1tKey: Object1Value
Object2Key: Object2Value
Object3Key: Object3Value
Object4Key: Object4Value1
Object4Value2
Object4Value3
Object5Key: Object5Value
Object6Key: Object6Value
I'm reading this line by line with StreamReader. for the objects 1, 2, 3, 5 and 6 it wouldn't be a problem because the whole object is on one line, so it's possible to process the object.
But for object 4 I need to process multiple lines. can I use Peek for this? (MSDN for Peek: Returns the next available character but does not consume it.). Is there a method like Peek which returns the next line and not the character?
If I can use Peek, then my question is, can I use Peek two times so I can read the two next lines (or 3) until I know there is a new object (obect 5) to be processed?

I would strongly recommend that you separate the IO from the line handling entirely.
Instead of making your processing code use a StreamReader, pass it either an IList<string> or an IEnumerable<string>... if you use IList<string> that will make it really easy to just access the lines by index (so you can easily keep track of "the key I'm processing started at line 5" or whatever), but it would mean either doing something clever or reading the whole file in one go.
If it's not a big file, then just using File.ReadAllLines is going to be the very simplest way of reading a file as a list of lines.
If it is a big file, use File.ReadLines to obtain an IEnumerable<string>, and then your processing code needs to be a bit smarter... for example, it might want to create a List<string> for each key that it processes, containing all the lines for that key - and let that list be garbage collected when you read the next key.

There is now way to use Peek multiple time as you thing, because it will always return only "top" character in stream. It just read it but "not send" to stream information that it was read.
To sum up pointer to stream after Peek stays in same place.
If you use for example FileStream you can use Seek for going back, but you didn't precise what type of stream are you using.

You could do something like this:
List<MyObject> objects = new List<MyObject>();
using (StreamReader sr = new StreamReader(aPath))
{
MyObject curObj;
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
if (line.IndexOf(':') >= 0) // or whatever identify the beginning of a new object
{
curObj = new MyObject(line);
objects.Add(curObj);
}
else
curObj.AddAttribute(line);
}
}

How to read text file after specified line in c#

I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?

Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.

Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}

Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.

Use the Filestream.Position method to get the position of that file and then set the position.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimizing List<string> - c#

Related

Call Length Property on Returned Array in Chained String/LINQ Methods of C#

C# Alternate ways to load a text file into a method?

How do I use HashSet to remove duplicates from a text file? (C#)

Read multiple lines with StreamReader with StreamReader.Peek

How to read text file after specified line in c#

Categories

Resources