search in big files performance difference - c#

I checked 2 ways to search in big files.
I tested on 500mb size file.
1st way took 9500ms and 2nd way took 11500ms.
How could it happen?
Buffering is faster than accessing the resources on each iteration.
Linq is more powerfull than foreach search.
Is it trouble with memory allocation?
1:
var __file = new System.IO.StreamReader(file);
var line = "";
while ((line = __file.ReadLine()) != null)
{
var firstOccurrence = line.Contains(contains);
}
__file.Close();
2:
var lines = File.ReadAllLines(_file);
var firstOccurrence = lines.FirstOrDefault(l => l.Contains(contains));

In your first code snippet, you don't stop looping when you find a match. Try something like this:
while ((line = __file.ReadLine()) != null)
{
var firstOccurrence = line.Contains(contains);
if (firstOccurrence)
{
break;
}
}
In your second code snippet, you read the entire file into memory, and then start looking through it line-by-line. This is different to your first code snippet, where you read the file off disk one line at a time.
The equivalent method is File.ReadLines -- this reads the file line-by-line:
var firstOccurrence = File.ReadLines(_file).FirstOrDefault(l => l.Contains(Contains));

Related

Can't find string in input file

I have a text file, which I am trying to insert a line of code into. Using my linked-lists I believe I can avoid having to take all the data out, sort it, and then make it into a new text file.
What I did was come up with the code below. I set my bools, but still it is not working. I went through debugger and what it seems to be going on is that it is going through the entire list (which is about 10,000 lines) and it is not finding anything to be true, so it does not insert my code.
Why or what is wrong with this code?
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
using (StreamReader inFile = new StreamReader("Students.txt", true))
{
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i];
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.CompareTo(lastName) < 0)
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
{
lines.Add(newRecord);
}
You're just reading the file into memory and not committing it anywhere.
I'm afraid that you're going to have to load and completely re-write the entire file. Files support appending, but they don't support insertions.
you can write to a file the same way that you read from it
string[] lines;
/// instanciate and build `lines`
File.WriteAllLines("path", lines);
WriteAllLines also takes an IEnumerable, so you can past a List of string into there if you want.
one more issue: it appears as though you're reading your file twice. one with ReadAllLines and another with your StreamReader.
There are at least four possible errors.
The opening of the streamreader is not required, you have already read
all the lines. (Well not really an error, but...)
The check for StartsWith can be fooled if you lines starts with blank
space and you will miss the insertionPoint. (Adding a Trim will remove any problem here)
In the CompareTo line you check for < 0 but you should check for == 0. CompareTo returns 0 if the strings are equivalent, however.....
To check if two string are equals you should avoid using CompareTo as
explained in MSDN link above but use string.Equals
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i].Trim();
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.Equals(lastName))
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
lines.Add(newRecord);
I don't list as an error the missing write back to the file. Hope that you have just omitted that part of the code. Otherwise it is a very simple problem.
(However I think that the way in which CompareTo is used is probably the main reason of your problem)
EDIT Looking at your comment below it seems that the answer from Sam I Am is the right one for you. Of course you need to write back the modified array of lines. All the changes are made to an in memory array of lines and nothing is written back to a file if you don't have code that writes a file. However you don't need new file
File.WriteAllLines("Students.txt", lines);

How to delete first and last line from a text file c#?

I found this code on stackoverflow to delete first and last line from a text file.
But I'm not getting how to combine this code into one so that it will delete 1st and
last line from a single file?
What I tried was using streamreader read the file and then skip 1st and last line then
streamwriter to write in new file, but couldn't get the proper structure.
To delete first line.
var lines = System.IO.File.ReadAllLines("test.txt");
System.IO.File.WriteAllLines("test.txt", lines.Skip(1).ToArray());
to delete last line.
var lines = System.IO.File.ReadAllLines("...");
System.IO.File.WriteAllLines("...", lines.Take(lines.Length - 1).ToArray());
You can chain the Skip and Take methods. Remember to subtract the appropriate number of lines in the Take method. The more you skip at the beginning, the less lines remain.
var filename = "test.txt";
var lines = System.IO.File.ReadAllLines(filename);
System.IO.File.WriteAllLines(
filename,
lines.Skip(1).Take(lines.Length - 2)
);
Whilst probably not a major issue in this case, the existing answers all rely on reading the entire contents of the file into memory first. For small files, that's probably fine, but if you're working with very large files, this could prove prohibitive.
It is reasonably trivial to create a SkipLast equivalent of the existing Skip Linq method:
public static class SkipLastExtension
{
public static IEnumerable<T> SkipLast<T>(this IEnumerable<T> source, int count)
{
var queue = new Queue<T>();
foreach (var item in source)
{
queue.Enqueue(item);
if (queue.Count > count)
{
yield return queue.Dequeue();
}
}
}
}
If we also define a method that allows us to enumerate over each line of a file without pre-buffering the whole file (per: https://stackoverflow.com/a/1271236/381588):
static IEnumerable<string> ReadFrom(string filename)
{
using (var reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Then we can use the following the following one-liner to write a new file that contains all the lines from the original file, except the first and last:
File.WriteAllLines("output.txt", ReadFrom("input.txt").Skip(1).SkipLast(1));
This is undoubtedly (considerably) more code than the other answers that have already been posted here, but should work on files of essentially any size, (as well as providing a code for a potentially useful SkipLast extension method).
Here's a different approach that uses ArraySegment<string> instead:
var lines = File.ReadAllLines("test.txt");
File.WriteAllLines("test.txt", new ArraySegment<string>(lines, 1, lines.Length-2));

Most memory efficient way to merge two files

I need to merge two files while also applying a sort. It is important the I keep the task light on memory usage. I need to create a console app in c# for this.
Input File 1:
Some Header
A12345334
A00123445
A44566555
B55677
B55683
B66489
record count: 6
Input File 2:
Some Header
A00123465
B99423445
record count: 2
So, I need to make sure that the third file should have all the "A" records coming first and then the "B" records followed by the Total record count.
Output File:
Some header
A12345334
A00123445
A44566555
A00123465
B99423445
B55677
B55683
B66489
record count: 8
Record sorting within "A" and "B" is not relevant.
Since your source files appear sorted, you can do with with very low memory usage.
Just open both input files as well as a new file for writing. Then compare the next available line from each input file and write the line that comes first to your output file. Each time you write a line to the output file, get the next line from the input file it came from.
Continue until both input files are finished.
If memory is an issue the easiest way to do this is probably going to be to read the records from both files, store them in a SQLite or SQL Server Compact database, and execute a SELECT query that returns a sorted record set. Make sure you have an index on the field you want to sort on.
That way, you don't have to store the records in memory, and you don't need any sorting algorithms; the database will store the records on disk and do your sorting for you.
Quick idea, assuming the records are already sorted in the original files:
Start looping through file 2, collecting all A-records
Once you reach the first B-record, start collecting those in a separate collection.
Read all of File 1.
Write out the content of the A-records collection from file 2, then append the contents read from file 1, followed by the B-records from file 2.
Visualized:
<A-data from file 2>
<A-data, followed by B-data from file 1>
<B-data from file 2>
If you are concerned about memory this is a perfect case for insertion sort and read one line at a time from each file. If that is not an issue read the whole thing into a list and just call sort the write it out.
If you can't even keep the whole sorted list in memory then a database or memory mapped file is you best bet.
Assuming your input files are already ordered:
Open Input files 1 and 2 and create the Output file.
Read the first record from file 1. If it starts with A, write it to the output file. Continue reading from input file 1 until you reach a record that starts with B.
Read the first record from file 2. If it start with A, write it to the output file. Continue reading from input file 2 until you reach a record that starts with B.
Go back to file 1, and write the 'B' record to the output file. Continue reading from input file 1 until you reach the end of the stream.
Go back to file 2, and write the 'B' record to the output file. Continue reading from input file 2 until you reach the end of the stream.
This method will prevent you from ever having to hold more than 2 rows of data in memory at a time.
i would recommend using StreamReader and StreamWriter for this application. So you can open a file using StreamWriter, copy all lines using StreamReader for file #1, then for file #2. This operations are very fast, have integrated buffers and are very lightweight.
if the input files are already sorted by A and B, you can switch between the source readers to make the output sorted.
Since you have two sorted sequences you just need to merge the two sequences into a single sequence, in much the same way the second half of the MergeSort algorithm works.
Unfortunately, given the interface that IEnumerable provides, it ends up a bit mess and copy-pasty, but it should perform quite well and use a very small memory footprint:
public class Wrapper<T>
{
public T Value { get; set; }
}
public static IEnumerable<T> Merge<T>(IEnumerable<T> first, IEnumerable<T> second, IComparer<T> comparer = null)
{
comparer = comparer ?? Comparer<T>.Default;
using (var secondIterator = second.GetEnumerator())
{
Wrapper<T> secondItem = null; //when the wrapper is null there are no more items in the second sequence
if (secondIterator.MoveNext())
secondItem = new Wrapper<T>() { Value = secondIterator.Current };
foreach (var firstItem in first)
{
if (secondItem != null)
{
while (comparer.Compare(firstItem, secondItem.Value) > 0)
{
yield return secondItem.Value;
if (secondIterator.MoveNext())
secondItem.Value = secondIterator.Current;
else
secondItem = null;
}
}
yield return firstItem;
yield return secondItem.Value;
while (secondIterator.MoveNext())
yield return secondIterator.Current;
}
}
}
Once you have a Merge function it's pretty trivial:
File.WriteAllLines("output.txt",
Merge(File.ReadLines("File1.txt"), File.ReadLines("File2.txt")))
The File ReadLines and WriteAllLines here each utilize IEnumerable and will stream the lines accordingly.
Here's the source code for the more generic/boiler plate solution for merge sorting 2 files.
public static void Merge(string inFile1, string inFile2, string outFile)
{
string line1 = null;
string line2 = null;
using (StreamReader sr1 = new StreamReader(inFile1))
{
using (StreamReader sr2 = new StreamReader(inFile2))
{
using (StreamWriter sw = new StreamWriter(outFile))
{
line1 = sr1.ReadLine();
line2 = sr2.ReadLine();
while(line1 != null && line2 != null)
{
// your comparison function here
// ex: (line1[0] < line2[0])
if(line1 < line2)
{
sw.WriteLine(line1);
line1 = sr1.ReadLine();
}
else
{
sw.WriteLine(line2);
line2 = sr2.ReadLine();
}
}
while(line1 != null)
{
sw.WriteLine(line1);
line1 = sr1.ReadLine();
}
while(line2 != null)
{
sw.WriteLine(line2);
line2 = sr2.ReadLine();
}
}
}
}
}
public void merge_click(Object sender, EventArgs e)
{
DataTable dt = new DataTable();
dt.Clear();
dt.Columns.Add("Name");
dt.Columns.Add("designation");
dt.Columns.Add("age");
dt.Columns.Add("year");
string[] lines = File.ReadAllLines(#"C:\Users\user1\Desktop\text1.txt", Encoding.UTF8);
string[] lines1 = File.ReadAllLines(#"C:\Users\user2\Desktop\text1.txt", Encoding.UTF8);
foreach (string line in lines)
{
string[] values = line.Split(',');
DataRow dr = dt.NewRow();
dr["Name"] = values[0].ToString();
dr["designation"] = values[1].ToString();
dr["age"] = values[2].ToString();
dr["year"] = values[3].ToString();
dt.Rows.Add(dr);
}
foreach (string line in lines1)
{
string[] values = line.Split(',');
DataRow dr = dt.NewRow();
dr["Name"] = values[0].ToString();
dr["designation"] = values[1].ToString();
dr["age"] = values[2].ToString();
dr["year"] = values[3].ToString();
dt.Rows.Add(dr);
}
grdstudents.DataSource = dt;
grdstudents.DataBind();
}

Read in a file using a regular expression?

This is tangentially related to an earlier question of mine.
Essentially, the solution in that question worked great, but now I need to adapt it to work in a much larger analysis application. Simply using StreamReader.ReadToEnd() is not acceptable, since some of the files I will be reading in are very, very large. If there's been a mistake and someone forgot to clean up, they can theoretically be gigabytes big. Obviously I can't just read to the end of that.
Unfortunately, the normal read lines is also not acceptable, because some of the rows of data I am reading in contain stack traces - they obviously use /r/n in their formatting. Ideally, I would like to tell the program to read forward until it hits a match for a regex, which it then returns. Is there any functionality to do this in .net? If not, can I get some suggestions for how I'd go about writing it?
Edit: To make it a bit easier to follow my question, here's a paste of some of the important parts of the adapted code:
foreach (var fileString in logpath.Select(log => new StreamReader(log)).Select(fileStream => fileStream.ReadToEnd()))
{
const string junkPattern = #"\[(?<junk>[0-9]*)\] \((?<userid>.{0,32})\)";
const string severityPattern = #"INFO|ERROR|FATAL";
const string datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var records = Regex.Split(fileString, datePattern, RegexOptions.Multiline);
foreach (var record in records.Where(x => string.IsNullOrEmpty(x) == false))
......
The problem lies in the Foreach. .Select(fileStream => fileStream.ReadToEnd()) is gonna blow up memory badly, I just know it.
First off all, you should move your const definition to class declaration - the compiler will do that for you, but this should be done by yourself, just for better code readability.
As #Blam mentioned, you should use StringBuilder and StreamReader.ReadLine in pair, something like this:
foreach(var filePath in logpath)
{
var sbRecord = new StringBuilder();
using(var reader = new StreamReader(filePath))
{
do
{
var line = reader.ReadLine();
// check start of the new record lines
if (Regex.Match(line, datePattern) && sbRecord.Length > 0)
{
// your method for log record
HandleRecord(sbRecord.ToString());
sbRecord.Clear();
sbRecord.AppendLine(line);
}
// if no lines were added or datePattern didn't hit
// append info about current record
else
{
sbRecord.AppendLine(line);
}
} while (!reader.EndOfStream)
}
}
If I didn't understand something about your problem, please clarify this in comment.
Also, you can use ThreadPool for schedule the tasks for your lines, just for speed of your application.

Copy line number in text file c#

I have 2 files file A and file B how need to copy line 30 on file A and paste it over the top of line 30 in file B can I do this in C#?
Here's a very simple way, assuming file B is small enough to read into memory:
string lineFromA = File.ReadLines("fileA.txt").Skip(29).First();
string[] linesFromB = File.ReadAllLines("fileB.txt");
linesFromB[29] = lineFromA;
File.WriteAllLines("fileC.txt", linesFromB);
This assumes you're using .NET 4, with its lazy File.ReadLines method. If you're not, the simplest approach would be to read both files into memory completely, using File.ReadAllLines twice:
string[] linesFromA = File.ReadAllLines("fileA.txt");
string[] linesFromB = File.ReadAllLines("fileB.txt");
linesFromB[29] = linesFromA[29];
File.WriteAllLines("fileC.txt", linesFromB);
There are definitely more efficient approaches, but I'd go with the above unless I had any reason to need a more efficient one.
If you use a streamwriter for the writing side you get a routine that does not use a lot of memory and can also be used for larger files.
string lineFromA = File.ReadLines("fileA.txt").Skip(29).First();
using (var fileC = File.AppendText("fileC.txt"))
{
int i = 0;
foreach (var lineFromB in File.ReadLines("fileB.txt"))
{
i++;
fileC.WriteLine(i != 30 ? lineFromB : lineFromA);
}
}

Categories

Resources