I love C#, I love the framework, and I also love to learn as much as possible. Today I began to read articles about LINQ in C# and I couldn't find anything good for a beginner that never worked with SQL in his life.
I found this article very helpful and I understood small parts of it, but I'd like to get more examples.
After reading it couple of times, I tried to use LINQ in a function of mine, but I failed.
private void Filter(string filename)
{
using (TextWriter writer = File.CreateText(Application.StartupPath + "\\temp\\test.txt"))
{
using(TextReader reader = File.OpenText(filename))
{
string line;
while((line = reader.ReadLine()) != null)
{
string[] items = line.Split('\t');
int myInteger = int.Parse(items[1]);
if (myInteger == 24809) writer.WriteLine(line);
}
}
}
}
This is what I did and it did not work, the result was always false.
private void Filter(string filename)
{
using (TextWriter writer = File.CreateText(Application.StartupPath + "\\temp\\test.txt"))
{
using(TextReader reader = File.OpenText(filename))
{
string line;
while((line = reader.ReadLine()) != null)
{
string[] items = line.Split('\t');
var Linqi = from item in items
where int.Parse(items[1]) == 24809
select true;
if (Linqi == true) writer.WriteLine(line);
}
}
}
}
I'm asking for two things:
How would the function look like using as much Linq as possible?
A website/book/article about Linq,but please note I'm a decent beginner in sql/linq.
Thank you in advance!
Well one thing that would make your sample more "LINQy" is an IEnumerable<string> for reading lines from a file. Here's a somewhat simplified version of my LineReader class from MiscUtil:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
public sealed class LineReader : IEnumerable<string>
{
readonly Func<TextReader> dataSource;
public LineReader(string filename)
: this(() => File.OpenText(filename))
{
}
public LineReader(Func<TextReader> dataSource)
{
this.dataSource = dataSource;
}
public IEnumerator<string> GetEnumerator()
{
using (TextReader reader = dataSource())
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Now you can use that:
var query = from line in new LineReader(filename)
let items = line.Split('\t')
let myInteger int.Parse(items[1]);
where myInteger == 24809
select line;
using (TextWriter writer = File.CreateText(Application.StartupPath
+ "\\temp\\test.txt"))
{
foreach (string line in query)
{
writer.WriteLine(line);
}
}
Note that it would probably be more efficient to not have the let clauses:
var query = from line in new LineReader(filename)
where int.Parse(line.Split('\t')[1]) == 24809
select line;
at which point you could reasonably do it all in "dot notation":
var query = new LineReader(filename)
.Where(line => int.Parse(line.Split('\t')[1]) == 24809);
However, I far prefer the readability of the original query :)
101 LINQ Samples is certainly a good collection of examples. Also LINQPad might be a good way to play around with LINQ.
For a website as a starting point, you can try Hooked on LINQ
Edit:
Original site appears to be dead now (domain is for sale).
Here's the internet archive of the last version: https://web.archive.org/web/20140823041217/http://www.hookedonlinq.com/
If you're after a book, I found LINQ in action from Manning Publications a good place to start.
MSDN LINQ Examples: http://msdn.microsoft.com/en-us/vcsharp/aa336746.aspx
I got a lot out of the following sites when I started:
http://msdn.microsoft.com/en-us/library/bb425822.aspx
http://weblogs.asp.net/scottgu/archive/2007/05/19/using-linq-to-sql-part-1.aspx
To answer the first question, there frankly isn't too much reason to use LINQ the way you suggest in the above function except as an exercise. In fact, it probably just makes the function harder to read.
LINQ is more useful at operating on a collection than a single element, and I would use it in that way instead. So, here's my attempt at using as much LINQ as possible in the function (make no mention of efficiency and I don't suggest reading the whole file into memory like this):
private void Filter(string filename)
{
using (TextWriter writer = File.CreateText(Application.StartupPath + "\\temp\\test.txt"))
{
using(TextReader reader = File.OpenText(filename))
{
List<string> lines;
string line;
while((line = reader.ReadLine()) != null)
lines.Add(line);
var query = from l in lines
let splitLine = l.Split('\t')
where int.Parse(splitLine.Skip(1).First()) == 24809
select l;
foreach(var l in query)
writer.WriteLine(l);
}
}
}
First, I would introduce this method:
private IEnumerable<string> ReadLines(StreamReader reader)
{
while(!reader.EndOfStream)
{
yield return reader.ReadLine();
}
}
Then, I would refactor the main method to use it. I put both using statements above the same block, and also added a range check to ensure items[1] doesn't fail:
private void Filter(string fileName)
{
using(var writer = File.CreateText(Application.StartupPath + "\\temp\\test.txt"))
using(var reader = File.OpenText(filename))
{
var myIntegers =
from line in ReadLines(reader)
let items = line.Split('\t')
where items.Length > 1
let myInteger = Int32.Parse(items[1])
where myInteger == 24809
select myInteger;
foreach(var myInteger in myIntegers)
{
writer.WriteLine(myInteger);
}
}
}
I found this article to be extremely crucial to understand LINQ which is based upon so many new constructs brought in in .NET 3.0 & 3.5:
I'll warn you it's a long read, but if you really want to understand what Linq is and does I believe it is essential
http://blogs.msdn.com/ericwhite/pages/FP-Tutorial.aspx
Happy reading
If I was to rewrite your filter function using LINQ where possible, it'd look like this:
private void Filter(string filename)
{
using (TextWriter writer = File.CreateText(Application.StartupPath + "\\temp\\test.txt"))
{
var lines = File.ReadAllLines(filename);
var matches = from line in lines
let items = line.Split('\t')
let myInteger = int.Parse(items[1]);
where myInteger == 24809
select line;
foreach (var match in matches)
{
writer.WriteLine(line)
}
}
}
As for Linq books, I would recommend:
(source: ebookpdf.net)
http://www.diesel-ebooks.com/mas_assets/full/0321564189.jpg
Both are excellent books that drill into Linq in detail.
To add yet another variation to the as-much-linq-as-possible topic, here's my take:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace LinqDemo
{
class Program
{
static void Main()
{
var baseDir = AppDomain.CurrentDomain.BaseDirectory;
File.WriteAllLines(
Path.Combine(baseDir, "out.txt"),
File.ReadAllLines(Path.Combine(baseDir, "in.txt"))
.Select(line => new KeyValuePair<string, string[]>(line, line.Split(','))) // split each line into columns, also carry the original line forward
.Where(info => info.Value.Length > 1) // filter out lines that don't have 2nd column
.Select(info => new KeyValuePair<string, int>(info.Key, int.Parse(info.Value[1]))) // convert 2nd column to int, still carrying the original line forward
.Where(info => info.Value == 24809) // apply the filtering criteria
.Select(info => info.Key) // restore original lines
.ToArray());
}
}
}
Note that I changed your tab-delimited-columns to comma-delimited columns (easier to author in my editor that converts tabs to spaces ;-) ). When this program is run against an input file:
A1,2
B,24809,C
C
E
G,24809
The output will be:
B,24809,C
G,24809
You could improve memory requirements of this solution by replacing "File.ReadAllLines" and "File.WriteAllLines" with Jon Skeet's LineReader (and LineWriter in a similar vein, taking IEnumerable and writing each returned item to the output file as a new line). This would transform the solution above from "get all lines into memory as an array, filter them down, create another array in memory for result and write this result to output file" to "read lines from input file one by one, and if that line meets our criteria, write it to output file immediately" (pipeline approach).
cannot just check if Linqi is true...Linqi is an IEnumerable<bool> (in this case) so have to check like Linqi.First() == true
here is a small example:
string[] items = { "12121", "2222", "24809", "23445", "24809" };
var Linqi = from item in items
where Convert.ToInt32(item) == 24809
select true;
if (Linqi.First() == true) Console.WriteLine("Got a true");
You could also iterate over Linqi, and in my example there are 2 items in the collection.
Related
I run into examples like this all the time. In this case I want to populate a stringbuilder with a new line for each FileInfo object in a previously loaded variable called files, that of course contains a bunch of FileInfo objects. For the first object, I want to add FIRST after the text then for everything else I want to add NOTFIRST. To do this with a forloop, I have to setup a counter, do an if statement and increment the counter.
I've learned just enough linq that its on the tip of my fingers, but I know there has to be an elegant LINQ solution.
var mysb = new StringBuilder();
var count = 0;
string extra;
foreach (System.IO.FileInfo fi in files)
{
var newLine = fi.Name;
if (count == 0)
extra = "FIRST";
else
extra= "NOTFIRST";
count = count++;
mysb.AppendLine(string.Format("({0} {1})", newLine, extra));
}
Personally, I would forego the LINQ and stick with what you have, just simpler:
var mysb = new StringBuilder();
foreach (FileInfo fi in files)
{
string extra = mysb.Length == 0 ? "FIRST" : "NOTFIRST";
mysb.Append(fi.Name);
mysb.AppendLine(extra);
}
(It's not clear to me why you are treating the file name as a valid format string...of course, if it really is a valid format string, you can change my two calls to Append() and AppendLine() back to the single call with the string.Format())
You may use the overload of Select that gives you the current index: http://msdn.microsoft.com/pl-pl/library/bb534869(v=vs.110).aspx
I also don't like mutating state when using linq so I would use String.Join instead.
mysb.AppendLine(String.Join(Environment.NewLine,
files.Select((fi, i) => String.Format(fi.Name, i == 0 ? "FIRST" : "NOTFIRST"))));
As often as it happens I ask questions here, I went with a hybrid of suggetsions:
foreach (var fi in files)
{
var extra = (fi == files.First() ? "FIRST" : "NOTFIRST");
sb.AppendLine(fi.Name + extra);
}
I was unwilling to check the length of the stringbuilder, because I have other scenarios where extra pretty much requires using a linq function.
I suppose I could have just as easily done the following (for my stated example):
sb.AppendLine(files.First().Name + " FIRST");
sb.AppendLine(String.Join(Environment.NewLine,
files.Skip(1).Select( fi => fi.Name + " NOTFIRST")));
But to be honest, its half as readable.
I'm not suggesting that this is the best way to do this, but it was fun to write:
var fi = new [] { new { Name= "A"},
new { Name= "B"},
new { Name= "C"}};
String.Join(Environment.NewLine,
fi.Take(1).Select (f => Tuple.Create(f.Name,"FIRST"))
.Concat(fi.Skip(1).Select (f => Tuple.Create(f.Name,"NONFIRST")))
.Select(t=> String.Format("({0} {1})", t.Item1, t.Item2)))
.Dump();
I am trying to minimize this piece of code
public static void UnfavSong(Song song)
{
List<string> favorites = FileManagement.GetFileContent_List(FAVS_FILENAME);
foreach (string s in favorites)
{
Song deser = SongSerializer.Deserialize(s);
if (deser.ID == song.ID)
{
favorites.Remove(s);
break;
}
}
FileManagement.SaveFile(FAVS_FILENAME, favorites);
}
But I feel like the whole foreach part can be made much shorter.
Is there a way in C# to cut this down to the core?
Using LINQ
favorites.RemoveAll(s => SongSerializer.Deserialize(s).ID == song.ID)
Btw. your code shouldn't work at all as you can't modify the List during it's iteration
you can use linq Where() to filter them:
List<string> result = favorites.Where(x=>SongSerializer.Deserialize(x).ID != song.ID).ToList();
This will give you all element except with the matching ID with song.ID
I am using the CSVHelper library, which can extract a list of objects from a CSV file with just three lines of code:
var streamReader = // Create a reader to your CSV file.
var csvReader = new CsvReader( streamReader );
List<MyCustomType> myData = csvReader.GetRecords<MyCustomType>();
However, by file has nonsense lines and I need to skip the first ten lines in the file. I thought it would be nice to use LINQ to ensure 'clean' data, and then pass that data to CsvFReader, like so:
public TextReader GetTextReader(IEnumerable<string> lines)
{
// Some magic here. Don't want to return null;
return TextReader.Null;
}
public IEnumerable<T> ExtractObjectList<T>(string filePath) where T : class
{
var csvLines = File.ReadLines(filePath)
.Skip(10)
.Where(l => !l.StartsWith(",,,"));
var textReader = GetTextReader(csvLines);
var csvReader = new CsvReader(textReader);
csvReader.Configuration.ClassMapping<EventMap, Event>();
return csvReader.GetRecords<T>();
}
But I'm really stuck into pushing a 'static' collection of strings through a stream like a TextReaer.
My alternative here is to process the CSV file line by line through CsvReader and examine each line before extracting an object, but I find that somewhat clumsy.
The StringReader Class provides a TextReader that wraps a String. You could simply join the lines and wrap them in a StringReader:
public TextReader GetTextReader(IEnumerable<string> lines)
{
return new StringReader(string.Join("\r\n", lines));
}
An easier way would be to use CsvHelper to skip the lines.
// Skip rows.
csvReader.Configuration.IgnoreBlankLines = false;
csvReader.Configuration.IgnoreQuotes = true;
for (var i = 0; i < 10; i++)
{
csvReader.Read();
}
csvReader.Configuration.IgnoreBlankLines = false;
csvReader.Configuration.IgnoreQuotes = false;
// Carry on as normal.
var myData = csvReader.GetRecords<MyCustomType>;
IgnoreBlankLines is turned off in case any of those first 10 rows are blank. IgnoreQuotes is turned off so you don't get any BadDataExceptions if those rows contain a ". You can turn them back on after for normal functionality again.
If you don't know the amount of rows and need to test based on row data, you can just test csvReader.Context.Record and see if you need to stop. In this case, you would probably need to manually call csvReader.ReadHeader() before calling csvReader.GetRecords<MyCustomType>().
How do I read a CSV file using C#?
A choice, without using third-party components, is to use the class Microsoft.VisualBasic.FileIO.TextFieldParser (http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx) . It provides all the functions for parsing CSV. It is sufficient to import the Microsoft.VisualBasic assembly.
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(file);
parser.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited;
parser.SetDelimiters(new string[] { ";" });
while (!parser.EndOfData)
{
string[] row = parser.ReadFields();
/* do something */
}
You can use the Microsoft.VisualBasic.FileIO.TextFieldParser class in C#:
using System;
using System.Data;
using Microsoft.VisualBasic.FileIO;
static void Main()
{
string csv_file_path = #"C:\Users\Administrator\Desktop\test.csv";
DataTable csvData = GetDataTableFromCSVFile(csv_file_path);
Console.WriteLine("Rows count:" + csvData.Rows.Count);
Console.ReadLine();
}
private static DataTable GetDataTableFromCSVFile(string csv_file_path)
{
DataTable csvData = new DataTable();
try
{
using(TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datacolumn = new DataColumn(column);
datacolumn.AllowDBNull = true;
csvData.Columns.Add(datacolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
You could try CsvHelper, which is a project I work on. Its goal is to make reading and writing CSV files as easy as possible, while being very fast.
Here are a few ways you can read from a CSV file.
// By type
var records = csv.GetRecords<MyClass>();
var records = csv.GetRecords( typeof( MyClass ) );
// Dynamic
var records = csv.GetRecords<dynamic>();
// Using anonymous type for the class definition
var anonymousTypeDefinition =
{
Id = default( int ),
Name = string.Empty,
MyClass = new MyClass()
};
var records = csv.GetRecords( anonymousTypeDefinition );
I usually use a simplistic approach like this one:
var path = Server.MapPath("~/App_Data/Data.csv");
var csvRows = System.IO.File.ReadAllLines(path, Encoding.Default).ToList();
foreach (var row in csvRows.Skip(1))
{
var columns = row.Split(';');
var field1 = columns[0];
var field2 = columns[1];
var field3 = columns[2];
}
I just used this library in my application. http://www.codeproject.com/KB/database/CsvReader.aspx. Everything went smoothly using this library, so I'm recommending it. It is free under the MIT License, so just include the notice with your source files.
I didn't display the CSV in a browser, but the author has some samples for Repeaters or DataGrids. I did run one of his test projects to test a Sort operation I have added and it looked pretty good.
You can try Cinchoo ETL - an open source lib for reading and writing CSV files.
Couple of ways you can read CSV files
Id, Name
1, Tom
2, Mark
This is how you can use this library to read it
using (var reader = new ChoCSVReader("emp.csv").WithFirstLineHeader())
{
foreach (dynamic item in reader)
{
Console.WriteLine(item.Id);
Console.WriteLine(item.Name);
}
}
If you have POCO object defined to match up with CSV file like below
public class Employee
{
public int Id { get; set; }
public string Name { get; set; }
}
You can parse the same file using this POCO class as below
using (var reader = new ChoCSVReader<Employee>("emp.csv").WithFirstLineHeader())
{
foreach (var item in reader)
{
Console.WriteLine(item.Id);
Console.WriteLine(item.Name);
}
}
Please check out articles at CodeProject on how to use it.
Disclaimer: I'm the author of this library
I recommend Angara.Table, about save/load: http://predictionmachines.github.io/Angara.Table/saveload.html.
It makes column types inference, can save CSV files and is much faster than TextFieldParser. It follows RFC4180 for CSV format and supports multiline strings, NaNs, and escaped strings containing the delimiter character.
The library is under MIT license. Source code is https://github.com/Microsoft/Angara.Table.
Though its API is focused on F#, it can be used in any .NET language but not so succinct as in F#.
Example:
using Angara.Data;
using System.Collections.Immutable;
...
var table = Table.Load("data.csv");
// Print schema:
foreach(Column c in table)
{
string colType;
if (c.Rows.IsRealColumn) colType = "double";
else if (c.Rows.IsStringColumn) colType = "string";
else if (c.Rows.IsDateColumn) colType = "date";
else if (c.Rows.IsIntColumn) colType = "int";
else colType = "bool";
Console.WriteLine("{0} of type {1}", c.Name, colType);
}
// Get column data:
ImmutableArray<double> a = table["a"].Rows.AsReal;
ImmutableArray<string> b = table["b"].Rows.AsString;
Table.Save(table, "data2.csv");
You might be interested in Linq2Csv library at CodeProject. One thing you would need to check is that if it's reading the data when it needs only, so you won't need a lot of memory when working with bigger files.
As for displaying the data on the browser, you could do many things to accomplish it, if you would be more specific on what are your requirements, answer could be more specific, but things you could do:
1. Use HttpListener class to write simple web server (you can find many samples on net to host mini-http server).
2. Use Asp.Net or Asp.Net Mvc, create a page, host it using IIS.
Seems like there are quite a few projects on CodeProject or CodePlex for CSV Parsing.
Here is another CSV Parser on CodePlex
http://commonlibrarynet.codeplex.com/
This library has components for CSV parsing, INI file parsing, Command-Line parsing as well. It's working well for me so far. Only thing is it doesn't have a CSV Writer.
This is just for parsing the CSV. For displaying it in a web page, it is simply a matter of taking the list and rendering it however you want.
Note: This code example does not handle the situation where the input string line contains newlines.
public List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
int index = 0;
int start = 0;
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (char c in line)
{
switch (c)
{
case '"':
inQuote = !inQuote;
break;
case ',':
if (!inQuote)
{
result.Add(line.Substring(start, index - start)
.Replace("\"",""));
start = index + 1;
}
break;
}
index++;
}
if (start < index)
{
result.Add(line.Substring(start, index - start).Replace("\"",""));
}
return result;
}
}
I have been maintaining an open source project called FlatFiles for several years now. It's available for .NET Core and .NET 4.5.1.
Unlike most of the alternatives, it allows you to define a schema (similar to the way EF code-first works) with an extreme level of precision, so you aren't fight conversion issues all the time. You can map directly to your data classes, and there is also support for interfacing with older ADO.NET classes.
Performance-wise, it's been tuned to be one of the fastest parsers for .NET, with a plethora of options for quirky format differences. There's also support for fixed-length files, if you need it.
you can use this library: Sky.Data.Csv
https://www.nuget.org/packages/Sky.Data.Csv/
this is a really fast CSV reader library and it's really easy to use:
using Sky.Data.Csv;
var readerSettings = new CsvReaderSettings{Encoding = Encoding.UTF8};
using(var reader = CsvReader.Create("path-to-file", readerSettings)){
foreach(var row in reader){
//do something with the data
}
}
it also supports reading typed objects with CsvReader<T> class which has a same interface.
I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#
Here is my code:
private static Hashtable importFile(Hashtable myHashtable, String myFileName)
{
StreamReader sr = new StreamReader(myFileName);
CSVReader csvReader = new CSVReader();
ArrayList tempArray = new ArrayList();
int count = 0;
while (!sr.EndOfStream)
{
String temp = sr.ReadLine();
if (temp.StartsWith(" "))
{
ServMissing.Add(temp);
}
else
{
tempArray = csvReader.CSVParser(temp);
Boolean first = true;
String key = "";
String value = "";
foreach (String x in tempArray)
{
if (first)
{
key = x;
first = false;
}
else
{
value += x + ",";
}
}
myHashtable.Add(key, value);
}
count++;
}
Console.WriteLine("Import Count: " + count);
return myHashtable;
}
if (myHashtable.ContainsKey(key))
duplicates.Add(key);
else
myHashtable.Add(key, value);
A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.
ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.
In most situations, I'd say check for the key, but in this case, its better to catch the exception.
Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:
Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();
//Insert item
if (!dict.ContainsKey(key))
dict[key] = new List<string>();
dict[key].Add(value);
You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.
If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.
Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.
Here's one way to do this using LINQ.
CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
while (!sr.EndOfStream)
{
source.Add(sr.ReadLine());
}
}
List<string> ServMissing =
source
.Where(s => s.StartsWith(" ")
.ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource =
(
from s in source
where !s.StartsWith(" ")
let parsed = csvReader.CSVParser(s)
where parsed.Any()
let first = parsed.First()
let rest = String.Join( "," , parsed.Skip(1).ToArray())
select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest) //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
myHashTable.Add(g.Key, g.First());
if (g.Skip(1).Any())
{
myExtras.Add(g.Key);
}
}
Thank you all.
I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.