Searching an array string with a binary search sub string

Searching an array string with a binary search sub string - c#

I have a file.txt containing about 200,000 records.
The format of each record is 123456-99-Text. The 123456 are unique account numbers, the 99 is a location code that I need (it changes from 01 to 99), and the text is irrelevant. These account numbers are sorted in order and with a line break in the file per ac(111111, 111112, 111113, etc).
I made a visual studio textbox and search button to have someone search for the account number. The account number is actually 11 digits long but only the first 6 matter. I wrote this as string actnum = textbox1.text.substring(0,6)
I wrote a foreach (string x in file.readline('file.txt')) with an if (x.contains(actnum)) then string code = x.substring(8,2)) statement.
The program works well, but because there are so many records if someone searches an account number that doesnt exist, or a number at the bottom of the list, the program locks up for a good 10 seconds before going to the "number not found" else statement, or taking forever to find that last record.
My Question:
Reading about binary searches I have attempted to try one without much success. I cannot seem to get the array or file to act like a legitimate binary search. Is there a way to take the 6 digit actnum from textbox1, compare it to an array substring of the 6 digit account number, then grab the substring 99 code from that specific line?
A binary search would help greatly! I could take 555-555 and compare it to the top or bottom half of the record file, then keep searching until i fine the line i need, grab the entire line, then substring the 99 out. The problem I have is I cant seem to get a proper integer conversion of the file because it contains both numbers AND text, and therefore I cant properly use <, >, = signs.
Any help on this would be greatly appreciated. The program I currently have actually works but is incredibly slow at times.

As one possible solution (not necessarily the best) you can add your record IDs to a Dictionary<string, int> (or even a Dictionary<long, int> if all record IDs are numeric) where each key is the ID of one line and each value is the line index. When you need to look up a particular record, just look in the dictionary (it'll do an efficient lookup for you) and gives you the line number. If the item is not there (non-existent ID), you won't find it in the dictionary.
At this point, if the record ID exists in the file, you have a line number - you can either load the entire file into memory (if it's not too big) or just seek to the right line and read in the line with the data.
For this to work, you have to go through the file at least once and collect all the record IDs from all lines and add them to the dictionary. You won't have to implement the binary search - the dictionary will internally perform the lookup for you.
Edit:
If you don't need all the data from a particular line, just one bit (like the location code you mentioned), you don't even need to store the line number (since you won't need to go back to the line in the file) - just store the location data as the value in the dictionary.
I personally would still store the line index because, in my experience, such projects start out small but end up collecting features and there'll be a point where you'll have to have everything from the file. If you expect this to be the case over time, just parse data from each line into a data structure and store that in the dictionary - it'll make your future life simpler. If you're very sure you'll never need more data than the one bit of information, you can just stash the data itself in the dictionary.
Here's a simple example (assuming that your record IDs can be parsed into a long):
public class LineData
{
public int LineIndex { get; set; }
public string LocationCode { get; set; }
// other data from the line that you need
}
// ...
// declare your map
private Dictionary<long, LineData> _dataMap = new Dictionary<long, LineData> ();
// ...
// Read file, parse lines into LineData objects and put them in dictionary
// ...
To see if a record ID exists, you just call TryGetValue():
LineData lineData;
if ( _dataMap.TryGetValue ( recordID, out lineData ) )
{
// record ID was found
}
This approach essentially keeps the entire file in memory but all data is parsed only once (at the beginning, during building the dictionary). If this approach uses too much memory, just store the line index in the dictionary and then go back to the file if you find a record and parse the line on the fly.

You cannot really do a binary search against file.ReadLine because you have to be able to access the lines in different order. Instead you should read the whole file into memory (file.ReadAllLines would be an option)
Assuming your file is sorted by the substring, you can create a new class that implements IComparer
public class SubstringComparer : IComparer<string>
{
public int Compare(string x, string y)
{
return x.Substring(0, 6).CompareTo(y.Substring(0, 6));
}
}
and then your binary search would look like:
int returnedValue = foundStrings.BinarySearch(searchValue, new SubstringComparer());

Assuming the file doesn't change often, then you can simply load the entire file into memory using a structure that handles the searching in faster time. If the file can change then you will need to decide on a mechanism for reloading the file, be it restarting the program or a more complex process.
It looks like you are looking for exact matches (searching for 123456 yields only one record which is labelled 123456). If that is the case then you can use a Dictionary. Note that to use a Dictionary you need to define key and value types. It looks like in your case they would both be string.

While I did not find a way to do a better type of search, I did manage to learn about embedded resources which considerably sped up the program. Scanning the entire file takes a fraction of a second now, instead of 5-10 seconds. Posting the following code:
string searchfor = textBox1.Text
Assembly assm = Assembly.GetExecutingAssembly();
using (Stream datastream = assm.GetManifestResourceStream("WindowsFormsApplication2.Resources.file1.txt"))
using (StreamReader reader = new StreamReader(datastream))
{
string lines;
while ((lines = reader.ReadLine()) != null)
{
if (lines.StartsWith(searchfor))
{
label1.Text = "Found";
break;
}
else
{
label1.Text = "Not found";
}
}
}

Related

Dynamic Regex generation for predictable repeating string patterns in a data feed

I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.
The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.
e.g
Panasonic TV with FREE Blu-Ray Player
Sony TV with FREE DVD Player + Box Office DVD
Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher
So the format in this instance is: PRODUCT with FREEGIFT.
PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.
Creating a Regular Expression to extract the dynamic parts is trivial.
The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:
Brand new Panasonic TV and a FREE Blu-Ray Player if you order today
Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today
Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today
And next week it will probably be something different, so I have to keep modifying my Regular Expressions...
How would you handle this?
Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?
Thanks for any help or advice.

This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.
However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.
I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)
Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?
The algorithm
If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
Continue until all strings are exhausted
The implementation
private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
string sharedContent = "";
while (strings.All(x => x.Length > 0))
{
var item1FirstCharacter = strings[0][0];
if (strings.All(x => x[0] == item1FirstCharacter))
{
sharedContent += item1FirstCharacter;
for (int index = 0; index < strings.Length; index++)
strings[index] = strings[index].Substring(1);
continue;
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
sharedContent = "";
// If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
for (int index = 0; index < strings.Length; index++)
{
string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));
if (!strings.All(x => x.Contains(testBlock)))
strings[index] = strings[index].Substring(1);
}
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
}
Output
Set 1 (from your example):
FindCommonContent(strings, 4);
=> "with "
Set 2 (from your example):
FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"
Building the regex
This should be as simple as:
"{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"
Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.

I think it would be possible with an algorithm , but the time it would take you to code it versus simply doing the Regular Expression might not be worth it.
You could however make your changing process faster. If instead of having your Regex String inside your application, you'd put it in a text file somewhere, you wouldn't have to recompile and redeploy everything every time there's a change, you could simply edit the text file.
Depending on your project size and implementation, this could save you a generous amount of time.

Fastest way to delete files that are not in a data table?

I need to write a code in C# that will select a list of file names from a data table and delete every file in a folder that is not in this list.
One possibility would be to have both ordered by name, and then loop through my table results, and for each result, loop through my files and delete them until I find a file that matches the current result or is alphabetically bigger, and then move to the next result without resetting the current file index.
I haven't tried to actually implement this, but seems to me that this would be an O(n) since each list would be looped through just once (ignoring the sorting both lists part). The only thing I'm not sure about is whether I can be 100% sure both the file system and the database engine will sort exactly the same way (will they both consider "_" smaller than "-" and stuff like that). If not, the algorithm above just wouldn't work at all. (By the way this is a Jet Engine database.)
But since this is probably not such an uncommon problem you guys might already know a better solution. I tried search the web but couldn't find anything. Perhaps a more effective solution would be to put each list into a HashSet and find their difference.

Get the folder content into folderFiles (IEnumerable<string>)
Get the file you want to keep in filesToKeep (IEnumerable<string>)
Get a list of "not in list" files.
Delete these files.
Code Sample :
IEnumerable<FileInfo> folderFiles = new List<FileInfo>(); // Fill me.
IEnumerable<string> filesToKeep = new List<string>(); // Fill me.
foreach (string fileToDelete in folderFiles.Select(fi => fi.FullName).Except(filesToKeep))
{
File.Delete(fileToDelete);
}

Here is my suggestion for you. Assuming filesInDatabase contains a list of files which are in the database and pathOfDirectory contains the path of the directory where the files to compare are contained.
foreach (var fileToDelete in Directory.EnumerateFiles(pathOfDirectory).Where(item => !filesInDatabase.Contains(item))
{
File.Delete(fileToDelete);
}
EDIT:
This requires using System.Linq;, because it uses LINQ.

I think hashing is the way to go, but you don't really need two HashSets. Only one HashSet is needed to store the standardized file names from the datatable; the other container can be any collection data type.

First off, .Net allows you to define cultures that can be used in sorting, but I'm not all that familiar with the mechanism, so I'll let Google to give his pointers on the subject.
Second, to avoid all the culture mass, you can use a different algorithm with an idea similar to radix-sort (only without the sort) - time complexity is O(n * length_longest_file_name). File name lengths are limited (as far as I know, almost no file system will allow a file name longer then 256), so I'm assuming that n is dramatically larger then file name lengths, and if n is smaller then the max file name length, just use an O(n^2) method and avoid the work (iterating lists this small is near instant times anyways).
Note: This method does not require sorting.
The idea is to create an array of symbols that can be used as file name chars (about 60-70 chars, if this is a case sensitive search), and another flag array with a flag for each char in the first array.
Now, you create a loop for each char in the file names of the list from the DB (from 1 -> length_longest_file_name).
In each iteration (i) you go over the i-th char of each file name in the DB list. Every char you see, you set it's relevant flag to true.
When all flags are set, you go over the second list and delete every file for which the i-th char of it's name is not flagged.
Implementation might be complex, and the overhead of the two arrays might make it slower when n is small, but you can optimize this to make it better (for instance, no iterating over files that have names shorter then the current i by removing them from both lists).
Hope this helps

I have another idea that might be faster.
var filesToDelete = new List<string>(Directory.GetFiles(directoryPath));
foreach (var databaseFile in databaseFileList)
{
filesToDelete.Remove(databaseFile);
}
foreach (var fileToDelete in filesToDelete)
{
File.Delete(fileToDelete);
}
Explanation: First get all files containing in the directory. Then delete every file from that list, which is in the database. At last delete all remaining files from the list filesToDelete.

Big strings: System.OutOfMemoryException

var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS

You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability

Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.

Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

Need help on an algorithm

I need help on an algorithm. I have randomly generated numbers with 6 digits. Like;
123654
109431
There are approximately 1 million of them saved in a file line by line. I have to filter them according to the rule I try to describe below.
Take a number, compare it to all others digit by digit. If a number comes up with a digit with a value of bigger by one to the compared number, then delete it. Let me show it by using numbers.
Our number is: 123456
Increase the first digit with 1, so the number becomes: 223456. Delete all the 223456s from the file.
Increase the second digit by 1, the number becomes: 133456. Delete all 133456s from the file, and so on...
I can do it just as I describe but I need it to be "FAST".
So can anyone help me on this?
Thanks.

First of all, since it is around 1Million you had better perform the algorithm in RAM, not on Disk, that is, first load the contents into an array, then modify the array, then paste the results back into the file.
I would suggest the following algorithm - a straightforward one. Precalculate all the target numbers, in this case 223456, 133456, 124456, 123556, 123466, 123457. Now pass the array and if the number is NOT any of these, write it to another array. Alternatively if it is one of these numbers delete it(recommended if your data structure has O(1) remove)

This algorithm will keep a lot of numbers around in memory, but it will process the file one number at a time so you don't actually need to read it all in at once. You only need to supply an IEnumerable<int> for it to operate on.
public static IEnumerable<int> FilterInts(IEnumerable<int> ints)
{
var removed = new HashSet<int>();
foreach (var i in ints)
{
var iStr = i.ToString("000000").ToCharArray();
for (int j = 0; j < iStr.Length; j++)
{
var c = iStr[j];
if (c == '9')
iStr[j] = '0';
else
iStr[j] = (char)(c + 1);
removed.Add(int.Parse(new string(iStr)));
iStr[j] = c;
}
if (!removed.Contains(i))
yield return i;
}
}
You can use this method to create an IEnumerable<int> from the file:
public static IEnumerable<int> ReadIntsFrom(string path)
{
using (var reader = File.OpenText(path))
{
string line;
while ((line = reader.ReadLine()) != null)
yield return int.Parse(line);
}
}

Take all the numbers from the file to an arrayList, then:
take the number of threads as the number of digits
increment the first digit on the number in first thread, second in the second thread and then compare it with the rest of the numbers,
It would be fast as it will undergo by parallel processing...

All the suggestions (so far) require six comparisons per input line, which is not necessary. The numbers are coming in as strings, so use string comparisons.
Start with #Armen Tsirunyan's idea:
Precalculate all the target numbers,
in this case 223456, 133456, 124456,
123556, 123466, 123457.
But instead of single comparisons, make that into a string:
string arg = "223456 133456 124456 123556 123466 123457";
Then read through the input (either from file or in memory). Pseudocode:
foreach (string s in theBigListOfNumbers)
if (arg.indexOf(s) == -1)
print s;
This is just one comparison per input line, no dictionaries, maps, iterators, etc.
Edited to add:
In x86 instruction set processors (not just the Intel brand), substring searches like this are very fast. To search for a character within a string, for example, is just one machine instruction.
I'll have to ask others to weigh in on alternate architectures.

For starters, I would just read all the numbers into an array.
When you are finally done, rewrite the file.

It seems like the rule you're describing is for the target number abdcef you want to find all numbers that contain a+1, b+1, c+1, d+1, e+1, or f+1 in the appropriate place. You can do this in O(n) by looping over the lines in the file and comparing each of the six digits to the digit in the target number if no digits match, write the number to an output file.

This sounds like a potential case for a multidimensional array, and possibly also unsafe c# code so that you can use pointer math to iterate through such a large quantity of numbers.
I would have to dig into it further, but I would also probably use a Dictionary for non-linear lookups, if you are comparing numbers that aren't in sequence.

How about this. You process numbers one by one. Numbers will be stored in hash tables NumbersOK and NumbersNotOK.
Take one number
If it's not in NumbersNotOK place it in a Hash of NumbersOK
Get it's variances of single number increments in hash - NumbersNotOK.
Remove all of the NumbersOK members if they match any of the variances.
Repeat from 1, untill end of file
Save the NumbersOK to the file.
This way you'll pass the list just once. The hash table is made just for this kind of purposes and it'll be very fast (no expensive comparison methods).
This algorithm is not in full, as it doesn't handle when there are some numbers repeating, but it can be handled with some tweaking...

Read all your numbers from the file and store them in a map where the number is the key and a boolean is the value signifying that the value hasn't been deleted. (True means exists, false means deleted).
Then iterate through your keys. For each key, set the map to false for the values you would be deleting from the list.
Iterate through your list one more time and get all the keys where the value is true. This is the list of remaining numbers.
public List<int> FilterNumbers(string fileName)
{
StreamReader sr = File.OpenTest(fileName);
string s = "";
Dictionary<int, bool> numbers = new Dictionary<int, bool>();
while((s = sr.ReadLine()) != null)
{
int number = Int32.Parse(s);
numbers.Add(number,true);
}
foreach(int number in numbers.Keys)
{
if(numbers[number])
{
if(numbers.ContainsKey(100000+number))
numbers[100000+number]=false;
if(numbers.ContainsKey(10000+number))
numbers[10000+number]=false;
if(numbers.ContainsKey(1000+number))
numbers[1000+number]=false;
if(numbers.ContainsKey(100+number))
numbers[100+number]=false;
if(numbers.ContainsKey(10+number))
numbers[10+number]=false;
if(numbers.ContainsKey(1+number))
numbers[1+number]=false;
}
}
List<int> validNumbers = new List<int>();
foreach(int number in numbers.Keys)
{
validNumbers.Add(number);
}
return validNumbers;
}
This may need to be tested as I don't have a C# compiler on this computer and I'm a bit rusty. The algorithm will take a bit of memory bit it runs in linear time.
** EDIT **
This runs into problems whenever one of the numbers is 9. I'll update the code later.

Still sounds like a homework question... the fastest sort on a million numbers will be n log(n) that is 1000000log(1000000) that is 6*1000000 which is the same as comparing 6 numbers to each of the million numbers. So a direct comparison will be faster than sort and remove, because after sorting you still have to compare to remove. Unless, ofcourse, my calculations have entirely missed the target.
Something else comes to mind. When you pick up the number, read it as hex and not base 10. then maybe some bitwise operators may help somehow.
Still thinking on what can be done using this. Will update if it works
EDIT: currently thinking on the lines of gray code. 123456 (our original number) and 223456 or 133456 will be off only by one digit and a gray code convertor will catch it fast. It's late night here, so if someone else finds this useful and can give a solution...

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.