Efficient Methods of Comparing Text Files Simultaneously

Efficient Methods of Comparing Text Files Simultaneously - c#

I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.

Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).

Related

Removing redundant data from large file

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an "OutOfMemory" exception when attempting to do this (on the line that adds the string to the hashset).
There are around 32,000,000 lines in the files. It's not practical to re-read the entire file for each comparison.
Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I'm not sure that'd work either with that many values.
Thanks for any input!

First thing you need to think about - is high memory consumption is a problem?
If your application will always run on server with a lot of RAM available, or in any other case you know you'll have enough memory, you can do a lot of things you can't do if your application will run in a low-memory environment, or in an unknown environment. If memory isn't the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you'll be limited to 2GB memory (4GB, if you'll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you've got to do is change it - and it'll work great (assuming you have enough memory).
If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i'm more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.
Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you'll save less data in the memory (only hash for not duplicated items).
Best of luck.

Have you tried to use an array to intialize the HashSet. I assume that the doubling algorithm of HashSet is the reason for the OutOfMemoryException.
var uniqueLines = new HashSet<string>(File.ReadAllLines(#"C:\Temp\BigFile.log"));
Edit:
I am testing the result of the .Add() method to see if it
returns false to count the number of items that are redundant. I'd
like to keep this feature if possible.
Then you should try to initilize the HashSet with the correct(maximum) size of the file's lines:
int lineCount = File.ReadLines(path).Count();
List<string> fooList = new List<String>(lineCount);
var uniqueLines = new HashSet<string>(fooList);
fooList.Clear();
foreach (var line in File.ReadLines(path))
uniqueLines.Add(line);

I took a similar approach to Tim using HashSet. I did add manual line counting and comparison.
I read the setup log from my windows 8 install which was 58MB in size at 312248 lines and ran it in LinqPad in .993 seconds.
var temp=new List<string>(10000);
var uniqueHash=new HashSet<int>();
int lineCount=0;
int uniqueLineCount=0;
using(var fs=new FileStream(#"C:\windows\panther\setupact.log",FileMode.Open,FileAccess.Read))
using(var sr=new StreamReader(fs,true)){
while(!sr.EndOfStream){
lineCount++;
var line=sr.ReadLine();
var key=line.GetHashCode();
if(!uniqueHash.Contains(key) ){
uniqueHash.Add(key);
temp.Add(line);
uniqueLineCount++;
if(temp.Count()>10000){
File.AppendAllLines(#"c:\temp\output.txt",temp);
temp.Clear();
}
}
}
}
Console.WriteLine("Total Lines:"+lineCount.ToString());
Console.WriteLine("Lines Removed:"+ (lineCount-uniqueLineCount).ToString());

appending and reading text file

Environment: Any .Net Framework welcomed.
I have a log file that gets written to 24/7.
I am trying to create an application that will read the log file and process the data.
What's the best way to read the log file efficiently? I imagine monitoring the file with something like FileSystemWatcher. But how do I make sure I don't read the same data once it's been processed by my application? Or say the application aborts for some unknown reason, how would it pick up where it left off last?
There's usually a header and footer around the payload that's in the log file. Maybe an id field in the content as well. Not sure yet though about the id field being there.
I also imagined maybe saving the lines read count somewhere to maybe use that as bookmark.

For obvious reasons reading the whole content of the file, as well as removing lines from the log files (after loading them into your application) is out of question.
What I can think of as a partial solution is having a small database (probable something much smaller than a full-blown MySQL/MS SQL/PostgreSQL instance) and populating table with what has been read from the log file. I am pretty sure that even if there is power cut off and then the machine is booted again, most of the relational databases should be able to restore it's state with ease. This solution requires some data that could be used to identify the row from the log file (for example: exact time of the action logged, machine on which the action has taken place etc.)

Well, you will have to figure out your magic for your particular case yourself. If you are going to use well-known text encoding it may be pretty simple thoght. Look toward System.IO.StreamReader and it's ReadLine(), DiscardBufferedData() methods and BaseStream property. You should be able to remember your last position in the file and rewind to that position later and start reading again, given that you are sure that file is only appended. There are other things to consider though and there is no single universal answer to this.
Just as a naive example (you may still need to adjust a lot to make it work):
static void Main(string[] args)
{
string filePath = #"c:\log.txt";
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var streamReader = new StreamReader(stream,Encoding.Unicode))
{
long pos = 0;
if (File.Exists(#"c:\log.txt.lastposition"))
{
string strPos = File.ReadAllText(#"c:\log.txt.lastposition");
pos = Convert.ToInt64(strPos);
}
streamReader.BaseStream.Seek(pos, SeekOrigin.Begin); // rewind to last set position.
streamReader.DiscardBufferedData(); // clearing buffer
for(;;)
{
string line = streamReader.ReadLine();
if( line==null) break;
ProcessLine(line);
}
// pretty sure when everything is read position is at the end of file.
File.WriteAllText(#"c:\log.txt.lastposition",streamReader.BaseStream.Position.ToString());
}
}
}

I think you will find the File.ReadLines(filename) function in conjuction with LINQ will be very handy for something like this. ReadAllLines() will load the entire text file into memory as a string[] array, but ReadLines will allow you to begin enumerating the lines immediately as it traverses through the file. This not only saves you time but keeps the memory usage very low as it is processing each line one at a time. Using statements are important because if this program is interrupted it will close the filestreams flushing the writer and saving unwritten content to the file. Then when it starts up it will skip all the files that are already read.
int readCount = File.ReadLines("readLogs.txt").Count();
using (FileStream readLogs = new FileStream("readLogs.txt", FileMode.Append))
using (StreamWriter writer = new StreamWriter(readLogs))
{
IEnumerable<string> lines = File.ReadLines(bigLogFile.txt).Skip(readCount);
foreach (string line in lines)
{
// do something with line or batch them if you need more than one
writer.WriteLine(line);
}
}
As MaciekTalaska mentioned, I would strongly recommend using a database if this is something written to 24/7 and will get quite large. File systems are simply not equipped to handle such volume and you will spend a lot of time trying to invent solutions where a database could do it in a breeze.

Is there a reason why it logs to a file? Files are great because they are simple to use and, being the lowest common denominator, there is relatively little that can go wrong. However, files are limited. As you say, there's no guarantee a write to the file will be complete when you read the file. Multiple applications writing to the log can interfere with each other. There is no easy sorting or filtering mechanism. Log files can grow very big very quickly and there's no easy way to move old events (say those more than 24 hours old) into separate files for backup and retention.
Instead, I would considering writing the logs to a database. The table structure can be very simple but you get the advantage of transactions (so you can extract or backup with ease) and search, sort and filter using an almost universally understood syntax. If you are worried about load spikes, use a message queue, like http://msdn.microsoft.com/en-us/library/ms190495.aspx for SQL Server.
To make the transition easier, consider using a logging framework like log4net. It abstracts much of this away from your code.
Another alternative is to use a system like syslog or, if you have multiple servers and a large volume of logs, flume. By moving the log files away the source computer, you can store them or inspect them on a different machine far more effectively. However, these are probably overkill for your current problem.

How to append to large XML files in C# using memory efficiently

Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.

If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line

Big strings: System.OutOfMemoryException

var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS

You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability

Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.

Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.