C# - Comparing two CSV Files and giving an output

C# - Comparing two CSV Files and giving an output - c#

Need a bit of help, I have two sources of information and the information is exported to two different CSV file's by different programs. They are supposed to include the same information, however this is what needs to be checked.
Therefore what I would like to do is as follows:
Take the information from the two files.
Compare
Output any differences and which file the difference was in. (e.g File A Contained this, but File B did not and vice versa).
The files are 200,000 odd rows so will need to be as effective as possible.
Tried doing this with Excel however has proved to be too complicated and I'm really struggling to find a way programatically.

Assuming that the files are really supposed to be identical, right down to text qualifiers, ordering of rows, and number of rows contained in each file, the simplest approach may be to simply iterate through both files together and compare each line.
using (StreamReader f1 = new StreamReader(path1))
using (StreamReader f2 = new StreamReader(path2)) {
var differences = new List<string>();
int lineNumber = 0;
while (!f1.EndOfStream) {
if (f2.EndOfStream) {
differences.Add("Differing number of lines - f2 has less.");
break;
}
lineNumber++;
var line1 = f1.ReadLine();
var line2 = f2.ReadLine();
if (line1 != line2) {
differences.Add(string.Format("Line {0} differs. File 1: {1}, File 2: {2}", lineNumber, line1, line2);
}
}
if (!f2.EndOfStream) {
differences.Add("Differing number of lines - f1 has less.");
}
}

Depending on your answers to the comments on your question, if it doesn't really need to be done with code, you could do worse than download a compare tool, which is likely to more sophisticated.
(Winmerge for example)

OK, for anyone else that googles this and finds this. Here is what my answer was.
I exported the details to a CSV and ordered them numerically when they were exported for ease of use. Once they were exported as two CSV files, I then used a program called Beyond Compare which can be found here. This allows the files to be compared.
At first I used Beyond Compare manually to test what I was exporting was correct etc, however Beyond Compare does have the ability to be able to use command lines to compare. This then results in everything done programatically, all that has to be done is a user views the results in Beyond Compare. You may be able to export them to another CSV, I havn't looked as the GUI of Beyond Compare is very nice and useful, so it is easier to use this.

Related

What's the fastest way to copy files responsive to multiple search terms?

Presently working on an application that allows the user to input a list of names/search terms and an folder path. The application then searches for each phrase and copies any responsive documents to an output path. Important to note that the use is often with directories containing from 100GB up to a few TB and can sometimes be required to run thousands of search terms.
Initially I simply used the System.IO.GetFiles() function for this, but I've found I have better results creating a data table of all documents in the input path and running my searches over that data table (see below).
//Constructing a data table of all files in the input path
foreach (var file in fileArray)
{
System.Data.DataRow row = searchTable.NewRow();
row[1] = file;
row[0] = System.IO.Path.GetFileName(file);
searchTable.Rows.Add(row);
}
//For each line inputted by the user, search the data table to find any responsive file names
foreach (var line in searchArray)
{
for (int i = 0; i < searchTable.Rows.Count; i++)
{
if (searchTable.Rows[i][0].ToString().Contains(line))
{
string file = searchTable.Rows[i][1].ToString();
string output = SwiftBank.CalculateOutputFilePath(outputPath,inputPath,file);
System.IO.File.Copy(file, output);
}
}
}
I've found that while this works, it isn't optimised and functions very slowly for large data sets. Obviously doing a lot of repeat work, searching the data table in full every search term. Wondering if someone on here might have a better idea?

In my experience, doing a handful of contains queries for a few thousands fairly short strings should take less than a second. If you are searching inside much larger data sets, like searching thru 100Gb of content, you should look at some more advanced library, like lucene.
There are a few things I would suggest changing
Use a regular list instead of a datatable. Something like List<(string filePath, string fileName)> would be much simpler, and contain the same information
Perform all checks for a specific file at once, i.e. reorder your loops the the file loop is the outer one. This should help cache usage a little bit.
However, the vast majority of the time will likely be spent on copying files. This is many orders of magnitude slower than doing some simple searching in a few kilobytes of memory. You might gain a little bit by doing more than one copy in parallel, since SSDs may be able to improve throughput at higher loads, but that is likely only true if the files are small. You might consider alternatives to the copying, like adding shortcuts, instead.

Convert text of a C# project into 1 text file

So I'm doing Google Code Jam, and for their new format I have to upload my code as a single text file.
I like writing my code as properly constructed classes and multiple files even when under time pressure (I find that I save more time in clarity and my own debugging speed than I lose in wasted time.) and I want to re-use the common code.
Once I've got my code finished I have to convert from a series of classes in multiple files, to a single file.
Currently I'm just manually copying and pasting all the files' text into a single file, and then manually massaging the usings and namespaces to make it all work.
Is there a better option?
Ideally a tool that will JustDoIt for me?
Alternatively, if there were some predictable algorithm that I could implement that wouldn't require any manual tweaks?

Write your classes so that all "using"s are inside "namespace"
Write a script which collects all *.cs files and concatenates them

This is probably not the most optimal way to do this but this is a algorithm which can do what you need:
loop through every file and grab every line starting with "using" -> write them to a temp file/buffer
check for duplicates and remove them
get the position of the first '{' after the charsequence "namespace"
get the position of the last '}' in the file
append the text in between these two positions onto a temp file/buffer
append the second file/buffer to the first one
write out the merged buffer

It is very subjective. I see the algorithm as the following in pseudo code:
usingsLines = new HashSet<string>();
newFile = new StringBuilder();
foreeach (file in listOfFiles)
{
var textFromFile = file.ReadToEnd();
var usingOperators = textFromFile.GetUsings();
var fileBody = textFromFile.GetBody();
newFile+=fileBody ;
}
newFile = usingsLines.ToString() + newFile;
// As a result if will have something like this
// using usingsfromFirstFile;
// using usingsfromSecondFile;
//
// namespace FirstFileNamespace
// {
// ...
// }
//
// namespace SecondFileNamespace
// {
// ...
// }
But keep in mind this approach can lead to conflicts in namespaces if two different namespaces contan the same classes etc. To solve it you need to fix it manually, or rid of using operator and use fullnames with namespaces.
Also these few links may be useful:
Merge files,
Merge file in Java

Efficient Methods of Comparing Text Files Simultaneously

I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.

Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).

Fastest way to delete files that are not in a data table?

I need to write a code in C# that will select a list of file names from a data table and delete every file in a folder that is not in this list.
One possibility would be to have both ordered by name, and then loop through my table results, and for each result, loop through my files and delete them until I find a file that matches the current result or is alphabetically bigger, and then move to the next result without resetting the current file index.
I haven't tried to actually implement this, but seems to me that this would be an O(n) since each list would be looped through just once (ignoring the sorting both lists part). The only thing I'm not sure about is whether I can be 100% sure both the file system and the database engine will sort exactly the same way (will they both consider "_" smaller than "-" and stuff like that). If not, the algorithm above just wouldn't work at all. (By the way this is a Jet Engine database.)
But since this is probably not such an uncommon problem you guys might already know a better solution. I tried search the web but couldn't find anything. Perhaps a more effective solution would be to put each list into a HashSet and find their difference.

Get the folder content into folderFiles (IEnumerable<string>)
Get the file you want to keep in filesToKeep (IEnumerable<string>)
Get a list of "not in list" files.
Delete these files.
Code Sample :
IEnumerable<FileInfo> folderFiles = new List<FileInfo>(); // Fill me.
IEnumerable<string> filesToKeep = new List<string>(); // Fill me.
foreach (string fileToDelete in folderFiles.Select(fi => fi.FullName).Except(filesToKeep))
{
File.Delete(fileToDelete);
}

Here is my suggestion for you. Assuming filesInDatabase contains a list of files which are in the database and pathOfDirectory contains the path of the directory where the files to compare are contained.
foreach (var fileToDelete in Directory.EnumerateFiles(pathOfDirectory).Where(item => !filesInDatabase.Contains(item))
{
File.Delete(fileToDelete);
}
EDIT:
This requires using System.Linq;, because it uses LINQ.

I think hashing is the way to go, but you don't really need two HashSets. Only one HashSet is needed to store the standardized file names from the datatable; the other container can be any collection data type.

First off, .Net allows you to define cultures that can be used in sorting, but I'm not all that familiar with the mechanism, so I'll let Google to give his pointers on the subject.
Second, to avoid all the culture mass, you can use a different algorithm with an idea similar to radix-sort (only without the sort) - time complexity is O(n * length_longest_file_name). File name lengths are limited (as far as I know, almost no file system will allow a file name longer then 256), so I'm assuming that n is dramatically larger then file name lengths, and if n is smaller then the max file name length, just use an O(n^2) method and avoid the work (iterating lists this small is near instant times anyways).
Note: This method does not require sorting.
The idea is to create an array of symbols that can be used as file name chars (about 60-70 chars, if this is a case sensitive search), and another flag array with a flag for each char in the first array.
Now, you create a loop for each char in the file names of the list from the DB (from 1 -> length_longest_file_name).
In each iteration (i) you go over the i-th char of each file name in the DB list. Every char you see, you set it's relevant flag to true.
When all flags are set, you go over the second list and delete every file for which the i-th char of it's name is not flagged.
Implementation might be complex, and the overhead of the two arrays might make it slower when n is small, but you can optimize this to make it better (for instance, no iterating over files that have names shorter then the current i by removing them from both lists).
Hope this helps

I have another idea that might be faster.
var filesToDelete = new List<string>(Directory.GetFiles(directoryPath));
foreach (var databaseFile in databaseFileList)
{
filesToDelete.Remove(databaseFile);
}
foreach (var fileToDelete in filesToDelete)
{
File.Delete(fileToDelete);
}
Explanation: First get all files containing in the directory. Then delete every file from that list, which is in the database. At last delete all remaining files from the list filesToDelete.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.