Fastest way to delete files that are not in a data table?

Fastest way to delete files that are not in a data table? - c#

I need to write a code in C# that will select a list of file names from a data table and delete every file in a folder that is not in this list.
One possibility would be to have both ordered by name, and then loop through my table results, and for each result, loop through my files and delete them until I find a file that matches the current result or is alphabetically bigger, and then move to the next result without resetting the current file index.
I haven't tried to actually implement this, but seems to me that this would be an O(n) since each list would be looped through just once (ignoring the sorting both lists part). The only thing I'm not sure about is whether I can be 100% sure both the file system and the database engine will sort exactly the same way (will they both consider "_" smaller than "-" and stuff like that). If not, the algorithm above just wouldn't work at all. (By the way this is a Jet Engine database.)
But since this is probably not such an uncommon problem you guys might already know a better solution. I tried search the web but couldn't find anything. Perhaps a more effective solution would be to put each list into a HashSet and find their difference.

Get the folder content into folderFiles (IEnumerable<string>)
Get the file you want to keep in filesToKeep (IEnumerable<string>)
Get a list of "not in list" files.
Delete these files.
Code Sample :
IEnumerable<FileInfo> folderFiles = new List<FileInfo>(); // Fill me.
IEnumerable<string> filesToKeep = new List<string>(); // Fill me.
foreach (string fileToDelete in folderFiles.Select(fi => fi.FullName).Except(filesToKeep))
{
File.Delete(fileToDelete);
}

Here is my suggestion for you. Assuming filesInDatabase contains a list of files which are in the database and pathOfDirectory contains the path of the directory where the files to compare are contained.
foreach (var fileToDelete in Directory.EnumerateFiles(pathOfDirectory).Where(item => !filesInDatabase.Contains(item))
{
File.Delete(fileToDelete);
}
EDIT:
This requires using System.Linq;, because it uses LINQ.

I think hashing is the way to go, but you don't really need two HashSets. Only one HashSet is needed to store the standardized file names from the datatable; the other container can be any collection data type.

First off, .Net allows you to define cultures that can be used in sorting, but I'm not all that familiar with the mechanism, so I'll let Google to give his pointers on the subject.
Second, to avoid all the culture mass, you can use a different algorithm with an idea similar to radix-sort (only without the sort) - time complexity is O(n * length_longest_file_name). File name lengths are limited (as far as I know, almost no file system will allow a file name longer then 256), so I'm assuming that n is dramatically larger then file name lengths, and if n is smaller then the max file name length, just use an O(n^2) method and avoid the work (iterating lists this small is near instant times anyways).
Note: This method does not require sorting.
The idea is to create an array of symbols that can be used as file name chars (about 60-70 chars, if this is a case sensitive search), and another flag array with a flag for each char in the first array.
Now, you create a loop for each char in the file names of the list from the DB (from 1 -> length_longest_file_name).
In each iteration (i) you go over the i-th char of each file name in the DB list. Every char you see, you set it's relevant flag to true.
When all flags are set, you go over the second list and delete every file for which the i-th char of it's name is not flagged.
Implementation might be complex, and the overhead of the two arrays might make it slower when n is small, but you can optimize this to make it better (for instance, no iterating over files that have names shorter then the current i by removing them from both lists).
Hope this helps

I have another idea that might be faster.
var filesToDelete = new List<string>(Directory.GetFiles(directoryPath));
foreach (var databaseFile in databaseFileList)
{
filesToDelete.Remove(databaseFile);
}
foreach (var fileToDelete in filesToDelete)
{
File.Delete(fileToDelete);
}
Explanation: First get all files containing in the directory. Then delete every file from that list, which is in the database. At last delete all remaining files from the list filesToDelete.

Related

What's the fastest way to copy files responsive to multiple search terms?

Presently working on an application that allows the user to input a list of names/search terms and an folder path. The application then searches for each phrase and copies any responsive documents to an output path. Important to note that the use is often with directories containing from 100GB up to a few TB and can sometimes be required to run thousands of search terms.
Initially I simply used the System.IO.GetFiles() function for this, but I've found I have better results creating a data table of all documents in the input path and running my searches over that data table (see below).
//Constructing a data table of all files in the input path
foreach (var file in fileArray)
{
System.Data.DataRow row = searchTable.NewRow();
row[1] = file;
row[0] = System.IO.Path.GetFileName(file);
searchTable.Rows.Add(row);
}
//For each line inputted by the user, search the data table to find any responsive file names
foreach (var line in searchArray)
{
for (int i = 0; i < searchTable.Rows.Count; i++)
{
if (searchTable.Rows[i][0].ToString().Contains(line))
{
string file = searchTable.Rows[i][1].ToString();
string output = SwiftBank.CalculateOutputFilePath(outputPath,inputPath,file);
System.IO.File.Copy(file, output);
}
}
}
I've found that while this works, it isn't optimised and functions very slowly for large data sets. Obviously doing a lot of repeat work, searching the data table in full every search term. Wondering if someone on here might have a better idea?

In my experience, doing a handful of contains queries for a few thousands fairly short strings should take less than a second. If you are searching inside much larger data sets, like searching thru 100Gb of content, you should look at some more advanced library, like lucene.
There are a few things I would suggest changing
Use a regular list instead of a datatable. Something like List<(string filePath, string fileName)> would be much simpler, and contain the same information
Perform all checks for a specific file at once, i.e. reorder your loops the the file loop is the outer one. This should help cache usage a little bit.
However, the vast majority of the time will likely be spent on copying files. This is many orders of magnitude slower than doing some simple searching in a few kilobytes of memory. You might gain a little bit by doing more than one copy in parallel, since SSDs may be able to improve throughput at higher loads, but that is likely only true if the files are small. You might consider alternatives to the copying, like adding shortcuts, instead.

Removing redundant data from large file

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an "OutOfMemory" exception when attempting to do this (on the line that adds the string to the hashset).
There are around 32,000,000 lines in the files. It's not practical to re-read the entire file for each comparison.
Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I'm not sure that'd work either with that many values.
Thanks for any input!

First thing you need to think about - is high memory consumption is a problem?
If your application will always run on server with a lot of RAM available, or in any other case you know you'll have enough memory, you can do a lot of things you can't do if your application will run in a low-memory environment, or in an unknown environment. If memory isn't the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you'll be limited to 2GB memory (4GB, if you'll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you've got to do is change it - and it'll work great (assuming you have enough memory).
If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i'm more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.
Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you'll save less data in the memory (only hash for not duplicated items).
Best of luck.

Have you tried to use an array to intialize the HashSet. I assume that the doubling algorithm of HashSet is the reason for the OutOfMemoryException.
var uniqueLines = new HashSet<string>(File.ReadAllLines(#"C:\Temp\BigFile.log"));
Edit:
I am testing the result of the .Add() method to see if it
returns false to count the number of items that are redundant. I'd
like to keep this feature if possible.
Then you should try to initilize the HashSet with the correct(maximum) size of the file's lines:
int lineCount = File.ReadLines(path).Count();
List<string> fooList = new List<String>(lineCount);
var uniqueLines = new HashSet<string>(fooList);
fooList.Clear();
foreach (var line in File.ReadLines(path))
uniqueLines.Add(line);

I took a similar approach to Tim using HashSet. I did add manual line counting and comparison.
I read the setup log from my windows 8 install which was 58MB in size at 312248 lines and ran it in LinqPad in .993 seconds.
var temp=new List<string>(10000);
var uniqueHash=new HashSet<int>();
int lineCount=0;
int uniqueLineCount=0;
using(var fs=new FileStream(#"C:\windows\panther\setupact.log",FileMode.Open,FileAccess.Read))
using(var sr=new StreamReader(fs,true)){
while(!sr.EndOfStream){
lineCount++;
var line=sr.ReadLine();
var key=line.GetHashCode();
if(!uniqueHash.Contains(key) ){
uniqueHash.Add(key);
temp.Add(line);
uniqueLineCount++;
if(temp.Count()>10000){
File.AppendAllLines(#"c:\temp\output.txt",temp);
temp.Clear();
}
}
}
}
Console.WriteLine("Total Lines:"+lineCount.ToString());
Console.WriteLine("Lines Removed:"+ (lineCount-uniqueLineCount).ToString());

Big strings: System.OutOfMemoryException

var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS

You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability

Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.

Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

C# - Comparing two CSV Files and giving an output

Need a bit of help, I have two sources of information and the information is exported to two different CSV file's by different programs. They are supposed to include the same information, however this is what needs to be checked.
Therefore what I would like to do is as follows:
Take the information from the two files.
Compare
Output any differences and which file the difference was in. (e.g File A Contained this, but File B did not and vice versa).
The files are 200,000 odd rows so will need to be as effective as possible.
Tried doing this with Excel however has proved to be too complicated and I'm really struggling to find a way programatically.

Assuming that the files are really supposed to be identical, right down to text qualifiers, ordering of rows, and number of rows contained in each file, the simplest approach may be to simply iterate through both files together and compare each line.
using (StreamReader f1 = new StreamReader(path1))
using (StreamReader f2 = new StreamReader(path2)) {
var differences = new List<string>();
int lineNumber = 0;
while (!f1.EndOfStream) {
if (f2.EndOfStream) {
differences.Add("Differing number of lines - f2 has less.");
break;
}
lineNumber++;
var line1 = f1.ReadLine();
var line2 = f2.ReadLine();
if (line1 != line2) {
differences.Add(string.Format("Line {0} differs. File 1: {1}, File 2: {2}", lineNumber, line1, line2);
}
}
if (!f2.EndOfStream) {
differences.Add("Differing number of lines - f1 has less.");
}
}

Depending on your answers to the comments on your question, if it doesn't really need to be done with code, you could do worse than download a compare tool, which is likely to more sophisticated.
(Winmerge for example)

OK, for anyone else that googles this and finds this. Here is what my answer was.
I exported the details to a CSV and ordered them numerically when they were exported for ease of use. Once they were exported as two CSV files, I then used a program called Beyond Compare which can be found here. This allows the files to be compared.
At first I used Beyond Compare manually to test what I was exporting was correct etc, however Beyond Compare does have the ability to be able to use command lines to compare. This then results in everything done programatically, all that has to be done is a user views the results in Beyond Compare. You may be able to export them to another CSV, I havn't looked as the GUI of Beyond Compare is very nice and useful, so it is easier to use this.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.