I have a set of subfolders 3 levels deep over with over 20k files in c:\MyData.
There is an almost identical set of subfolders on my E drive at e:\projects\massdata
I want to check in C and anything that already exists in E (same folder name, same file name, same size), I want to delete from C.
What is my best way of traversing this folder structure?
how about using the join operator. joining on filename like this
public void cleanUp()
{
var cFiles = Directory.GetFiles(#"c:\MyData","*.*",SearchOption.AllDirectories);
var fFiles = Directory.GetFiles(#"e:\projects\massdata","*.*",SearchOption.AllDirectories);
Func<string, string, Tuple<string, long>> keySelector = (path, root) =>
new Tuple<string, long>(path.Replace(root, ""), new FileInfo(path).Length);
foreach (var file in cFiles.Join(fFiles, f => keySelector(f,#"e:\projects\massdata"), c => keySelector(c,#"c:\MyData"), (c, f) => c))
{
File.Delete(file);
}
}
Second Edit after update:
The key selector should now meet your requirement. If I've misunderstood them. It sure be rather easy so see what you need to change. If not drop a comment :)
Recursively go thru all files in each directory.
Create a string describing the relative path,file size, etc. of the files in E in a hashMap. Then just check if a specific files relative path exists, when going thru C, and delete it if so.
The string could for instance be [FILENAME]##[FILESIZE]##[LASTEDITER].
Here is one way to search recursively in C#:
http://support.microsoft.com/kb/303974
Related
I'm working on a program that is supposed to scan a specific directory looking for any directories within it that have specific names, and if it finds them, tell the user.
Currently, the way I am loading the names its searching for is like this:
static string path = Path.Combine(Directory.GetCurrentDirectory(), #"database.txt");
static string[] database = File.ReadAllLines(datapath);
I am using this as an array of names to look for when looking through a specific directory. I am doing so with a foreach method.
System.IO.DirectoryInfo di = new DirectoryInfo("C:\ExampleDirectory");
foreach (DirectoryInfo dir in di.GetDirectories())
{
}
Is there a way to see if any of the names in the file "database.txt" match any names of directories found within "C:\ExampleDirectory"?
The only way I can think of doing this is:
System.IO.DirectoryInfo di = new DirectoryInfo(versionspath);
foreach (DirectoryInfo dir in di.GetDirectories())
{
if(dir.Name == //Something...) {
Console.WriteLine("Match found!");
break;}
}
But this obviously won't work, and I cannot think of any other way to do this. Any help would be greatly appreciated!
Based on your other questions on stackoverflow, I presume your question is a homework or you are a passionate hobby programmer, am I right? So I'll try to explain the principle here continuing your almost complete solution.
You will need a nested loop here, a loop in a loop. In the outer loop you iterate through the directories. You already got this one. For each directory you need to loop through the names in database to see if any item in it matches the name of the directory:
System.IO.DirectoryInfo di = new DirectoryInfo(versionspath);
foreach (DirectoryInfo dir in di.GetDirectories())
{
foreach (string name in database)
{
if (dir.Name == name)
{
Console.WriteLine("Match found!");
break;
}
}
}
Depending on your goal, you might want to exit at the first matching directory. The sample code above doesn't. The single break; statement only exits the inner loop, not the outer one. So it continues to check the next directory. Try to figure it out yourself how to stop at the first match (by exiting the outer loop).
As usual, LINQ is the way to go. Whenever you have to find matches or not-matches between two lists and both lists containing different types, you'll have to use .Join() or .GroupJoin().
The .Join() comes into play, if you need to find a 1:1 relationship and the .GroupJoin() for any kind of 1-to relationship (1:0, 1:many or also 1:1).
So, if you need the directories that match your list, this sounds for a job to the .Join() operator:
public static void Main(string[] args)
{
// Where ever this comes normally from.
string[] database = new[] { "fOo", "bAr" };
string startDirectory = #"D:\baseFolder";
// A method that returns an IEnumerable<string>
// Using maybe a recursive approach to get all directories and/or files
var candidates = LoadCandidates(startDirectory);
var matches = database.Join(
candidates,
// Simply pick the database entry as is.
dbEntry => dbEntry,
// Only take the last portion of the given path.
fullPath => Path.GetFileName(fullPath),
// Return only the full path from the given matching pair.
(dbEntry, fullPath) => fullPath,
// Ignore case on comparison.
StringComparer.OrdinalIgnoreCase);
foreach (var match in matches)
{
// Shows "D:\baseFolder\foo"
Console.WriteLine(match);
}
Console.ReadKey();
}
private static IEnumerable<string> LoadCandidates(string baseFolder)
{
return new[] { #"D:\baseFolder\foo", #"D:\basefolder\baz" };
//return Directory.EnumerateDirectories(baseFolder, "*", SearchOption.AllDirectories);
}
You can use LINQ to do this
var allDirectoryNames = di.GetDirectories().Select(d => d.Name);
var matches = allDirectoryNames.Intersect(database);
if (matches.Any())
Console.WriteLine("Matches found!");
In the first line we get all the directory names, then we use the Intersect() method to see which ones are present in both allDirectoryNames and database
I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.
Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.
I'm currently doing it this way:
private List<MyFile> Diff(Node index1, Node index2)
{
List<MyFile> DifferentFiles = new List<MyFile>();
List<MyFile> Index1Files = FindFiles(index1);
List<MyFile> Index2Files = FindFiles(index2);
List<MyFile> JoinedList = new List<MyFile>();
JoinedList.AddRange(Index1Files);
JoinedList.AddRange(Index2Files);
List<MyFile> JoinedListCopy = new List<MyFile>();
JoinedListCopy.AddRange(JoinedList);
List<string> ChecksumList = new List<string>();
foreach (MyFile m in JoinedList)
{
if (ChecksumList.Contains(m.Checksum))
{
JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
}
else
{
ChecksumList.Add(m.Checksum);
}
}
return JoinedListCopy;
}
And the Node class looks like this:
class Node
{
private string _Dir;
private Node _Parent;
private List<Node> _Children;
private List<MyFile> _Files;
}
Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.
private List<MyFile> Diff(Node index1, Node index2)
{
var Index1Files = FindFiles(index1);
var Index2Files = FindFiles(index2);
//this is all of the files in both
var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
.Intersect(Index2Files.Select(file => file.Checksum)));
return Index1Files.Concat(Index2Files)
.Where(file => !intersection.Contains(file.Checksum))
.ToList();
}
How about:
public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
{
HashSet<string> hash = new HashSet<string>();
foreach (var file in index1.Concat(index2))
{
if (!hash.Add(file.Checksum))
{
hash.Remove(file.Checksum);
}
}
return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
}
This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.
Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?
I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.
Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.
var diff_Files = (from a in Index1Files
join b in Index2Files
on a.CheckSum equals b.CheckSum
where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()
another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...
where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))
look at the individual checksums, rather than the entire file object instance.
the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.
Another way to do this is to use the counting function in linq
var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();
nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.
Hi i'm a c# begginer and i'd like to do a simple program which is going to go through a folder and count how many files are .mp3 files and how many are .flac .
Like I said the program is very basic. It will ask for the music folder path and will then go through it. I know there will be a lot of subfolders in that main music folder so it will have to open them one at the time and go through them too.
E.g
C:/Music/
will be the given directory.
But it doesn't contain any music in itself.
To get to the music files the program would have to open subfolders like
C:/Music/Electronic/deadmau5/RandomAlbumTitle/
Only then he can count the .mp3 files and .flac files and store them in two separated counters.
The program will have to do that for at least 2000 folders.
Do you know a good way or method to go through files and return its name (and extension)?
You can use System.IO.DirectoryInfo. DirectoryInfo provides a GetFiles method, which also has a recursive option, so if you're not worried about speed, you can do this:
DirectoryInfo di = new DirectoryInfo(#"C:\Music");
int numMP3 = di.GetFiles("*.mp3", SearchOption.AllDirectories).Length;
int numFLAC = di.GetFiles("*.flac", SearchOption.AllDirectories).Length;
Use DirectoryInfo and a grouping by the file extension:
var di = new DirectoryInfo(#"C:/Music/");
var extensionCounts = di.EnumerateFiles("*.*", SearchOption.AllDirectories)
.GroupBy(x => x.Extension)
.Select(g => new { Extension = g.Key, Count = g.Count() })
.ToList();
foreach (var group in extensionCounts)
{
Console.WriteLine("There are {0} files with extension {1}", group.Count,
group.Extension);
}
C# has a built in method of searching for files in all sub-directories. Make sure you add a using statement for System.IO
var path = "C:/Music/"
var files = Directory.GetFiles(path, "*.mp3", SearchOption.AllDirectories);
var count = files.Length;
Since you're a beginner you should hold off on the more flexible LINQ method until later.
int fileCount = Directory.GetFiles(_Path, "*.*", SearchOption.TopDirectoryOnly).Length
Duplicate question How to read File names recursively from subfolder using LINQ
Jon Skeet answered there with
You don't need to use LINQ to do this - it's built into the framework:
string[] files = Directory.GetFiles(directory, "*.dll",
SearchOption.AllDirectories);
or if you're using .NET 4:
IEnumerable<string> files = Directory.EnumerateFiles(directory, "*.dll",
SearchOption.AllDirectories);
To be honest, LINQ isn't great in terms of recursion. You'd probably want to write your own general-purpose recursive extension method. Given how often this sort of question is asked, I should really do that myself some time...
Here is MSDN support page, How to recursively search directories by Visual C#
Taken directly from that page:
void DirSearch(string sDir)
{
try
{
foreach (string d in Directory.GetDirectories(sDir))
{
foreach (string f in Directory.GetFiles(d, txtFile.Text))
{
lstFilesFound.Items.Add(f);
}
DirSearch(d);
}
}
catch (System.Exception excpt)
{
Console.WriteLine(excpt.Message);
}
}
You can use this code in addition to creating FileInfo objects. Once you have the file info objects you can check the Extension property to see if it matches the ones you care about.
MSDN has lots of information and examples, for example how you can iterate through a directory: http://msdn.microsoft.com/en-us/library/bb513869.aspx
I'm trying to iterate through all the files on a certain level in the folder hierachy, more specifically, in all the sub-sub folders. Before I do actual operations on the files, I also want to count all the files to be able to show a progress bar. This means the iterating method must be called 2 times. This is the relevant code, I'm using now:
Iterate(bool count)
{
foreach (string dir in Directory.GetDirectories(root))
foreach (string subdir in Directory.GetDirectories(dir))
foreach (string file in Directory.GetFiles(subdir))
{
if (count) progressBar.Maximum++;
else
{
//do operations
}
}
}
I'm wondering if there's a better way of doing this. Surely there must be a better way than adding a foreach for every folder level..?
It'd be easier to me to use LINQ here.
var files =
(from dir in Directory.GetDirectories(root)
from subdir in Directory.GetDirectories(dir)
from f in Directory.GetFiles(subdir)
select f).ToList();
var fileCount = files.Length;
foreach (var f in files) {
...
}
Before your GetFiles foreach, try this:
string [] fileEntries = Directory.GetFiles(subdir);
int intFileCount = fileEntries.length;
Or it can replace it, if the loop only serves to count the files.
The documentation of Directory.GetFiles shows how to recursively iterate a directory tree
you could download the fluent thing I wrote for System.IO (see here: http://blog.staticvoid.co.nz/2011/11/staticvoid-io-extentions-nuget.html) and then use this LINQ statement
var files = from d in di.Directories()
from dir in d.Directories()
from f in dir.Files()
select f;
write this instead your code
string [] files = Directory.GetFile("yourDirectory","*.*",SearchOptions.AllDirectories);
this will return all files in sub-directories instead of using recursion
I have a folder with two files:
Awesome.File.20091031_123002.txt
Awesome.File.Summary.20091031_123152.txt
Additionally, a third-party app handles the files as follows:
Reads a folderPath and a searchPattern out of a database
Executes Directory.GetFiles(folderPath, searchPattern), processing whatever files match the filter in bulk, then moving the files to an archive folder.
It turns out that I have to move my two files into different archive folders, so I need to handle them separately by providing different searchPatterns to select them individually. Please note that I can't modify the third-party app, but I can modify the searchPattern and file destinations in my database.
What searchPattern will allow me to select Awesome.File.20091031_123002.txt without including Awesome.File.Summary.20091031_123152.txt?
If your were going to use LINQ then...
var regexTest = new Func<string, bool>(i => Regex.IsMatch(i, #"Awesome.File.(Summary)?.[\d]+_[\d]+.txt", RegexOptions.Compiled | RegexOptions.IgnoreCase));
var files = Directory.GetFiles(#"c:\path\to\folder").Where(regexTest);
Awesome.File.????????_??????.txt
The question mark (?) acts as a single character place holder.
I wanted to try my meager linq skills here... I'm sure there is a more elegant solution, but here's mine:
string pattern = ".SUMMARY.";
string[] awesomeFiles = System.IO.Directory.GetFiles("path\\to\\awesomefiles");
IEnumerable<string> sum_files = from file in awesomeFiles
where file.ToUpper().Contains(pattern)
select file;
IEnumerable<string> other_files = from file in awesomeFiles
where !file.ToUpper().Contains(pattern)
select file;
This assumes there aren't any other files in the directory other than the two, but you can adjust the pattern here to suit your needs (i.e. add "Awesome.File" to the pattern start.)
When you iterate the collection of each, you should get what you need.
According to the documentation, searchPattern only supports the ***** and ? wildcards. You would need to write your own regex filter that takes the results of Directory.GetFiles and applies further filtering logic.
If you don't want to use Linq, here's one way.
public void FileChecker(string filePath)
{
DirectoryInfo di = new DirectoryInfo(filePath);
int _MatchCounter;
string RegexPattern = "^[a-zA-Z_a-zA-Z_a-zA-Z_0-9_0-9_0-9.csv]*$";
Regex RegexPatternMatch = new Regex(RegexPattern, RegexOptions.IgnoreCase);
foreach (FileInfo matchingFile in di.GetFiles())
{
Match m = RegexPatternMatch.Match(matchingFile.Name);
if ((m.Success))
{
MessageBox.Show(matchingFile.Name);
_MatchCounter += 1;
}
}
}