I'm trying to find some directories on a network drive.
I use Directory.EnumerateDirectories for this.
The problem is that it takes very long because there are many subdirectories.
Is there a way to make the function stop searching further down into subdirectories if a match was found and carry on with the next directory on same level?
static readonly Regex RegexValidDir = new ("[0-9]{4,}\\.[0-9]+$");
var dirs = Directory.EnumerateDirectories(startDir, "*.*", SearchOption.AllDirectories)
.Where(x => RegexValidDir.IsMatch(x));
The directory structure looks like that
a\b\20220902.1\c\d\
a\b\20220902.2\c\d\e
a\b\x\20220902.3\
a\b\x\20221004.1\c\
a\b\x\20221004.2\c\
a\b\x\20221004.3\d\e\f\
...
a\v\w\x\20221104.1\c\d
a\v\w\x\20221105.1\c\d
a\v\w\x\20221106.1\c\d
a\v\w\x\20221106.2\c\d
a\v\w\x\20221106.3\c\d
a\v\w\x\20221106.4\
I'm interested in the directories with a date in the name only and want to stop searchin further down into the subdirectories of a matching dir.
Another thing is I don't know if the search pattern I'm supplying (.) is correct for my usage szenario.
The directories are found relatively quickly, but it then takes another 11 minutes to complete the search function
I don't think that it's possible to prune the enumeration efficiently with the built-in Directory.EnumerateDirectories method, in SearchOption.AllDirectories mode. My suggestion is to write a custom recursive iterator, that allows to select the children of each individual item:
static IEnumerable<T> Traverse<T>(IEnumerable<T> source,
Func<T, IEnumerable<T>> childrenSelector)
{
foreach (T item in source)
{
IEnumerable<T> children = childrenSelector(item);
yield return item;
if (children is null) continue;
foreach (T child in Traverse(children, childrenSelector))
yield return child;
}
}
Then for the directories that match the date pattern, you can just return null children, effectively stopping the recursion for those directories:
IEnumerable<string> query = Traverse(new[] { startDir }, path =>
{
if (RegexValidDir.IsMatch(path)) return null; // Stop recursion
return Directory.EnumerateDirectories(path);
}).Where(path => RegexValidDir.IsMatch(path));
This query is slightly inefficient because the RegexValidDir pattern is matched twice for each path (one in the childrenSelector and one in the predicate of the Where). In case you want to optimize it, you could modify the Traverse method by replacing the childrenSelector with a more complex lambda, that returns both the children and whether the item should be yielded by the iterator: Func<T, (IEnumerable<T>, bool)> lambda. Or alternatively use the Traverse as is, with the T being (string, bool) instead of string.
Related
I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.
Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.
I'm currently doing it this way:
private List<MyFile> Diff(Node index1, Node index2)
{
List<MyFile> DifferentFiles = new List<MyFile>();
List<MyFile> Index1Files = FindFiles(index1);
List<MyFile> Index2Files = FindFiles(index2);
List<MyFile> JoinedList = new List<MyFile>();
JoinedList.AddRange(Index1Files);
JoinedList.AddRange(Index2Files);
List<MyFile> JoinedListCopy = new List<MyFile>();
JoinedListCopy.AddRange(JoinedList);
List<string> ChecksumList = new List<string>();
foreach (MyFile m in JoinedList)
{
if (ChecksumList.Contains(m.Checksum))
{
JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
}
else
{
ChecksumList.Add(m.Checksum);
}
}
return JoinedListCopy;
}
And the Node class looks like this:
class Node
{
private string _Dir;
private Node _Parent;
private List<Node> _Children;
private List<MyFile> _Files;
}
Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.
private List<MyFile> Diff(Node index1, Node index2)
{
var Index1Files = FindFiles(index1);
var Index2Files = FindFiles(index2);
//this is all of the files in both
var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
.Intersect(Index2Files.Select(file => file.Checksum)));
return Index1Files.Concat(Index2Files)
.Where(file => !intersection.Contains(file.Checksum))
.ToList();
}
How about:
public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
{
HashSet<string> hash = new HashSet<string>();
foreach (var file in index1.Concat(index2))
{
if (!hash.Add(file.Checksum))
{
hash.Remove(file.Checksum);
}
}
return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
}
This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.
Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?
I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.
Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.
var diff_Files = (from a in Index1Files
join b in Index2Files
on a.CheckSum equals b.CheckSum
where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()
another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...
where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))
look at the individual checksums, rather than the entire file object instance.
the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.
Another way to do this is to use the counting function in linq
var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();
nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.
I am creating a program that cursively finds all the files and directories in the specified path. So one node may have other nodes if that node happens to be a directory.
Here is my Node class:
class Node
{
public List<Node> Children = new List<Node>(); // if node is directory then children will be the files and directories in this direcotry
public FileSystemInfo Value { get; set; } // can eather be a FileInfo or DirectoryInfo
public bool IsDirectory
{
get{ return Value is DirectoryInfo;}
}
public long Size // HERE IS WHERE I AM HAVING PROBLEMS! I NEED TO RETRIEVE THE
{ // SIZE OF DIRECTORIES AS WELL AS FOR FILES.
get
{
long sum = 0;
if (Value is FileInfo)
sum += ((FileInfo)Value).Length;
else
sum += Children.Sum(x => x.Size);
return sum;
}
}
// this is the method I use to filter results in the tree
public Node Search(Func<Node, bool> predicate)
{
// if node is a leaf
if(this.Children.Count==0)
{
if (predicate(this))
return this;
else
return null;
}
else // Otherwise if node is not a leaf
{
var results = Children.Select(i => i.Search(predicate)).Where(i => i != null).ToList();
if (results.Any()) // THIS IS HOW REMOVE AND RECUNSTRUCT THE TREE WITH A FILTER
{
var result = (Node)MemberwiseClone();
result.Children = results;
return result;
}
return null;
}
}
}
and thanks to that node class I am able to display the tree as:
In one column I display the name of the directory or file and on the right the size. The size is formated as currency just because the commas help visualize it more clearly.
So now my problem is The reason why I have this program was to perform some advance searches. So I may only want to search for files that have the ".txt" extension for example. If I perform that filter on my tree I will get:
(note that I compile the text to a function that takes a Node and returns a bool and I pass that method to the Search method on my Node class in order to filter results. More information on how to dynamically compile code can be found at: http://www.codeproject.com/Articles/10324/Compiling-code-during-runtime) Anyways that has nothing to do with this question. The important part was that I removed all the nodes that did not matched that criteria and because I removed those nodes now the sizes of the directories changed!!!
So my question is how will I be able to filter results maintaining the real size of the directory. I guess I will have to remove the property Size and replace it with a field. The problem with that is that every time I add to the tree I will have to update the size of all the parent directories and that gets complex. Before starting coding it that way I will appreciate your opinion on how I should start implementing the class.
Since you're using recursion and your weight is a node-level property you can't expect that will continue to sum even after you remove the node. You either promote it to a upper level (collection) or use an external counter within the recursion (which counts but not depending on filter, you'll need to carry this through the recuersion).
Anyway, why are you implementing a core .NET functionality again? any reason beyond filtering or recursive search? both are pretty well implemented in the BCL.
When you want to recursively enumerate a hierarchical object, selecting some elements based on some criteria, there are numerous examples of techniques like "flattening" and then filtering using Linq : like those found here :
link text
But, when you are enumerating something like the Controls collection of a Form, or the Nodes collection of a TreeView, I have been unable to use these types of techniques because they seem to require an argument (to the extension method) which is an IEnumerable collection : passing in SomeForm.Controls does not compile.
The most useful thing I found was this :
link text
Which does give you an extension method for Control.ControlCollection with an IEnumerable result you can then use with Linq.
I've modified the above example to parse the Nodes of a TreeView with no problem.
public static IEnumerable<TreeNode> GetNodesRecursively(this TreeNodeCollection nodeCollection)
{
foreach (TreeNode theNode in nodeCollection)
{
yield return theNode;
if (theNode.Nodes.Count > 0)
{
foreach (TreeNode subNode in theNode.Nodes.GetNodesRecursively())
{
yield return subNode;
}
}
}
}
This is the kind of code I'm writing now using the extension method :
var theNodes = treeView1.Nodes.GetNodesRecursively();
var filteredNodes =
(
from n in theNodes
where n.Text.Contains("1")
select n
).ToList();
And I think there may be a more elegant way to do this where the constraint(s) are passed in.
What I want to know if it is possible to define such procedures generically, so that : at run-time I can pass in the type of collection, as well as the actual collection, to a generic parameter, so the code is independent of whether it's a TreeNodeCollection or Controls.Collection.
It would also interest me to know if there's any other way (cheaper ? fastser ?) than that shown in the second link (above) to get a TreeNodeCollection or Control.ControlCollection in a form usable by Linq.
A comment by Leppie about 'SelectMany in the SO post linked to first (above) seems like a clue.
My experiments with SelectMany have been : well, call them "disasters." :)
Appreciate any pointers. I have spent several hours reading every SO post I could find that touched on these areas, and rambling my way into such exotica as the "y-combinator." A "humbling" experience, I might add :)
This code should do the trick
public static class Extensions
{
public static IEnumerable<T> GetRecursively<T>(this IEnumerable collection,
Func<T, IEnumerable> selector)
{
foreach (var item in collection.OfType<T>())
{
yield return item;
IEnumerable<T> children = selector(item).GetRecursively(selector);
foreach (var child in children)
{
yield return child;
}
}
}
}
Here's an example of how to use it
TreeView view = new TreeView();
// ...
IEnumerable<TreeNode> nodes = view.Nodes.
.GetRecursively<TreeNode>(item => item.Nodes);
Update: In response to Eric Lippert's post.
Here's a much improved version using the technique discussed in All About Iterators.
public static class Extensions
{
public static IEnumerable<T> GetItems<T>(this IEnumerable collection,
Func<T, IEnumerable> selector)
{
Stack<IEnumerable<T>> stack = new Stack<IEnumerable<T>>();
stack.Push(collection.OfType<T>());
while (stack.Count > 0)
{
IEnumerable<T> items = stack.Pop();
foreach (var item in items)
{
yield return item;
IEnumerable<T> children = selector(item).OfType<T>();
stack.Push(children);
}
}
}
}
I did a simple performance test using the following benchmarking technique. The results speak for themselves. The depth of the tree has only marginal impact on the performance of the second solution; whereas the performance decreases rapidly for the first solution, eventually leadning to a StackOverflowException when the depth of the tree becomes too great.
You seem to be on the right track and the answers above have some good ideas. But I note that all these recursive solutions have some deep flaws.
Let's suppose the tree in question has a total of n nodes with a max tree depth of d <= n.
First off, they consume system stack space in the depth of the tree. If the tree structure is very deep, then this can blow the stack and crash the program. Tree depth d is O(lg n), depending on the branching factor of the tree. Worse case is no branching at all -- just a linked list -- in which case a tree with only a few hundred nodes will blow the stack.
Second, what you're doing here is building an iterator that calls an iterator that calls an iterator ... so that every MoveNext() on the top iterator actually does a chain of calls that is again O(d) in cost. If you do this on every node, then the total cost in calls is O(nd) which is worst case O(n^2) and best case O(n lg n). You can do better than both; there's no reason why this cannot be linear in time.
The trick is to stop using the small, fragile system stack to keep track of what to do next, and to start using a heap-allocated stack to explicitly keep track.
You should add to your reading list Wes Dyer's article on this:
https://blogs.msdn.microsoft.com/wesdyer/2007/03/23/all-about-iterators/
He gives some good techniques at the end for writing recursive iterators.
I'm not sure about TreeNodes, but you can make the Controls collection of a form IEnumerable by using System.Linq and, for example
var ts = (from t in this.Controls.OfType<TextBox>
where t.Name.Contains("fish")
select t);
//Will get all the textboxes whose Names contain "fish"
Sorry to say I don't know how to make this recursive, off the top of my head.
Based on mrydengren's solution:
public static IEnumerable<T> GetRecursively<T>(this IEnumerable collection,
Func<T, IEnumerable> selector,
Func<T, bool> predicate)
{
foreach (var item in collection.OfType<T>())
{
if(!predicate(item)) continue;
yield return item;
IEnumerable<T> children = selector(item).GetRecursively(selector, predicate);
foreach (var child in children)
{
yield return child;
}
}
}
var theNodes = treeView1.Nodes.GetRecursively<TreeNode>(
x => x.Nodes,
n => n.Text.Contains("1")).ToList();
Edit: for BillW
I guess you are asking for something like this.
public static IEnumerable<T> <T,TCollection> GetNodesRecursively(this TCollection nodeCollection, Func<T, TCollection> getSub)
where T, TCollection: IEnumerable
{
foreach (var theNode in )
{
yield return theNode;
foreach (var subNode in GetNodesRecursively(theNode, getSub))
{
yield return subNode;
}
}
}
var all_control = GetNodesRecursively(control, c=>c.Controls).ToList();
I've noticed that in my project, we frequently are writing recursive functions.
My question is: is there any way to create the recursive function as generic function for each hierarchy structure that is using the recursive iteration?
Maybe I can use a delegate that gets the root and the end flag of the recursion?
Any ideas?
Thanks.
My question is: is there any way to create the recursive function as generic function for each hierarchy structure that is using the recusive iteration?
may be i can use a delegate that gets the root and the end flag of the recursive?
Yes - The only thing you need is a delegate function that computes a list of children for each element. The function terminates when no children are returned.
delegate IEnumerable<TNode> ChildSelector<TNode>(TNode Root);
static IEnumerable<TNode> Traverse<TNode>(this TNode Root, ChildSelector<TNode> Children) {
// Visit current node (PreOrder)
yield return Root;
// Visit children
foreach (var Child in Children(Root))
foreach (var el in Traverse(Child, Children))
yield return el;
}
Example:
static void Main(string[] args) {
var Init = // Some path
var Data = Init.Traverse(Dir => Directory.GetDirectories(Dir, "*", SearchOption.TopDirectoryOnly));
foreach (var Dir in Data)
Console.WriteLine(Dir);
Console.ReadKey();
}
I think what you want is a way to work with hierarchical structures in a generic way ("generic" as defined in English, not necessarily as defined in .Net). For example, this is something I wrote once when I needed to get all the Controls inside a Windows Form:
public static IEnumerable<T> SelectManyRecursive<T>(this IEnumerable<T> items, Func<T, IEnumerable<T>> selector)
{
if (items == null)
throw new ArgumentNullException("items");
if (selector == null)
throw new ArgumentNullException("selector");
return SelectManyRecursiveInternal(items, selector);
}
private static IEnumerable<T> SelectManyRecursiveInternal<T>(this IEnumerable<T> items, Func<T, IEnumerable<T>> selector)
{
foreach (T item in items)
{
yield return item;
IEnumerable<T> subitems = selector(item);
if (subitems != null)
{
foreach (T subitem in subitems.SelectManyRecursive(selector))
yield return subitem;
}
}
}
// sample use, get Text from some TextBoxes in the form
var strings = form.Controls
.SelectManyRecursive(c => c.Controls) // all controls
.OfType<TextBox>() // filter by type
.Where(c => c.Text.StartWith("P")) // filter by text
.Select(c => c.Text);
Another example: a Category class where each Category could have ChildCategories (same way a Control has a Controls collection) and assuming that rootCategory is directly or indirectly the parent of all categories:
// get all categories that are enabled
var categories = from c in rootCategory.SelectManyRecursive(c => c.ChildCategories)
where c.Enabled
select c;
I'm not sure what exactly your question is asking for but a recursive function can be generic. There's no limitation on that. For instance:
int CountLinkedListNodes<T>(MyLinkedList<T> input) {
if (input == null) return 0;
return 1 + CountLinkedListNodes<T>(input.Next);
}
An easier and also generic approach might be to cache the results of the function and use the "real" function only when the result is known - the effectivness of this approach depends how frequently the same set of parameters is used during your recursion.
If you know Perl you should check the first 4 chapters of Higher-Order Perl which are available as a EBook, the ideas presented are language-independent.
It sounds like your solution can successfully use the Visitor Pattern.
You can create a specific variation of the Visitor Pattern by creating a hierarchical visitor pattern.
It is a little complex to discuss entirely here, but that should get you started into some research. The basic idea is that you have a class that knows how to traverse the structure, and then you have Visitor classes that know how to process a particular node. You can separate the traversal of the tree with the processing of nodes.
I often use this recursive 'visitor' in F#
let rec visitor dir filter=
seq { yield! Directory.GetFiles(dir, filter)
for subdir in Directory.GetDirectories(dir) do yield! visitor subdir filter}
Recently I've started working on implementing some F# functionality in C#, and I'm trying to reproduce this as IEnumerable, but I'm having difficulty getting any further than this:
static IEnumerable<string> Visitor(string root, string filter)
{
foreach (var file in Directory.GetFiles(root, filter))
yield return file;
foreach (var subdir in Directory.GetDirectories(root))
foreach (var file in Visitor(subdir, filter))
yield return file;
}
What I don't understand is why I have to do a double foreach in the C# version for the recursion, but not in F#... Does the seq {} implicitly do a 'concat'?
yield! does a 'flatten' operation, so it integrates the sequence you passed it into the outer sequence, implicitly performing a foreach over each element of the sequence and yield on each one.
There is no simple way to do this.
You could workaround this by defining a C# type that can store either one value or a sequence of values - using the F# notation it would be:
type EnumerationResult<'a> =
| One of 'a
| Seq of seq<'a>
(translate this to C# in any way you like :-))
Now, you could write something like:
static IEnumerable<EnumerationResult<string>> Visitor
(string root, string filter) {
foreach (var file in Directory.GetFiles(root, filter))
yield return EnumerationResult.One(file);
foreach (var subdir in Directory.GetDirectories(root))
yield return EnumerationResult.Seq(Visitor(subdir, filter))
}
}
To use it, you'd have to write a function that flattens EnumerationResult, which could be an extension method in C# with the following signature:
IEnumerable<T> Flatten(this IEnumerable<EnumerationResult<T>> res);
Now, this is a part where it gets tricky - if you implemented this in a straighforward way, it would still contain "forach" to iterate over the nested "Seq" results. However, I believe that you could write an optimized version that wouldn't have quadratic complexity.
Ok.. I guess this is a topic for a blog post rather than something that could be fully described here :-), but hopefully, it shows an idea that you can try following!
[EDIT: But of course, you can also use naive implementation of "Flatten" that would use "SelectMany" just to make the syntax of your C# iterator code nicer]
In the specific case of retrieving all files under a specific directory, this overload of Directory.GetFiles works best:
static IEnumerable<string> Visitor( string root, string filter ) {
return Directory.GetFiles( root, filter, SearchOption.AllDirectories );
}
In the general case of traversing a tree of enumerable objects, a nested foreach loop or equivalent is required (see also: All About Iterators).
Edit: Added an example of a function to flatten any tree into an enumeration:
static IEnumerable<T> Flatten<T>( T item, Func<T, IEnumerable<T>> next ) {
yield return item;
foreach( T child in next( item ) )
foreach( T flattenedChild in Flatten( child, next ) )
yield return flattenedChild;
}
This can be used to select all nested files, as before:
static IEnumerable<string> Visitor( string root, string filter ) {
return Flatten( root, dir => Directory.GetDirectories( dir ) )
.SelectMany( dir => Directory.GetFiles( dir, filter ) );
}
In C#, I use the following code for this kind of function:
public static IEnumerable<DirectoryInfo> TryGetDirectories(this DirectoryInfo dir) {
return F.Swallow(() => dir.GetDirectories(), () => new DirectoryInfo[] { });
}
public static IEnumerable<DirectoryInfo> DescendantDirs(this DirectoryInfo dir) {
return Enumerable.Repeat(dir, 1).Concat(
from kid in dir.TryGetDirectories()
where (kid.Attributes & FileAttributes.ReparsePoint) == 0
from desc in kid.DescendantDirs()
select desc);
}
This addresses IO errors (which inevitably happen, unfortunately), and avoids infinite loops due to symbolic links (in particular, you'll run into that searching some dirs in windows 7).