Fastest way to get difference between List<object> - c#

I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.
Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.
I'm currently doing it this way:
private List<MyFile> Diff(Node index1, Node index2)
{
List<MyFile> DifferentFiles = new List<MyFile>();
List<MyFile> Index1Files = FindFiles(index1);
List<MyFile> Index2Files = FindFiles(index2);
List<MyFile> JoinedList = new List<MyFile>();
JoinedList.AddRange(Index1Files);
JoinedList.AddRange(Index2Files);
List<MyFile> JoinedListCopy = new List<MyFile>();
JoinedListCopy.AddRange(JoinedList);
List<string> ChecksumList = new List<string>();
foreach (MyFile m in JoinedList)
{
if (ChecksumList.Contains(m.Checksum))
{
JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
}
else
{
ChecksumList.Add(m.Checksum);
}
}
return JoinedListCopy;
}
And the Node class looks like this:
class Node
{
private string _Dir;
private Node _Parent;
private List<Node> _Children;
private List<MyFile> _Files;
}

Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.
private List<MyFile> Diff(Node index1, Node index2)
{
var Index1Files = FindFiles(index1);
var Index2Files = FindFiles(index2);
//this is all of the files in both
var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
.Intersect(Index2Files.Select(file => file.Checksum)));
return Index1Files.Concat(Index2Files)
.Where(file => !intersection.Contains(file.Checksum))
.ToList();
}

How about:
public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
{
HashSet<string> hash = new HashSet<string>();
foreach (var file in index1.Concat(index2))
{
if (!hash.Add(file.Checksum))
{
hash.Remove(file.Checksum);
}
}
return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
}
This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.

Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?

I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.
Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.
var diff_Files = (from a in Index1Files
join b in Index2Files
on a.CheckSum equals b.CheckSum
where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()
another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...
where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))
look at the individual checksums, rather than the entire file object instance.
the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.
Another way to do this is to use the counting function in linq
var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();
nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.

Related

Poor performance in tree pruning

I made what I'm calling a TreePruner. Its purpose: given a hierarchy starting at a list of root level nodes, return a new hierarchy where the new root nodes are the highest level nodes that meet a certain condition. Here is my class.
public class BreadthFirstPruner<TResource>
{
private IEnumerable<TResource> originalList;
private IEnumerable<TResource> prunedList;
private Func<TResource, ICollection<TResource>> getChildren;
public BreadthFirstPruner(IEnumerable<TResource> list, Func<TResource, ICollection<TResource>> getChildren)
{
this.originalList = list;
this.getChildren = getChildren;
}
public IEnumerable<TResource> GetPrunedTree(Func<TResource,bool> condition)
{
this.prunedList = new List<TResource>();
this.Prune(this.originalList, condition);
return this.prunedList;
}
private void Prune(IEnumerable<TResource> list, Func<TResource,bool> condition)
{
if (list.Count() == 0)
{
return;
}
var included = list.Where(condition);
this.prunedList = this.prunedList.Union(included);
var excluded = list.Except(included);
this.Prune(excluded.SelectMany(this.getChildren), condition);
}
}
The class does what it's supposed to, but it does so slowly, and I can't figure out why. I've used this on very small hierarchies where the complete hierarchy is already in memory (so there should be no linq-to-sql surprises). But regardless of how eager or lazy I try to make things, the first line of code to actually evaluate the results of a linq expression winds up taking 3-4 seconds to execute.
Here is the code that's currently consuming the pruner:
Func<BusinessUnitLabel, ICollection<BusinessUnitLabel>> getChildren = l => l.Children;
var hierarchy = scope.ToList();
var pruner = new BreadthFirstPruner<BusinessUnitLabel>(hierarchy, getChildren);
Func<BusinessUnitLabel, bool> hasBusinessUnitsForUser = l =>
l.BusinessUnits.SelectMany(bu => bu.Users.Select(u => u.IDGUID)).Contains(userId);
var labels = pruner.GetPrunedTree(hasBusinessUnitsForUser).ToList();
As I stated previously, the dataset that I'm working with when this executes is quite small. It's only a few levels deep with only one node on most levels. As it's currently written, the slowness will occur on the first recursive call to Prune when I call list.Count(), because that's when the second level of the hierarchy (excluded.SelectMany(this.getChildren)) is being evaluated.
If, however, I add a .ToList call like so:
var included = list.Where(condition).ToList()
Then the slowness will occur at that point.
What do I need to do to make this thing go fast?
Update
After someone prompted me to reevaluate my condition more carefully, I realized that those associations in hasBusinessUnitsForUser were not being eager loaded. That there was the problem.
These calls are all lazily executed and the results are not cached/materialized:
var included = list.Where(condition);
this.prunedList = this.prunedList.Union(included);
var excluded = list.Except(included);
Even in this snippet included runs twice. Since this is a recursive algorithm there might be many more invocations.
Add a ToList call to any sequence that might be executed more than once.

Most appropriate way to construct a File and Directory class in order to easily filter results when placing them on a tree

I am creating a program that cursively finds all the files and directories in the specified path. So one node may have other nodes if that node happens to be a directory.
Here is my Node class:
class Node
{
public List<Node> Children = new List<Node>(); // if node is directory then children will be the files and directories in this direcotry
public FileSystemInfo Value { get; set; } // can eather be a FileInfo or DirectoryInfo
public bool IsDirectory
{
get{ return Value is DirectoryInfo;}
}
public long Size // HERE IS WHERE I AM HAVING PROBLEMS! I NEED TO RETRIEVE THE
{ // SIZE OF DIRECTORIES AS WELL AS FOR FILES.
get
{
long sum = 0;
if (Value is FileInfo)
sum += ((FileInfo)Value).Length;
else
sum += Children.Sum(x => x.Size);
return sum;
}
}
// this is the method I use to filter results in the tree
public Node Search(Func<Node, bool> predicate)
{
// if node is a leaf
if(this.Children.Count==0)
{
if (predicate(this))
return this;
else
return null;
}
else // Otherwise if node is not a leaf
{
var results = Children.Select(i => i.Search(predicate)).Where(i => i != null).ToList();
if (results.Any()) // THIS IS HOW REMOVE AND RECUNSTRUCT THE TREE WITH A FILTER
{
var result = (Node)MemberwiseClone();
result.Children = results;
return result;
}
return null;
}
}
}
and thanks to that node class I am able to display the tree as:
In one column I display the name of the directory or file and on the right the size. The size is formated as currency just because the commas help visualize it more clearly.
So now my problem is The reason why I have this program was to perform some advance searches. So I may only want to search for files that have the ".txt" extension for example. If I perform that filter on my tree I will get:
(note that I compile the text to a function that takes a Node and returns a bool and I pass that method to the Search method on my Node class in order to filter results. More information on how to dynamically compile code can be found at: http://www.codeproject.com/Articles/10324/Compiling-code-during-runtime) Anyways that has nothing to do with this question. The important part was that I removed all the nodes that did not matched that criteria and because I removed those nodes now the sizes of the directories changed!!!
So my question is how will I be able to filter results maintaining the real size of the directory. I guess I will have to remove the property Size and replace it with a field. The problem with that is that every time I add to the tree I will have to update the size of all the parent directories and that gets complex. Before starting coding it that way I will appreciate your opinion on how I should start implementing the class.
Since you're using recursion and your weight is a node-level property you can't expect that will continue to sum even after you remove the node. You either promote it to a upper level (collection) or use an external counter within the recursion (which counts but not depending on filter, you'll need to carry this through the recuersion).
Anyway, why are you implementing a core .NET functionality again? any reason beyond filtering or recursive search? both are pretty well implemented in the BCL.

Efficient and Accurate way to add items into a TList

I have 2 lists and the entities of the those lists have some IDs for instance
Client.ID, where ID is a property of Client anf then I have PopulationClient.ID, where ID is a property of the class PopulationClient. So I have two Lists
TList<Client> clients = clientsHelper.GetAllClients();
TList<PopulationClient> populationClients = populationHelper.GetAllPopulationClients();
So then I have a temp List
TList<Client> temp_list = new TList<Client>();
So the problem i am having is doing this efficiently and correctly. This is what I have tried.. but I am not getting the correct results
foreach(PopulationClient pClients in populationClients)
{
foreach(Client client in clients)
{
if(pClients.ID != client.ID && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
}
public bool InTempList(TList<Client> c, int id)
{
bool IsInList = false;
foreach(Client client in c)
{
if(client.ID == id)
{
IsInList = true;
}
}
return IsInList;
}
So while I am trying to do it right I can not come up with a good way of doing it, this is not returning the correct data because in my statement in the first loop at the top,at some point one or more is different to the otherone so it adds it anyways. What constraints do you think I should check here so that I only end up with a list of Clients that are in population clients but not in Clients?.
For instance population clients would have 4 clients and Clients 2, those 2 are also in population clients but I need to get a list of population clients not in Clients.
ANy help or pointers would be appreciated.
First, let's concentrate on getting the right results, and then we'll optimize.
Consider your nested loops: you will get too many positives, because in most (pclient, client) pairs the IDs wouldn't match. I think you wanted to code it like this:
foreach(PopulationClient pClients in populationClients)
{
if(!InTempList(clients, pClients.ID) && !InTempList(temp_list, pClients.ID))
{
temp_list.Add(client);
}
}
Now for the efficiency of that code: InTempList uses linear search through lists. This is not efficient - consider using structures that are faster to search, for example, hash sets.
If I understand what you're looking for, here is a way to do it with LINQ...
tempList = populationList.Where(p => !clientList.Any(p2 => p2.ID == p.ID));
Just to offer another LINQ-based answer... I think your intent is to populate tempList based on all the items in 'clients' (returned from GetAllClients) that don't show up (based on 'ID" value) in the populationClients collection.
If that's the case, then I'm going to assume that populationClients is sufficiently large to warrant doing a hash-based looked (if it's less than 10 items, the linear scan may not be a big deal, for instance).
So we want a fast-lookup version of all the ID values from the populationClients collection:
var populationClientIDs = populationClients.Select(pc => pc.ID);
var populationClientIDHash = new HashSet(populationClientIDs);
Now that we have the ID values we want to ignore in a fast lookup data structure, we can then use that as a filter for the clients:
var filteredClients = clients.Where(c => populationClientIDHash.Contains(c.ID) == false);
Based on the usage/need, you could either populate the tempList from 'filteredClients', or do a ToList, or whatever.

Is there a LINQ method to join/concat an unknown number of lists?

I have an object that contains a list of child objects, each of which in turn contains a list of children, and so on. Using that first generation of children only, I want to combine all those lists as cleanly and cheaply as possible. I know I can do something like
public List<T> UnifiedListOfTChildren<T>()
{
List<T> newlist = new List<T>();
foreach (childThing in myChildren)
{
newlist = newlist.Concat<T>(childThing.TChildren);
}
return newlist;
}
but is there a more elegant, less expensive LINQ method I'm missing?
EDIT If you've landed at this question the same way I did and are new to SelectMany, I strongly recommend this visual explanation of how to use it. Comes up near the top in google results currently, but is worth skipping straight to.
var newList = myChildren.SelectMany(c => c.TChildren);

Build Tree more efficiently?

I was wondering if this code is good enough or if there are glaring newbie no-no's.
Basically I'm populating a TreeView listing all Departments in my database. Here is the Entity Framework model:
Here is the code in question:
private void button1_Click(object sender, EventArgs e)
{
DepartmentRepository repo = new DepartmentRepository();
var parentDepartments = repo.FindAllDepartments()
.Where(d => d.IDParentDepartment == null)
.ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = repo.FindAllDepartments()
.Where(x => x.IDParentDepartment == parent.ID)
.ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
}
EDIT:
Good suggestions so far. Working with the entire collection makes sense I guess. But what happens if the collection is huge as in 200,000 entries? Wouldn't this break my software?
DepartmentRepository repo = new DepartmentRepository();
var entries = repo.FindAllDepartments();
var parentDepartments = entries
.Where(d => d.IDParentDepartment == null)
.ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID)
.ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
Since you are getting all of the departments anyway, why don't you do it in one query where you get all of the departments and then execute queries against the in-memory collection instead of the database. That would be much more efficient.
In a more general sense, any database model that is recursive can lead to issues, especially if this could end up being a fairly deep structure. One possible thing to consider would be for each department to store all of its ancestors so that you would be able to get them all at once instead of having to query for them all at once.
In light of your edit, you might want to consider an alternative database schema that scales to handle very large tree structures.
There's a explanation on the fogbugz blog on how they handle hierarchies. They also link to this article by Joe Celko for more information.
Turns out there's a pretty cool solution for this problem explained by Joe Celko. Instead of attempting to maintain a bunch of parent/child relationships all over your database -- which would necessitate recursive SQL queries to find all the descendents of a node -- we mark each case with a "left" and "right" value calculated by traversing the tree depth-first and counting as we go. A node's "left" value is set whenever it is first seen during traversal, and the "right" value is set when walking back up the tree away from the node. A picture probably makes more sense:
The Nested Set SQL model lets us add case hierarchies without sacrificing performance.
How does this help? Now we just ask for all the cases with a "left" value between 2 and 9 to find all of the descendents of B in one fast, indexed query. Ancestors of G are found by asking for nodes with "left" less than 6 (G's own "left") and "right" greater than 6. Works in all databases. Greatly increases performance -- particularly when querying large hierarchies.
Assuming that you are getting the data from a database the first thing that comes to mind is that you are going to be hitting the database n+1 times for as many parents that you have in the database. You should try and get the whole tree structure out in one hit.
Secondly, you seem to get the idea patterns seeing as you appear to be using the repository pattern so you might want to look at IoC. It allows you to inject your dependency on a particular object such as your repository into your class where it is going to be used allowing for easier unit testing.
Thirdly, regardless of where you get your data from, move the structuring of the data into a tree data structure into a service which returns you an object containing all your departments that have already been organised (This basically becomes a DTO). This will help you reduce code duplication.
With anything you need to apply the yagni principle. This basically says that you should only do something if you are going to need it so if the code you have provided above is complete, needs no further work and is functional don't touch it. The same goes with the performance issue of select n+1, if you are not seeing any performance hits don't do anything as it may be premature optimization.
In your edit
DepartmentRepository repo = new DepartmentRepository();
var entries = repo.FindAllDepartments();
var parentDepartments = entries.Where(d => d.IDParentDepartment == null).ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID).ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
You still have a n+1 issue. This is because the data is only retrieved from the database when you call the ToList() or when you iterate over the enumeration. This would be better.
var entries = repo.FindAllDepartments().ToList();
var parentDepartments = entries.Where(d => d.IDParentDepartment == null);
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID);
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
That looks ok to me, but think about a collection of hundreds of thousands nodes. The best way to do that is asynchronous loading - please notice, that do don't necassarily have to load all elements at the same time. Your tree view can be collapsed by default and you can load additional levels as the user expands tree's nodes. Let's consider such case: you have a root node containing 100 nodes and each of these nodes contains at least 1000 nodes. 100 * 1000 = 100000 nodes to load - pretty much, istn't it? To reduce the database traffic you can first load your first 100 nodes and then, when user expands one of those, you can load its 1000 nodes. That will save considerable amount of time.
Things that come to mind:
It looks like .ToList() is needless. If you are simply iterating over the returned result, why bother with the extra step?
Move this function into its own thing and out of the event handler.
As other have said, you could get the whole result in one call. Sort by IDParentDepartment so that the null ones are first. That way, you should be able to get the list of departments in one call and iterate over it only once adding children departments to already existing parent ones.
Wrap the TreeView modifications with:
treeView.BeginUpdate();
// modify the tree here.
treeView.EndUpdate();
To get better performance.
Pointed out here by jgauffin
This should use only one (albeit possibly large) call to the database:
Departments.Join(
Departments,
x => x.IDParentDepartment,
x => x.Name,
(o,i) => new { Child = o, Parent = i }
).GroupBy(x => x.Parent)
.Map(x => {
var node = new TreeNode(x.Key.Name);
x.Map(y => node.Nodes.Add(y.Child.Name));
treeView1.Nodes.Add(node);
}
)
Where 'Map' is just a 'ForEach' for IEnumerables:
public static void Map<T>(this IEnumerable<T> source, Action<T> func)
{
foreach (T i in source)
func(i);
}
Note: This will still not help if the Departments table is huge as 'Map' materializes the result of the sql statement much like 'ToList()' does. You might consider Piotr's answer.
In addition to Bronumski and Keith Rousseau answer
Also add the DepartmentID with the nodes(Tag) so that you don't have to re-query the database to get the departmentID

Categories

Resources