Poor performance in tree pruning - c#

I made what I'm calling a TreePruner. Its purpose: given a hierarchy starting at a list of root level nodes, return a new hierarchy where the new root nodes are the highest level nodes that meet a certain condition. Here is my class.
public class BreadthFirstPruner<TResource>
{
private IEnumerable<TResource> originalList;
private IEnumerable<TResource> prunedList;
private Func<TResource, ICollection<TResource>> getChildren;
public BreadthFirstPruner(IEnumerable<TResource> list, Func<TResource, ICollection<TResource>> getChildren)
{
this.originalList = list;
this.getChildren = getChildren;
}
public IEnumerable<TResource> GetPrunedTree(Func<TResource,bool> condition)
{
this.prunedList = new List<TResource>();
this.Prune(this.originalList, condition);
return this.prunedList;
}
private void Prune(IEnumerable<TResource> list, Func<TResource,bool> condition)
{
if (list.Count() == 0)
{
return;
}
var included = list.Where(condition);
this.prunedList = this.prunedList.Union(included);
var excluded = list.Except(included);
this.Prune(excluded.SelectMany(this.getChildren), condition);
}
}
The class does what it's supposed to, but it does so slowly, and I can't figure out why. I've used this on very small hierarchies where the complete hierarchy is already in memory (so there should be no linq-to-sql surprises). But regardless of how eager or lazy I try to make things, the first line of code to actually evaluate the results of a linq expression winds up taking 3-4 seconds to execute.
Here is the code that's currently consuming the pruner:
Func<BusinessUnitLabel, ICollection<BusinessUnitLabel>> getChildren = l => l.Children;
var hierarchy = scope.ToList();
var pruner = new BreadthFirstPruner<BusinessUnitLabel>(hierarchy, getChildren);
Func<BusinessUnitLabel, bool> hasBusinessUnitsForUser = l =>
l.BusinessUnits.SelectMany(bu => bu.Users.Select(u => u.IDGUID)).Contains(userId);
var labels = pruner.GetPrunedTree(hasBusinessUnitsForUser).ToList();
As I stated previously, the dataset that I'm working with when this executes is quite small. It's only a few levels deep with only one node on most levels. As it's currently written, the slowness will occur on the first recursive call to Prune when I call list.Count(), because that's when the second level of the hierarchy (excluded.SelectMany(this.getChildren)) is being evaluated.
If, however, I add a .ToList call like so:
var included = list.Where(condition).ToList()
Then the slowness will occur at that point.
What do I need to do to make this thing go fast?
Update
After someone prompted me to reevaluate my condition more carefully, I realized that those associations in hasBusinessUnitsForUser were not being eager loaded. That there was the problem.

These calls are all lazily executed and the results are not cached/materialized:
var included = list.Where(condition);
this.prunedList = this.prunedList.Union(included);
var excluded = list.Except(included);
Even in this snippet included runs twice. Since this is a recursive algorithm there might be many more invocations.
Add a ToList call to any sequence that might be executed more than once.

Related

C# Recursive Search an array of Objects.parent_id for value, then search those and so on till none left

Looking for a solution to find an object.id and get all the parent_id's in an array of objects, and then set object.missed = true.
Object.id, and Object parent_id. If the object doesn't have a parent_id, parent_id = id.
I know how to do it for one level of parent_id's. How can I go unlimited levels deep? Below is the code I have for searching the 1 level.
public class EPlan
{
public int id;
public int parent_id;
public bool is_repeatable;
public bool missed;
}
EPlan[] plans = Array.FindAll(eventsPlan, item => item.parent_id == event_id);
foreach (EPlan plan in plans)
{
plan.missed = true;
plan.is_repeatable = false;
}
I'm trying to search for event_id an int. So I search all of the object.id's for event_id. Once I find object.id == event_id. I need to set object.is_repeatable = false and object.missed = true.
Then I need to search all of the objects.parent_id for current object.id (event_id). Change all of those object to the same as above.
Then I need to check all of those object.id's against all of the object.parent_id's and do the same to those. Like a tree affect. 1 event was missed, and any of the events that are parented to that event need to be set as missed as well.
So far, all I can do is get 1 level deep, or code multiple foreach loops in. But it could be 10 or more levels deep. So that doesn't make sense.
Any help is appreciated. There has to be a better way that the multiple loops.
I too was confused by the question, save for the one line you said:
1 event was missed, and any of the events that are parented to that event need to be set as missed as well.
With that in mind, I suggest the following code will do what you're looking for. Each time you call the method, it will find all of the objects in the array that match the ID and set the event as Missed and Is_Repeatable appropriately.
It also keeps a running list of the Parent_ID's it found during this scan. Once the loop is finished it will call itself, using the list of parent id values instead of the passed in list of events ids it just used. That is the trick that makes the recursion work here.
To start the process off, you call the method with the single event ID you did for 1-level search.
findEvents(new List<string>{event_id}, eventsPlan);
private void findEvents(List<int> eventIDs, EPlan[] eventsPlan)
{
foreach (int eventID in eventIDs)
{
EPlan[] plans = Array.FindAll(eventsPlan, item => item.parent_id == eventID);
List<int> parentIDs = new List<int>();
foreach (EPlan plan in plans)
{
plan.missed = true;
plan.is_repeatable = false;
parentIDs.Add(plan.parent_id);
}
if (parentIDs.Count > 0)
findEvents(parentIDs, eventsPlan);
}
}
I also recommend that if you have the chance to reengineer this code to not use arrays, but a Generic Collection (like List<EPlan>) you can avoid the performance penalty this code has because it's building new arrays in memory each time you call the Array.FindAll method. Using the Generic Collection, or even using old-school foreach loop will work faster when processing a lot of data here.
Update 1:
To answer your question about how you might go about this using a Generic Collection instead:
private void findEventsAsList(List<int> eventIDs, List<EPlan> eventsPlans)
{
List<int> parentIDs = new List<int>();
foreach (EPlan plan in eventsPlans.Where(p => eventIDs.Contains(p.parent_id)))
{
plan.missed = true;
plan.is_repeatable = false;
parentIDs.Add(plan.parent_id);
}
findEventsAsList(parentIDs, eventsPlan);
}

Querying a chain of list of lists with LINQ

I am working with an XML standard called SDMX. It's fairly complicated but I'll make it as short as possible. I am receiving an object called CategoryScheme. This object can contain a number of Category, and each Category can contain more Category, and so on, the chain can be infinite. Every Category has an unique ID.
Usually each Category contains a lot of Categories. Together with this object I am receiving an Array, that contains the list of IDs that indicates where a specific Category is nested, and then I am receiving the ID of that category.
What I need to do is to create an object that maintains the hierarchy of the Category objects, but each Category must have only one child and that child has to be the one of the tree that leads to the specific Category.
So I had an idea, but in order to do this I should generate LINQ queries inside a cycle, and I have no clue how to do this. More information of what I wanted to try is commented inside the code
Let's go to the code:
public void RemoveCategory(ArtefactIdentity ArtIdentity, string CategoryID, string CategoryTree)
{
try
{
WSModel wsModel = new WSModel();
// Prepare Art Identity and Array
ArtIdentity.Version = ArtIdentity.Version.Replace("_", ".");
var CatTree = JArray.Parse(CategoryTree).Reverse();
// Get Category Scheme
ISdmxObjects SdmxObj = wsModel.GetCategoryScheme(ArtIdentity, false, false);
ICategorySchemeMutableObject CatSchemeObj = SdmxObj.CategorySchemes.FirstOrDefault().MutableInstance;
foreach (var Cat in CatTree)
{
// The cycle should work like this.
// At every iteration it must delete all the elements except the correct one
// and on the next iteration it must delete all the elements of the previously selected element
// At the end, I need to have the CatSchemeObj full of the all chains of categories.
// Iteration 1...
//CatSchemeObj.Items.ToList().RemoveAll(x => x.Id != Cat.ToString());
// Iteration 2...
//CatSchemeObj.Items.ToList().SingleOrDefault().Items.ToList().RemoveAll(x => x.Id != Cat.ToString());
// Iteration 3...
//CatSchemeObj.Items.ToList().SingleOrDefault().Items.ToList().SingleOrDefault().Items.ToList().RemoveAll(x => x.Id != Cat.ToString());
// Etc...
}
}
catch (Exception ex)
{
throw ex;
}
}
Thank you for your help.
So, as i already said in my comment, building a recursive function should fix the issue. If you're new to it, you can find some basic information about recursion in C# here.
The method could look something like this:
private void DeleteRecursively(int currentRecursionLevel, string[] catTree, ICategorySchemeMutableObject catSchemeObj)
{
catSchemeObj.Items.ToList().RemoveAll(x => x.Id != catTree[currentRecursionLevel].ToString());
var leftoverObject = catSchemeObj.Items.ToList().SingleOrDefault();
if(leftoverObject != null) DeleteRecursively(++currentRecursionLevel, catTree, leftoverObject);
}
Afterwards you can call this method in your main method, instead of the loop:
DeleteRecursively(0, CatTree, CatSchemeObject);
But as i also said, keep in mind, that calling the method in the loop, seems senseless to me, because you already cleared the tree, besides the one leftover path, so calling the method with the same tree, but another category, will result in an empty tree (in CatSchemeObject).
CAUTION! Another thing to mention i noticed right now: Calling to list on your Items property and afterwards deleting entries, will NOT affect your source object, as ToList is generating a new object. It IS keeping the referenced original objects, but a deletion only affects the list. So you must write back the resulting list to your Items property, or find a way to directly delete in the Items object. (Assuming it's an IEnumerable and not a concrete collection type you should write it back).
Just try it out with this simple example, and you will see that the original list is not modified.
IEnumerable<int> test = new List<int>() { 1, 2, 3, 4 , 1 };
test.ToList().RemoveAll(a => a != 1);
Edited:
So here is another possible way of going after the discussion below.
Not sure what do you really need so just try it out.
int counter = 0;
var list = CatSchemeObj.Items.ToList();
//check before you call it or you will get an error
if(!list.Equals(default(list)))
{
while(true)
{
var temp = list.Where(x => CatTree[counter++] == x.Id); // or != ? play with it .
list = temp.Items.ToList().SingleOrDefault();
if(list.Equals(default(list))
{
break;
}
}
}
I just translated you problem to 2 solutions, but I am not sure if you won't lose data because of the SingleOrDefault call. It means 'Grab the first item regardless of everything'. I know you said you have only 1 Item that is ok, but still... :)
Let me know in comment if this worked for you or not.
//solution 1
// inside of this loop check each child list if empty or not
foreach (var Cat in CatTree)
{
var list = CatSchemeObj.Items.ToList();
//check before you call it or you will get an error
if(!list.Equals(default(list)))
{
while(true)
{
list.RemoveAll(x => x.Id != Cat.ToString());
list = list.ToList().SingleOrDefault();
if(list.Equals(default(list))
{
break;
}
}
}
}
//solution 2
foreach (var Cat in CatTree)
{
var list = CatSchemeObj.Items.ToList();
//check before you call it or you will get an error
if(!list.Equals(default(list)))
{
CleanTheCat(cat, list);
}
}
//use this recursive function outside of loop because it will cat itself
void CleanTheCat(string cat, List<typeof(ICategorySchemeMutableObject.Items) /*Place here whatever type you have*/> CatSchemeObj)
{
CatSchemeObj.RemoveAll(x => x.Id != cat);
var catObj = CatSchemeObj.Items.ToList().SingleOrDefault();
if (!catObj.Equals(default(catObj)){
CleanTheCat(cat, catObj);
}
}
Thank you to whoever tried to help but I solved it by myself in a much easier way.
I just sent the full CategoryScheme object to the method that converted it in the XML format, then just one line did the trick:
XmlDocument.Descendants("Category").Where(x => !CatList.Contains(x.Attribute("id").Value)).RemoveIfExists();

Why does this LINQ query new up only one instance of the internal List?

Upon request, I have simplified this question. When trying to take two generic List and blend them, I get unexpected results.
private List<ConditionGroup> GetConditionGroupParents()
{
return (from Conditions in dataContext.Conditions
orderby Conditions.Name
select new ConditionGroup
{
GroupID = Conditions.ID,
GroupName = Conditions.Name,
/* PROBLEM */ MemberConditions = new List<Condition>()
}).ToList();
}
private List<ConditionGroup> BuildConditionGroups()
{
var results = GetConditionGroupParents();
// contents of ConditionMaps is irrelevant to this matter
List<ConditionMap> ConditionMaps = GenerateGroupMappings();
// now pair entries from the map into their appropriate group,
// adding them to the proper List<MemberConditions> as appropriate
foreach (var map in ConditionMaps)
{
results.Find(groupId => groupId.GroupID == map.GroupID)
.MemberConditions.Add(new ConditionOrphan(map));
}
return results;
}
I would expect each map in ConditionMaps to be mapped to a single ConditionGroup's MemberConditions in the "results.Find...." statement.
Instead, each map is being added to the list of every group, and that happens simultaneously/concurrently.
[edit] I've since proven that there is only a single instance of
List<Memberconditions>, being referenced by each group.
I unrolled the creation of the groups like so:
.
.
.
/* PROBLEM */ MemberConditions = null }).ToList();
foreach (var result in results)
{
List<Condition> memberConditions = new List<Condition>();
results.MemberConditions = memberConditions;
}
return results;
In that case I was able to watch each instantiation stepping
through the loop, and then it worked as expected. My question
remains, though, why the original code only created a single
instance. Thanks!
.
Why doesn't the LINQ query in GetConditionGroupParents "new up" a unique MemberConditions list for each Group, as indicated in the /* PROBLEM */ comment above?
Any insight is appreciated. Thanks!
Jeff Woods of
Reading, PA
This is a bug. As a workaround you can create a factory function
static List<T> CreateList<T>(int dummy) { ... }
And pass it any dummy value depending on the current row such as Conditions.ID.
This trick works because L2S, unlike EF, is capable of calling non-translatable functions in the last Select of the query. You will not have fun migrating to EF since they have not implemented this (yet).

Fastest way to get difference between List<object>

I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.
Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.
I'm currently doing it this way:
private List<MyFile> Diff(Node index1, Node index2)
{
List<MyFile> DifferentFiles = new List<MyFile>();
List<MyFile> Index1Files = FindFiles(index1);
List<MyFile> Index2Files = FindFiles(index2);
List<MyFile> JoinedList = new List<MyFile>();
JoinedList.AddRange(Index1Files);
JoinedList.AddRange(Index2Files);
List<MyFile> JoinedListCopy = new List<MyFile>();
JoinedListCopy.AddRange(JoinedList);
List<string> ChecksumList = new List<string>();
foreach (MyFile m in JoinedList)
{
if (ChecksumList.Contains(m.Checksum))
{
JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
}
else
{
ChecksumList.Add(m.Checksum);
}
}
return JoinedListCopy;
}
And the Node class looks like this:
class Node
{
private string _Dir;
private Node _Parent;
private List<Node> _Children;
private List<MyFile> _Files;
}
Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.
private List<MyFile> Diff(Node index1, Node index2)
{
var Index1Files = FindFiles(index1);
var Index2Files = FindFiles(index2);
//this is all of the files in both
var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
.Intersect(Index2Files.Select(file => file.Checksum)));
return Index1Files.Concat(Index2Files)
.Where(file => !intersection.Contains(file.Checksum))
.ToList();
}
How about:
public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
{
HashSet<string> hash = new HashSet<string>();
foreach (var file in index1.Concat(index2))
{
if (!hash.Add(file.Checksum))
{
hash.Remove(file.Checksum);
}
}
return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
}
This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.
Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?
I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.
Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.
var diff_Files = (from a in Index1Files
join b in Index2Files
on a.CheckSum equals b.CheckSum
where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()
another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...
where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))
look at the individual checksums, rather than the entire file object instance.
the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.
Another way to do this is to use the counting function in linq
var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();
nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.

Linq filtering results with multiple Where clauses

I am trying to use EF 5 to apply multiple search criteria to a result set (in this case, for a library catalog search). Here is the relevant code:
public IQueryable<LibraryResource> GetSearchResults(string SearchCriteria, int? limit = null)
{
List<string> criteria = SearchCriteria.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();
IQueryable<LibraryResource> allResults = context.LibraryResources.Include("Type").Where(r => r.AuditInfo.DeletedAt == null);
foreach (string criterion in criteria)
{
allResults = allResults.Where(r => (r.Title.Contains(criterion) || r.Keywords.Contains(criterion) || r.Author.Contains(criterion) || r.Comments.Contains(criterion)));
}
allResults = allResults.OrderBy(r => r.Title);
if (limit.HasValue) allResults = allResults.Take(limit.Value);
return allResults;
}
Sample SearchCriteria = "history era"
For some reason, only the last criterion gets applied. For instance, in the sample above, all the books with "era" in the title, author, keywords and comments are returned, without also filtering by "history". I stepped through the code, and the loop executes twice, with the appropriate criterion each time. Can you see something I can't? Thanks!
You have fallen victim to modifying the value of a closed-over variable.
Change the code to this:
foreach (string criterion in criteria)
{
var crit = criterion;
allResults = allResults.Where(/* use crit here, not criterion */);
}
The problem here is that while you are building up the query your filtering expressions close over the variable criterion, in effect pulling it in scope at the point where the query is evaluated. However, at that time criterion will only have one value (the last one it happened to loop over) so all but the last of your filters will in fact be turned into duplicates of the last one.
Creating a local copy of criterion and referencing that inside the expressions corrects the problem because crit is a different local variable each time, with lifetime that does not extend from one iteration of the loop to the next one.
For more details you might want to read Is there a reason for C#'s reuse of the variable in a foreach?, where it is also mentioned that C# 5.0 will take a breaking change that applies to this scenario: the lifetime of the loop variable criterion is going to change, making this code work correctly without an extra local.

Categories

Resources