Build Tree more efficiently?

Build Tree more efficiently? - c#

I was wondering if this code is good enough or if there are glaring newbie no-no's.
Basically I'm populating a TreeView listing all Departments in my database. Here is the Entity Framework model:
Here is the code in question:
private void button1_Click(object sender, EventArgs e)
{
DepartmentRepository repo = new DepartmentRepository();
var parentDepartments = repo.FindAllDepartments()
.Where(d => d.IDParentDepartment == null)
.ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = repo.FindAllDepartments()
.Where(x => x.IDParentDepartment == parent.ID)
.ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
}
EDIT:
Good suggestions so far. Working with the entire collection makes sense I guess. But what happens if the collection is huge as in 200,000 entries? Wouldn't this break my software?
DepartmentRepository repo = new DepartmentRepository();
var entries = repo.FindAllDepartments();
var parentDepartments = entries
.Where(d => d.IDParentDepartment == null)
.ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID)
.ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}

Since you are getting all of the departments anyway, why don't you do it in one query where you get all of the departments and then execute queries against the in-memory collection instead of the database. That would be much more efficient.
In a more general sense, any database model that is recursive can lead to issues, especially if this could end up being a fairly deep structure. One possible thing to consider would be for each department to store all of its ancestors so that you would be able to get them all at once instead of having to query for them all at once.

In light of your edit, you might want to consider an alternative database schema that scales to handle very large tree structures.
There's a explanation on the fogbugz blog on how they handle hierarchies. They also link to this article by Joe Celko for more information.
Turns out there's a pretty cool solution for this problem explained by Joe Celko. Instead of attempting to maintain a bunch of parent/child relationships all over your database -- which would necessitate recursive SQL queries to find all the descendents of a node -- we mark each case with a "left" and "right" value calculated by traversing the tree depth-first and counting as we go. A node's "left" value is set whenever it is first seen during traversal, and the "right" value is set when walking back up the tree away from the node. A picture probably makes more sense:
The Nested Set SQL model lets us add case hierarchies without sacrificing performance.
How does this help? Now we just ask for all the cases with a "left" value between 2 and 9 to find all of the descendents of B in one fast, indexed query. Ancestors of G are found by asking for nodes with "left" less than 6 (G's own "left") and "right" greater than 6. Works in all databases. Greatly increases performance -- particularly when querying large hierarchies.

Assuming that you are getting the data from a database the first thing that comes to mind is that you are going to be hitting the database n+1 times for as many parents that you have in the database. You should try and get the whole tree structure out in one hit.
Secondly, you seem to get the idea patterns seeing as you appear to be using the repository pattern so you might want to look at IoC. It allows you to inject your dependency on a particular object such as your repository into your class where it is going to be used allowing for easier unit testing.
Thirdly, regardless of where you get your data from, move the structuring of the data into a tree data structure into a service which returns you an object containing all your departments that have already been organised (This basically becomes a DTO). This will help you reduce code duplication.
With anything you need to apply the yagni principle. This basically says that you should only do something if you are going to need it so if the code you have provided above is complete, needs no further work and is functional don't touch it. The same goes with the performance issue of select n+1, if you are not seeing any performance hits don't do anything as it may be premature optimization.
In your edit
DepartmentRepository repo = new DepartmentRepository();
var entries = repo.FindAllDepartments();
var parentDepartments = entries.Where(d => d.IDParentDepartment == null).ToList();
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID).ToList();
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}
You still have a n+1 issue. This is because the data is only retrieved from the database when you call the ToList() or when you iterate over the enumeration. This would be better.
var entries = repo.FindAllDepartments().ToList();
var parentDepartments = entries.Where(d => d.IDParentDepartment == null);
foreach (var parent in parentDepartments)
{
TreeNode node = new TreeNode(parent.Name);
treeView1.Nodes.Add(node);
var children = entries.Where(x => x.IDParentDepartment == parent.ID);
foreach (var child in children)
{
node.Nodes.Add(child.Name);
}
}

That looks ok to me, but think about a collection of hundreds of thousands nodes. The best way to do that is asynchronous loading - please notice, that do don't necassarily have to load all elements at the same time. Your tree view can be collapsed by default and you can load additional levels as the user expands tree's nodes. Let's consider such case: you have a root node containing 100 nodes and each of these nodes contains at least 1000 nodes. 100 * 1000 = 100000 nodes to load - pretty much, istn't it? To reduce the database traffic you can first load your first 100 nodes and then, when user expands one of those, you can load its 1000 nodes. That will save considerable amount of time.

Things that come to mind:
It looks like .ToList() is needless. If you are simply iterating over the returned result, why bother with the extra step?
Move this function into its own thing and out of the event handler.
As other have said, you could get the whole result in one call. Sort by IDParentDepartment so that the null ones are first. That way, you should be able to get the list of departments in one call and iterate over it only once adding children departments to already existing parent ones.

Wrap the TreeView modifications with:
treeView.BeginUpdate();
// modify the tree here.
treeView.EndUpdate();
To get better performance.
Pointed out here by jgauffin

This should use only one (albeit possibly large) call to the database:
Departments.Join(
Departments,
x => x.IDParentDepartment,
x => x.Name,
(o,i) => new { Child = o, Parent = i }
).GroupBy(x => x.Parent)
.Map(x => {
var node = new TreeNode(x.Key.Name);
x.Map(y => node.Nodes.Add(y.Child.Name));
treeView1.Nodes.Add(node);
}
)
Where 'Map' is just a 'ForEach' for IEnumerables:
public static void Map<T>(this IEnumerable<T> source, Action<T> func)
{
foreach (T i in source)
func(i);
}
Note: This will still not help if the Departments table is huge as 'Map' materializes the result of the sql statement much like 'ToList()' does. You might consider Piotr's answer.

In addition to Bronumski and Keith Rousseau answer
Also add the DepartmentID with the nodes(Tag) so that you don't have to re-query the database to get the departmentID

Related

EF - A proper way to search several items in database

I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!

As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.

You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));

They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches

Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});

Converting foreach to Linq

Current Code:
For each element in the MapEntryTable, check the properties IsDisplayedColumn and IsReturnColumn and if they are true then add the element to another set of lists, its running time would be O(n), there would be many elements with both properties as false, so will not get added to any of the lists in the loop.
foreach (var mapEntry in MapEntryTable)
{
if (mapEntry.IsDisplayedColumn)
Type1.DisplayColumnId.Add(mapEntry.OutputColumnId);
if (mapEntry.IsReturnColumn)
Type1.ReturnColumnId.Add(mapEntry.OutputColumnId);
}
Following is the Linq version of doing the same:
MapEntryTable.Where(x => x.IsDisplayedColumn == true).ToList().ForEach(mapEntry => Type1.DisplayColumnId.Add(mapEntry.OutputColumnId));
MapEntryTable.Where(x => x.IsReturnColumn == true).ToList().ForEach(mapEntry => Type1.ReturnColumnId.Add(mapEntry.OutputColumnId));
I am converting all such foreach code to linq, as I am learning it, but my question is:
Do I get any advantage of Linq conversion in this case or is it a disadvantage ?
Is there a better way to do the same using Linq
UPDATE:
Consider the condition where out of 1000 elements in the list 80% have both properties false, then does where provides me a benefit of quickly finding elements with a given condition.
Type1 is a custom type with set of List<int> structures, DisplayColumnId and ReturnColumnId

ForEach ins't a LINQ method. It's a method of List. And not only is it not a part of LINQ, it's very much against the very values and patterns of LINQ. Eric Lippet explains this in a blog post that was written when he was a principle developer on the C# compiler team.
Your "LINQ" approach also:
Completely unnecessarily copies all of the items to be added into a list, which is both wasteful in time and memory and also conflicts with LINQ's goals of deferred execution when executing queries.
Isn't actually a query with the exception of the Where operator. You're acting on the items in the query, rather than performing a query. LINQ is a querying tool, not a tool for manipulating data sets.
You're iterating the source sequence twice. This may or may not be a problem, depending on what the source sequence actually is and what the costs of iterating it are.
A solution that uses LINQ as much as is it is designed for would be to use it like so:
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsDisplayedColumn))
list1.DisplayColumnId.Add(mapEntry.OutputColumnId);
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsReturnColumn))
list2.ReturnColumnId.Add(mapEntry.OutputColumnId);

I would say stick with the original way with the foreach loop, since you are only iterating through the list 1 time over.
also your linq should look more like this:
list1.DisplayColumnId.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn).Select(mapEntry => mapEntry.OutputColumnId));
list2.ReturnColumnId.AddRange(MapEntryTable.Where(x => x.IsReturnColumn).Select(mapEntry => mapEntry.OutputColumnId));

The performance of foreach vs Linq ForEach are almost exactly the same, within nano seconds of each other. Assuming you have the same internal logic in the loop in both versions when testing.
However a for loop, outperforms both by a LARGE margin. for(int i; i < count; ++i) is much faster than both. Because a for loop doesn't rely on an IEnumerable implementation (overhead). The for loop compiles to x86 register index/jump code. It maintains an incrementor, and then it's up to you to retrieve the item by it's index in the loop.
Using a Linq ForEach loop though does have a big disadvantage. You cannot break out of the loop. If you need to do that you have to maintain a boolean like "breakLoop = false", set it to true, and have each recursive exit if breakLoop is true... Bad performing there. Secondly you cannot use continue, instead you use "return".
I never use Linq's foreach loop.
If you are dealing with linq, e.g.
List<Thing> things = .....;
var oldThings = things.Where(p.DateTime.Year < DateTime.Now.Year);
That internally will foreach with linq and give you back only the items with a year less than the current year. Cool..
But if I am doing this:
List<Thing> things = new List<Thing>();
foreach(XElement node in Results) {
things.Add(new Thing(node));
}
I don't need to use a linq for each loop. Even if I did...
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing") {
if (node.Ignore) {
continue;
}
thing.Add(node);
}
even though I could write that cleaner like
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing" && !node.Ignore) {
thing.Add(node);
}
There is no real reason I can think of to do this..>
things.ForEach(thing => {
//do something
//can't break
//can't continue
return; //<- continue
});
And if I want the fastest loop possible,
for (int i = 0; i < things.Count; ++i) {
var thing = things[i];
//do something
}
Will be faster.

Your LINQ isn't quite right as you're converting the results of Where to a List and then pseudo-iterating over those results with ForEach to add to another list. Use ToList or AddRange for converting or adding sequences to lists.
Example, where overwriting list1 (if it were actually a List<T>):
list1 = MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId).ToList();
or to append:
list1.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId));

In C#, to do what you want functionally in one call, you have to write your own partition method. If you are open to using F#, you can use List.Partition<'T>
https://msdn.microsoft.com/en-us/library/ee353782.aspx

Efficient way to create parent-child relation

To obtain parent-child relation i use recursion,
public List<MasterDiagnosi> GetAllDiagonsis()
{
IEnumerable<MasterDiagnosi> list = this.GetMasterDiagnosi();
List<MasterDiagnosi> tree = new List<MasterDiagnosi>();
var Nodes = list.Where(x => x.DiagnosisParentID == 0 && x.Status == 2 && x.DiagnosisLanguageID == 2);
foreach (var node in Nodes)
{
SetChildren(node, list);
tree.Add(node);
}
return tree;
}
private void SetChildren(MasterDiagnosi model, IEnumerable<MasterDiagnosi> diagonisList)
{
var childs = diagonisList.Where(x => x.DiagnosisParentID == model.DiagnosisID && x.Status == 2 && x.DiagnosisLanguageID == 2);
if (childs.Count() > 0)
{
foreach (var child in childs)
{
SetChildren(child, diagonisList);
model.MasterDiagnosis.Add(child);
}
}
}
It work great for 0-500 records, but if i have records more than 5k or 10k then i am ended up with worst case of 4-5 minutes of processing. I am looking for another way around,or any optimization. Currently i save my day using cahce, but i need this to be done as various CRUD operations may be done. Any Help ?

I would suggest to save the tree data structure inside the relational database. In this presentation of advanced data structures you could find some data structures which could be used efficiently to store a tree in a database. Some of them are quite simple.
I have to warn you that these structures are not always easy to implement. The gain in performance could be great though.
I have used the nested set model which I found very efficient. This of course depends on your specific tree navigation scenarios. As Evan Petersen suggests, the nested sets hierarchical model is very sufficient when someone wants to retrieve all the children of one node, but very inefficient for deleting a node because the whole tree has to be reorganized. Therefore, you should make your own decision, by taking your specific tree navigation and update scenarios into account.
Hope I helped!

Try to disable AutoDetectChanges, while you are inserting values into the table, to prevent EntityFramework change tracker firing every time behind on scene. It helped me when I loaded tens thousand records to the database. Look at the example below
model.Configuration.AutoDetectChangesEnabled = false;
// insert data
model.DataBaseContext.ChangeTracker.DetectChanges();
model.SubmitChanges();

Fastest way to get difference between List<object>

I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.
Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.
I'm currently doing it this way:
private List<MyFile> Diff(Node index1, Node index2)
{
List<MyFile> DifferentFiles = new List<MyFile>();
List<MyFile> Index1Files = FindFiles(index1);
List<MyFile> Index2Files = FindFiles(index2);
List<MyFile> JoinedList = new List<MyFile>();
JoinedList.AddRange(Index1Files);
JoinedList.AddRange(Index2Files);
List<MyFile> JoinedListCopy = new List<MyFile>();
JoinedListCopy.AddRange(JoinedList);
List<string> ChecksumList = new List<string>();
foreach (MyFile m in JoinedList)
{
if (ChecksumList.Contains(m.Checksum))
{
JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
}
else
{
ChecksumList.Add(m.Checksum);
}
}
return JoinedListCopy;
}
And the Node class looks like this:
class Node
{
private string _Dir;
private Node _Parent;
private List<Node> _Children;
private List<MyFile> _Files;
}

Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.
private List<MyFile> Diff(Node index1, Node index2)
{
var Index1Files = FindFiles(index1);
var Index2Files = FindFiles(index2);
//this is all of the files in both
var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
.Intersect(Index2Files.Select(file => file.Checksum)));
return Index1Files.Concat(Index2Files)
.Where(file => !intersection.Contains(file.Checksum))
.ToList();
}

How about:
public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
{
HashSet<string> hash = new HashSet<string>();
foreach (var file in index1.Concat(index2))
{
if (!hash.Add(file.Checksum))
{
hash.Remove(file.Checksum);
}
}
return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
}
This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.

Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?

I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.
Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.
var diff_Files = (from a in Index1Files
join b in Index2Files
on a.CheckSum equals b.CheckSum
where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()
another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...
where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))
look at the individual checksums, rather than the entire file object instance.
the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.
Another way to do this is to use the counting function in linq
var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();
nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.

A better way to loop through lists

So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList

You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}

Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}

You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}

Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.

Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.

Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Build Tree more efficiently? - c#

Wrap the TreeView modifications with: treeView.BeginUpdate(); // modify the tree here. treeView.EndUpdate(); To get better performance. Pointed out here by jgauffin

In addition to Bronumski and Keith Rousseau answer Also add the DepartmentID with the nodes(Tag) so that you don't have to re-query the database to get the departmentID

Related

EF - A proper way to search several items in database

Converting foreach to Linq

Efficient way to create parent-child relation

Fastest way to get difference between List<object>

A better way to loop through lists

Categories

Resources