Whats the best collection to use for uniquely identifying nodes?

Whats the best collection to use for uniquely identifying nodes? - c#

Currently I am using a Dictionary<int,node> to store around 10,000 nodes. The key is used as an ID number for later look up and the 'node' is a class that contains some data. Other classes within the program use the ID number as a pointer to the node. (this may sound inefficient. However, explaining my reasoning for using a dictionary for this is beyond the scope of my question.)
However, 20% of the nodes are duplicate.
What i want to do is when i add a node check to see if it all ready exists. if it does then use that ID number. If not create a new one.
This is my current solution to the problem:
public class nodeDictionary
{
Dictionary<int, node> dict = new Dictionary<int, node>( );
public int addNewNode( latLng ll )
{
node n = new node( ll );
if ( dict.ContainsValue( n ) )
{
foreach ( KeyValuePair<int, node> kv in dict )
{
if ( kv.Value == n )
{
return kv.Key;
}
}
}
else
{
if ( dict.Count != 0 )
{
dict.Add( dict.Last( ).Key + 1, n );
return dict.Last( ).Key + 1;
}
else
{
dict.Add( 0, n );
return 0;
}
}
throw new Exception( );
}//end add new node
}
The problem with this is when trying to add a new node to a list of 100,000 nodes it takes 78 milliseconds to add the node. This is unacceptable because i could be adding an additional 1,000 nodes at any given time.
So, is there a better way do do this? I am not looking for someone to write the code for me, I am just looking for guidance.

It sounds like you want to
make sure that LatLng overrides Equals/GetHashCode (preferrably implement the IEquatable<LatLng> interface)
stuff all the items directly into a HashSet<LatLng>
For implementing GetHashCode, see here: Why is it important to override GetHashCode when Equals method is overridden?
If you need to generate 'artificial' unique IDs in some fashion, I suggest you use the dictionary approach again, but 'in reverse':
// uses the same hash function for speedy lookup/insertion
IDictionary<LatLng, int> idMap = new Dictionary<LatLng, int>();
foreach (LatLng latLng in LatLngCoords)
{
if (!idMap.ContainsKey(latLng))
idMap.Add(latLng, idMap.Count+1); // to start with 1
}
You can have the idMap replace the HashSet<>; the implementation (and performance characteristics) is essentially the same but as an associative container.
Here's a lookup function to get from LatLng to Id:
int IdLookup(LatLng latLng)
{
int id;
if (idMap.TryGetValue(latLng, id))
return id;
throw new InvalidArgumentException("Coordinate not in idMap");
}
You could just-in-time add it:
int IdFor(LatLng latLng)
{
int id;
if (idMap.TryGetValue(latLng, id))
return id;
id = idMap.Count+1;
idMap.Add(latLng, id);
return id;
}

I'd add a second dictionary for the reverse direction. i.e. Dictionary<Node,int>
Then you either
Are content with reference equality and do nothing.
Create an IEqualityComparer<Node> and supply it to the dictionary
Override Equals and GetHashCode on Node
In both cases a good implementation for the hashcode is essential to get good performance.

Your solution is not only slow, but also wrong. The order of items in a Dictionary is undefined, so dict.Last() is not guaranteed to return the item that was added last. (Although it may often look that way.)
Using an id to identify an object in your application seems wrong too. You should consider using references to the object directly.
But if you want to use your current design and assuming that you compare nodes based on their latLng, you could create two dictionaries: the one you already have and a second one, Dictionary<latLng, int>, that can be used to efficiently fond out whether a certain node already exists. And if it does, it gives you its id.

What exactly is the purpose of this code?
if ( dict.ContainsValue( n ) )
{
foreach ( KeyValuePair kv in dict )
{
if ( kv.Value == n )
{
return kv.Key;
}
}
}
The ContainsValue searches for a value (instead of a key) and is very inefficient (O(n)). Ditto for foreach. Let alone you do both when only one is necessary (you could completely remove ContainsValue by rearranging your ifs a little)!
You should probably mainntain additional dictionary that is "reverse" of the original one (i.e. values in old dictionary are keys in the new one and vice versa), to "cover" your search patterns (similarly to how databases can maintain multiple indexes par table to cover multiple ways table can be queried).

You could try using HashSet<T>

You might want to consider restructuring this to just use a List (where the 'key' is just the index into the List) instead of a Dictionary. A few advantages:
Looking up an element by integer key is now O(1) (and a very fast O(1) given that it's just an array dereference internally).
When you insert a new element, you perform an O(n) search to see whether it already exists in the list. If it does not, you've also already traversed the list and can have recorded whether you encountered an entry with a null record. If you have, that index is the new key. If not, the new key is the current list Count. You're enumerating the collection once instead of multiple times and the enumeration itself is much faster than enumerating a Dictionary.

Related

How to properly use SortedDictionary in c#?

I'm trying to do something very simple but it seems that I don't understand SortedDictionary.
What I'm trying to do is the following:
Create a sorted dictionary that sorts my items by some floating number, so I create a dictionary that looks like this
SortedDictionary<float, Node<T>> allNodes = new SortedDictionary<float, Node<T>>();
And now after I add items, I want to remove them one by one (every removal should be at a complexity of O(log(n)) from the smallest to the largest.
How do I do it? I thought that simply allNodes[0] will give me the the smallest, but it doesn't.
More over, it seems like the dictionary can't handle duplicate keys. I feel like I'm using the wrong data structure...
Should I use something else if I have bunch of nodes that I want to be sorted by their distance (floating point)?

allNodes[0] will not give you the first item in the dictionary - it will give you the item with a float key value of 0.
If you want the first item try allNodes.Values.First() instead. Or to find the first key use allNodes.Keys.First()
To remove the items one by one, loop over a copy of the Keys collection and call allNodes.Remove(key);
foreach (var key in allNodes.Keys.ToList())
{
allNodes.Remove(key);
}
To answer your addendum to your question, yes SortedDictionary (any flavor of Dictionary for that matter) will not handle duplicate keys - if you try and add an item with an existing key it will overwrite the previous value.
You could use a SortedDictionary<float, List<Node<T>>> but then you have the complexity of extracting collections versus items, needing to initialize each list rather than just adding an item, etc. It's all possible and may still be the fastest structure for adds and gets, but it does add a bit of complexity.

Yes, you're right about complexity.
In SortedDictionary all the keys are sorted. If you want to iterate from the smallest to the largest, foreach will be enough:
foreach(KeyValuePair<float, Node<T>> kvp in allNodes)
{
// Do Something...
}
You wrote that you want to remove items. It's forbidden to remove from collections during iteratation with foreach, so firstly create a copy of it to do so.
EDIT:
Yes, if you have duplicated keys you can't use SortedDictionary. Create a structural Node with Node<T> and float, then write a comparer:
public class NodeComparer : IComparer<Node>
{
public int Compare(Node n1, Node n2)
{
return n2.dist.CompareTo(n1.dist);
}
}
And then put everything in simple List<Node> allNodes and sort:
allNodes.Sort(new NodeComparer());

As a Dictionary<TKey, TValue> must have unique keys, I'd use List<Node<T>> instead. For instance, if your Node<T> class has a Value property
class Node<T>
{
float Value { get; set; }
// other properties
}
and you want to sort by this property, use LINQ:
var list = new List<Node<T>>();
// populate list
var smallest = list.OrderBy(n => n.Value).FirstOrDefault();
To remove the nodes one by one, just iterate threw the list:
while (list.Count > 0)
{
list.RemoveAt(0);
}

A better way to loop through lists

So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList

You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}

Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}

You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}

Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.

Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.

Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList

How to compare two sorted large lists efficiently in C#?

I have got two generic lists with 20,000 and 30,000 objects in each list.
class Employee
{
string name;
double salary;
}
List<Employee> newEmployeeList = List<Employee>() {....} // contains 20,000 objects
List<Employee> oldEmployeeList = List<Employee>() {....} // contains 30,000 objects
Lists can also be sorted by name if it improves the speed.
I want to compare these two lists to find out
employees whose name and salary matching
employees whose name is matching but not salary
What is the fastest way to compare such large data lists with above conditions?

I would sort both newEmployeeList and oldEmployeeList lists by name - O(n*log(n)). And then you can use linear algorithm to search for matches. So the total would be O(n+n*log(n)) if both lists are about the same size. This should be faster than O(n^2) "brute force" algorithm.

I'd probably recommend the two lists be stored in a Dictionary<string, Employee> based on the name to begin with, then you can iterate over the keys in one and lookup to see if they exist and the salaries match in the other. This would also save the cost of sorting them later or putting them in a more efficient structure.
This is pretty much O(n) - linear to build both dictionaries, linear to go through the keys and lookup in the other. Since O(n + m + n) reduces to O(n)
But, if you must use List<T> to hold the lists for other reasons, you could also use the Join() LINQ method, and build a new list with a Match field that tells you whether they were a match or mismatch...
var results = newEmpList.Join(
oldEmpList,
n => n.Name,
o => o.Name,
(n, o) => new
{
Name = n.Name,
Salary = n.Salary,
Match = o.Salary == n.Salary
});
You can then filter this with a Where() clause for Match or !Match.

Update: I assume (by the title of your question) that the 2 lists are already sorted. Perhaps they're stored in a database with a clustered index or something. This answer, therefore, relies on that assumption.
Here is an implementation that has O(n) complexity, and is also very fast, AND is pretty simple too.
I believe this is a variant of the Merge Algorithm.
Here's the idea:
Start enumerating both lists
Compare the 2 current items.
If they match, add to your results.
If the 1st item is "smaller", advance the 1st list.
If the 2nd item is "smaller", advance the 2nd list.
Since both lists are known to be sorted, this will work very well. This implementation assumes that name is unique in each list.
var comparer = StringComparer.OrdinalIgnoreCase;
var namesAndSalaries = new List<Tuple<Employee, Employee>>();
var namesOnly = new List<Tuple<Employee, Employee>>();
// Create 2 iterators; one for old, one for new:
using (IEnumerator<Employee> A = oldEmployeeList.GetEnumerator()) {
using (IEnumerator<Employee> B = newEmployeeList.GetEnumerator()) {
// Start enumerating both:
if (A.MoveNext() && B.MoveNext()) {
while (true) {
int compared = comparer.Compare(A.Current.name, B.Current.name);
if (compared == 0) {
// Names match
if (A.Current.salary == B.Current.salary) {
namesAndSalaries.Add(Tuple.Create(A.Current, B.Current));
} else {
namesOnly.Add(Tuple.Create(A.Current, B.Current));
}
if (!A.MoveNext() || !B.MoveNext()) break;
} else if (compared == -1) {
// Keep searching A
if (!A.MoveNext()) break;
} else {
// Keep searching B
if (!B.MoveNext()) break;
}
}
}
}
}

One of fastest possible solutions on sorted lists is use of BinarySearch in order to find an item in another list.
But as mantioned others, you should measure it against your project requirements, as performance often tends to be a subjective thing.

You could create a Dictionary using
var lookupDictionary = list1.ToDictionary(x=>x.name);
That would give you close to O(1) lookup and a close to O(n) behavior if you're looking up values from a loop over the other list.
(I'm assuming here that ToDictionary is O(n) which would make sense with a straight forward implementation, but I have not tested this to be the case)
This would make for a very straight forward algorithm, and I'm thinking going below O(n) with two unsorted lists is pretty hard.

SortedSet and SortedList fails with different enums

The whole story; I have some KeyValuePairs that I need to store in a session and my primary goal is to keep it small. Therefore I don't have the option of using many different collection. While the key is a different enum value of of a different enum type the value is always just a enum value of the same enum type. I have chosen a HashTable for this approach which content look like this (just many more):
// The Key-Value-Pairs
{ EnumTypA.ValueA1, MyEnum.ValueA },
{ EnumTypB.ValueB1, MyEnum.ValueB },
{ EnumTypC.ValueC1, MyEnum.ValueA },
{ EnumTypA.ValueA2, MyEnum.ValueC },
{ EnumTypB.ValueB1, MyEnum.ValueC }
At most I am running contains on that HashTable but for sure I also need to fetch the value at some point and I need to loop through all elements. That all works fine but now I have a new requirement to keep the order I have added them to the HashTable -> BANG
A HashTable is a map and that is not possible!
Now I thought about using a SortedList<object, MyEnum> or to go with more Data but slightly faster lookups and use a SortedSet<object> in addition to the HashTable.
Content below has been edited
The SortedList is implemented as
SortedList<Enum, MyEnum> mySortedList = new SortedList<Enum, MyEnum>();
the SortedSet is implemented as
SortedSet<Enum> mySortedSet = new SortedSet<Enum>();
The described Key - Value - Pairs are added to the sorted list with
void AddPair(Enum key, MyEnum value)
{
mySortedList.Add(key, value);
}
And for the SortedSett like this
void AddPair(Enum key)
{
mySortedSet.Add(key);
}
Both are failing with the exception:
Object must be the same type as the
enum
My question is: What goes wrong and how can I archive my goal?
Used Solution
I've decided to life with the downside
of redundant data against slower
lookups and decided to implement a
List<Enum> which will retain the
insert order parallel to my already
existing HashTable.
In my case I just have about 50-150
Elements so I decided to benchmark the
Hashtable against the
List<KeyValuePair<object,object>>
Therefore I have create me the
following helper to implement
ContainsKey() to the
List<KeyValuePair<object,object>>
static bool ContainsKey(this List<KeyValuePair<object, object>> list, object key)
{
foreach (KeyValuePair<object, object> p in list)
{
if (p.Key.Equals(key))
return true;
}
return false;
}
I inserted the same 100 Entries and
checked randomly for one of ten
different entries in a 300000 loop.
And... the difference was tiny so I
decided to go with the
List<KeyValuePair<object,object>>

I think you should store your data in an instance of List<KeyValuePair<Enum, MyEnum>> or Dictionary<Enum, MyEnum>.
SortedSet and SortedList are generic, but your keys are EnumTypeA/EnumTypeB, you need to specify the generic T with their base class(System.Enum) like:
SortedList<Enum, MyEnum> sorted = new SortedList<Enum, MyEnum>();
EDIT
Why you got this exception
SortedList and SortedSet use a comparer inside to check if two keys are equal. Comparer<Enum>.Default will be used as the comparer if you didn't specify the comparer in the constructor. Unfortunately Comparer<Enum>.Default isn't implemented as you expected. It throws the exception if the two enums are not the same type.
How to resolve the problem
If you don't want to use a List<KeyValuePair<Enum, MyEnum>> and insist using SortedLIst, you need to specify a comparer to the constructor like this:
class EnumComparer : IComparer<Enum>
{
public int Compare(Enum x, Enum y)
{
return x.GetHashCode() - y.GetHashCode();
}
}
var sorted = new SortedList<Enum, MyEnum>(new EnumComparer());
Btw, I think you need to obtain the "inserting order"? If so, List<KeyValuePair<K,V>> is a better choice, because SortedSet will prevent duplicated items.

Quick mass-updating a Dictionary

I have a Dictionary<int, int> and would like to update certain elements all at once based on their current values, e.g. changing all elements with value 10 to having value 14 or something.
I imagined this would be easy with some LINQ/lambda stuff but it doesn't appear to be as simple as I thought. My current approach is this:
List<KeyValuePair<int, int>> kvps = dictionary.Where(d => d.Value == oldValue).ToList();
foreach (KeyValuePair<int, int> kvp in kvps)
{
dictionary[KeyValuePair.Key] = newValue;
}
The problem is that dictionary is pretty big (hundreds of thousands of elements) and I'm running this code in a loop thousands of times, so it's incredibly slow. There must be a better way...

This might be the wrong data structure. You are attempting to look up dictionary entries based on their values which is the reverse of the usual pattern. Maybe you could store Sets of keys that currently map to certain values. Then you could quickly move these sets around instead of updating each entry separately.

I would consider writing your own collection type to achieve this whereby keys with the same value actually share the same value instance such that changing it in one place changes it for all keys.
Something like the following (obviously, lots of code omitted here - just for illustrative purposes):
public class SharedValueDictionary : IDictionary<int, int>
{
private List<MyValueObject> values;
private Dictionary<int, MyValueObject> keys;
// Now, when you add a new key/value pair, you actually
// look in the values collection to see if that value already
// exists. If it does, you add an entry to keys that points to that existing object
// otherwise you create a new MyValueObject to wrap the value and add entries to
// both collections.
}
This scenario would require multiple versions of Add and Remove to allow for changing all keys with the same value, changing only one key of a set to be a new value, removing all keys with the same value and removing just one key from a value set. It shouldn't be difficult to code for these scenarios as and when needed.

You need to generate a new dictionary:
d = d.ToDictionary(w => w.Key, w => w.Value == 10 ? 14 : w.Value)

I think the thing that everybody must be missing is that it is exceeeeedingly trivial:
List<int> keys = dictionary.Keys.Where(d => d == oldValue);
You are NOT looking up keys by value (as has been offered by others).
Instead, keys.SingleOrDefault() will now by definition return the single key that equals oldValue if it exists in the dictionary. So the whole code should simplify to
if (dictionary.ContainsKey(oldValue))
dictionary[key] = newValue;
That is quick. Now I'm a little concerned that this might indeed not be what the OP intended, but it is what he had written. So if the existing code does what he needs, he will now have a highly performant version of the same :)

After the edit, this seems an immediate improvement:
foreach (var kvp in dictionary.Where(d => d.Value == oldValue))
{
kvp.Value = newValue;
}
I'm pretty sure you can update the kvp directly, as long as the key isn't changed

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Whats the best collection to use for uniquely identifying nodes? - c#

You could try using HashSet<T>

Related

How to properly use SortedDictionary in c#?

A better way to loop through lists

How to compare two sorted large lists efficiently in C#?

SortedSet and SortedList fails with different enums

Quick mass-updating a Dictionary

Categories

Resources