Unusual hashset implementation: access a random element?

Unusual hashset implementation: access a random element? - c#

Background: In my program I have a list of nodes (a class I have defined). They each have a unique id number and a non-unique "region" number. I want to randomly select a node, record its id number, then remove all nodes of the same region from the list.
Problem: Someone pointed out to me that using a hashset instead of a list would be much faster, as a hashset's "order" is effectively random for my purposes and removing elements from it would be much faster. How would I do this (i.e. how do I access a random element in a hashset? I only know how to check to see if a hashset contains an element I already have)?
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions? Again, I know how to remove a known element from a hashset, but here I don't know how to remove all nodes of a certain region.
I can post specifics about my code if that would help.

To be clear, the order items in a HashSet isn't random, it's just not easily determinable. Meaning if you iterate a hash set multiple times the items will be in the same order each time, but you have no control over what order they're in.
That said, HastSet<T> implements IEnumerable<T> so you could just pick a random number n and remove the nth item:
// assuming a Random object is defined somewhere (do not declare it here)
n = rand.Next(hashSet.Count);
var item = hashSet.ElementAt(n);
hashSet.Remove(item);
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions?
Not necessarily - you'll need to scan the hashSet to find matching items (easily done with Linq) and remove each one individually. Whether you do that by just comparing properties or defining an equality comparer is up to you.
foreach (var dupe in hashSet.Where(x => x.Region == item.Region).ToList())
hashSet.Remove(dupe);
Note the ToList which is necessary since you can't modify a collection while iterating over it, so the items to remove need to be stored in a different collection.
Note that you can't override Equals in the Node class for this purpose or you won't be able to put multiple nodes from one region in the hash set.
If you haven't noticed, both of these requirements defeat the purpose of using a HashSet - A HashSet is faster only when looking for a known item; iterating or looking for items based on properties is no faster than a regular collection. It would be like looking through the phone book to find all people whose phone number start with 5.
If you always want the items organized by region, then perhaps a Dictionary<int, List<Node>> is a better structure.

There's another alternative approach that you could take that could end up being faster than removals from hash sets, and that's creating a structure that does your job for you in one go.
First up, to give me some sample data I'm running this code:
var rnd = new Random();
var nodes =
Enumerable
.Range(0, 10)
.Select(n => new Node() { id = n, region = rnd.Next(0, 3) })
.ToList();
That gives me this kind of data:
Now I build up my structure like this:
var pickable =
nodes
.OrderBy(n => rnd.Next())
.ToLookup(n => n.region, n => n.id);
Which gives me this:
Notice how the regions and individual ids are randomized in the lookup. Now it's possible to iterate over the lookup and take just the first element of each group to get both a random region and random node id without the need to remove any items from a hash set.
I wouldn't expect performance to be too much of an issue as I just tried this with 1,000,000 nodes with 1,000 regions and got a result back in just over 600ms.

On a hashset you can use ElementAt
notreallrandomObj nrrbase = HS.ElementAt(0);
int region = nrrbase.region;
List<notreallrandomObj> removeItems = new List<notreallrandomObj>();
foreach (notreallrandomObj nrr in HS.Where(x => x.region == region))
removeItems.Add(nrr);
foreach (notreallrandomObj nrr in removeItems)
HS.Remove(nrr);
Not sure if you can remove in the loop.
You may need to build up the remove list.
Yes remove O(1) on a HashSet but that does not mean it will be faster than a List. You don't even have a solution and are optimizing. That is premature optimization.
With a List you can just use RemoveAll
ll.RemoveAll(x => x.region == region);

Related

C# efficient datastructure to store objects sorted by key with duplicate keys to fit given requirements

I'm looking for the most efficient way to store a collection of objects sorted by Comparator using object attribute in programming language C#.
There are objects with same value for attribute, so duplicate keys occur in the collection.
Complexityclasses for inserting and removing elements to or from sorted datastructure should not be higher than O(log(N)) (noted in Big O notation) since attribute used for sorting will change very often and list has to be updated with every change to stay consistent.
Complexityclasses for getting all elements in datastructure as list in sorted order should not be higher than O(1).
Options might be C# templates SortedSets, SortedDictionaries or SortedLists. All of them fail when inserting or deleting from sorted datastructure if duplicate keys are present.
A workaround might be to use stacked SortedDictionaries as shown below, and aggregate objects with equal key as seperated collections, and sort them by another unique key (ID for example):
SortedDictionary<long, SortedDictionary<long, RankedObject>> sPresortedByRank =
new SortedDictionary<long, SortedDictionary<long, RankedObject>>(new ByRankAscSortedDictionary());
long rank = 52
sPresortedByRank[rank] = new SortedDictionary<long, RankedObject>(new ByIdAscSortedDictionary);
Inserting and removing elements from datastructure will work in O(log(N)), what is good. Getting all elements from datastructe as list requires a expensive Queryable.SelectMany, which increases complexity for this operation to O(N^2), what is not acceptable.
Current best attemp is to use a primitive List and insert and delete using BinarySearch to identify indices for inserations and deletions. For insert this gives me worst case complexity O(log(N)). For delete average case complexity O(log(n)) since it duplicate keys will be rare, but O(N) in worst case (all objects with same key). Getting all elements of sorted list will be O(1).
Can someone imagine a better way of managing object collection in sorted datastructure that fits my needs, or is current best attemp the best one in general.
Help and well founded opinions are appreciated. Cookies for good answers of course.

I think the simplest solution that meets your requirements it to add tie breaking to your comparator to ensure that there will be no duplicates, and a defined total ordering among all objects. Then you can just use SortedSet.
If you can have an ID in every object, for example, then instead of sorting by "rank", you can sort by (rank,ID) to make the comparator a total ordering.
To find all the elements with a specific rank, you would then use SortedSet.GetViewBeteen() with the range of (rank,min_id) and (rank,max_id).

#Iliar Turdushevs very useful hint:
Do you consider using C5 library? It contains TreeBag collection that satisfies all you requirements. By the way, for primitive List the complexity of inserting and removing elements is not O(log(N)). O(log(N)) is only complexity of searching the index where the element must be inserted/deleted. But insertion/deletion itself uses Array.Copy to shift elements to insert/delete the element. Therefore the complexity would be O(M), where M <= N.

Linq contains for compairing two huge list takes long time

When i want to compare a huge list (about 700,000 elements) with a specific property and list of string, takes long time.
I tried AsParallel but it doesn't help me any more. i need list for removedSuccessFromList because i want to use this list for start a Parallel.Foreach
List<string> successStrings = service.GetProperty().Select(q =>
q.IdString).ToList();
List<Property> removedSuccessFromList = properties.AsParallel().Where(q =>
!successStrings.Contains(q.IdString)).ToList();

Use mre effective data structure if you have lot of strings in successStrings, like hash set:
var successStrings = new HashSet<string>(service.GetProperty().Select(q => q.IdString));
List<Property> removedSuccessFromList = properties.Where(q => !successStrings.Contains(q.IdString)).ToList();
List.Contains method has complexity O(N), so it scan all elements to find match. HashSet.Contains has complexity O(1) - it can check if element exists very vast.

If your IdString is unique maybe you could remove each founded item from successStrings in the Where logic so the list get smaller eventually

What is the best way to search for a string in a list of list of strings c#

What is the best way to search for a string in a list of lists of strings?
Example I have many lists of strings e.g. List1 List 2 etc. and all of the lists are collected in a List of Lists (e.g. ListofLists<,...>.
Now I want to search for the string "foo" in the list of lists. Is there a way to optimize this in c#?
Thanks in advance

Using Linq:
To check if any of the lists contains the string 'foo':
collection.Any(x => x.Contains("foo"));
To get the particular list which contains the string 'foo':
collection.FirstOrDefault(x => x.Contains("foo"));

Does this work?
var listOfLists = new List<List<string>>();
// insert code to populate listOfLists
var containsFoo = listOfLists.SelectMany(x => x).Any(x => x == "foo");

I'll just summarize/optimize other answers:
The best way to do this is using LINQ. Therefore you could use two ways:
You could flaten the list and look wether it contains an element
You could check all lists wether one list contains the element
Flaten the list
This is done with SelectMany:
listOfLists.SelectMany(x => x).Contains("foo");
SelectMany combines all elements of the sublists and sublists of the sublists (and so on) into one list, so you can check all items, wether one contains the string (https://msdn.microsoft.com/de-de/library/bb534336(v=vs.110).aspx).
Check all lists
This is done with Any:
listOfLists.Any(x => x.Contains("foo"));
Any checks wether any item fulfills the condition (https://msdn.microsoft.com/de-de/library/bb337697(v=vs.110).aspx).
Actually it seems to be more efficient to check all lists (With a randomly generated list with a total of 10,000 entries the first possibility needs averragely 34ms and the second one 31).
In both possibilities I use Contains, which simply checks, wether the list contains the element (https://msdn.microsoft.com/de-de/library/bb352880(v=vs.110).aspx).
Of course you could still use a loop as well:
var contains = false;
foreach (var l in listOfLists)
foreach (var i in l)
if (i == "foo")
{
contains = true;
goto end;
}
end:
But this is less readable, less effective, more complicated and less elegant.
However, if you don't just want to check, wether it exists, but do a bit more with it, the last possibility is possibly the easiest one. If you want an optimized version for another case, feel free to specify your requirements.

Fast random access to a collection

I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).
The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.
I started with a list and identified the appropriate token object to update using a Linq query.
public class Model {
public List<State> States { get; set; }
...
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.
Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).
So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).
So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)
If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?
To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).
A wider view of the code is...
var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
var tokens = tokeniser.Tokenise(contents); // Iterator
foreach (var token in tokens) {
var stateText = StateToString(ref state);
var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
// ... update the match as appropriate for the token
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.
Possible data structures that meet your needs:
Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.
Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.

30k is not that much so if state is unique you can do something like this.
Dictionary access is much faster.
var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);
Quoting MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
You can find more info about Dictionaries here.
Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:
var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);

List.Sort(by length) returns different results on different computers

I have the following code :
List<string> Words = item.Split(' ').ToList<string>();
Words.Sort((a, b) => b.Length.CompareTo(a.Length));
Which is supposed to sort a List of words from a line in a file (item) according to their size. However, if two words have the same length, they are supposed to be sorted by the order of appearence in the line.
The problem here is that if the line is, for example "a b c", on my computer, the list will have three sorted items (0 - a, 1 - b, 2 - c), but on another computer, using the same .Net version (4.5), the sorted items will be (0 - c, 1 - b, 2 - a)
Is there a way to enforce the same result troughout different computers ?

List.Sort is an unstable sort, meaning in your case that elements of the same length can go in different order.
This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.

You can force the same results using LINQ and set of OrderBy/ThenBy calls, instead of Sort.
var result = source.Select((v, i) => new { v, i })
.OrderBy(x => x.v.Length)
.ThenBy(x => x.i)
.Select(x => x.v)
.ToList();
But you should be aware, that it will create new list, instead of sorting existing one in place:

The method List.Sort() is an unstable sort. You cannot predict the order of duplicate keys.
There are 2 generic methods to solve this problem: use a stable sort, or force uniqueness by extending the key to include identity information.
One of the common stable sorts is an Insertion sort. I believe this would be the sort used by SortedList, but it does not allow duplicate keys. Failing that, you can write your own, either in Linq or by hand. Even a bubble sort is stable!
The preferred way is to keep the item identity. Create a list of pairs, where each pair consists of the key value and its position in the list. Sort the list of pairs, or insert the pairs directly into a sorted list, because every pair is unique. The ordering of these pairs, and the keys they contain, is guaranteed the same on all platforms.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unusual hashset implementation: access a random element? - c#

Related

C# efficient datastructure to store objects sorted by key with duplicate keys to fit given requirements

Linq contains for compairing two huge list takes long time

What is the best way to search for a string in a list of list of strings c#

Fast random access to a collection

List.Sort(by length) returns different results on different computers

Categories

Resources