I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).
The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.
I started with a list and identified the appropriate token object to update using a Linq query.
public class Model {
public List<State> States { get; set; }
...
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.
Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).
So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).
So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)
If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?
To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).
A wider view of the code is...
var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
var tokens = tokeniser.Tokenise(contents); // Iterator
foreach (var token in tokens) {
var stateText = StateToString(ref state);
var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
// ... update the match as appropriate for the token
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.
Possible data structures that meet your needs:
Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.
Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.
30k is not that much so if state is unique you can do something like this.
Dictionary access is much faster.
var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);
Quoting MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
You can find more info about Dictionaries here.
Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:
var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);
Related
Background: In my program I have a list of nodes (a class I have defined). They each have a unique id number and a non-unique "region" number. I want to randomly select a node, record its id number, then remove all nodes of the same region from the list.
Problem: Someone pointed out to me that using a hashset instead of a list would be much faster, as a hashset's "order" is effectively random for my purposes and removing elements from it would be much faster. How would I do this (i.e. how do I access a random element in a hashset? I only know how to check to see if a hashset contains an element I already have)?
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions? Again, I know how to remove a known element from a hashset, but here I don't know how to remove all nodes of a certain region.
I can post specifics about my code if that would help.
To be clear, the order items in a HashSet isn't random, it's just not easily determinable. Meaning if you iterate a hash set multiple times the items will be in the same order each time, but you have no control over what order they're in.
That said, HastSet<T> implements IEnumerable<T> so you could just pick a random number n and remove the nth item:
// assuming a Random object is defined somewhere (do not declare it here)
n = rand.Next(hashSet.Count);
var item = hashSet.ElementAt(n);
hashSet.Remove(item);
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions?
Not necessarily - you'll need to scan the hashSet to find matching items (easily done with Linq) and remove each one individually. Whether you do that by just comparing properties or defining an equality comparer is up to you.
foreach (var dupe in hashSet.Where(x => x.Region == item.Region).ToList())
hashSet.Remove(dupe);
Note the ToList which is necessary since you can't modify a collection while iterating over it, so the items to remove need to be stored in a different collection.
Note that you can't override Equals in the Node class for this purpose or you won't be able to put multiple nodes from one region in the hash set.
If you haven't noticed, both of these requirements defeat the purpose of using a HashSet - A HashSet is faster only when looking for a known item; iterating or looking for items based on properties is no faster than a regular collection. It would be like looking through the phone book to find all people whose phone number start with 5.
If you always want the items organized by region, then perhaps a Dictionary<int, List<Node>> is a better structure.
There's another alternative approach that you could take that could end up being faster than removals from hash sets, and that's creating a structure that does your job for you in one go.
First up, to give me some sample data I'm running this code:
var rnd = new Random();
var nodes =
Enumerable
.Range(0, 10)
.Select(n => new Node() { id = n, region = rnd.Next(0, 3) })
.ToList();
That gives me this kind of data:
Now I build up my structure like this:
var pickable =
nodes
.OrderBy(n => rnd.Next())
.ToLookup(n => n.region, n => n.id);
Which gives me this:
Notice how the regions and individual ids are randomized in the lookup. Now it's possible to iterate over the lookup and take just the first element of each group to get both a random region and random node id without the need to remove any items from a hash set.
I wouldn't expect performance to be too much of an issue as I just tried this with 1,000,000 nodes with 1,000 regions and got a result back in just over 600ms.
On a hashset you can use ElementAt
notreallrandomObj nrrbase = HS.ElementAt(0);
int region = nrrbase.region;
List<notreallrandomObj> removeItems = new List<notreallrandomObj>();
foreach (notreallrandomObj nrr in HS.Where(x => x.region == region))
removeItems.Add(nrr);
foreach (notreallrandomObj nrr in removeItems)
HS.Remove(nrr);
Not sure if you can remove in the loop.
You may need to build up the remove list.
Yes remove O(1) on a HashSet but that does not mean it will be faster than a List. You don't even have a solution and are optimizing. That is premature optimization.
With a List you can just use RemoveAll
ll.RemoveAll(x => x.region == region);
I have the following code. The dictionary "_ItemsDict" contains millions of records. this code takes so much of time to add items to associatedItemslst LIST. Is there a way to speed up this process.
foreach (var obj in lst)
{
foreach (var item in _ItemsDict.Where(ikey => ikey.Key.StartsWith(obj))
.Select(ikey => ikey.Value))
{
aI = new AssociatedItem
{
associatedItemCode = artikel.ItemCode
};
associatedItemslst.Add(aI);
}
}
Instead of using a Dictionary<TKey, TValue> you may want to implement a Trie/Radix Tree/Prefix Tree.
Quoted from wikipedia:
A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile telephone.
(snip)
Tries are also well suited for implementing approximate matching algorithms,[6] including those used in spell checking and hyphenation[2] software.
You can divide by a factor 5 or 6 the time by using Parallel.Foreach()
String obj = "42";
Parallel.ForEach(_ItemsDict, new ParallelOptions{ MaxDegreeOfParallelism = Environment.ProcessorCount},
(i) =>
{
if (i.Key.StartsWith(obj))
bag.Add(new AssociatedItem() { associatedItemCode = i.Value });
});
But it seems there's definitely an architectural issue. Trie is one way to go. Or you can use a
Dictionary<String,List<TValue>>
where you store all occurrence of each part of each String, and then reference associated objects.
Last but not least, if your data comes from a database, SQL server is very efficient at searching part of varchar with a clause as :
WHERE ValueColumn like '42%' (equivalent of StartsWith("42") )
I do not think using dictionary is helping you to make this code fater, the reason is dictionary are good for matching the complete key not the partial key , in your case it actually going through each key in the dictionary and finding the result. I would suggest you to use some other data structure to get the result faster , one of them s TRIE data structure. I have posted a blog here for auto complete using TRIE https://devesh4blog.wordpress.com/2013/11/16/real-time-auto-complete-using-trie-in-c/
I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.
What is the most efficient way to do look-up table in C#
I have a look-up table. Sort of like
0 "Thing 1"
1 "Thing 2"
2 "Reserved"
3 "Reserved"
4 "Reserved"
5 "Not a Thing"
So if someone wants "Thing 1" or "Thing 2" they pass in 0 or 1. But they may pass in something else also.
I have 256 of these type of things and maybe 200 of them are reserved.
So what is the most efficient want to set this up?
A string Array or dictionary variable that gets all of the values. And then take the integer and return the value at that place.
One problem I have with this solution is all of the "Reserved" values. I don't want to create those redundant "reserved" values. Or else I can have an if statement against all of the various places that are "reserved" but they might now be just 2-3, might be 2-3, 40-55 and all different places in the byte. This if statement would get unruly quick
My other option that I was thinking was a switch statement. And I would have all of the 50ish known values and would fall through through and default for the reserved values.
I am wondering if this is a lot more processing than creating a string array or dictionary and just returning the appropriate value.
Something else? Is there another way to consider?
"Retrieving a value by using its key is very fast, close to O(1), because the Dictionary(TKey, TValue) class is implemented as a hash table."
var things = new Dictionary<int, string>();
things[0]="Thing 1";
things[1]="Thing 2";
things[4711]="Carmen Sandiego";
The absolute fastest way to do lookups of integer values in C# is with an array. This will be preferable to using a dictionary, maybe, if you are trying to do tens of thousands of lookups at a time. For most purposes, this is overkill; it's more likely that you need to optimize developer time than processor time.
If the reserved keys are not simply all keys that aren't in the lookup table (i.e. if a lookup for a key can return the found value, a not-found status, or a reserved status), you'll need to save the reserved keys somewhere. Saving them as dictionary entries with magic values (e.g. the key of any dictionary entry whose value is null is reserved) is OK unless you write code that iterates over the dictionary's entries without filtering them.
A way to solve that problem is to use a separate HashSet<int> to store the reserved keys, and maybe bake the whole thing into a class, e.g.:
public class LookupTable
{
public readonly Dictionary<int, string> Table { get; }
public readonly HashSet<int> ReservedKeys { get; }
public LookupTable()
{
Table = new Dictionary<int, string>();
ReservedKeys = new HashSet<int>();
}
public string Lookup(int key)
{
return (ReservedKeys.Contains(key))
? null
: Table[key];
}
}
You'll note that this still has the magic-value issue - Lookup returns null if the key is reserved, and throws an exception if it's not in the table - but at least now you can iterate over Table.Values without filtering magic values.
Checkout the HybridDictionary. It automatically adjusts it's underlying storage mechanism based on size to get the greatest efficiency.
http://msdn.microsoft.com/en-us/library/system.collections.specialized.hybriddictionary.aspx
If you have lots of reserved (currently unused) values or if the range of the integer values can get very big, then I would use a generic dictionary (Dictionary):
var myDictionary = new Dictionary<int, string>();
myDictionary.Add(0, "Value 1");
myDictionary.Add(200, "Another value");
// and so on
Otherwise, if you have a fixed number of values and only few of the are currently unused, then I'd use a string array (string[200]) and set/leave the reserved entries to null.
var myArray = new string[200];
myArray[0] = "Value 1";
myArray[2] = "Another value";
//myArray[1] is null
The in-built Dictionary object (preferably a generic dictionary) would be ideal for this, and is specifically designed for fast/efficient retrieval of the values relating to the keys.
From the linked MSDN article:
Retrieving a value by using its key is
very fast, close to O(1), because the
Dictionary<(Of <(TKey, TValue>)>)
class is implemented as a hash table.
As far as your "reserved" keys go, I wouldn't worry about that at all if we're only talking about a few hundred keys/values. It's only when you reach tens, maybe hundreds of thousands of "reserved" keys/values that you'll want to implement something more efficient.
In those cases, probably the most efficient storage container then would be an implementation of a Sparse Matrix.
I'm not quite sure I understand your problem correctly. You have a collection of strings. Each string is associated to an index. The consumer requests gives an index and you return the corresponding string, unless the index is reserved. Right?
Can't you simple set reserved items as null in the array.
If not, using a dictionary that doesn't contain the reserved items seems a reasonable solution.
Anyway, you'll probably get better answers if you clarify your problem.
I would use a Dictionary to do the lookups. This is the most efficient way to do look ups by far. Using a string will run somewhere in the region of O(n) to find the object.
It might be useful to have a 2nd Dictionary to all you to do a reverse lookup if its needed
Load all your values into
var dic = new Dictionary<int, string>();
And use this for retrieval:
string GetDescription(int val)
{
if(0 <= val && val < 256)
if(!dic.Contains(val))
return "Reserved";
return dic[val];
throw new ApplicationException("Value must be between 0 and 255");
}
Your question seems to imply that the query key is an integer. Since you have at most 256 items, then the query key is in the range 0..255, right? If so, just have a string array of 256 strings, and use the key as an index into the array.
If your query key is a string value, then it's more like a real lookup table. Using a Dictionary object is simple, but if you're after raw speed for a set of as few as 50 or so actual answers, a do-it-yourself approach such as binary search, or a trie, could be quicker. If you use binary search, since the number of items is so small, you could unroll it.
How often does the list of items change? If it only changes very seldom, you can get even better speed by generating code to do the search, which you can then compile and execute to do each query.
On the other hand, I assume you've proven that this lookup is your bottleneck, either by profiling or taking stackshots. If less than 10% of time-when-slow is spent in this query, then it is not your bottleneck so you may as well do the thing that is easiest to code.
Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:
class ClassFoo
{
public string word;
public float score;
//... etc ...
}
Assume there's going to be like 50.000 items in the collection, all in memory.
Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:
List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
result.Add(cf);
}
How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?
The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.
With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.
If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.
For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.
So, it really comes down to what queries you want to do against that data.
In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.
Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.
EDIT, based on the comment
If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.
If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.
Anything else is some specialization on these concepts.
var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));
that's the easiest in my opinion, should execute rather quickly.
Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.
You could parallelize if you have multiple cores or machines.
I'm not up on my Java right now, but I would think about the following things.
How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.
If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.
For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.
Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.
For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.
In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.
If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.
You can create some sort of index and it might get faster.
We can build a index like this:
Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[0]].Add(cf);
indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}
Then use the it like this:
foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
result.Add(cf);
}
Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.
Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?
If the collection will get large, I would definitely use an index.
In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.
If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.
If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.