When i want to compare a huge list (about 700,000 elements) with a specific property and list of string, takes long time.
I tried AsParallel but it doesn't help me any more. i need list for removedSuccessFromList because i want to use this list for start a Parallel.Foreach
List<string> successStrings = service.GetProperty().Select(q =>
q.IdString).ToList();
List<Property> removedSuccessFromList = properties.AsParallel().Where(q =>
!successStrings.Contains(q.IdString)).ToList();
Use mre effective data structure if you have lot of strings in successStrings, like hash set:
var successStrings = new HashSet<string>(service.GetProperty().Select(q => q.IdString));
List<Property> removedSuccessFromList = properties.Where(q => !successStrings.Contains(q.IdString)).ToList();
List.Contains method has complexity O(N), so it scan all elements to find match. HashSet.Contains has complexity O(1) - it can check if element exists very vast.
If your IdString is unique maybe you could remove each founded item from successStrings in the Where logic so the list get smaller eventually
I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).
The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.
I started with a list and identified the appropriate token object to update using a Linq query.
public class Model {
public List<State> States { get; set; }
...
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.
Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).
So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).
So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)
If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?
To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).
A wider view of the code is...
var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
var tokens = tokeniser.Tokenise(contents); // Iterator
foreach (var token in tokens) {
var stateText = StateToString(ref state);
var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
// ... update the match as appropriate for the token
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.
Possible data structures that meet your needs:
Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.
Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.
30k is not that much so if state is unique you can do something like this.
Dictionary access is much faster.
var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);
Quoting MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
You can find more info about Dictionaries here.
Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:
var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);
.NET 4.5.1
I have a ConcurrentBag with 200,000 objects. An object is considered "unique" by two properties of type long.
I need to check the bag for a previous existence of a unique object, and if it does not exist, add it.
I think doing something like the below is not correct -
var foundRef = mybag.Where( r => r.mainid == tempObj.mainid &&
r.subid == tempObj.subid);
what is the right way to search the bag as quickly as possible? I do need the concurrency/safety of the 'bag.
Thanks.
Why not to use ConcurrentDictionary<Tuple<long, long>, Foo>? Your data will be indexed by these two properties mainid and subid.
The only disadvantage of this approach is that you have to create new Tuple<long, long> each time you want to retrieve a value from the dictionary:
var foundRef = myDict[new Tuple<long, long>(tempObj.mainid, tempObj.subid)];
But it will give you the fastest possible access time close to O(1).
I've always thought the any index should be unique, but I think it's not true at least for SQL Server as shown in the following post:
Do clustered indexes have to be unique?
Recently I had to store a very amount of data within a collection and thought of using a dictionary for it's the fastest collection to get an object by index. But my collection would have to allow duplicated keys. But in fact duplicated keys would not be a problem since any of the object returned would be meet the requirements (The objects are not exactly unique, but the keys would be).
Some more research led me to the following post:
C# Hashset Contains Non-Unique Objects
Which shows a way to get a HashSet with "duplicated keys". His problem would be my solution but I wonder if there's any other way that I can have a list with duplicated keys which allows me to search very fast without having to do any workaround the get this done.
"duplicated indexes would not be a problem since any of them would be meet the requirements"
If by this, you mean that obtaining any item stored against the same index value would be satisfactory you when retrieving an item by index, then a simple Dictionary will suffice.
E.g.
Dictionary<int, string> myData = new Dictionary<int, string>();
myData[1] = "foo";
myData[2] = "bar";
myData[2] = "baz"; // overwrites "bar"
var myDatum = myData[2]; // retrievs "baz" not "bar", but this is satisfactory.
What is the most efficient way to do look-up table in C#
I have a look-up table. Sort of like
0 "Thing 1"
1 "Thing 2"
2 "Reserved"
3 "Reserved"
4 "Reserved"
5 "Not a Thing"
So if someone wants "Thing 1" or "Thing 2" they pass in 0 or 1. But they may pass in something else also.
I have 256 of these type of things and maybe 200 of them are reserved.
So what is the most efficient want to set this up?
A string Array or dictionary variable that gets all of the values. And then take the integer and return the value at that place.
One problem I have with this solution is all of the "Reserved" values. I don't want to create those redundant "reserved" values. Or else I can have an if statement against all of the various places that are "reserved" but they might now be just 2-3, might be 2-3, 40-55 and all different places in the byte. This if statement would get unruly quick
My other option that I was thinking was a switch statement. And I would have all of the 50ish known values and would fall through through and default for the reserved values.
I am wondering if this is a lot more processing than creating a string array or dictionary and just returning the appropriate value.
Something else? Is there another way to consider?
"Retrieving a value by using its key is very fast, close to O(1), because the Dictionary(TKey, TValue) class is implemented as a hash table."
var things = new Dictionary<int, string>();
things[0]="Thing 1";
things[1]="Thing 2";
things[4711]="Carmen Sandiego";
The absolute fastest way to do lookups of integer values in C# is with an array. This will be preferable to using a dictionary, maybe, if you are trying to do tens of thousands of lookups at a time. For most purposes, this is overkill; it's more likely that you need to optimize developer time than processor time.
If the reserved keys are not simply all keys that aren't in the lookup table (i.e. if a lookup for a key can return the found value, a not-found status, or a reserved status), you'll need to save the reserved keys somewhere. Saving them as dictionary entries with magic values (e.g. the key of any dictionary entry whose value is null is reserved) is OK unless you write code that iterates over the dictionary's entries without filtering them.
A way to solve that problem is to use a separate HashSet<int> to store the reserved keys, and maybe bake the whole thing into a class, e.g.:
public class LookupTable
{
public readonly Dictionary<int, string> Table { get; }
public readonly HashSet<int> ReservedKeys { get; }
public LookupTable()
{
Table = new Dictionary<int, string>();
ReservedKeys = new HashSet<int>();
}
public string Lookup(int key)
{
return (ReservedKeys.Contains(key))
? null
: Table[key];
}
}
You'll note that this still has the magic-value issue - Lookup returns null if the key is reserved, and throws an exception if it's not in the table - but at least now you can iterate over Table.Values without filtering magic values.
Checkout the HybridDictionary. It automatically adjusts it's underlying storage mechanism based on size to get the greatest efficiency.
http://msdn.microsoft.com/en-us/library/system.collections.specialized.hybriddictionary.aspx
If you have lots of reserved (currently unused) values or if the range of the integer values can get very big, then I would use a generic dictionary (Dictionary):
var myDictionary = new Dictionary<int, string>();
myDictionary.Add(0, "Value 1");
myDictionary.Add(200, "Another value");
// and so on
Otherwise, if you have a fixed number of values and only few of the are currently unused, then I'd use a string array (string[200]) and set/leave the reserved entries to null.
var myArray = new string[200];
myArray[0] = "Value 1";
myArray[2] = "Another value";
//myArray[1] is null
The in-built Dictionary object (preferably a generic dictionary) would be ideal for this, and is specifically designed for fast/efficient retrieval of the values relating to the keys.
From the linked MSDN article:
Retrieving a value by using its key is
very fast, close to O(1), because the
Dictionary<(Of <(TKey, TValue>)>)
class is implemented as a hash table.
As far as your "reserved" keys go, I wouldn't worry about that at all if we're only talking about a few hundred keys/values. It's only when you reach tens, maybe hundreds of thousands of "reserved" keys/values that you'll want to implement something more efficient.
In those cases, probably the most efficient storage container then would be an implementation of a Sparse Matrix.
I'm not quite sure I understand your problem correctly. You have a collection of strings. Each string is associated to an index. The consumer requests gives an index and you return the corresponding string, unless the index is reserved. Right?
Can't you simple set reserved items as null in the array.
If not, using a dictionary that doesn't contain the reserved items seems a reasonable solution.
Anyway, you'll probably get better answers if you clarify your problem.
I would use a Dictionary to do the lookups. This is the most efficient way to do look ups by far. Using a string will run somewhere in the region of O(n) to find the object.
It might be useful to have a 2nd Dictionary to all you to do a reverse lookup if its needed
Load all your values into
var dic = new Dictionary<int, string>();
And use this for retrieval:
string GetDescription(int val)
{
if(0 <= val && val < 256)
if(!dic.Contains(val))
return "Reserved";
return dic[val];
throw new ApplicationException("Value must be between 0 and 255");
}
Your question seems to imply that the query key is an integer. Since you have at most 256 items, then the query key is in the range 0..255, right? If so, just have a string array of 256 strings, and use the key as an index into the array.
If your query key is a string value, then it's more like a real lookup table. Using a Dictionary object is simple, but if you're after raw speed for a set of as few as 50 or so actual answers, a do-it-yourself approach such as binary search, or a trie, could be quicker. If you use binary search, since the number of items is so small, you could unroll it.
How often does the list of items change? If it only changes very seldom, you can get even better speed by generating code to do the search, which you can then compile and execute to do each query.
On the other hand, I assume you've proven that this lookup is your bottleneck, either by profiling or taking stackshots. If less than 10% of time-when-slow is spent in this query, then it is not your bottleneck so you may as well do the thing that is easiest to code.