I need to modify all of the values in a Dictionary. Typically, modifying a Dictionary while enumerating it throws an exception. There are various ways to work around that, but all of the answers I've seen involve allocating temporary storage. See Editing dictionary values in a foreach loop for an example.
I would like to modify all the values without allocating any memory. Writing a custom struct enumerator the for the values that disregarded the dictionary version would be fine, but since all the important members of the dictionary are private, this seems impossible.
You're definitely getting into some nitty-gritty performance optimization here.
Based on the additional information you've given in the comments, it sounds like the best approach (short of upgrading your memory so you can handle a little more allocation) will probably be to take the Dictionary source code and make a new class specifically for this purpose, which doesn't increment the version field if it's only changing a value.
Related
After reading the excellent accepted answer in this question:
How is the c#/.net 3.5 dictionary implemented?
I decided to set my initial capacity to a large guess and then trim it after I read in all values. How can I do this? That is, how can I trim a Dictionary so the gc will collect the unused space later?
My goal with this is optimization. I often have large datasets and the time penalty for small datasets is acceptable. I want to avoid the overhead of reallocating and copying the data that is incured with small initial capacities on large datasets.
According to Reflector, the Dictionary class never shrinks. void Resize() is hard-coded to always double the size.
You can probably create a new dictionary and use the respective constructor to copy over the items. This will be quite inefficient.
Or, implement your own dictionary with the existing one as a blue-print. This is less work than you might think at first.
Be sure to benchmark both approaches.
In .NET 5 there is the method TrimExcess doing exactly what you're asking:
Sets the capacity of this dictionary to what it would be if it had
been originally initialized with all its entries.
You might consider putting your data in a list first. Then you know the list's size, and can create a dictionary with that capacity (now exactly right for the data you want) and populate it.
Allowing the list to dynamically resize (as you add the elements) should be cheaper than allowing a dictionary to resize. (But, as others have noted, test the performance yourself!) Resizing a dictionary involves a rehashing operation, which means every element's GetHashCode will get called again, as well as the reference being copied into the new data structure. Resizing a list just means copying the references, so should be cheaper.
Say for example I have
Dictionary<string, double> foo;
I can do
foo["hello"] = foo["hello"] + 2.0
Or I could do
foo["hello"] += 2.0
but the compiler just expands this to the code above. I verified that by using JetBrains .Peek to look at the assemblies.
This seems wasteful as two key lookups are required to update. Is there a dictionary implementation that can do this in one lookup? Note I'm using a dictionary to store 100k items of geometry information from a mesh and the lookups are in an inner loop. Please no "premature optimization is the root of all evil" answers. :)
Yes I have profiled.
Using a class would probably be faster as the comments mention because:
With a struct, you must do a double look-up as mentioned in the comments.
With a class, you simply go to the memory of the class reference and can update it there.
Each Lookup:
GetHashCode
Get the bucket
Iterate through to find the right one
(This all involves reading multiple ref object values)
However, if you use a class and update its value:
Change the value at the correct position relative to that ref.
It's a single change in memory.
#George Duckett's solution should be much faster. Change to a class and get the ref and update the object's value:
var hello = foo["hello"];
hello.howAreYou += 2.0;
By the way, this is an example case where a mutable class will win in performance over the immutable struct.
There's a method in ConcurrentDictionary, ConcurrentDictionary.AddOrUpdate, that does what you want. You can update an existing value in the dictionary based on its previous value in one go.
However, the concurrent dictionary is supposed to be used in multiple thread situations, so I can imagine it does some locking which might defeat your optimization goal. But then again, you can always benchmark and see how it goes.
No, it is not. As noted in the comment by bradgonesurfing, the language lacks a way to return reference to the stored value, so when it has to change that value, it needs to find it again.
Also, you said you are storing pairs of integers. Did you thought about using an array? Even 100k long array is not even 1MB big. And I'm sure it would be fastest you can get.
I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).
A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.
You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.
Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.
I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.
I have a huge dictionary of blank values in a variable called current like so:
struct movieuser {blah blah blah}
Dictionary<movieuser, float> questions = new Dictionary<movieuser, float>();
So I am looping through this dictionary and need to fill in the "answers", like so:
for(var k = questions.Keys.GetEnumerator();k.MoveNext(); )
{
questions[k.Current] = retrieveGuess(k.Current.userID, k.Current.movieID);
}
Now, this doesn't work, because I get an InvalidOperationException from trying to modify the dictionary I am looping through. However, you can see that the code should work fine - since I am not adding or deleting any values, just modifying the value. I understand, however, why it is afraid of my attempting this.
What is the preferred way of doing this? I can't figure out a way to loop through a dictionary WITHOUT using iterators.
I don't really want to create a copy of the whole array, since it is a lot of data and will eat up my ram like its still Thanksgiving.
Thanks,
Dave
Matt's answer, getting the keys first, separately is the right way to go. Yes, there'll be some redundancy - but it will work. I'd take a working program which is easy to debug and maintain over an efficient program which either won't work or is hard to maintain any day.
Don't forget that if you make MovieUser a reference type, the array will only be the size of as many references as you've got users - that's pretty small. A million users will only take up 4MB or 8MB on x64. How many users have you really got?
Your code should therefore be something like:
IEnumerable<MovieUser> users = RetrieveUsers();
IDictionary<MovieUser, float> questions = new Dictionary<MovieUser, float>();
foreach (MovieUser user in users)
{
questions[user] = RetrieveGuess(user);
}
If you're using .NET 3.5 (and can therefore use LINQ), it's even easier:
IDictionary<MovieUser, float> questions =
RetrieveUsers.ToDictionary(user => user, user => RetrieveGuess(user));
Note that if RetrieveUsers() can stream the list of users from its source (e.g. a file) then it will be efficient anyway, as you never need to know about more than one of them at a time while you're populating the dictionary.
A few comments on the rest of your code:
Code conventions matter. Capitalise the names of your types and methods to fit in with other .NET code.
You're not calling Dispose on the IEnumerator<T> produced by the call to GetEnumerator. If you just use foreach your code will be simpler and safer.
MovieUser should almost certainly be a class. Do you have a genuinely good reason for making it a struct?
Is there any reason you can't just populate the dictionary with both keys and values at the same time?
foreach(var key in someListOfKeys)
{
questions.Add(key, retrieveGuess(key.userID, key.movieID);
}
store the dictionary keys in a temporary collection then loop over the temp collection and use the key value as your indexer parameter. This should get you around the exception.
I need to enumerate though generic IList<> of objects. The contents of the list may change, as in being added or removed by other threads, and this will kill my enumeration with a "Collection was modified; enumeration operation may not execute."
What is a good way of doing threadsafe foreach on a IList<>? prefferably without cloning the entire list. It is not possible to clone the actual objects referenced by the list.
Cloning the list is the easiest and best way, because it ensures your list won't change out from under you. If the list is simply too large to clone, consider putting a lock around it that must be taken before reading/writing to it.
There is no such operation. The best you can do is
lock(collection){
foreach (object o in collection){
...
}
}
Your problem is that an enumeration does not allow the IList to change. This means you have to avoid this while going through the list.
A few possibilities come to mind:
Clone the list. Now each enumerator has its own copy to work on.
Serialize the access to the list. Use a lock to make sure no other thread can modify it while it is being enumerated.
Alternatively, you could write your own implementation of IList and IEnumerator that allows the kind of parallel access you need. However, I'm afraid this won't be simple.
ICollection MyCollection;
// Instantiate and populate the collection
lock(MyCollection.SyncRoot) {
// Some operation on the collection, which is now thread safe.
}
From MSDN
You'll find that's a very interesting topic.
The best approach relies on the ReadWriteResourceLock which use to have big performance issues due to the so called Convoy Problem.
The best article I've found treating the subject is this one by Jeffrey Richter which exposes its own method for a high performance solution.
So the requirements are: you need to enumerate through an IList<> without making a copy while simultaniously adding and removing elements.
Could you clarify a few things? Are insertions and deletions happening only at the beginning or end of the list?
If modifications can occur at any point in the list, how should the enumeration behave when elements are removed or added near or on the location of the enumeration's current element?
This is certainly doable by creating a custom IEnumerable object with perhaps an integer index, but only if you can control all access to your IList<> object (for locking and maintaining the state of your enumeration). But multithreaded programming is a tricky business under the best of circumstances, and this is a complex probablem.
Forech depends on the fact that the collection will not change. If you want to iterate over a collection that can change, use the normal for construct and be prepared to nondeterministic behavior. Locking might be a better idea, depending on what you're doing.
Default behavior for a simple indexed data structure like a linked list, b-tree, or hash table is to enumerate in order from the first to the last. It would not cause a problem to insert an element in the data structure after the iterator had already past that point or to insert one that the iterator would enumerate once it had arrived, and such an event could be detected by the application and handled if the application required it. To detect a change in the collection and throw an error during enumeration I could only imagine was someone's (bad) idea of doing what they thought the programmer would want. Indeed, Microsoft has fixed their collections to work correctly. They have called their shiny new unbroken collections ConcurrentCollections (System.Collections.Concurrent) in .NET 4.0.
I recently spend some time multip-threading a large application and had a lot of issues with the foreach operating on list of objects shared across threads.
In many cases you can use the good old for-loop and immediately assign the object to a copy to use inside the loop. Just keep in mind that all threads writing to the objects of your list should write to different data of the objects. Otherwise, use a lock or a copy as the other contributors suggest.
Example:
foreach(var p in Points)
{
// work with p...
}
Can be replaced by:
for(int i = 0; i < Points.Count; i ++)
{
Point p = Points[i];
// work with p...
}
Wrap the list in a locking object for reading and writing. You can even iterate with multiple readers at once if you have a suitable lock, that allows multiple concurrent readers but also a single writer (when there are no readers).
This is something that I've recently had to deal with and to me it really depends on what you're doing with the list.
If you need to use the list at a point in time (given the number of elements currently in it) AND another thread can only ADD to the end of the list, then maybe you just switch out to a FOR loop with a counter. At the point you grab the counter, you're only seeing X numbers of elements in the list. You can walk through the list (while others are adding to the end of it) . . . should not cause a problem.
Now, if the list needs to have items taken OUT of it by other threads, or CLEARED by other threads, then you'll need to implement one of the locking mechanisms mentioned above. Also, you may want to look at some of the newer "concurrent" collection classes (though I don't believe they implement IList - so you may need refactor for a dictionary).