I have a large file which, in essence contains data like:
Netherlands,Noord-holland,Amsterdam,FooStreet,1,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,2,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,3,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,4,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,5,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,1,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,2,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,3,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,4,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,1,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,2,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,3,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,1,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,2,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,3,...,...
...
This is a multi-gigabyte file. I have a class that reads this file and exposes these lines (records) as an IEnumerable<MyObject>. This MyObject has several properties (Country,Province,City, ...) etc.
As you can see there is a LOT of duplication of data. I want to keep exposing the underlying data as an IEnumerable<MyObject>. However, some other class might (and probably will) make some hierarchical view/structure of this data like:
Netherlands
Noord-holland
Amsterdam
FooStreet [1, 2, 3, 4, 5]
BarRoad [1, 2, 3, 4]
...
Amstelveen
BazDrive [1, 2, 3]
...
...
Zuid-holland
Rotterdam
LoremAve [1, 2, 3]
...
...
...
...
When reading this file, I do, essentially, this:
foreach (line in myfile) {
fields = line.split(",");
yield return new MyObject {
Country = fields[0],
Province = fields[1],
City = fields[2],
Street = fields[3],
//...other fields
};
}
Now, to the actual question at hand: I could use string.Intern() to intern the Country, Province, City, and Street strings (those are the main 'vilains', the MyObject has several other properties not relevant to the question).
foreach (line in myfile) {
fields = line.split(",");
yield return new MyObject {
Country = string.Intern(fields[0]),
Province = string.Intern(fields[1]),
City = string.Intern(fields[2]),
Street = string.Intern(fields[3]),
//...other fields
};
}
This will save about 42% of memory (tested and measured) when holding the entire dataset in memory since all duplicate strings will be a reference to the same string. Also, when creating the hierarchical structure with a lot of LINQ's .ToDictionary() method the keys (Country, Province etc.) of the resp. dictionaries will be much more efficient.
However, one of the drawbacks (aside a slight loss of performance, which is not problem) of using string.Intern() is that the strings won't be garbage collected anymore. But when I'm done with my data I do want all that stuff garbage collected (eventually).
I could use a Dictionary<string, string> to 'intern' this data but I don't like the "overhead" of having a key and value where I am, actually, only interested in the key. I could set the value to null or the use the same string as value (which will result in the same reference in key and value). It's only a small price of a few bytes to pay, but it's still a price.
Something like a HashSet<string> makes more sense to me. However, I cannot get a reference to a string in the HashSet; I can see if the HashSet contains a specific string, but not get a reference to that specific instance of the located string in the HashSet. I could implement my own HashSet for this, but I am wondering what other solutions you kind StackOverflowers may come up with.
Requirements:
My "FileReader" class needs to keep exposing an IEnumerable<MyObject>
My "FileReader" class may do stuff (like string.Intern()) to optimize memory usage
The MyObject class cannot change; I won't make a City class, Country class etc. and have MyObject expose those as properties instead of simple string properties
Goal is to be (more) memory efficient by de-duplicating most of the duplicate strings in Country, Province, City etc.; how this is achieved (e.g. string interning, internal hashset / collection / structure of something) is not important. However:
I know I can stuff the data in a database or use other solutions in such direction; I am not interested in these kind of solutions.
Speed is only of secondary concern; the quicker the better ofcourse but a (slight) loss in performance while reading/iterating the objects is no problem
Since this is a long-running process (as in: windows service running 24/7/365) that, occasionally, processes a bulk of this data I want the data to be garbage-collected when I'm done with it; string interning works great but will, in the long run, result in a huge string pool with lots of unused data
I would like any solutions to be "simple"; adding 15 classes with P/Invokes and inline assembly (exaggerated) is not worth the effort. Code maintainability is high on my list.
This is more of a 'theoretical' question; it's purely out of curiosity / interest that I'm asking. There is no "real" problem, but I can see that in similar situations this might be a problem to someone.
For example: I could do something like this:
public class StringInterningObject
{
private HashSet<string> _items;
public StringInterningObject()
{
_items = new HashSet<string>();
}
public string Add(string value)
{
if (_items.Add(value))
return value; //New item added; return value since it wasn't in the HashSet
//MEH... this will quickly go O(n)
return _items.First(i => i.Equals(value)); //Find (and return) actual item from the HashSet and return it
}
}
But with a large set of (to be de-duplicated) strings this will quickly bog down. I could have a peek at the reference source for HashSet or Dictionary or... and build a similar class that doesn't return bool for the Add() method but the actual string found in the internals/bucket.
The best I could come up with until now is something like:
public class StringInterningObject
{
private ConcurrentDictionary<string, string> _items;
public StringInterningObject()
{
_items = new ConcurrentDictionary<string, string>();
}
public string Add(string value)
{
return _items.AddOrUpdate(value, value, (v, i) => i);
}
}
Which has the "penalty" of having a Key and a Value where I'm actually only interested in the Key. Just a few bytes though, small price to pay. Coincidally this also yields 42% less memory usage; the same result as when using string.Intern() yields.
tolanj came up with System.Xml.NameTable:
public class StringInterningObject
{
private System.Xml.NameTable nt = new System.Xml.NameTable();
public string Add(string value)
{
return nt.Add(value);
}
}
(I removed the lock and string.Empty check (the latter since the NameTable already does that))
xanatos came up with a CachingEqualityComparer:
public class StringInterningObject
{
private class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
{
public System.WeakReference X { get; private set; }
public System.WeakReference Y { get; private set; }
private readonly IEqualityComparer<T> Comparer;
public CachingEqualityComparer()
{
Comparer = EqualityComparer<T>.Default;
}
public CachingEqualityComparer(IEqualityComparer<T> comparer)
{
Comparer = comparer;
}
public bool Equals(T x, T y)
{
bool result = Comparer.Equals(x, y);
if (result)
{
X = new System.WeakReference(x);
Y = new System.WeakReference(y);
}
return result;
}
public int GetHashCode(T obj)
{
return Comparer.GetHashCode(obj);
}
public T Other(T one)
{
if (object.ReferenceEquals(one, null))
{
return null;
}
object x = X.Target;
object y = Y.Target;
if (x != null && y != null)
{
if (object.ReferenceEquals(one, x))
{
return (T)y;
}
else if (object.ReferenceEquals(one, y))
{
return (T)x;
}
}
return one;
}
}
private CachingEqualityComparer<string> _cmp;
private HashSet<string> _hs;
public StringInterningObject()
{
_cmp = new CachingEqualityComparer<string>();
_hs = new HashSet<string>(_cmp);
}
public string Add(string item)
{
if (!_hs.Add(item))
item = _cmp.Other(item);
return item;
}
}
(Modified slightly to "fit" my "Add() interface")
As per Henk Holterman's request:
public class StringInterningObject
{
private Dictionary<string, string> _items;
public StringInterningObject()
{
_items = new Dictionary<string, string>();
}
public string Add(string value)
{
string result;
if (!_items.TryGetValue(value, out result))
{
_items.Add(value, value);
return value;
}
return result;
}
}
I'm just wondering if there's maybe a neater/better/cooler way to 'solve' my (not so much of an actual) problem. By now I have enough options I guess
Here are some numbers I came up with for some simple, short, preliminary tests:
Non optimizedMemory: ~4,5GbLoad time: ~52s
StringInterningObject (see above, the ConcurrentDictionary variant)Memory: ~2,6GbLoad time: ~49s
string.Intern()Memory: ~2,3GbLoad time: ~45s
System.Xml.NameTableMemory: ~2,3GbLoad time: ~41s
CachingEqualityComparerMemory: ~2,3GbLoad time: ~58s
StringInterningObject (see above, the (non-concurrent) Dictionary variant) as per Henk Holterman's request:Memory: ~2,3GbLoad time: ~39s
Although the numbers aren't very definitive, it seems that the many memory-allocations for the non-optimized version actually slow down more than using either string.Intern() or the above StringInterningObjects which results in (slightly) longer load times. Also, string.Intern() seems to 'win' from StringInterningObject but not by a large margin; << See updates.
I've had exactly this requirement and indeed asked on SO, but with nothing like the detail of your question, no useful responses. One option that is built in is a (System.Xml).NameTable, which is basically a string atomization object, which is what you are looking for, we had (we've actually move to Intern because we do keep these strings for App-life).
if (name == null) return null;
if (name == "") return string.Empty;
lock (m_nameTable)
{
return m_nameTable.Add(name);
}
on a private NameTable
http://referencesource.microsoft.com/#System.Xml/System/Xml/NameTable.cs,c71b9d3a7bc2d2af shows its implemented as a Simple hashtable, ie only storing one reference per string.
Downside? is its completely string specific. If you do cross-test for memory / speed I'd be interested to see the results. We were already using System.Xml heavily, might of course not seem so natural if you where not.
When in doubt, cheat! :-)
public class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
{
public T X { get; private set; }
public T Y { get; private set; }
public IEqualityComparer<T> DefaultComparer = EqualityComparer<T>.Default;
public bool Equals(T x, T y)
{
bool result = DefaultComparer.Equals(x, y);
if (result)
{
X = x;
Y = y;
}
return result;
}
public int GetHashCode(T obj)
{
return DefaultComparer.GetHashCode(obj);
}
public T Other(T one)
{
if (object.ReferenceEquals(one, X))
{
return Y;
}
if (object.ReferenceEquals(one, Y))
{
return X;
}
throw new ArgumentException("one");
}
public void Reset()
{
X = default(T);
Y = default(T);
}
}
Example of use:
var comparer = new CachingEqualityComparer<string>();
var hs = new HashSet<string>(comparer);
string str = "Hello";
string st1 = str.Substring(2);
hs.Add(st1);
string st2 = str.Substring(2);
// st1 and st2 are distinct strings!
if (object.ReferenceEquals(st1, st2))
{
throw new Exception();
}
comparer.Reset();
if (hs.Contains(st2))
{
string cached = comparer.Other(st2);
Console.WriteLine("Found!");
// cached is st1
if (!object.ReferenceEquals(cached, st1))
{
throw new Exception();
}
}
I've created an equality comparer that "caches" the last Equal terms it analyzed :-)
Everything could then be encapsulated in a subclass of HashSet<T>
/// <summary>
/// An HashSet<T;gt; that, thorough a clever use of an internal
/// comparer, can have a AddOrGet and a TryGet
/// </summary>
/// <typeparam name="T"></typeparam>
public class HashSetEx<T> : HashSet<T> where T : class
{
public HashSetEx()
: base(new CachingEqualityComparer<T>())
{
}
public HashSetEx(IEqualityComparer<T> comparer)
: base(new CachingEqualityComparer<T>(comparer))
{
}
public T AddOrGet(T item)
{
if (!Add(item))
{
var comparer = (CachingEqualityComparer<T>)Comparer;
item = comparer.Other(item);
}
return item;
}
public bool TryGet(T item, out T item2)
{
if (Contains(item))
{
var comparer = (CachingEqualityComparer<T>)Comparer;
item2 = comparer.Other(item);
return true;
}
item2 = default(T);
return false;
}
private class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
{
public WeakReference X { get; private set; }
public WeakReference Y { get; private set; }
private readonly IEqualityComparer<T> Comparer;
public CachingEqualityComparer()
{
Comparer = EqualityComparer<T>.Default;
}
public CachingEqualityComparer(IEqualityComparer<T> comparer)
{
Comparer = comparer;
}
public bool Equals(T x, T y)
{
bool result = Comparer.Equals(x, y);
if (result)
{
X = new WeakReference(x);
Y = new WeakReference(y);
}
return result;
}
public int GetHashCode(T obj)
{
return Comparer.GetHashCode(obj);
}
public T Other(T one)
{
if (object.ReferenceEquals(one, null))
{
return null;
}
object x = X.Target;
object y = Y.Target;
if (x != null && y != null)
{
if (object.ReferenceEquals(one, x))
{
return (T)y;
}
else if (object.ReferenceEquals(one, y))
{
return (T)x;
}
}
return one;
}
}
}
Note the use of WeakReference so that there aren't useless references to objects that could prevent garbage collection.
Example of use:
var hs = new HashSetEx<string>();
string str = "Hello";
string st1 = str.Substring(2);
hs.Add(st1);
string st2 = str.Substring(2);
// st1 and st2 are distinct strings!
if (object.ReferenceEquals(st1, st2))
{
throw new Exception();
}
string stFinal = hs.AddOrGet(st2);
if (!object.ReferenceEquals(stFinal, st1))
{
throw new Exception();
}
string stFinal2;
bool result = hs.TryGet(st1, out stFinal2);
if (!object.ReferenceEquals(stFinal2, st1))
{
throw new Exception();
}
if (!result)
{
throw new Exception();
}
edit3:
instead of indexing strings, putting them in non-duplicate lists will save much more ram.
we have int indexes in class MyObjectOptimized. access is instant.
if list is short(like 1000 item) speed of setting values wont be noticable.
i assumed every string will have 5 character .
this will reduce memory usage
percentage : 110 byte /16byte = 9x gain
total : 5gb/9 = 0.7 gb + sizeof(Country_li , Province_li etc )
with int16 index (will further halve ram usage )
*note:* int16 capacity is -32768 to +32767 ,
make sure your list is not bigger than 32 767
usage is same but will use the class MyObjectOptimized
main()
{
// you can use same code
foreach (line in myfile) {
fields = line.split(",");
yield
return
new MyObjectOptimized {
Country = fields[0],
Province = fields[1],
City = fields[2],
Street = fields[3],
//...other fields
};
}
}
required classes
// single string size : 18 bytes (empty string size) + 2 bytes per char allocated
//1 class instance ram cost : 4 * (18 + 2* charCount )
// ie charcounts are at least 5
// cost: 4*(18+2*5) = 110 byte
class MyObject
{
string Country ;
string Province ;
string City ;
string Street ;
}
public static class Exts
{
public static int AddDistinct_and_GetIndex(this List<string> list ,string value)
{
if( !list.Contains(value) ) {
list.Add(value);
}
return list.IndexOf(value);
}
}
// 1 class instance ram cost : 4*4 byte = 16 byte
class MyObjectOptimized
{
//those int's could be int16 depends on your distinct item counts
int Country_index ;
int Province_index ;
int City_index ;
int Street_index ;
// manuallly implemented properties will not increase memory size
// whereas field WILL increase
public string Country{
get {return Country_li[Country_index]; }
set { Country_index = Country_li.AddDistinct_and_GetIndex(value); }
}
public string Province{
get {return Province_li[Province_index]; }
set { Province_index = Province_li.AddDistinct_and_GetIndex(value); }
}
public string City{
get {return City_li[City_index]; }
set { City_index = City_li.AddDistinct_and_GetIndex(value); }
}
public string Street{
get {return Street_li[Street_index]; }
set { Street_index = Street_li.AddDistinct_and_GetIndex(value); }
}
//beware they are static.
static List<string> Country_li ;
static List<string> Province_li ;
static List<string> City_li ;
static List<string> Street_li ;
}
Related
We use dictionaries in various places in the existing code, where we map tags to objects that each contain that tag. This was never a problem before, as these dictionaries "only" managed several thousand objects. However, we are now at a point with the software where we are more likely to be dealing with tens to hundreds of thousands of objects. The use of the said tag as a key leads to the fact that we consume a lot of unnecessary memory, because these tags are stored twice and can reach lengths of more than 150 characters.
So the idea was obvious to replace the long tags with a hash that has a fixed size. For this we decided to use the FNV hash algorithm, which calculates an unsigned 64-bit integer from the string. To avoid having to make too many changes to the existing code, we enclosed the dictionary in an object that converts the passed string keys and works on an internal dictionary. This saves us masses of changes in the methods that use the previous implementation. You could call it a decorator in the broadest sense. The following is a brief outline of what we came up with.
[Serializable]
public class SimpleTestObject {
public string Tag { get; set; }
public SimpleTestObject(string tag) {
this.Tag = tag;
}
}
[Serializable]
public class FnvDictionary<T> : IDictionary<string, T> where T : SimpleTestObject {
private ConcurrentDictionary<UInt64, T> _InternalDictionary = new ConcurrentDictionary<UInt64, T>();
public T this[string key] {
get {
return this._InternalDictionary[this.CalculateHash(key)];
}
set {
if (key != value.Tag)
throw new ArgumentException();
_InternalDictionary[this.CalculateHash(key)] = value;
}
}
public ICollection<string> Keys {
get { return this._InternalDictionary.Values.Select(item => item.Tag).ToList(); }
}
public ICollection<T> Values {
get { return this._InternalDictionary.Values; }
}
public int Count {
get { return this._InternalDictionary.Count; }
}
public bool IsReadOnly {
get { return false; }
}
public void Add(string key, T value) {
this._InternalDictionary[this.CalculateHash(key)] = value;
}
public void Add(KeyValuePair<string, T> item) {
this.Add(item.Key, item.Value);
}
public void Clear() {
this._InternalDictionary.Clear();
}
public bool Contains(KeyValuePair<string, T> item) {
if (item.Key != item.Value.Tag)
throw new ArgumentException();
return this.ContainsKey(item.Value.Tag);
}
public bool ContainsKey(string key) {
return this._InternalDictionary.ContainsKey(CalculateHash(key));
}
public void CopyTo(KeyValuePair<string, T>[] array, int arrayIndex) {
KeyValuePair<string, T>[] source = this._InternalDictionary
.Select(data => new KeyValuePair<string, T>(data.Value.Tag, data.Value))
.ToArray();
Array.Copy(source, 0, array, arrayIndex, source.Length);
}
public IEnumerator<KeyValuePair<string, T>> GetEnumerator() {
return new FnvDictionaryEnumerator<T>(this._InternalDictionary);
}
public bool Remove(string key) {
return this._InternalDictionary.TryRemove(this.CalculateHash(key), out _);
}
public bool Remove(KeyValuePair<string, T> item) {
return this.Remove(item.Value.Tag);
}
public bool TryGetValue(string key, out T value) {
return this._InternalDictionary.TryGetValue(this.CalculateHash(key), out value);
}
private UInt64 CalculateHash(string input) {
const UInt64 MAGIC_PRIME = 1099511628211;
UInt64 hash = 14695981039346656037;
for (int i = 0; i < input.Length; i++)
hash = (hash ^ (byte)input[i]) * MAGIC_PRIME;
return hash;
}
IEnumerator IEnumerable.GetEnumerator() {
return this.GetEnumerator();
}
}
public class FnvDictionaryEnumerator<T> : IEnumerator<KeyValuePair<string, T>> where T : SimpleTestObject {
private ConcurrentDictionary<UInt64, T> _InternalDictionary;
private readonly int _KeysCount;
private int _KeyPos;
public FnvDictionaryEnumerator(ConcurrentDictionary<UInt64, T> data) {
_InternalDictionary = data;
_KeysCount = data.Keys.Count;
_KeyPos = -1;
}
public KeyValuePair<string, T> Current {
get {
T currentItem = _InternalDictionary.ElementAt(_KeyPos).Value;
return new KeyValuePair<string, T>(currentItem.Tag, currentItem);
}
}
object System.Collections.IEnumerator.Current => this.Current;
public bool MoveNext() => ++_KeyPos < _KeysCount;
public void Reset() => _KeyPos = -1;
public void Dispose() {
_InternalDictionary = null;
}
}
Now to the problem: The object described above was examined by us with a small test program and compared directly with the ConcurrentDictionary used so far. For this we have built a small function that outputs the size of the respective dictionaries:
public static long GetObjectSize(object source) {
BinaryFormatter formatter = new BinaryFormatter();
using (MemoryStream stream = new MemoryStream()) {
formatter.Serialize(stream, source);
return stream.Length;
}
}
After we had created 250000 data sets on a test basis and packed them into the dictionaries, we were disillusioned. Although our own creation works exclusively with hashes that are each 8 bytes long, the memory consumption is higher than in the ConcurrentDictionary.
const string TAG_BASE = "XXXXX|XXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXX-";
const int TEST_OBJECTS_COUNT = 250000;
SimpleTestObject[] testObjects = new SimpleTestObject[TEST_OBJECTS_COUNT];
for (int index = 0; index < testObjects.Length; index++)
testObjects[index] = new SimpleTestObject($"{TAG_BASE}{index}");
ConcurrentDictionary<string, SimpleTestObject> concurrentDict = new ConcurrentDictionary<string, SimpleTestObject>();
foreach (SimpleTestObject testObject in testObjects)
concurrentDict[testObject.Tag] = testObject;
Console.WriteLine("Size of the ConcurrentDictionary = {0} bytes.", GetObjectSize(concurrentDict));
FnvDictionary<SimpleTestObject> customDict = new FnvDictionary<SimpleTestObject>();
foreach (SimpleTestObject testObject in testObjects)
customDict.Add(testObject.Tag, testObject);
Console.WriteLine("Size of the FnvDictionary = {0} bytes.", GetObjectSize(customDict));
// Output:
// Size of the ConcurrentDictionary = 36140494 bytes.
// Size of the custom dictionary = 36890908 bytes.
The question that now arises is how it can be that a dictionary that supposedly holds less data can have a larger memory consumption. The assumption is obvious that the ConcurrentDictionary also works only on the basis of hashes, but this is contradicted by the fact that the collection of keys can be retrieved continuously. Is there a design problem in the test scenario described above or even in the GetObjectSize function? And more important: How can the memory consumption of the dictionary be reduced as much as possible?
Remember that strings in .NET are reference types: the actual string
data is stored in a heap-allocated object, and any type which has a
string field just has a pointer-sized reference to that object.
canton7
I didn't realize that before. I assumed that strings are value types. To investigate this aspect in more detail, I got myself a Visual Studio Enterprise installation, since it has memory analysis tools. With a small test program, the behavior was easy to observe. In the code in the following example, a value type would cause the dictionary to be twice as large as the list. This is not the case.
const string TAG_BASE = "XXXXX|XXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXX-";
const int TEST_OBJECTS_COUNT = 250000;
List<string> myList = new List<string>();
Dictionary<string, string> myDictionary = new Dictionary<string, string>();
for (int index = 0; index < TEST_OBJECTS_COUNT; index++)
{
string myString = TAG_BASE + index;
myList.Add(myString);
myDictionary.Add(myString, myString);
}
The result was already predicted to me in the comments:
Also remember that dictionaries already organise themselves using a
hash code, but they do so using a 32-bit hash code. You've added an
additional 64-bit hash code which the dictionary needs to store, but
it still condenses it down into a 32-bit hash code internally. You've
just given it another, new hash code to worry about, while also
opening yourself up to hash code collisions.
canton7
When analyzing the memory dump, it became clear quite quickly that the strings are indeed stored as references. Thus, our well-intentioned approach only worsens the problem by forcing the memory management to store another numeric value instead of the actually used references, which in turn has to be calculated again.
In fact, in the example described above, both collections consume about the same amount of memory. In the case of the dictionary, there is also some administrative overhead.
Object type
Size (Bytes)
Dictionary<string, string>
65,448,420
List<string>
60,007,980
To answer my question: Memory consumption is increasing because we are unintentionally forcing it to do so. The reason for this is insufficient knowledge about internal memory management. For the future: Rely on analysis tools instead of self-builds!
Thanks to all the commenters for the important advice!
I have two arrays of ArrayList.
public class ProductDetails
{
public string id;
public string description;
public float rate;
}
ArrayList products1 = new ArrayList();
ArrayList products2 = new ArrayList();
ArrayList duplicateProducts = new ArrayList();
Now what I want is to get all the products (with all the fields of ProductDetails class) having duplicate description in both products1 and products2.
I can run two for/while loops as traditional way, but that would be very slow specially if I will be having over 10k elements in both arrays.
So probably something can be done with LINQ.
If you want to use linQ, you need write your own EqualityComparer where you override both methods Equals and GetHashCode()
public class ProductDetails
{
public string id {get; set;}
public string description {get; set;}
public float rate {get; set;}
}
public class ProductComparer : IEqualityComparer<ProductDetails>
{
public bool Equals(ProductDetails x, ProductDetails y)
{
//Check whether the objects are the same object.
if (Object.ReferenceEquals(x, y)) return true;
//Check whether the products' properties are equal.
return x != null && y != null && x.id.Equals(y.id) && x.description.Equals(y.description);
}
public int GetHashCode(ProductDetails obj)
{
//Get hash code for the description field if it is not null.
int hashProductDesc = obj.description == null ? 0 : obj.description.GetHashCode();
//Get hash code for the idfield.
int hashProductId = obj.id.GetHashCode();
//Calculate the hash code for the product.
return hashProductDesc ^ hashProductId ;
}
}
Now, supposing you have this objects:
ProductDetails [] items1= { new ProductDetails { description= "aa", id= 9, rating=2.0f },
new ProductDetails { description= "b", id= 4, rating=2.0f} };
ProductDetails [] items= { new ProductDetails { description= "aa", id= 9, rating=1.0f },
new ProductDetails { description= "c", id= 12, rating=2.0f } };
IEnumerable<ProductDetails> duplicates =
items1.Intersect(items2, new ProductComparer());
Consider overriding the System.Object.Equals method.
public class ProductDetails
{
public string id;
public string description;
public float rate;
public override bool Equals(object obj)
{
if(obj is ProductDetails == null)
return false;
if(ReferenceEquals(obj,this))
return true;
ProductDetails p = (ProductDetails)obj;
return description == p.description;
}
}
Filtering would then be as simple as:
var result = products1.Where(product=>products2.Contains(product));
EDIT:
Do consider that this implementation is not optimal..
Moreover- it has been proposed in the comments to your question that you use a data base.
This way performance will be optimized - as per the database implementation In any case- the overhead will not be yours.
However, you can optimize this code by using a Dictionary or a HashSet:
Overload the System.Object.GetHashCode method:
public override int GetHashCode()
{
return description.GetHashCode();
}
You can now do this:
var hashSet = new HashSet<ProductDetails>(products1);
var result = products2.Where(product=>hashSet.Contains(product));
Which will boost your performance to an extent since lookup will be less costly.
10k elements is nothing, however make sure you use proper collection types. ArrayList is long deprecated, use List<ProductDetails>.
Next step is implementing proper Equals and GetHashCode overrides for your class. The assumption here is that description is the key since that's what you care about from a duplication point of view:
public class ProductDetails
{
public string id;
public string description;
public float rate;
public override bool Equals(object obj)
{
var p = obj as ProductDetails;
return ReferenceEquals(p, null) ? false : description == obj.description;
}
public override int GetHashCode() => description.GetHashCode();
}
Now we have options. One easy and efficient way of doing this is using a hash set:
var set = new HashSet<ProductDetails>();
var products1 = new List<ProductDetails>(); // fill it
var products2 = new List<ProductDetails>(); // fill it
// shove everything in the first list in the set
foreach(var item in products1)
set.Add(item);
// and simply test the elements in the second set
foreach(var item in products2)
if(set.Contains(item))
{
// item.description was already used in products1, handle it here
}
This gives you linear (O(n)) time-complexity, best you can get.
I have two lists of object
[Serializable]
private class MemorySet
{
public Dictionary<string, object> _Map;
public List<object> _Results;
public List<object> _Storage;
}
MemorySet Memory = new MemorySet();
I can have keys assigned for an object, for example
_Map.Add("someKey", _Results[_Results.Count - 1]);
I have a method
private object Mapper(string key)
{
if (Memory._Map.ContainsKey(key))
{
return Memory._Map[key];
}
else if (key.ToLower() == "result")
{
return Memory._Results[Memory._Results.Count - 1];
}
else if (key.ToLower() == "storage")
{
return Memory._Storage[Memory._Storage.Count - 1];
}
else if (key.ToLower().Contains("result"))
{
int n = Convert.ToInt32(key.ToLower().Split(new string[] { "result" }, StringSplitOptions.None)[1]);
return Memory._Results[n];
}
else if (key.ToLower().Contains("storage"))
{
int n = Convert.ToInt32(key.ToLower().Split(new string[] { "storage" }, StringSplitOptions.None)[1]);
return Memory._Storage[n];
}
else return null;
}
Now I must assign to an object from _Storage or _Results like that:
object obj = key != "" ? Mapper(key) : Memory._Storage[Memory._Storage.Count - 1];
if(obj is string) obj as string = "test";
this will change obj to reference some new string in memory. But I want to change the object that obj references to instead.
In other words obj will become "test", but the underlying object won't be changed.
I understand why that happends, though I didn't imagine it that way when writing the whole engine, and now I have big trouble with that one. In C++ we have pointers, but in C# I don't want to use GCHandles or unmanaged code for that trivial stuff, would be extremely ugly.
So, how to assign to the object that object points to, instead assigning to the object itself?
Try this
[Serializable]
private class MemorySet
{
public Dictionary<string, object> _Map = new Dictionary<string,object>();
public List<object> _Results = new List<object>();
public List<object> _Storage = new List<object>)_;
}
If you don't want to mess with your current design, you could just add a method that would update your data structures on the same template as your Mapper method. This is what it would look like:
private void Update(string key, object value)
{
if (Memory._Map.ContainsKey(key))
{
Memory._Map[key] = value;
}
else if (key.ToLower() == "result")
{
Memory._Results[Memory._Results.Count - 1] = value;
}
else if (key.ToLower() == "storage")
{
Memory._Storage[Memory._Storage.Count - 1] = value;
}
else if (key.ToLower().Contains("result"))
{
int n = Convert.ToInt32(key.ToLower().Split(new string[] { "result" }, StringSplitOptions.None)[1]);
Memory._Results[n] = value;
}
else if (key.ToLower().Contains("storage"))
{
int n = Convert.ToInt32(key.ToLower().Split(new string[] { "storage" }, StringSplitOptions.None)[1]);
Memory._Storage[n] = value;
}
else
{
throw new ArgumentException("Failed to compute valid mapping", nameof(key));
}
}
Maybe you could also add the key == "" pattern in there, I'm not sure to understand how this would be used exactly, but hopefully you get the idea.
EDIT: OK, so references to the same object are used in different structures. You should consider to design a MemorySet that avoids this. If you still think this is the proper design considering your needs, you have a simple solution: wrap your target objects in other objects.
public class ObjectWrapper
{
public object ObjectOfInterest { get; set; }
}
Now you store ObjectWrapper objects. Then you can update the property ObjectOfInterest and this change will be reflected to all structures that contain this ObjectWrapper:
ObjectWrapper wrapper = Mapper(key);
wrapper.ObjectOfInterest = "test";
Given this code:
private static IObservable<Stock> ToStock(this IObservable<decimal> prices, string symbol)
{
return prices.Scan(
default(Stock),
(previous, price) => previous == default(Stock)
? new Stock(symbol, price)
: previous.Change(price));
}
// The signature for Stock.Change() looks like this. Stock is an immutable class.
// public Stock Change(decimal newCurrentPrice)
I would like to eliminate the check previous == default(Stock) that is happening on every call to the accumulator. What I have is behavior that is different for the first item vs the rest. I'm not sure how to express that simply using LINQ for Rx.
EDIT. Here's the code for Stock, which might help explain why I can't give it a sentinel value for price.
public class Stock
{
private readonly decimal _current;
private readonly decimal _dayHigh;
private readonly decimal _dayLow;
private readonly decimal _dayOpen;
private readonly decimal _lastChange;
private readonly string _symbol;
public Stock(string symbol, decimal price)
{
if (symbol == null) throw new ArgumentNullException("symbol");
if (price <= 0) throw new ArgumentOutOfRangeException("price", "Price must be greater than zero.");
_symbol = symbol;
_current = _dayOpen = _dayLow = _dayHigh = price;
}
private Stock(Stock original, decimal newCurrent)
{
if (original == null) throw new ArgumentNullException("original");
_symbol = original.Symbol;
_current = newCurrent;
_dayOpen = original.DayOpen;
_dayHigh = Math.Max(newCurrent, original.DayHigh);
_dayLow = Math.Min(newCurrent, original.DayLow);
_lastChange = newCurrent - original.Current;
}
public string Symbol { get { return _symbol; } }
public decimal Current { get { return _current; } }
public decimal LastChange { get { return _lastChange; } }
public decimal DayOpen { get { return _dayOpen; } }
public decimal DayLow { get { return _dayLow; } }
public decimal DayHigh { get { return _dayHigh; } }
public decimal DayChange { get { return Current - DayOpen; } }
public double DayChangeRatio { get { return (double) Math.Round(DayChange/Current, 4); } }
public Stock Change(decimal newCurrent)
{
return newCurrent == Current
? this
: new Stock(this, newCurrent);
}
}
I came up with this solution:
private static IObservable<Stock> ToStock2(this IObservable<decimal> prices, string symbol)
{
Func<Stock, decimal, Stock> accumulator = (_, firstPrice) =>
{
accumulator = (previous, price) => previous.Change(price);
return new Stock(symbol, firstPrice);
};
return prices.Scan(default(Stock), (previous, price) => accumulator(previous, price));
}
It uses a self-mutating Func variable to change the behavior during its first invocation, but a quick test (ran with 0.5 million prices) shows that it performs 2-3% slower than the original method, and the code is much less clear. It seems .NET is more efficient at doing the equality comparer for every item, vs. calling a second Func for every item. I'm not sure if there's anyway to optimize this so that it performs better enough than the original to justify the lessened clarity.
You can do this:
public static partial class ObservableExtensions
{
public static IObservable<Stock> ToStock(this IObservable<decimal> prices, string symbol)
{
return Observable.Create<Stock>(o =>
{
Stock lastStock;
Action<decimal> action = null;
action = price => {
lastStock = new Stock(symbol, price);
action = newPrice =>
{
lastStock = lastStock.Change(newPrice);
o.OnNext(lastStock);
};
o.OnNext(lastStock);
};
return prices.Subscribe(p => action(p), o.OnError, o.OnCompleted);
});
}
}
Compared to Jim's answer, I'm not sure if mine is any better; it's a similar idea but it avoids calling Scan which may avoid some hops.
My flakey performance tests showed this runs no worse than the original - but no better either. I ran it a few times with 100,000,000 prices and they took within 1% of each other with each winning roughly half the time. There was no statistically significant difference.
I would take this with a pinch of salt though, as this is on my home PC and not in a lab environment, not run for very long and with god knows what other services installed.
HOWEVER... I did get a seemly significant 3% improvement by rewriting the private constructor to not do the Math.Max/Min calculation redundantly, and to bypass the properties and access the fields directly - and I'm sure there's further mileage to be explored such as removing Change and using public fields:
private Stock(Stock original, decimal newCurrent)
{
if (original == null) throw new ArgumentNullException("original");
_symbol = original._symbol;
_current = newCurrent;
_dayOpen = original._dayOpen;
if(newCurrent > original._dayHigh)
{
_dayHigh = newCurrent;
_dayLow = original._dayLow;
}
else
{
_dayHigh = original._dayHigh;
_dayLow = newCurrent;
}
_lastChange = newCurrent - original._current;
}
On general performance - with a lot of prices, there is going to be a fair amount of GC pressure with this approach. I've had success in the past with using a pool of Stock instances in a ring buffer implemented with an array to reduce garbage collection.
return prices.Skip(1)
.Scan(new Stock(symbol, prices.First()),
(previous, price) => previous.Change(price));
Does this fix your problem of side-effects?
I would prefer to introduce some kind of polymorfism. You can introduce special case of Stock just for seed purpose:
public class Stock {
// same implementation as yours but make the Change method virtual
public static Stock Seed(string symbol) {
return new StockSeed(symbol);
}
class StockSeed : Stock {
public StockSeed(string symbol) {
_symbol = symbol;
}
public override Stock Change(decimal newCurrent) {
return new Stock(Symbol, newCurrent)
}
}
}
Then you can simplify the reactive code to:
static IObservable<Stock> ToStock(this IObservable<decimal> prices, string symbol)
{
return prices.Scan(Stock.Seed(symbol), (prev, price) => prev.Change(price));
}
Problem: I have 2 kinds of objects, lets call them Building and Improvement. There are roughly 30 Improvement instances, while there can be 1-1000 Buildings. For each combination of Building and Improvement, I have to perform some heavy calculation, and store the result in a Result object.
Both Buildings and Improvements can be represented by an integer ID.
I then need to be able to:
Access the Result for a given Building and Improvement efficiently (EDIT: see comment further down)
Perform aggregations on the Results for all Improvements for a given Building, like .Sum() and .Average()
Perform the same aggregations on the Results for all Buildings for a given Improvement
This will happen on a web-server back-end, so memory may be a concern, but speed is most important.
Thoughts so far:
Use a Dictionary<Tuple<int, int>, Result> with <BuildingID, ImprovementID> as key. This should give me speedy inserts and single lookups, but I am concerned about .Where() and .Sum() performance.
Use a two-dimensional array, with one dimension for BuildingIDs and one for ImprovementIDs, and the Result as value. In addition, build two Dictionary<int, int> that map BuildingIDs and ImprovementIDs to their respective array row/column indexes. This could potentially mean max 1000+ Dictionarys, will this be a problem?
Use a List<Tuple<int, int, Result>>. I think this may be the least efficient, with O(n) inserts, though I could be wrong.
Am I missing an obvious better option here?
EDIT: Turns out it is only the aggregated values (per Building and per Improvement) I am interested in; see my answer.
Generally, the Dictionary is most lookup efficent. The both lookup efficency and manipulation efficency is constant O(1), when accessed via key. This will help for access, the first point.
In the second and third you need to walk through all of the items O(n), so there is no way to speed it except you want to walk them through in specified order O(n*n) - then you can use SortedDictionray O(n), but you compromise the lookup and manipulation efficency (O(log n)).
So I would go with the 1st solution you post.
You could use a "dictionary of dictionaries" to hold the Result data, for example:
// Building ID ↓ ↓ Improvement ID
var data = new Dictionary<int, Dictionary<int, Result>>();
This would let you quickly find the improvements for a particular building.
However, finding the buildings that contain a particular improvement would require iterating over all the buildings. Here's some sample code:
using System;
using System.Linq;
using System.Collections.Generic;
namespace Demo
{
sealed class Result
{
public double Data;
}
sealed class Building
{
public int Id;
public int Value;
}
sealed class Improvement
{
public int Id;
public int Value;
}
class Program
{
void run()
{
// Building ID ↓ ↓ Improvement ID
var data = new Dictionary<int, Dictionary<int, Result>>();
for (int buildingKey = 1000; buildingKey < 2000; ++buildingKey)
{
var improvements = new Dictionary<int, Result>();
for (int improvementKey = 5000; improvementKey < 5030; ++improvementKey)
improvements.Add(improvementKey, new Result{ Data = buildingKey + improvementKey/1000.0 });
data.Add(buildingKey, improvements);
}
// Aggregate data for all improvements for building with ID == 1500:
int buildingId = 1500;
var sum = data[buildingId].Sum(result => result.Value.Data);
Console.WriteLine(sum);
// Aggregate data for all buildings with a given improvement.
int improvementId = 5010;
sum = data.Sum(improvements =>
{
Result result;
return improvements.Value.TryGetValue(improvementId, out result) ? result.Data : 0.0;
});
Console.WriteLine(sum);
}
static void Main()
{
new Program().run();
}
}
}
To speed up the second aggregation (for summing data for all improvements with a given ID) we can use a second dictionary:
// Improvment ID ↓ ↓ Building ID
var byImprovementId = new Dictionary<int, Dictionary<int, Result>>();
You would have an extra dictionary to maintain, but it's not too complicated. Having a few nested dictionaries like this might take too much memory though - but it's worth considering.
As noted in the comments below, it would be better to define types for the IDs and also for the dictionaries themselves. Putting that together gives:
using System;
using System.Linq;
using System.Collections.Generic;
namespace Demo
{
sealed class Result
{
public double Data;
}
sealed class BuildingId
{
public BuildingId(int id)
{
Id = id;
}
public readonly int Id;
public override int GetHashCode()
{
return Id.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as BuildingId;
if (other == null)
return false;
return this.Id == other.Id;
}
}
sealed class ImprovementId
{
public ImprovementId(int id)
{
Id = id;
}
public readonly int Id;
public override int GetHashCode()
{
return Id.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as ImprovementId;
if (other == null)
return false;
return this.Id == other.Id;
}
}
sealed class Building
{
public BuildingId Id;
public int Value;
}
sealed class Improvement
{
public ImprovementId Id;
public int Value;
}
sealed class BuildingResults : Dictionary<BuildingId, Result>{}
sealed class ImprovementResults: Dictionary<ImprovementId, Result>{}
sealed class BuildingsById: Dictionary<BuildingId, ImprovementResults>{}
sealed class ImprovementsById: Dictionary<ImprovementId, BuildingResults>{}
class Program
{
void run()
{
var byBuildingId = CreateTestBuildingsById(); // Create some test data.
var byImprovementId = CreateImprovementsById(byBuildingId); // Create the alternative lookup dictionaries.
// Aggregate data for all improvements for building with ID == 1500:
BuildingId buildingId = new BuildingId(1500);
var sum = byBuildingId[buildingId].Sum(result => result.Value.Data);
Console.WriteLine(sum);
// Aggregate data for all buildings with a given improvement.
ImprovementId improvementId = new ImprovementId(5010);
sum = byBuildingId.Sum(improvements =>
{
Result result;
return improvements.Value.TryGetValue(improvementId, out result) ? result.Data : 0.0;
});
Console.WriteLine(sum);
// Aggregate data for all buildings with a given improvement using byImprovementId.
// This will be much faster than the above Linq.
sum = byImprovementId[improvementId].Sum(result => result.Value.Data);
Console.WriteLine(sum);
}
static BuildingsById CreateTestBuildingsById()
{
var byBuildingId = new BuildingsById();
for (int buildingKey = 1000; buildingKey < 2000; ++buildingKey)
{
var improvements = new ImprovementResults();
for (int improvementKey = 5000; improvementKey < 5030; ++improvementKey)
{
improvements.Add
(
new ImprovementId(improvementKey),
new Result
{
Data = buildingKey + improvementKey/1000.0
}
);
}
byBuildingId.Add(new BuildingId(buildingKey), improvements);
}
return byBuildingId;
}
static ImprovementsById CreateImprovementsById(BuildingsById byBuildingId)
{
var byImprovementId = new ImprovementsById();
foreach (var improvements in byBuildingId)
{
foreach (var improvement in improvements.Value)
{
if (!byImprovementId.ContainsKey(improvement.Key))
byImprovementId[improvement.Key] = new BuildingResults();
byImprovementId[improvement.Key].Add(improvements.Key, improvement.Value);
}
}
return byImprovementId;
}
static void Main()
{
new Program().run();
}
}
}
Finally, here's a modified version which determines the time it takes to aggregate data for all instances of a building/improvement combination for a particular improvement and compares the results for dictionary of tuples with dictionary of dictionaries.
My results for a RELEASE build run outside any debugger:
Dictionary of dictionaries took 00:00:00.2967741
Dictionary of tuples took 00:00:07.8164672
It's significantly faster to use a dictionary of dictionaries, but this is only of importance if you intend to do many of these aggregations.
using System;
using System.Diagnostics;
using System.Linq;
using System.Collections.Generic;
namespace Demo
{
sealed class Result
{
public double Data;
}
sealed class BuildingId
{
public BuildingId(int id)
{
Id = id;
}
public readonly int Id;
public override int GetHashCode()
{
return Id.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as BuildingId;
if (other == null)
return false;
return this.Id == other.Id;
}
}
sealed class ImprovementId
{
public ImprovementId(int id)
{
Id = id;
}
public readonly int Id;
public override int GetHashCode()
{
return Id.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as ImprovementId;
if (other == null)
return false;
return this.Id == other.Id;
}
}
sealed class Building
{
public BuildingId Id;
public int Value;
}
sealed class Improvement
{
public ImprovementId Id;
public int Value;
}
sealed class BuildingResults : Dictionary<BuildingId, Result>{}
sealed class ImprovementResults: Dictionary<ImprovementId, Result>{}
sealed class BuildingsById: Dictionary<BuildingId, ImprovementResults>{}
sealed class ImprovementsById: Dictionary<ImprovementId, BuildingResults>{}
class Program
{
void run()
{
var byBuildingId = CreateTestBuildingsById(); // Create some test data.
var byImprovementId = CreateImprovementsById(byBuildingId); // Create the alternative lookup dictionaries.
var testTuples = CreateTestTuples();
ImprovementId improvementId = new ImprovementId(5010);
int count = 10000;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < count; ++i)
byImprovementId[improvementId].Sum(result => result.Value.Data);
Console.WriteLine("Dictionary of dictionaries took " + sw.Elapsed);
sw.Restart();
for (int i = 0; i < count; ++i)
testTuples.Where(result => result.Key.Item2.Equals(improvementId)).Sum(item => item.Value.Data);
Console.WriteLine("Dictionary of tuples took " + sw.Elapsed);
}
static Dictionary<Tuple<BuildingId, ImprovementId>, Result> CreateTestTuples()
{
var result = new Dictionary<Tuple<BuildingId, ImprovementId>, Result>();
for (int buildingKey = 1000; buildingKey < 2000; ++buildingKey)
for (int improvementKey = 5000; improvementKey < 5030; ++improvementKey)
result.Add(
new Tuple<BuildingId, ImprovementId>(new BuildingId(buildingKey), new ImprovementId(improvementKey)),
new Result
{
Data = buildingKey + improvementKey/1000.0
});
return result;
}
static BuildingsById CreateTestBuildingsById()
{
var byBuildingId = new BuildingsById();
for (int buildingKey = 1000; buildingKey < 2000; ++buildingKey)
{
var improvements = new ImprovementResults();
for (int improvementKey = 5000; improvementKey < 5030; ++improvementKey)
{
improvements.Add
(
new ImprovementId(improvementKey),
new Result
{
Data = buildingKey + improvementKey/1000.0
}
);
}
byBuildingId.Add(new BuildingId(buildingKey), improvements);
}
return byBuildingId;
}
static ImprovementsById CreateImprovementsById(BuildingsById byBuildingId)
{
var byImprovementId = new ImprovementsById();
foreach (var improvements in byBuildingId)
{
foreach (var improvement in improvements.Value)
{
if (!byImprovementId.ContainsKey(improvement.Key))
byImprovementId[improvement.Key] = new BuildingResults();
byImprovementId[improvement.Key].Add(improvements.Key, improvement.Value);
}
}
return byImprovementId;
}
static void Main()
{
new Program().run();
}
}
}
Thanks for the answers, the test code was really informative :)
The solution for me turned out to be to forgo LINQ, and perform aggregation manually directly after the heavy calculation, as I had to iterate over each combination of Building and Improvement anyway.
Also, I had to use the objects themselves as keys, in order to perform calculations before the objects were persisted to Entity Framework (i.e. their IDs were all 0).
Code:
public class Building {
public int ID { get; set; }
...
}
public class Improvement {
public int ID { get; set; }
...
}
public class Result {
public decimal Foo { get; set; }
public long Bar { get; set; }
...
public void Add(Result result) {
Foo += result.Foo;
Bar += result.Bar;
...
}
}
public class Calculator {
public Dictionary<Building, Result> ResultsByBuilding;
public Dictionary<Improvement, Result> ResultsByImprovement;
public void CalculateAndAggregate(IEnumerable<Building> buildings, IEnumerable<Improvement> improvements) {
ResultsByBuilding = new Dictionary<Building, Result>();
ResultsByImprovement = new Dictionary<Improvement, Result>();
for (building in buildings) {
for (improvement in improvements) {
Result result = DoHeavyCalculation(building, improvement);
if (ResultsByBuilding.ContainsKey(building)) {
ResultsByBuilding[building].Add(result);
} else {
ResultsByBuilding[building] = result;
}
if (ResultsByImprovement.ContainsKey(improvement)) {
ResultsByImprovement[improvement].Add(result);
} else {
ResultsByImprovement[improvement] = result;
}
}
}
}
}
public static void Main() {
var calculator = new Calculator();
IList<Building> buildings = GetBuildingsFromRepository();
IList<Improvement> improvements = GetImprovementsFromRepository();
calculator.CalculateAndAggregate(buildings, improvements);
DoStuffWithResults(calculator);
}
I did it this way because I knew exactly which aggregations I wanted; if I required a more dynamic approach I would probably have gone with something like #MatthewWatson's Dictionary of Dictionaries.