Frequent inserts into sorted collection

Frequent inserts into sorted collection - c#

I have sorted collection (List) and I need to keep it sorted at all times.
I am currently using List.BinarySearch on my collection and then insert element in right place. I have also tried sorting list after every insertion but the performance in unacceptable.
Is there a solution that will give better performance? Maybe I should use other collection.
(I am aware of SortedList but it is restricted to unique keys)

PowerCollections has an OrderedBag type which may be good for what you need. From the docs
Inserting, deleting, and looking up an
an element all are done in log(N) + M
time, where N is the number of keys in
the tree, and M is the current number
of copies of the element being
handled.
However, for the .NET 3.5 built in types, using List.BinarySearch and inserting each item into the correct place is a good start - but that uses an Array internally so your performance will drop due to all the copying you're doing when you insert.
If you can group your inserts into batches that will improve things, but unless you can get down to only a single sort operation after all your inserting you're probably better off using OrderedBag from PowerCollections if you can.

If you're using .Net 4, you can use a SortedSet<T>
http://msdn.microsoft.com/en-us/library/dd412070.aspx
For .Net 3.5 and lower, see if a SortedList<TKey,TValue> works for you.
http://msdn.microsoft.com/en-us/library/ms132319.aspx

Try to aggregate insertions into batches, and sort only at the end of each batch.

Related

C# efficient datastructure to store objects sorted by key with duplicate keys to fit given requirements

I'm looking for the most efficient way to store a collection of objects sorted by Comparator using object attribute in programming language C#.
There are objects with same value for attribute, so duplicate keys occur in the collection.
Complexityclasses for inserting and removing elements to or from sorted datastructure should not be higher than O(log(N)) (noted in Big O notation) since attribute used for sorting will change very often and list has to be updated with every change to stay consistent.
Complexityclasses for getting all elements in datastructure as list in sorted order should not be higher than O(1).
Options might be C# templates SortedSets, SortedDictionaries or SortedLists. All of them fail when inserting or deleting from sorted datastructure if duplicate keys are present.
A workaround might be to use stacked SortedDictionaries as shown below, and aggregate objects with equal key as seperated collections, and sort them by another unique key (ID for example):
SortedDictionary<long, SortedDictionary<long, RankedObject>> sPresortedByRank =
new SortedDictionary<long, SortedDictionary<long, RankedObject>>(new ByRankAscSortedDictionary());
long rank = 52
sPresortedByRank[rank] = new SortedDictionary<long, RankedObject>(new ByIdAscSortedDictionary);
Inserting and removing elements from datastructure will work in O(log(N)), what is good. Getting all elements from datastructe as list requires a expensive Queryable.SelectMany, which increases complexity for this operation to O(N^2), what is not acceptable.
Current best attemp is to use a primitive List and insert and delete using BinarySearch to identify indices for inserations and deletions. For insert this gives me worst case complexity O(log(N)). For delete average case complexity O(log(n)) since it duplicate keys will be rare, but O(N) in worst case (all objects with same key). Getting all elements of sorted list will be O(1).
Can someone imagine a better way of managing object collection in sorted datastructure that fits my needs, or is current best attemp the best one in general.
Help and well founded opinions are appreciated. Cookies for good answers of course.

I think the simplest solution that meets your requirements it to add tie breaking to your comparator to ensure that there will be no duplicates, and a defined total ordering among all objects. Then you can just use SortedSet.
If you can have an ID in every object, for example, then instead of sorting by "rank", you can sort by (rank,ID) to make the comparator a total ordering.
To find all the elements with a specific rank, you would then use SortedSet.GetViewBeteen() with the range of (rank,min_id) and (rank,max_id).

#Iliar Turdushevs very useful hint:
Do you consider using C5 library? It contains TreeBag collection that satisfies all you requirements. By the way, for primitive List the complexity of inserting and removing elements is not O(log(N)). O(log(N)) is only complexity of searching the index where the element must be inserted/deleted. But insertion/deletion itself uses Array.Copy to shift elements to insert/delete the element. Therefore the complexity would be O(M), where M <= N.

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.

Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).

A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).

We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.

This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

best data structure for sorted time series data that can quickly return subarrays?

I'm in need of a datastructure that is basically a list of data points, where each data point has a timestamp and a double[] of data values. I want to be able to retrieve the closest point to a given timestamp or all points within a specified range of timestamps.
I'm using c#. my thinking was using a regular list would be possible, where "datapoint" is a class that contains the timestamp and double[] fields. then to insert, I'd use the built-in binarysearch() to find where to insert the new data, and I could use it again to find the start/end indexes for a range search.
I first tried sortedlists, but it seems like you can't iterate through indexes i=0,1,2,...,n, just through keys, so I wasn't sure how to do the range search without some convoluted function.
but then I learned that list<>'s insert() is o(n)...couldn't I do better than that without sacrificing elsewhere?
alternatively, is there some nice linq query that will do everything I want in a single line?

If you're willing to use non BCL libraries, the C5.SortedArray<T> has always worked quite well for me.
It has a great method, RangeFromTo, that performs quite well with this sort of problem.

If you have only static data then any structure implementing IList should be fine. Sort it once and then make queries using BinarySearch. This should also work if your inserted timestamps are always increasing, then you can just do List.Add() in O(1) and it will be still sorted.
List<int> x = new List<int>();
x.Add(5);
x.Add(7);
x.Add(3);
x.Sort();
//want to find all elements between 4 and 6
int rangeStart = x.BinarySearch(4);
//since there is no element equal to 4, we'll get the binary complement of an index, where 4 could have possibly been found
//see MSDN for List<T>.BinarySearch
if (rangeStart < 0)
rangeStart = ~rangeStart;
while (x[rangeStart] < 6)
{
//do you business
rangeStart++;
}
If you need to insert data at random points in your structure, keep it sorted and be able to query ranges fast, you need a structure called B+ tree. It's not implemented in the framework, you'll need to get it somewhere by your own.
Inserting a record requires O(log n) operations in the worst case
Finding a record requires O(log n) operations in the worst case
Removing a (previously located) record requires O(log n) operations in the worst case
Performing a range query with k elements occurring within the range requires O((log n) + k) operations in the worst case.
P.S. "is there some nice linq query that will do everything I want in a single line"
I wish I knew such a nice linq query that could do everything I want in one line :-)

You have the choice of cost at insertion, retrieval or removal time. There are various data structures optimized for each of these cases. Before you decide on one, I'd estimate the total size of your structures, how many data points are being generated (and at which frequency) and what will be used more often: insertion or retrieval.
If you insert a lot of new data points at high frequency, I'd suggest looking at a LinkedList<>. If you're retrieving more often, I'd use a List<> even though its insertion time is slower.
Of course you could do that in a LINQ query, but remember this is only sugar coating: The query will execute every time and for every execution search the entire set of data points to find a match. This may be more expensive than using the right collection for the job in the first place.

How about using an actual database to store your data and run queries against that?
Then, you could use LINQ-to-SQL.

Fastest way to find objects from a collection matched by condition on string member

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:
class ClassFoo
{
public string word;
public float score;
//... etc ...
}
Assume there's going to be like 50.000 items in the collection, all in memory.
Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:
List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
result.Add(cf);
}
How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?
The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.

With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.
If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.
For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.
So, it really comes down to what queries you want to do against that data.
In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.
Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.
EDIT, based on the comment
If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.
If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.
Anything else is some specialization on these concepts.

var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));
that's the easiest in my opinion, should execute rather quickly.

Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.
You could parallelize if you have multiple cores or machines.

I'm not up on my Java right now, but I would think about the following things.
How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.
If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.
For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.
Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.

For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.
In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.

If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.

You can create some sort of index and it might get faster.
We can build a index like this:
Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[0]].Add(cf);
indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}
Then use the it like this:
foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
result.Add(cf);
}
Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.

Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?
If the collection will get large, I would definitely use an index.
In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.
If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.

If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.