Advanced: How to optimize my complex O(n²) algorithm - c#

I have people and places data as:
Person entity has
IList<DateRangePlaces> each having
IList<Place> of possible places
Schedule day pattern as ie. 10 days available 4 unavailable
Within a particular DateRangePlaces date range one has to obey to Schedule pattern whether person can go to a particular place or not.
Place entity has
IList<DateRangeTiming> each defining opening/closing times within each date range
Overlapping date ranges work as LIFO. So for each day that has already been defined previously new timing definition takes preference.
The problem
Now I need to do something like this (in pseudo code):
for each Place
{
for each Day between minimum and maximum date in IList<DateRangeTiming>
{
get a set of People applicable for Place and on Day
}
}
This means that number of steps to execute my task is approx.:
∑(places)( ∑(days) × ∑(people) )
This to my understanding is
O(x × yx × z)
and likely approximates to this algorithm complexity:
O(n3)
I'm not an expert in theory so you can freely correct my assumptions. What is true is that this kind of complexity is definitely not acceptable especially given the fact that I will be operating over long date ranges with many places and people.
From the formula approximation we can see that people set would be iterated lots of times. Hence I would like to optimize at least this part. To ease things a bit I changed
Person.IList<DateRangePlaces>.IList<Place>
to
Person.IList<DateRangePlaces>.IDictionary<int, Place>
which would give me a faster result whether a person can go to some place on particular date, because I would only check whether Place.Id is present in the dictionary versus IList.Where() LINQ clause that would have to scan the whole list each and every time.
Question
Can you suggest any additional optimizations I could implement into my algorithm to make it faster or even make it less complex in terms of the big O notation?
Which memory structure types would you use where and why (lists, dictionaries, stacks, queues...) to improve performance?
Addendum: The whole problem is even more complex
There're also additional complexities that I didn't mention since I wanted to simplify my question to make it more clear. So. There's also:
Place.IList<Permission>
Person.IList<DateRangePermission>
So places require particular permissions and people have a limited time permission grants that expire.
Additional to that, there's also
Person.IList<DateRangeTimingRestriction>
which tells only particular times that person can go somewhere during particular date range. And
Person.IList<DateRangePlacePriorities>
Which defines place prioritization for a particular date range.
And during this process of getting applicable people I also have to calculate certain factor per every person per every place that's related to the:
number of places that a person can visit on particular day
person's place priority factor on that particular day
All these are the reasons why I decided to rather manipulate this data in memory than using a very complex stored procedure that would also be doing multiple table scans to get factors per person and place and day.
I think such stored procedure would be way to complex to handle and maintain. So I rather get all the data first (put it appropriate memory structures to aid performance) and then mangle with it in memory.

I suggest using a relational database and writing a stored procedure to retrieve the "set of People applicable for Place and on Day".
The stored procedure approach would not be complex nor difficult to maintain if the model is architected properly. Additionally, relational databases have primary keys and indexing to avoid table scans.
The only way to speed things up using collections would be:
change the collection type. You could use a KeyedCollection, IDictionary<> or even a disconnected recordset. Disconnected recordsets also give you the ability to set foreign keys to child recordsets, however I think this would be a fairly complex pattern to use.
maintain a collection within a collection - basically the same concept as a parent / child relationship with a foreign key. The object references will only be pointers to the original object's memory space or, if you're using a keyed collection you could simply store the index of the other collection.
maintain boolean properties that can allow you to skip iterations if true or false. For example, as you build your entities, set a boolean of "HasPlaceXPermission". if the value is false, you know not to retrieve information related to place X.
maintain flags - flags can be a very good optimization technique when used properly. Similar to #3, flags can be used to determine permissions very quickly, for example if((person.PlacePermissions & (Place.Colorado | Place.Florida) > 0) // do date/time scan on Colorado and Florida, else don't.
It's difficult to know which collection types I would use based upon the information you have provided, I would need a larger scope of the application to determine that architecturally. For example, where is the data stored, how is it retrieved, how is it prepared and how is it presented? Knowing how the application is architected would help to determine its optimization points.

You can't avoid O(n^2) as the minimal iteration you need is to pass every Place and every Date element to find a match for a given Person.
I think the best way is to use a DB similar to SQL server and run your query in SQL as a store procedure.

The date range is presumably fairly limited, perhaps never more than a few years. Call it constant. When you say, for each of those combinations, you need to "get a set of people applicable", then it's pretty clear: if you really do need to get all that data, then you can't improve the complexity of you solution, because you need to return a result for each combination.
Don't worry about complexity unless you're having trouble scaling with large numbers of people. Ordinary profiling is the place to start if you're having performance problems. O(#locations * #people) is not so bad.

Related

Is it better additional query or get all data and filter in the client?

I have a query with EF Core in which I would like to include a property and from this property, that it is a ICollection, I would like filter what data to get.
It is something like that:
myDbContext.MyEntity.Where(x => x.ID == 1).Include(x => x.MyCollection.Where(y => y.isEnabled == true));
However, I get an error because it is not possible to filter the included properties.
In fact, the items in the collection will be few, about 6 or 7, so I was thinking that I could include all and later filter the data in the client.
Another option it would be get the the main entity first and in a second query to get the childs that really I need.
I always read that the connections to the database are expensive, so it is better to do as less queries as possible, but also I read that the best practice it is to get only the data that I need and no filter in the client, but it is better filter in the query.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
Which is longer? One long piece of string, or two shorter pieces of string?
You don't know, because I haven't told you the actual lengths. You don't know if it's a 1m string versus two 5cm strings or a 10cm string vs two 8cm string.
And your question here is the same. It's better to do fewer queries than many, and it's better to do short queries than long queries. When a choice is on only one of those metrics (e.g. the shorter query from doing a simple Where on the database vs a simple Where in memory on all results) then we can make sound a priori judgements about which is likely to be the more efficient, and choose accordingly.
When though we have competing factors in play we have to:
Decide whether we even care: If they're going to still be pretty fast either way it might not be worth worrying about; find bigger fish to fry.
Measure.
Make sure what we are measuring is realistic.
The third point is important as one can often create data sets that would make one come out the victor, and other data sets that would make the other win. We need to make sure we're correctly modelling what is encountered in real life.
When the difference is small, or if they are both fast either way (and/or the use is so rare that it's still not a big deal), then just go for whichever is easier to code and maintain.

DocumentDB: How to better structure data for updates

So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.
While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.
NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

having table for fixed data or Enum?

I have a table that has Constant Value...Is it better that I have this table in my Database(that is SQL)or have an Enum in my code and delete my table?
my table has only 2 Columns and maximum 20 rows that these rows are fixed and get filled once,first time that i run application.
I would suggest to create an Enum for your case. Since the values are fixed(and I am assuming that the table is not going to change very often) you can use Enum. Creating a table in database will require an unnecessary hit to the database and will require a database connection which could be skipped if you are using Enum.
Also a lot may depend on how much operation you are going to do with your values. For example: its tedious to query your Enum values to get distinct values from your table. Whereas if you will use table approach then it would be a simple select distinct. So you may have to look into your need and the operations which you will perform on these values.
As far as the performance is concerned you can look at: Enum Fields VS Varchar VS Int + Joined table: What is Faster?
As you can see, ENUM and VARCHAR results are almost the same, but join
query performance is 30% lower. Also note the times themselves –
traversing about same amount of rows full table scan performs about 25
times better than accessing rows via index (for the case when data
fits in memory!)
So, if you have an application and you need to have some table field
with a small set of possible values, I’d still suggest you to use
ENUM, but now we can see that performance hit may not be as large as
you expect. Though again a lot depends on your data and queries.
That depends on your needs.
You may want to translate the Enum Values (if you are showing it in GUI) and order a set of record based on translated values. For example: imagine you have a Employees table and a Position column. If the record set is big, and you want to sort or order by translated position column, then you have to keep the enum values + translations in database.
Otherwise KISS and have it in code. You will spare time on asking database for values.
I depends on character of that constants.
If they are some low level system constants that never should be change (like pi=3.1415) then it is better to keep them only in code part in some config file. And also if performance is critical parameter and you use them very often (on almost each request) it is better to keep them in code part.
If they are some constants (may be business constants) that can change in future it is Ok to put them in table - then you have more flexibility to change them (for instance from admin panel).
It really depends on what you actually need.
With Enum
It is faster to access
Bound to that certain application. (although you can share by making it as reference, but it just does not look as good as using DB)
You can use in switch statement
Enum usually does not care about value and it is limited to int.
With DB
It is slower, because you have to make connection and query.
The data can be shared widely.
You can set the value to be anything (any type any value).
So, if you will use it only on certain application, Enum is good enough. But if several applications are going to use it, then DB would be better option.

Data structure for relationships

I am converting a VB6 to C# and I want to make my data structure that holds values and relationships more efficient. In VB I have a collection of values and another collection of relationships between those values with priorities for those relationships. I also have an algorithm that when a set of values is passed to it all relationships required to join those values together is returned. For example, say the values collection contains 1-10 and the relationship collection contains
1,2
3,2
5,2
2,8
8,10
9,10
If the input was 1,9,10 the returned relationships would be --
1,2
2,8
8,10
9,10
Since there may be multiple paths the least amount of relationships would be returned but there is a caveat of relationship priorities. If a relationship has a higher priority then that relationship would be added and the rest of the relationships would be added from there. I am thinking of using a Disjoint-set data structure but I am not sure.
Any ideas?
More information --
The number of values would normally be less than 100 and the relationships less than 500. The collections are static and the algorithm will be used again and again to find paths. Also, I did not ask this but would the algorithm in Disjoint-set data structure be the most efficient?
It sounds like what you have is a Graph. This is a structure with Nodes and Edges. There are many many libraries and tools that deal with Graphs. Microsoft even put out a paper on how to deal with them. I think graphs are great and extremely useful in many situations.
One big benefit with graphs is the ability to assign priorities to the edges between the nodes. Then when you want to find the path between two nodes, boom, the graph can choose the path with the ideal priority.
In your situation, your values are the nodes and your relationships are the edges.
You need to ask yourself (and tell us) what kind of pattern of use you expect. Do these relations get added in order or randomly, do yours queries come in order (as you show them) or randomly, and is this essentially a batch process -- load them up, read off the queries -- or do you expect to do it "on line" in the sense that you may add some, then query some, then add some more and query some more?
Will you know how many you want to store beforehand, and how many do you expect to store? Dozens? Thousands? Tens of millions?
Here are some suggestions:
if you know beforehand how many you
expect to store, it's not a really
big number, you don't expect to add
them after first loading up, there
aren't any duplicates in the
left-hand side of the pair, and
they're reasonably "dense" in the
sense that there aren't big gaps
between numbers in the left-hand one
of the pair, then you probably want
an array. Insertion is O(1), access is O(1),
but can't have duplicate indices and
expanding it after you build it is a pain.
if the number is really large, like > 108,
you probably want some kind of database.
Databases are relatively very slow -- 4 to 5 orders of
magnitude greater than in-memory data structures --
but handle really big data.
If you have insertions after the
first load, and you care about
order, you're going to want some
sort of tree, like a 2-3 tree. Insertion
and access both O(lg n).
You'd probably find an implmentation
under a name like "ordered list"
(I'm not a C# guy.)
Most any other case, you probably
want a hash. Average insertion and
access both O(1), like an array;
worst case [which you won't hit with
this data] is O(n)

Categories

Resources