DocumentDB: How to better structure data for updates - c#

So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.

While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.

NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75

Related

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

NHibernate matching property as a substring

I'm trying to find the "best" way to match, for example, politicians' names in RSS articles. The names will be stored in a database accessed with NHibernate. As an example:
Id Name
--- ---------------
1 David Cameron
2 George Osborne
3 Alistair Darling
At the time of writing, the BBC politics news RSS feed has an item with the description
Backbench Conservative MPs put pressure on Chancellor George Osborne to stop rail firms in England increasing commuter fares by up to 11%.
For this article, I would like to detect that George Osborne is mentioned. I realise that there are several ways that this could be done, e.g. selecting all the politicians' names first, and comparing them in code, or doing the NHibernate equivalent of a LIKE.
The application itself would have a few dozen feeds, which will be queried at most every 15 minutes. Obviously there are speed, memory and scaling concerns, so I would like to ask for a recommended approach (and NHibernate query if relevant).
As we were discussing on the comments, I believe that there is a simpler approach to this problem:
Keep a list of the politicians' in memory. Because these entities won't be updated often, it's safe to work like this. Just implement an expiration logic to refresh it from the database sooner or later.
For each downloaded feed entry, simply run foreach Name in Politicians FeedEntry.Content.Contains(Name) (or something like it) before saving the entry to the database.
There you go, no complex query needed and less I/O for your solution.
Along the following lines I would either use use a regex expression or a contains to get the politicians that match the feed. The politician names and ids can be a simple collection in memory.
Then the the feed can be saved in a memcached or redis (even a db would do) with a guid. Then save the associated guid in a table that holds politician_id, feed_guid.
For some statistics you can also have a table which is an aggregate of politician_id, num_articles_mentioned where the num_articles_mentioned is incremented by 1.
You can wrap the above in a transaction if needed.

XMLSerialized Object in Database Field. Is it good design?

Suppose i have one table that holds Blogs.
The schema looks like :
ID (int)| Title (varchar 50) | Value (longtext) | Images (longtext)| ....
In the field Images i store an XML Serialized List of images that are associated with the blog.
Should i use another table for this purpose?
Yes, you should put the images in another table. Having several values in the same field indicates denormalized data and makes it hard to work with the database.
As with all rules, there are exceptions where it makes sense to put XML with multiple values in one field in the database. The first rule is that:
The data should always read/written together. No need to read or update just one of the values.
If that is fulfilled, there can be a number of reasons to put the data together in one field:
Storage efficiency, if space has proved to be a problem.
Retrieval efficiency, if performance has proved to be a problem.
Schema flexilibity; where one XML field can eliminate tens or hundreds of different tables.
I would certainly use another table. If you use XML, what happens when you need to go through and update the references to all images? (Would you just rather do an Update blog_images Set ..., or parse through the XML for each row, make the update, then re-generate the updated XML for each?
Well, it is a bit "inner platform", but it will work. A separate table would allow better image querying, although on some RDBMS platforms this could also be achieved via an XML-type column and SQL/XML.
If this data only has to be opaque storage, then maybe. However, keep in mind you'll generally have to bring back the entire XML to the app-tier to do anything interesting with it (or: depending on platform, use SQL/XML, but I advise against this, as the DB isn't the place to do such processing in most cases).
My advice in all other cases: separate table.
That depends on whether you'd need to query on the actual image data itself. If you see a possible need to query on certain images, or images with certain attributes, then it would probably be best to store that image data in a different way.
Otherwise, leave it the way it is.
But remember, only include the fields in your SELECT when you need them.
Should i use another table for this purpose?
Not necessarily. You just have to ensure that you are not selecting the images field in your queries when you don't need it. But if you wanted to denormalize your schema you could use another table and when you need the images perform a join.

How to efficiently store and read back a Hierarchy from cache

My situation is that I'm currently storing a hierarchy in a SQL database thats quickly approaching 15000 nodes ( 5000 edges ). This hierarchy is defining my security model based off a users position in the tree, granting access to items below. So when a user requests a list of all secured items, I'm using CTE to recurse it in the db ( and flatten all items ), which is started to show its age ( slow ).
The hierarchy is not changing often so I've attempted to move it into RAM ( redis ). Keeping in mind i have many subsystems that need this for security calls, and UI's to build the tree for CRUD operations.
First Attempt
My first attempt is to store the relationships as a key value pair
(this is how its stored in the database )
E
/ \
F G
/ \ / \
H I J K
mapped to:
E - [F, G]
F - [H, I]
G - [J, K]
So when i want E and all its decedents, i recursively get its child and their child using the keys, and it allows me to start at any node to move down. This solution gave a good speed increase but with 15,000 nodes, it was approximately 5000 cache hits to rebuild my tree in code ( Worse case scenario... starting at E. performance is based off the starting nodes location, resulting in super users seeing the worst performance). This was still pretty fast but seemed to chatty. I like the fact that i can remove a node at anytime by popping it out of the keys List without rebuilding my entire cache. This was also lighting fast to build a tree on demand visually on a UI.
Second Attempt
My other Idea is to to take the Hierarchy from the Database, build the tree and store that in RAM ( redis ) then pull the entire thing out of memory ( it was approx 2 MB in size, serialized ). This gave me a single call ( not as chatty ) into redis to pull the entire tree out, locate the users parent node, and descend to get all child items. These calls are frequent and passing down 2 MB at the network layer seemed large. This also means i cannot easily add/remove and item without pulling down the tree and editing and pushing it all back. Also on demand trees building via HTTP meant each request had to pull down 2MB to only get direct children ( very small using the first solution ).
So which solution do you think is a better approach ( long term as it continues to grow ). Both are defiantly faster and take some load off the database. Or is their a better way to accomplish this that i have not thought about?
Thanks
Let me offer an idea...
Use hierarchical versioning. When a node in the graph is modified, increment its version (a simple int field in the database), but also increment versions of all of its ancestors.
When getting a sub-tree from the database for the first time, cache it to RAM. (You can probably optimize this through recursive CTE and do it in a single database round-trip.)
However, the next time you need to retrieve the same sub-tree, retrieve only the root. Then compare the cached version with the version you just fetched from the database.
If they match, great, you can stop fetching and just reuse the cache.
If they don't, fetch the children and repeat the process, refreshing the cache as you go.
The net result is that more often than not, you'll cull the fetching very early, usually after only one node, and you won't even need to cache the whole graph. Modifications are expensive, but this shouldn't be a problem since they are rare.
BTW, a similar principle would work in the opposite direction - i.e. when you start with a leaf and need to find the path to the root. You'd need to update the versioning hierarchy in the opposite direction, but the rest should work in a very similar manner. You could even have both directions in combination.
--- EDIT ---
If your database and ADO.NET driver support it, it might be worth looking into server notifications, such as MS SQL Server's SqlDependency or OracleDependency.
Essentially, you instruct the DBMS to monitor changes and notify you when they happen. This is ideal for keeping your client-side cache up-to-date in an efficient way.
If hierarchy is not changed often, you can calculate whole list of items below for each node (instead of just direct children).
This way you will need significantly more RAM, but it will work lightning-fast for any user, because you will be able to read whole list of descendant nodes in single read.
For your example (I'll use JSON format):
E - {"direct" : [F, G], "all" : [F, G, H, I, J, K]}
F - {"direct" : [H, I], "all" : [H, I]}
G - {"direct" : [J, K], "all" : [J, K]}
Well, for superusers you will still need to transfer alot of data per request, but I don't see any way to make it lesser.
We do something like this. We read the tree into memory, store it in the application cache, and access it from memory. Since our changes almost never, and changes don't have to be immediately reflected in the web app, we don't even bother to detect them, just let the cache age and get refreshed. It works really well for us.

Advanced: How to optimize my complex O(n²) algorithm

I have people and places data as:
Person entity has
IList<DateRangePlaces> each having
IList<Place> of possible places
Schedule day pattern as ie. 10 days available 4 unavailable
Within a particular DateRangePlaces date range one has to obey to Schedule pattern whether person can go to a particular place or not.
Place entity has
IList<DateRangeTiming> each defining opening/closing times within each date range
Overlapping date ranges work as LIFO. So for each day that has already been defined previously new timing definition takes preference.
The problem
Now I need to do something like this (in pseudo code):
for each Place
{
for each Day between minimum and maximum date in IList<DateRangeTiming>
{
get a set of People applicable for Place and on Day
}
}
This means that number of steps to execute my task is approx.:
∑(places)( ∑(days) × ∑(people) )
This to my understanding is
O(x × yx × z)
and likely approximates to this algorithm complexity:
O(n3)
I'm not an expert in theory so you can freely correct my assumptions. What is true is that this kind of complexity is definitely not acceptable especially given the fact that I will be operating over long date ranges with many places and people.
From the formula approximation we can see that people set would be iterated lots of times. Hence I would like to optimize at least this part. To ease things a bit I changed
Person.IList<DateRangePlaces>.IList<Place>
to
Person.IList<DateRangePlaces>.IDictionary<int, Place>
which would give me a faster result whether a person can go to some place on particular date, because I would only check whether Place.Id is present in the dictionary versus IList.Where() LINQ clause that would have to scan the whole list each and every time.
Question
Can you suggest any additional optimizations I could implement into my algorithm to make it faster or even make it less complex in terms of the big O notation?
Which memory structure types would you use where and why (lists, dictionaries, stacks, queues...) to improve performance?
Addendum: The whole problem is even more complex
There're also additional complexities that I didn't mention since I wanted to simplify my question to make it more clear. So. There's also:
Place.IList<Permission>
Person.IList<DateRangePermission>
So places require particular permissions and people have a limited time permission grants that expire.
Additional to that, there's also
Person.IList<DateRangeTimingRestriction>
which tells only particular times that person can go somewhere during particular date range. And
Person.IList<DateRangePlacePriorities>
Which defines place prioritization for a particular date range.
And during this process of getting applicable people I also have to calculate certain factor per every person per every place that's related to the:
number of places that a person can visit on particular day
person's place priority factor on that particular day
All these are the reasons why I decided to rather manipulate this data in memory than using a very complex stored procedure that would also be doing multiple table scans to get factors per person and place and day.
I think such stored procedure would be way to complex to handle and maintain. So I rather get all the data first (put it appropriate memory structures to aid performance) and then mangle with it in memory.
I suggest using a relational database and writing a stored procedure to retrieve the "set of People applicable for Place and on Day".
The stored procedure approach would not be complex nor difficult to maintain if the model is architected properly. Additionally, relational databases have primary keys and indexing to avoid table scans.
The only way to speed things up using collections would be:
change the collection type. You could use a KeyedCollection, IDictionary<> or even a disconnected recordset. Disconnected recordsets also give you the ability to set foreign keys to child recordsets, however I think this would be a fairly complex pattern to use.
maintain a collection within a collection - basically the same concept as a parent / child relationship with a foreign key. The object references will only be pointers to the original object's memory space or, if you're using a keyed collection you could simply store the index of the other collection.
maintain boolean properties that can allow you to skip iterations if true or false. For example, as you build your entities, set a boolean of "HasPlaceXPermission". if the value is false, you know not to retrieve information related to place X.
maintain flags - flags can be a very good optimization technique when used properly. Similar to #3, flags can be used to determine permissions very quickly, for example if((person.PlacePermissions & (Place.Colorado | Place.Florida) > 0) // do date/time scan on Colorado and Florida, else don't.
It's difficult to know which collection types I would use based upon the information you have provided, I would need a larger scope of the application to determine that architecturally. For example, where is the data stored, how is it retrieved, how is it prepared and how is it presented? Knowing how the application is architected would help to determine its optimization points.
You can't avoid O(n^2) as the minimal iteration you need is to pass every Place and every Date element to find a match for a given Person.
I think the best way is to use a DB similar to SQL server and run your query in SQL as a store procedure.
The date range is presumably fairly limited, perhaps never more than a few years. Call it constant. When you say, for each of those combinations, you need to "get a set of people applicable", then it's pretty clear: if you really do need to get all that data, then you can't improve the complexity of you solution, because you need to return a result for each combination.
Don't worry about complexity unless you're having trouble scaling with large numbers of people. Ordinary profiling is the place to start if you're having performance problems. O(#locations * #people) is not so bad.

Categories

Resources