My situation is that I'm currently storing a hierarchy in a SQL database thats quickly approaching 15000 nodes ( 5000 edges ). This hierarchy is defining my security model based off a users position in the tree, granting access to items below. So when a user requests a list of all secured items, I'm using CTE to recurse it in the db ( and flatten all items ), which is started to show its age ( slow ).
The hierarchy is not changing often so I've attempted to move it into RAM ( redis ). Keeping in mind i have many subsystems that need this for security calls, and UI's to build the tree for CRUD operations.
First Attempt
My first attempt is to store the relationships as a key value pair
(this is how its stored in the database )
E
/ \
F G
/ \ / \
H I J K
mapped to:
E - [F, G]
F - [H, I]
G - [J, K]
So when i want E and all its decedents, i recursively get its child and their child using the keys, and it allows me to start at any node to move down. This solution gave a good speed increase but with 15,000 nodes, it was approximately 5000 cache hits to rebuild my tree in code ( Worse case scenario... starting at E. performance is based off the starting nodes location, resulting in super users seeing the worst performance). This was still pretty fast but seemed to chatty. I like the fact that i can remove a node at anytime by popping it out of the keys List without rebuilding my entire cache. This was also lighting fast to build a tree on demand visually on a UI.
Second Attempt
My other Idea is to to take the Hierarchy from the Database, build the tree and store that in RAM ( redis ) then pull the entire thing out of memory ( it was approx 2 MB in size, serialized ). This gave me a single call ( not as chatty ) into redis to pull the entire tree out, locate the users parent node, and descend to get all child items. These calls are frequent and passing down 2 MB at the network layer seemed large. This also means i cannot easily add/remove and item without pulling down the tree and editing and pushing it all back. Also on demand trees building via HTTP meant each request had to pull down 2MB to only get direct children ( very small using the first solution ).
So which solution do you think is a better approach ( long term as it continues to grow ). Both are defiantly faster and take some load off the database. Or is their a better way to accomplish this that i have not thought about?
Thanks
Let me offer an idea...
Use hierarchical versioning. When a node in the graph is modified, increment its version (a simple int field in the database), but also increment versions of all of its ancestors.
When getting a sub-tree from the database for the first time, cache it to RAM. (You can probably optimize this through recursive CTE and do it in a single database round-trip.)
However, the next time you need to retrieve the same sub-tree, retrieve only the root. Then compare the cached version with the version you just fetched from the database.
If they match, great, you can stop fetching and just reuse the cache.
If they don't, fetch the children and repeat the process, refreshing the cache as you go.
The net result is that more often than not, you'll cull the fetching very early, usually after only one node, and you won't even need to cache the whole graph. Modifications are expensive, but this shouldn't be a problem since they are rare.
BTW, a similar principle would work in the opposite direction - i.e. when you start with a leaf and need to find the path to the root. You'd need to update the versioning hierarchy in the opposite direction, but the rest should work in a very similar manner. You could even have both directions in combination.
--- EDIT ---
If your database and ADO.NET driver support it, it might be worth looking into server notifications, such as MS SQL Server's SqlDependency or OracleDependency.
Essentially, you instruct the DBMS to monitor changes and notify you when they happen. This is ideal for keeping your client-side cache up-to-date in an efficient way.
If hierarchy is not changed often, you can calculate whole list of items below for each node (instead of just direct children).
This way you will need significantly more RAM, but it will work lightning-fast for any user, because you will be able to read whole list of descendant nodes in single read.
For your example (I'll use JSON format):
E - {"direct" : [F, G], "all" : [F, G, H, I, J, K]}
F - {"direct" : [H, I], "all" : [H, I]}
G - {"direct" : [J, K], "all" : [J, K]}
Well, for superusers you will still need to transfer alot of data per request, but I don't see any way to make it lesser.
We do something like this. We read the tree into memory, store it in the application cache, and access it from memory. Since our changes almost never, and changes don't have to be immediately reflected in the web app, we don't even bother to detect them, just let the cache age and get refreshed. It works really well for us.
Related
So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.
While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.
NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75
I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.
I'm trying to find the "best" way to match, for example, politicians' names in RSS articles. The names will be stored in a database accessed with NHibernate. As an example:
Id Name
--- ---------------
1 David Cameron
2 George Osborne
3 Alistair Darling
At the time of writing, the BBC politics news RSS feed has an item with the description
Backbench Conservative MPs put pressure on Chancellor George Osborne to stop rail firms in England increasing commuter fares by up to 11%.
For this article, I would like to detect that George Osborne is mentioned. I realise that there are several ways that this could be done, e.g. selecting all the politicians' names first, and comparing them in code, or doing the NHibernate equivalent of a LIKE.
The application itself would have a few dozen feeds, which will be queried at most every 15 minutes. Obviously there are speed, memory and scaling concerns, so I would like to ask for a recommended approach (and NHibernate query if relevant).
As we were discussing on the comments, I believe that there is a simpler approach to this problem:
Keep a list of the politicians' in memory. Because these entities won't be updated often, it's safe to work like this. Just implement an expiration logic to refresh it from the database sooner or later.
For each downloaded feed entry, simply run foreach Name in Politicians FeedEntry.Content.Contains(Name) (or something like it) before saving the entry to the database.
There you go, no complex query needed and less I/O for your solution.
Along the following lines I would either use use a regex expression or a contains to get the politicians that match the feed. The politician names and ids can be a simple collection in memory.
Then the the feed can be saved in a memcached or redis (even a db would do) with a guid. Then save the associated guid in a table that holds politician_id, feed_guid.
For some statistics you can also have a table which is an aggregate of politician_id, num_articles_mentioned where the num_articles_mentioned is incremented by 1.
You can wrap the above in a transaction if needed.
I read this paper.
But I'd love to avoid a ton of research to solve this problem if someone has already done it. I need this space efficient tree for a reasonably (conceptually) simple GUI control: TreeDataGridView with virtual mode
An example tree might look like:
(RowIndexHierarchy),ColumnIndex
(0),0
(0,0),0
(0,0,0),0
(0,0,0,0),0
(0,0,0,0,0),0
(0,0,0,1),0
(0,0,0,2),0
(0,0,1),0
(0,0,2),0
(0,0,2,0),0
(0,0,2,1),0
(0,0,2,2),0
(0,1),0
(0,2),0
(0,2,0),0
(0,2,0,0),0
(0,2,0,1),0
(0,2,0,2),0
(0,2,1),0
(0,2,2),0
(0,2,2,0),0
(0,2,2,1),0
(0,2,2,2),0
(1),0
I need operations like "find flat row index from row hierarchy" and "find row hierarchy from flat row index". Also, to support expand/collapse, I need "find next node with the same or less depth".
For read-only tree you can store sorted array of nodes by its parent index.
0 a
1 (a/)b
1 (a/)c
2 (a/b/)d
2 (a/b/)e
2 (a/b/)f
3 (a/c/)c
Each time you'll need to find child nodes you can use binary search to find upper and lower boundaries of nodes range.
I'm not sure if I am following your needs exactly, but we access a database that has a tree-view UI. The tree runs from
Top Level (the user's company);
Direct Client Company;
Office Location;
Employee;
Indirect Client Company;
Proposal;
Specific Vendor Bid;
Detail Financials (invoices, adjustments, etc)
As we have thousands of direct clients and the tree branches pretty heavily at each tier, so we don't load the entire data-set at any time. Instead we only load Type, Guid, DisplayName and some administrative data for each child and load a "details" pane for the currently focused item. Unexplored paths through the tree simply don't exist in memory. To avoid loading the full lists of names, we have "rolodex" level that just divides the dataset into batches of 100 records or less. (So "Z" stands alone, but "Sa-St" subdivides S). These auto-insert when a subset grows beyond the 100 record threshold.
Periodically (when the system is idle) we check the loaded count and if it exceeds a threshold we drop the least recently used nodes until we are below the threshold.
The actual data access is done when the user navigates: we access the database and refresh the subset they are navigating through. We do have the advantage that Type determines the table to query (both for that level and the children) and thus we can keep the individual record types indexed and accessible.
This has given the user the ability to navigate through the data in any way they want, while keeping the in-memory retained data minimized. Of course we off standard search modes and a menu of "recently used history" (for each type) as well, but often the work they do requires moving up and down a narrow chain of nodes, so the tree view keeps it all in front of them while working with a given client and the subset.
With that background, I become curious as to the nature of data that would have such undifferentiated levels that such a tier by tier data store wouldn't be appropriate. The advantage that tier by tier storage has is that all I need is the current node's Guid and I can search the child table on that as the foreign key (which is indexed, so quick to return the subset).
I guess this amounts to "unasking the question", but it seems that most tree structures have distinct data at each level, so it would seem far easier to work with something established (like a table query on an indexed field, which keeps the whole affair out of memory in the first place) than making a custom structure.
For example, I have never asked for the "next node at the current level" except within a given parent (because leaving a given parent takes me to another context). But within a parent I already have the children and their order.
Perhaps it is because of the space I'm in, but I find a tree control that knows how to bind to different tables based on parent->child relationships of tables more useful, which is why I wrote one. I also find lazy loading of data and aggressive dismissal of data keep the memory footprint minimized. Finally I find the programming model incredibly simple: just create a new "treenode" subclass for any table I want to access and make the treenode responsible for loading their children.
(Clarifying, due to question below:)
In my implementation each TreeNode is actually a SpecificTreeNode, derived from BaseNode, which in turn is derived from TreeNode. Being inherited from TreeNode, they can be used directly by the tree, but because they have overrides of the BaseNode properties such as LoadChildren and display properties, the display and retrieval any given set of data is implied by the type of node (and the Guid that the item represents).
This means that as the user navigates the tree, the SpecificTreeNode generates the required ORM query on the fly. For performance, child tables have any parent IDs as indexes, so navigating down the tree (even by multiple layers, if using a SpecificTreeNode that does rollups) is just a quick index lookup.
In this way, I keep very little of the data in memory at any time, pulling only what we need from the database. Likewise, queries against the tree are converted to ORM queries against our database, pulling only the results and limiting the amount any query can pull (if you are using a Tree UI and you pull over 100 records at once, the UI isn't really the optimal place for whatever you are doing).
When your dataset is hundreds of GB in size, it seems the only reasonable recourse. The advantage I feel this has is that the Tree itself has no idea that different levels and paths render and query differently... it just asks the BaseNode (from its perspective) to do something, and the overrides on SpecificTreeNode actually do the lifting. Thus, the "data structure" is simply the way the tree works already combined with data queries on my tables and views.
(End of clarification.)
Meanwhile all the tree controls on the market seem to miss that and go with something far more complex.
The most space-efficient way to store a balanced N-ary tree is in an array... zero space-overhead! And actually very efficient to traverse ... just some simple math required to compute your parent index from your index... or your N children's indices from your index.
To find some code for this, look up heapsort... the heap structure (nothing to do with memory heap) is a balanced binary tree... and most people implement it in an array. Although it is binary, the same thing can be done for N-ary.
If your N-ary tree is not kept balanced, but tends to be fairly dense, then the array implementation will still be more space-efficient than most... the empty nodes being the only space overhead. However, if your trees are ever highly imbalanced, then the array implementation can be very space-inefficient.
I am converting a VB6 to C# and I want to make my data structure that holds values and relationships more efficient. In VB I have a collection of values and another collection of relationships between those values with priorities for those relationships. I also have an algorithm that when a set of values is passed to it all relationships required to join those values together is returned. For example, say the values collection contains 1-10 and the relationship collection contains
1,2
3,2
5,2
2,8
8,10
9,10
If the input was 1,9,10 the returned relationships would be --
1,2
2,8
8,10
9,10
Since there may be multiple paths the least amount of relationships would be returned but there is a caveat of relationship priorities. If a relationship has a higher priority then that relationship would be added and the rest of the relationships would be added from there. I am thinking of using a Disjoint-set data structure but I am not sure.
Any ideas?
More information --
The number of values would normally be less than 100 and the relationships less than 500. The collections are static and the algorithm will be used again and again to find paths. Also, I did not ask this but would the algorithm in Disjoint-set data structure be the most efficient?
It sounds like what you have is a Graph. This is a structure with Nodes and Edges. There are many many libraries and tools that deal with Graphs. Microsoft even put out a paper on how to deal with them. I think graphs are great and extremely useful in many situations.
One big benefit with graphs is the ability to assign priorities to the edges between the nodes. Then when you want to find the path between two nodes, boom, the graph can choose the path with the ideal priority.
In your situation, your values are the nodes and your relationships are the edges.
You need to ask yourself (and tell us) what kind of pattern of use you expect. Do these relations get added in order or randomly, do yours queries come in order (as you show them) or randomly, and is this essentially a batch process -- load them up, read off the queries -- or do you expect to do it "on line" in the sense that you may add some, then query some, then add some more and query some more?
Will you know how many you want to store beforehand, and how many do you expect to store? Dozens? Thousands? Tens of millions?
Here are some suggestions:
if you know beforehand how many you
expect to store, it's not a really
big number, you don't expect to add
them after first loading up, there
aren't any duplicates in the
left-hand side of the pair, and
they're reasonably "dense" in the
sense that there aren't big gaps
between numbers in the left-hand one
of the pair, then you probably want
an array. Insertion is O(1), access is O(1),
but can't have duplicate indices and
expanding it after you build it is a pain.
if the number is really large, like > 108,
you probably want some kind of database.
Databases are relatively very slow -- 4 to 5 orders of
magnitude greater than in-memory data structures --
but handle really big data.
If you have insertions after the
first load, and you care about
order, you're going to want some
sort of tree, like a 2-3 tree. Insertion
and access both O(lg n).
You'd probably find an implmentation
under a name like "ordered list"
(I'm not a C# guy.)
Most any other case, you probably
want a hash. Average insertion and
access both O(1), like an array;
worst case [which you won't hit with
this data] is O(n)