I am converting a VB6 to C# and I want to make my data structure that holds values and relationships more efficient. In VB I have a collection of values and another collection of relationships between those values with priorities for those relationships. I also have an algorithm that when a set of values is passed to it all relationships required to join those values together is returned. For example, say the values collection contains 1-10 and the relationship collection contains
1,2
3,2
5,2
2,8
8,10
9,10
If the input was 1,9,10 the returned relationships would be --
1,2
2,8
8,10
9,10
Since there may be multiple paths the least amount of relationships would be returned but there is a caveat of relationship priorities. If a relationship has a higher priority then that relationship would be added and the rest of the relationships would be added from there. I am thinking of using a Disjoint-set data structure but I am not sure.
Any ideas?
More information --
The number of values would normally be less than 100 and the relationships less than 500. The collections are static and the algorithm will be used again and again to find paths. Also, I did not ask this but would the algorithm in Disjoint-set data structure be the most efficient?
It sounds like what you have is a Graph. This is a structure with Nodes and Edges. There are many many libraries and tools that deal with Graphs. Microsoft even put out a paper on how to deal with them. I think graphs are great and extremely useful in many situations.
One big benefit with graphs is the ability to assign priorities to the edges between the nodes. Then when you want to find the path between two nodes, boom, the graph can choose the path with the ideal priority.
In your situation, your values are the nodes and your relationships are the edges.
You need to ask yourself (and tell us) what kind of pattern of use you expect. Do these relations get added in order or randomly, do yours queries come in order (as you show them) or randomly, and is this essentially a batch process -- load them up, read off the queries -- or do you expect to do it "on line" in the sense that you may add some, then query some, then add some more and query some more?
Will you know how many you want to store beforehand, and how many do you expect to store? Dozens? Thousands? Tens of millions?
Here are some suggestions:
if you know beforehand how many you
expect to store, it's not a really
big number, you don't expect to add
them after first loading up, there
aren't any duplicates in the
left-hand side of the pair, and
they're reasonably "dense" in the
sense that there aren't big gaps
between numbers in the left-hand one
of the pair, then you probably want
an array. Insertion is O(1), access is O(1),
but can't have duplicate indices and
expanding it after you build it is a pain.
if the number is really large, like > 108,
you probably want some kind of database.
Databases are relatively very slow -- 4 to 5 orders of
magnitude greater than in-memory data structures --
but handle really big data.
If you have insertions after the
first load, and you care about
order, you're going to want some
sort of tree, like a 2-3 tree. Insertion
and access both O(lg n).
You'd probably find an implmentation
under a name like "ordered list"
(I'm not a C# guy.)
Most any other case, you probably
want a hash. Average insertion and
access both O(1), like an array;
worst case [which you won't hit with
this data] is O(n)
Related
So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.
While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.
NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75
I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.
I have approximately 10,000 records. Each records has 2 fields: one field is a string up to 300 characters in length and the other field is a decimal value. This is like a product catalog with product names and the price of each product.
What I need to do is allow the user to type any word and display all products containing that word together with their prices in a listbox. That's all.
What type of collection is best for this scenario?
If I need to sort based on either product name or price, will the choice still be the same?
Right now I am using an XML file, but I thought using a collection so that I can embed all the values in the code is simpler. Thanks for your suggestions.
A Dictionary will do the job. However, if you are doing rapid partial matches (e.g. search as the user types) you may get better performance by creating multiple keys which point to the same item. For example, the word "Apple" could be located with "Ap", "App", "Appl", and "Apple".
I have used this approach on a similar number of records with very good results. I have turned my 10K source items into about 50K unique keys. Each of these Dictionary entries points to a list containing references to all matches for that term. You can then search this much smaller list more efficiently. Despite the large number of lists this creates, the memory footprint is quite reasonable.
You can also make up your own keys if desired to redirect common misspellings or point to related items. This also eliminates most of the issues with unique keys because each key points to a list. A single item may be classified by each of the words in its name; this is extremely useful if you have long product names with multiple words in it. When classifying your items, each word in the name can be mapped to one or more keys.
I should also point out that building and classifying 10K items shouldn't take long if done correctly (couple hundred milliseconds is reasonable). The results can be cached for as long as you want using Application, Cache, or static members.
To summarize, the resulting structure is a Dictionary<string, List<T>> where the string is a short (2-6 characters works well) but unique key. Each key points to a List<T> (or other collection, if you are so inclined) of items which match that key. When a search is performed, you locate the key which matches the term provided by the user. Depending on the length of your keys, you may truncate the user's search to your maximum key length. After locating the correct child collection, you then search that collection for a complete or partial match using whatever methodology you wish.
Lastly, you may wish to create a lightweight structure for each item in the list so that you can store additional information about the item. For example, you might create a small Product class which stores the name, price, department, and popularity of the product. This can help you refine the results you show to the user.
All-in-all, you can perform intelligent, detailed, fuzzy searches in real-time.
The aforementioned structures should provide functionality roughly equivalent to a trie.
10K records is not that much.
An Dictionary<string,decimal> would fit the bill. You can sort by key or by value using LINQ, as well as do searches.
This assumes that product names are unique.
I have people and places data as:
Person entity has
IList<DateRangePlaces> each having
IList<Place> of possible places
Schedule day pattern as ie. 10 days available 4 unavailable
Within a particular DateRangePlaces date range one has to obey to Schedule pattern whether person can go to a particular place or not.
Place entity has
IList<DateRangeTiming> each defining opening/closing times within each date range
Overlapping date ranges work as LIFO. So for each day that has already been defined previously new timing definition takes preference.
The problem
Now I need to do something like this (in pseudo code):
for each Place
{
for each Day between minimum and maximum date in IList<DateRangeTiming>
{
get a set of People applicable for Place and on Day
}
}
This means that number of steps to execute my task is approx.:
∑(places)( ∑(days) × ∑(people) )
This to my understanding is
O(x × yx × z)
and likely approximates to this algorithm complexity:
O(n3)
I'm not an expert in theory so you can freely correct my assumptions. What is true is that this kind of complexity is definitely not acceptable especially given the fact that I will be operating over long date ranges with many places and people.
From the formula approximation we can see that people set would be iterated lots of times. Hence I would like to optimize at least this part. To ease things a bit I changed
Person.IList<DateRangePlaces>.IList<Place>
to
Person.IList<DateRangePlaces>.IDictionary<int, Place>
which would give me a faster result whether a person can go to some place on particular date, because I would only check whether Place.Id is present in the dictionary versus IList.Where() LINQ clause that would have to scan the whole list each and every time.
Question
Can you suggest any additional optimizations I could implement into my algorithm to make it faster or even make it less complex in terms of the big O notation?
Which memory structure types would you use where and why (lists, dictionaries, stacks, queues...) to improve performance?
Addendum: The whole problem is even more complex
There're also additional complexities that I didn't mention since I wanted to simplify my question to make it more clear. So. There's also:
Place.IList<Permission>
Person.IList<DateRangePermission>
So places require particular permissions and people have a limited time permission grants that expire.
Additional to that, there's also
Person.IList<DateRangeTimingRestriction>
which tells only particular times that person can go somewhere during particular date range. And
Person.IList<DateRangePlacePriorities>
Which defines place prioritization for a particular date range.
And during this process of getting applicable people I also have to calculate certain factor per every person per every place that's related to the:
number of places that a person can visit on particular day
person's place priority factor on that particular day
All these are the reasons why I decided to rather manipulate this data in memory than using a very complex stored procedure that would also be doing multiple table scans to get factors per person and place and day.
I think such stored procedure would be way to complex to handle and maintain. So I rather get all the data first (put it appropriate memory structures to aid performance) and then mangle with it in memory.
I suggest using a relational database and writing a stored procedure to retrieve the "set of People applicable for Place and on Day".
The stored procedure approach would not be complex nor difficult to maintain if the model is architected properly. Additionally, relational databases have primary keys and indexing to avoid table scans.
The only way to speed things up using collections would be:
change the collection type. You could use a KeyedCollection, IDictionary<> or even a disconnected recordset. Disconnected recordsets also give you the ability to set foreign keys to child recordsets, however I think this would be a fairly complex pattern to use.
maintain a collection within a collection - basically the same concept as a parent / child relationship with a foreign key. The object references will only be pointers to the original object's memory space or, if you're using a keyed collection you could simply store the index of the other collection.
maintain boolean properties that can allow you to skip iterations if true or false. For example, as you build your entities, set a boolean of "HasPlaceXPermission". if the value is false, you know not to retrieve information related to place X.
maintain flags - flags can be a very good optimization technique when used properly. Similar to #3, flags can be used to determine permissions very quickly, for example if((person.PlacePermissions & (Place.Colorado | Place.Florida) > 0) // do date/time scan on Colorado and Florida, else don't.
It's difficult to know which collection types I would use based upon the information you have provided, I would need a larger scope of the application to determine that architecturally. For example, where is the data stored, how is it retrieved, how is it prepared and how is it presented? Knowing how the application is architected would help to determine its optimization points.
You can't avoid O(n^2) as the minimal iteration you need is to pass every Place and every Date element to find a match for a given Person.
I think the best way is to use a DB similar to SQL server and run your query in SQL as a store procedure.
The date range is presumably fairly limited, perhaps never more than a few years. Call it constant. When you say, for each of those combinations, you need to "get a set of people applicable", then it's pretty clear: if you really do need to get all that data, then you can't improve the complexity of you solution, because you need to return a result for each combination.
Don't worry about complexity unless you're having trouble scaling with large numbers of people. Ordinary profiling is the place to start if you're having performance problems. O(#locations * #people) is not so bad.
I read this paper.
But I'd love to avoid a ton of research to solve this problem if someone has already done it. I need this space efficient tree for a reasonably (conceptually) simple GUI control: TreeDataGridView with virtual mode
An example tree might look like:
(RowIndexHierarchy),ColumnIndex
(0),0
(0,0),0
(0,0,0),0
(0,0,0,0),0
(0,0,0,0,0),0
(0,0,0,1),0
(0,0,0,2),0
(0,0,1),0
(0,0,2),0
(0,0,2,0),0
(0,0,2,1),0
(0,0,2,2),0
(0,1),0
(0,2),0
(0,2,0),0
(0,2,0,0),0
(0,2,0,1),0
(0,2,0,2),0
(0,2,1),0
(0,2,2),0
(0,2,2,0),0
(0,2,2,1),0
(0,2,2,2),0
(1),0
I need operations like "find flat row index from row hierarchy" and "find row hierarchy from flat row index". Also, to support expand/collapse, I need "find next node with the same or less depth".
For read-only tree you can store sorted array of nodes by its parent index.
0 a
1 (a/)b
1 (a/)c
2 (a/b/)d
2 (a/b/)e
2 (a/b/)f
3 (a/c/)c
Each time you'll need to find child nodes you can use binary search to find upper and lower boundaries of nodes range.
I'm not sure if I am following your needs exactly, but we access a database that has a tree-view UI. The tree runs from
Top Level (the user's company);
Direct Client Company;
Office Location;
Employee;
Indirect Client Company;
Proposal;
Specific Vendor Bid;
Detail Financials (invoices, adjustments, etc)
As we have thousands of direct clients and the tree branches pretty heavily at each tier, so we don't load the entire data-set at any time. Instead we only load Type, Guid, DisplayName and some administrative data for each child and load a "details" pane for the currently focused item. Unexplored paths through the tree simply don't exist in memory. To avoid loading the full lists of names, we have "rolodex" level that just divides the dataset into batches of 100 records or less. (So "Z" stands alone, but "Sa-St" subdivides S). These auto-insert when a subset grows beyond the 100 record threshold.
Periodically (when the system is idle) we check the loaded count and if it exceeds a threshold we drop the least recently used nodes until we are below the threshold.
The actual data access is done when the user navigates: we access the database and refresh the subset they are navigating through. We do have the advantage that Type determines the table to query (both for that level and the children) and thus we can keep the individual record types indexed and accessible.
This has given the user the ability to navigate through the data in any way they want, while keeping the in-memory retained data minimized. Of course we off standard search modes and a menu of "recently used history" (for each type) as well, but often the work they do requires moving up and down a narrow chain of nodes, so the tree view keeps it all in front of them while working with a given client and the subset.
With that background, I become curious as to the nature of data that would have such undifferentiated levels that such a tier by tier data store wouldn't be appropriate. The advantage that tier by tier storage has is that all I need is the current node's Guid and I can search the child table on that as the foreign key (which is indexed, so quick to return the subset).
I guess this amounts to "unasking the question", but it seems that most tree structures have distinct data at each level, so it would seem far easier to work with something established (like a table query on an indexed field, which keeps the whole affair out of memory in the first place) than making a custom structure.
For example, I have never asked for the "next node at the current level" except within a given parent (because leaving a given parent takes me to another context). But within a parent I already have the children and their order.
Perhaps it is because of the space I'm in, but I find a tree control that knows how to bind to different tables based on parent->child relationships of tables more useful, which is why I wrote one. I also find lazy loading of data and aggressive dismissal of data keep the memory footprint minimized. Finally I find the programming model incredibly simple: just create a new "treenode" subclass for any table I want to access and make the treenode responsible for loading their children.
(Clarifying, due to question below:)
In my implementation each TreeNode is actually a SpecificTreeNode, derived from BaseNode, which in turn is derived from TreeNode. Being inherited from TreeNode, they can be used directly by the tree, but because they have overrides of the BaseNode properties such as LoadChildren and display properties, the display and retrieval any given set of data is implied by the type of node (and the Guid that the item represents).
This means that as the user navigates the tree, the SpecificTreeNode generates the required ORM query on the fly. For performance, child tables have any parent IDs as indexes, so navigating down the tree (even by multiple layers, if using a SpecificTreeNode that does rollups) is just a quick index lookup.
In this way, I keep very little of the data in memory at any time, pulling only what we need from the database. Likewise, queries against the tree are converted to ORM queries against our database, pulling only the results and limiting the amount any query can pull (if you are using a Tree UI and you pull over 100 records at once, the UI isn't really the optimal place for whatever you are doing).
When your dataset is hundreds of GB in size, it seems the only reasonable recourse. The advantage I feel this has is that the Tree itself has no idea that different levels and paths render and query differently... it just asks the BaseNode (from its perspective) to do something, and the overrides on SpecificTreeNode actually do the lifting. Thus, the "data structure" is simply the way the tree works already combined with data queries on my tables and views.
(End of clarification.)
Meanwhile all the tree controls on the market seem to miss that and go with something far more complex.
The most space-efficient way to store a balanced N-ary tree is in an array... zero space-overhead! And actually very efficient to traverse ... just some simple math required to compute your parent index from your index... or your N children's indices from your index.
To find some code for this, look up heapsort... the heap structure (nothing to do with memory heap) is a balanced binary tree... and most people implement it in an array. Although it is binary, the same thing can be done for N-ary.
If your N-ary tree is not kept balanced, but tends to be fairly dense, then the array implementation will still be more space-efficient than most... the empty nodes being the only space overhead. However, if your trees are ever highly imbalanced, then the array implementation can be very space-inefficient.