Indexed databases in Random Access Memory - c#

I'm currently writing a small test web application for a jobs search system.
I have a table Vacancies (the main table to talk about).
I need to make a rapid AJAX update of vacancies (in a suggest list below input control) matched to a user query. Different DBMS provide powerful programming extensions like Free Text Search in Microsoft SQL Server .. but I think that scanning a physical file takes plenty of time. And my idea is based on transfering the whole table Vacancies into RAM, so it, in my view, makes sense since in this case data retrieving demands less time.
So if a client types in a textbox something like "pro" - the suggest list shows up with suggestions:
-product manager
-professional designer
-programmer
-programmer C#
-programmer Java
-property administrator
-provision expert
when a user types another letter "g", the value of a textbox widens to "prog"
and the list is refreshed:
-programmer
-programmer C#
-programmer Java
To make that possible I plan to create a tree index with saved values in nodes, where a vacancy prefix will play a role of the index key and node values are the vacancy names. The index is built and populated only once with data from a data table. See what I mean below:
"pro" -> {
"product manager",
"professional designer",
"programmer",
"programmer C#",
"programmer Java"
"property administrator",
"provision expert"
}
So an index builder must analyze a string list and find the least prefixes of vacancy names.
Then when a builder finds a string with a letter after prior found prefix, it creates a child tree node ("prog") (the count of data values in that node decreases as it is constantly filtered) and adds itself up to the parent node ("pro")
"prog" -> {
"programmer",
"programmer C#",
"programmer Java"}
}
Can you advise me on the types of tree indexes that naturally fit to solve this problem?
What's the best of them by the seek time?
Thanks

This problem was solved years ago, you are recreating Lucene:
For what it's worth the type of tree you want is a Patricia Tree or a Radix Tree. In terms of storing all data in RAM, this is a bad idea because there are other applications that use RAM not just your index. Currently I am ripping out someone's custom database that they thought was a good idea to implement this way and replacing it with a real database solution.

Related

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

How can I properly present and write a checkboxlist to a db? (Theory/Logic help)

Please note that the database design I have now is fully in sandbox mode. Nothing is finalized. Everything (again this is in sandbox mode) is in one single table. Also, I'm in now way looking for coding help. I'm looking for the right theoretical/logical approach to this puzzle so I can go in and play with the coding myself. I learn better that way. If I need coding help, I'll come back here for further assistance.
I'm in the process of creating the first of a few CheckBoxList for a manually created form submittal. I've been looking over multiple ways to not only create said CheckBoxList but to enter it into the database. However, the challenge I'm facing (mainly because I haven't encountered this scenario before and I'd like to learn this) is that I not only need to worry about entering the items correctly in the database but I will eventually need to produce a report or make a printable form from these entries.
Let's say I have an order form for a specific type of Barbeque grill and I will need to send this form out to distriution centers across the nation. The distribution centers will need to pull said barbecues if they are highlighted on the form.
Here's what the CheckBoxList for the distibution centers will look like:
All
Dallas
Miami
Los Angeles
Seattle
New York
Chicago
Phoenix
Montreal
If the specific city (or all the cities) are checked, then the distribution center will pull the barbecue grill for shipment.
The added part is that I want to:
be able to create a grid view from this database for reporting to note which distribution center got orders for barbecues and
be able to create reports to tell what distribution center sent out barbecue orders in a given month (among other reporting).
Here's what I'm playing around with right now.
In my aspx page I have a checkboxlist programmed with all the distribution centers entered as a listitem as well as an option for 'ALL' (of the distribution centers).
I also created a dedicated column in this table that holds all the information in the listitem and programmed a sqldataconnection to this table to play with the programmability of leveraging the database for this purpose.
When it comes to writing the selections to the database, I originally created a column for each destination city including the 'All' option. I was toying around with just putting the selections into one single column but with some of the information I've been reading today about Database Normalization, the former options seems to be a better one than the latter. Is this correct practice for situations such as this especially if I need to think about reporting? Do I put the CheckBoxList data in one cell in a specific column? Do I create seprate columns for each distribution center? Am I even on the right track here?
Depending on the amount of cities that you want to store, I've used bitwise operators to store small amounts of data. Effectively, it would store it in the table like this:
CityID Location
2 Dallas
4 Miami
8 New York
16 Chicago
32 Montreal
and keep going in base 2 for additional cities.
When your user selects multiple cities for the order, the value that gets inserted into the database for cities is a bitwise OR calculation. So if they select Dallas, New York, and Chicago, you would be doing the following:
2 OR 8 OR 16
Which would equal 26
Now, you can use bitwise AND on the resulting value. So if checking for Miami the following is the evaluation:
26 AND 4 = 0
which indicates that Miami was not selected. Any value that was selected in the evaluation, it would return its ID like this:
26 AND 8 = 8
Again, I've only used this for small subsets of data, and to make the data storage as compact as possible. Computationally, it may be a trifle more expensive that some other methods, but I'm not 100% certain.
Note: This might not be the best approaches but I have seen them used.
1) Having one column of comma-delimited string
This should work well if the options don't have IDs in the database (having a separate referenced table)
You will need to loop through the checkbox list, obtained the selected options and concatenate them with String.Join()
You will need to split the string upon receiving it from the db and use it to check the checkboxes if there text is found in the resulting array
Problem: You might need a split function in the DB that converts the comma-separated string into rows. There split function implementation on the web/stackoverflow
2) You can have a separate table for the locations e.g. xxxxtable_location where the FK to the main table is referenced. This will be a one-many table
ParentID, Location
1 Dallas
2 Miami
2 New York
2 Chicago
3 Miami

Creating a free text-style search using Wpf/C#/EF5

I've created the bulk of a C# application, of which the core is a person database (there's a lot more going on peripherally too). I'm using EF with the CodeFirst/DbContext methodology. For my frontend, I have XAML using a MVVM type approach.
I would now like a search box which "feels" like a free-text search. I currently have an editable combo box, with properties set up to provide the correct feel. I am using EF's "Contains" query method to query the SQL database.
At present, I have something along the following lines:
x.Contains(p=> p.Forenames.Contains(s) || p.Surname.Contains(s))
Which works well to a limited extent. This is obviously a problem if "s", the search string, contains both first name and surname data. In essence, I want the user to be able to search by typing "Joe Bloggs" or "Bloggs, Joe" or various combinations of middle names etc... I may even want to add address data to the search in future.
My question is how I achieve this? The first way that springs to mind is to Split the string and then pass the individual components of the array to each search term, in a foreach loop. This could potentially result in multiple, rather large queries.
Is the a better way to achieve what I want to achieve through using a different query strategy with EF itself? I want to give the user the feeling that any search term they type provides something sensible in the combo box!

What is the most space efficient way to store an N-ary tree while preserving hierarchy traversal?

I read this paper.
But I'd love to avoid a ton of research to solve this problem if someone has already done it. I need this space efficient tree for a reasonably (conceptually) simple GUI control: TreeDataGridView with virtual mode
An example tree might look like:
(RowIndexHierarchy),ColumnIndex
(0),0
(0,0),0
(0,0,0),0
(0,0,0,0),0
(0,0,0,0,0),0
(0,0,0,1),0
(0,0,0,2),0
(0,0,1),0
(0,0,2),0
(0,0,2,0),0
(0,0,2,1),0
(0,0,2,2),0
(0,1),0
(0,2),0
(0,2,0),0
(0,2,0,0),0
(0,2,0,1),0
(0,2,0,2),0
(0,2,1),0
(0,2,2),0
(0,2,2,0),0
(0,2,2,1),0
(0,2,2,2),0
(1),0
I need operations like "find flat row index from row hierarchy" and "find row hierarchy from flat row index". Also, to support expand/collapse, I need "find next node with the same or less depth".
For read-only tree you can store sorted array of nodes by its parent index.
0 a
1 (a/)b
1 (a/)c
2 (a/b/)d
2 (a/b/)e
2 (a/b/)f
3 (a/c/)c
Each time you'll need to find child nodes you can use binary search to find upper and lower boundaries of nodes range.
I'm not sure if I am following your needs exactly, but we access a database that has a tree-view UI. The tree runs from
Top Level (the user's company);
Direct Client Company;
Office Location;
Employee;
Indirect Client Company;
Proposal;
Specific Vendor Bid;
Detail Financials (invoices, adjustments, etc)
As we have thousands of direct clients and the tree branches pretty heavily at each tier, so we don't load the entire data-set at any time. Instead we only load Type, Guid, DisplayName and some administrative data for each child and load a "details" pane for the currently focused item. Unexplored paths through the tree simply don't exist in memory. To avoid loading the full lists of names, we have "rolodex" level that just divides the dataset into batches of 100 records or less. (So "Z" stands alone, but "Sa-St" subdivides S). These auto-insert when a subset grows beyond the 100 record threshold.
Periodically (when the system is idle) we check the loaded count and if it exceeds a threshold we drop the least recently used nodes until we are below the threshold.
The actual data access is done when the user navigates: we access the database and refresh the subset they are navigating through. We do have the advantage that Type determines the table to query (both for that level and the children) and thus we can keep the individual record types indexed and accessible.
This has given the user the ability to navigate through the data in any way they want, while keeping the in-memory retained data minimized. Of course we off standard search modes and a menu of "recently used history" (for each type) as well, but often the work they do requires moving up and down a narrow chain of nodes, so the tree view keeps it all in front of them while working with a given client and the subset.
With that background, I become curious as to the nature of data that would have such undifferentiated levels that such a tier by tier data store wouldn't be appropriate. The advantage that tier by tier storage has is that all I need is the current node's Guid and I can search the child table on that as the foreign key (which is indexed, so quick to return the subset).
I guess this amounts to "unasking the question", but it seems that most tree structures have distinct data at each level, so it would seem far easier to work with something established (like a table query on an indexed field, which keeps the whole affair out of memory in the first place) than making a custom structure.
For example, I have never asked for the "next node at the current level" except within a given parent (because leaving a given parent takes me to another context). But within a parent I already have the children and their order.
Perhaps it is because of the space I'm in, but I find a tree control that knows how to bind to different tables based on parent->child relationships of tables more useful, which is why I wrote one. I also find lazy loading of data and aggressive dismissal of data keep the memory footprint minimized. Finally I find the programming model incredibly simple: just create a new "treenode" subclass for any table I want to access and make the treenode responsible for loading their children.
(Clarifying, due to question below:)
In my implementation each TreeNode is actually a SpecificTreeNode, derived from BaseNode, which in turn is derived from TreeNode. Being inherited from TreeNode, they can be used directly by the tree, but because they have overrides of the BaseNode properties such as LoadChildren and display properties, the display and retrieval any given set of data is implied by the type of node (and the Guid that the item represents).
This means that as the user navigates the tree, the SpecificTreeNode generates the required ORM query on the fly. For performance, child tables have any parent IDs as indexes, so navigating down the tree (even by multiple layers, if using a SpecificTreeNode that does rollups) is just a quick index lookup.
In this way, I keep very little of the data in memory at any time, pulling only what we need from the database. Likewise, queries against the tree are converted to ORM queries against our database, pulling only the results and limiting the amount any query can pull (if you are using a Tree UI and you pull over 100 records at once, the UI isn't really the optimal place for whatever you are doing).
When your dataset is hundreds of GB in size, it seems the only reasonable recourse. The advantage I feel this has is that the Tree itself has no idea that different levels and paths render and query differently... it just asks the BaseNode (from its perspective) to do something, and the overrides on SpecificTreeNode actually do the lifting. Thus, the "data structure" is simply the way the tree works already combined with data queries on my tables and views.
(End of clarification.)
Meanwhile all the tree controls on the market seem to miss that and go with something far more complex.
The most space-efficient way to store a balanced N-ary tree is in an array... zero space-overhead! And actually very efficient to traverse ... just some simple math required to compute your parent index from your index... or your N children's indices from your index.
To find some code for this, look up heapsort... the heap structure (nothing to do with memory heap) is a balanced binary tree... and most people implement it in an array. Although it is binary, the same thing can be done for N-ary.
If your N-ary tree is not kept balanced, but tends to be fairly dense, then the array implementation will still be more space-efficient than most... the empty nodes being the only space overhead. However, if your trees are ever highly imbalanced, then the array implementation can be very space-inefficient.

Need design suggestion for storing data for budget keeping application

I'm writing an application that I will use to keep up with my monthly budget. This will be a C# .NET 4.0 Winforms application.
Essentially my budget will be a matrix of data if you look at it visually. The columns are the "dates" at which that budget item will be spent. For example, I have 1 column for every Friday in the month. The Y axis is the name of the budget item (Car payment, house payment, eating out, etc). There are also categories, which are used to group the budget item names that are similar. For example, a category called "Housing" would have budget items called Mortgage, Rent, Electricity, Home Insurance, etc.
I need a good way to store this data from a code design perspective. Basically I've thought of two approaches:
One, I can have a "BudgetItem" class that has a "Category", "Value", and "Date". I would have a linear list of these items and each time I wanted to find a value by either date or category, I iterate this list in some form or fashion to find the value. I could probably use LINQ for this.
Second, I could use a 2D array which is indexed first by column (date) and second by row. I'd have to maintain categories and budget item names in a separate list and join the data together when I do my lookups somehow.
What is the best way to store this data in code? I'm leaning more towards the first solution but I wanted to see what you guys think. Later on when I implement my data persistence, I want to be able to persist this data to SQL server OR to an XML file (one file per monthly budget).
While your first attempt looks nicer, obviusly the second could be faster (depends on how you implement it). However when we are talking about desktop applications which are not performance critical, your first idea is definitely better, expecially because will help you a lot talking about maintaining your code. Also remember that the entity framework could be really nice in this situation
Finally if you know how to works with XML, I think is really better for this type of project. A database is required only when you have a fair amount of tables, as you explained you will only have 2 tables (budgetitem and category), I don't think you need a database for such a simple thing

Categories

Resources