Are 'heavy' aggregate functions in RavenDB advisable? - c#

I'm working on a proof-of-concept timesheet application in C# that allows users to simply enter lots of timesheet records. The proof-of-concept will use RavenDB as storage provider, however the question below is perhaps more related to the nosql concept in general.
A user will typically enter between 1 and about 10 records each working day. Let's just say that for the sake of the discussion there will be a lot of records by the end of the year (tens or hundreds of thousands) for this specific collection.
The model for a record will be defined as:
class TimesheetRecord {
public long Id { get; set; }
public int UserId { get; set; }
public bool IsApproved { get; set; }
public DateTime DateFrom { get; set; }
public DateTime DateTill { get; set; }
public int? ProjectId { get; set; }
public int? CustomerId { get; set; }
public string Description { get; set; }
}
Logically, the application will allow the users, or project managers, to create reports on the fly. Think of on the fly reports like:
Total time spent for a project, customer or user
Time spent for a project, or customer in a certain time span like a week, month or between certain dates
Total amount of hours not approved already, by user - or for all users
Etc.
Of course, it is an option to add additional fields, like integers for weeknumber, month etc. to decrease the amount of crunching needed to filter on date/period. The idea is to basically use Query<T> functions by preference in order to generate the desired data.
In a 'regular' relational table this all would be no problem. With or without normalization this woulb be a breeze. The proof-of-concept is based on: will it blend as well in a nosql variant? This question is because I'm having some doubts after being warned about these 'heavy' aggregate functions (like nested WHERE constraints and SUM etc.) not being ideal in a document store variant.
Considering all this, I have two questions:
Is this advisable in a nosql variant, specifically RavenDB?
Is the approach correct?
I can imagine storing all the data redundantly, instead of querying on the fly, would be more performant. Like in adding hours spent by a certain user in a Project() or Customer() object. This, however, will increase complexity with updates considerably. Not to mention create immense redundant data all over the collections, which on its turn seems like a direct violation of seperation of concern and DRY.
Any advise or thoughts would be great!

I'm a big fan of RavenDB, but it is not a silver bullet or golden hammer. It has scenarios for which it is not the best tool for the job, and this is probably one of them.
Specifically, document databases in general, and RavenDB in particular, aren't very applicable when the specific data access patterns are not known. RavenDB has the ability to create Map/Reduce indexes that can do some amazing things with aggregating data, but you have to know ahead of time how you want to aggregate it.
If you only have need for (let's say) 4 specific views on that data, then you can store that data in Raven, apply Map/Reduce indexes, and you will be able to access those reports with blazing speed because they will be asynchronously updated and always available with great performance, because the data will already be there and nothing has to be crunched at runtime. Of course, then some manager will go "You know what would be really great is if we could also see __." If it's OK that manager's request will require additional development time to create a new Map/Reduce index, UI, etc., then Raven could still be the tool for the job.
However, it sounds like you have a scenario with a table of data that would essentially fit perfectly in Excel, and you want to be able to query that data in crazy ways that cannot be known until run time. In that case, you are better off going with a relational database. They were created specifically for that task and they're great at it.

Related

What is the best approach to synchronizing large database tables across EC Core contexts?

My Scenario
I have three warehouse databases (Firebird) numbered 1, 2 and 3, each sharing the same scheme, and the same DbContext class. The following is the model of the Products table:
public class Product
{
public string Sku { get; }
public string Barcode { get; }
public int Quantity { get; }
}
I also have a local "Warehouse Cache" database (MySQL) where I want to periodically download the contents of all three warehouses for caching reasons. The data model of a cached product is similar, with the addition of a number denoting the source warehouse index. This table should contain all product information from all three warehouses. If a product appears in both warehouses 1 and 3 (same Sku), then I want to have two entries in the local Cache table, each with the corresponding warehouse ID:
public class CachedProduct
{
public int WarehouseId { get; set; } // Can be either 1, 2 or 3
public string Sku { get; }
public string Barcode { get; }
public int Quantity { get; }
}
There are multiple possible solutions to this problem, but given the size of my datasets (~20k entries per warehouse), none of them seem viable or efficient, and I'm hoping that someone could give me a better solution.
The problem
If the local cache database is empty, then it's easy. Just download all products from all three warehouses, and dump them into the cache DB. However on subsequent synchronizations, the cache DB will no longer be empty. In this case, I don't want to add all 60k products again, because that would be a tremendous waste of storage space. Instead, I would like to "upsert" the incoming data into the cache, so new products would be inserted normally, but if a product already exists in the cache (matching Sku and WarehouseId), then I just want to update the corresponding record (e.g. the Quantity could have changed in one of the warehouses since the last sync). This way, the no. of records in the cache DB will always be exactly the sum of the three warehouses; never more and never less.
Things I've tried so far
The greedy method: This one is probably the simplest. For each product in each warehouse, check if a matching record exists in the cache table. If it does then update, otherwise insert. The obvious problem is that there is no way to batch/optimize this, and it would result in tens of thousands of select, insert and update calls being executed on each synchronization.
:Clearing the Cache: Clear the local cache DB before every synchronization, and re-download all the data. My problem with this one is that it leaves a small window of time when no cache data will be available, which might cause problems with other parts of the application.
Using an EF-Core "Upsert" library: This one seemed the most promising with the FlexLabs.Upsert library, since it seemed to support batched operations. Unfortunately the library seems to be broken, as I couldn't even get their own minimal example to work properly. A new row in inserted on every "upsert", regardless of the matching rule.
Avoiding EF Core completely: I have found a library called Dotmim.Sync that seems to be a DB-to-DB synchronization library. The main issue with this is that the warehouses are running FirebirdDB which doesn't seem to be supported by this library. Also, I'm not sure if I could even do data transformation, since I have to add the WarehouseId column before a row is added to the cache DB.
Is there a way to do this as efficiently as possible in EF Core?
There are a couple of options here. Which ones are viable depends on your staleness constraints for the Cache. Must the cache always 100% relect the warehouse state, or can it get out of sync for a period of time.
First, you absolutely should not use EFCore for this, except possibly as a client lib to do raw SQL. EfCore is optimized for many small transactions. It doesn't do great with batch workloads.
The 'best' option is probably an event based system. Firebird supports emitting events to an event listener, which would then update the cache based on the events. The risk here is if event processing fails, you could get out of sync. You could mitigate that risk by using an event bus of some sort (Rabbit, Kafka), but Firebird event handling itself would be the weak link.
If the cache can handle some inconsistency, you could attach a expiry timestamp to each cache entry. Your application hits the cache, and if the expiry date is past, it rechecks the warehouse dbs. Depending on the business process that update the source of truth databases you may also be able to bust cache entries (e.g. if there's an order management system, it can bust the cache for a line item when someone makes an order).
If you have to batch sync, do a swap table. Set up a table with the live cache data, a separate table you load the new cache data in, and a flag in your application that says which yo read from. You read from table A while you load into B, then when the load is done, you swap to read from table B.
For now I ended up going with a simple, yet effective solution that is fully within EF Core.
For each cache entry, I also maintain a SyncIndex column. During synchronization, I download all products from all three warehouses, I set SyncIndex to max(cache.SyncIndex) + 1, and I dump them into the cache database. Then I delete all entries from the cache with an older SyncIndex. This way I always have some cache data available, I don't waste a lot of space, and the speed is pretty acceptable too.

How to split an entity in Entity Framework and reduce size using link tables?

Imagine you have a simple class
public class Orders
{
public int OrderId
{
get;
set;
}
public string Note
{
get;
set;
}
}
Now imagine that note is a field where the representative copy pasted canned responses, or enters their own, so after a good while you have hundreds of thousands, of a highly repetitive data.
While the rep can enter his own values, most of the data 80% is just repeated.
So we want to move the strings to a separate table, and only save on the DB the distinct versions, and then link those distinct versions to the order.
Once the data is entered there is no editing, so no changes will happen on the note data.
We want to approach this by using another two table,s one that holds the distinct, and one that has the links. But we are struggling into how configuring EF into using the distinct logic.
Now at first glance we think EF cant do this out of the box. So where can we modify and plug in something to save the note?
This is almost certainly something that your database should handle. The only exception I could see is if you have a fixed number of strings Note could be (i.e. an enum).
If you use SQL Server it already does this for you. The SQL Server Page Compression docs seem to indicate that if you turn on Page Compression for a table it will do exactly what you're trying to do.

Designing a Persistence Layer

For a project we are starting to look at persistence features and how we want to implement this. Currently we are looking at keeping Clean Architecture in mind, probably going for Onion Architecture. As such, we want to define a new outer layer which in which the persistence layer resides.
We're looking at various ORM solutions (we seem to be converging to Entity Framework) using SQLite as data store and we are hitting a snag: How should be manage ID's and deal with add/removal in some collection or move some instance between different collections.
In the core of our 'onion', we want to keep our POCO objects. As such, we do not want some kind of 'ID' property to be added in our business objects. Only inside the persistence layer do we want to have classes with object ID's. Because of this separation:
how should removing a business object from some collection cause a row to be deleted from the SQLite database?
More complex (at least I think it is), how should a POCO instance moved from 1 collection to another cause a foreign key of a SQLite databaserow to be changed? (Instead of removing the row and recreating it with the same values)
Looking around the internet I've yet to find an implementation somewhere that demonstrates a persistence layer in a Clean Architecture design. Plenty of high level diagrams and "depend only inward", but no source code examples to give a demonstration.
Some possible solutions that we came up with so far:
Have some lookup between POCO instances and their representative 'database model objects' (which have ID's etc) within the persistence layer. When saving the project state, business model objects will be matched with this database model objects and update the state for the matches accordingly. Then the object is persisted.
When loading a project, the persistence layer returns decorator objects of business objects that add an ID to the business object, which is only visible within the persistence layer by casting the objects to that decorator class. However, this prevents us from defining sealed POCO objects and seems to break the Clean Architecture design philosophy.
Option 1 seems costly in memory due to effectively doubling the business objects in memory. Option 2 seems the most elegant, but as I've written: it feels that it breaks Clean Architecture.
Are there better alternatives to there? Should we just go for Option 2 and take Clean Architecture more as guidelines than rule? Can someone point us to a working example in code (I did find a iOs example at https://github.com/luisobo/clean-architecture, but as I'm not literate in the language it cannot do much with it).
As others have mentioned in the comments, IDs are a natural part of applications and are usually required in other parts than persistence. So trying to avoid IDs at all costs is going to produce awkward designs.
Identity Design
However, identity design (where to use which IDs, what information to put in IDs, user defined vs system generated, etc.) is something that is very important and requires thought.
A good starting point to determine what requires an ID and what not is the Value Object / Entity distinction of domain-driven design.
Value objects are things that consist of other values and don't change - so you don't need an ID.
Entities have a lifecycle and change over time. So their value alone is not enough to identify them - they need an explicit ID.
As you see here, reasoning is very different from the technical point of view that you take in your question. This does not mean you should ignore constraints imposed by frameworks (e.g. entity framework), however.
If you want an in-depth discussion about identity design, I can recommend "Implementing DDD" by Vaughn Vernon (Section "Unique Identity" in Chapter 5 - Entities).
Note: I don't mean to recommend that you use DDD because of this. I just think that DDD has some nice guidelines about ID design. Whether or not to use DDD in this project is an entirely different question.
First of all, everything in the real world have ids. You have your social security number. Cars have their registration number. Items in shops have an EAN code (and a production identity). Without ids nothing in the world would work (a bit exaggerated, but hopefully you get my point).
It's the same with applications.
If your business objects do not have any natural keys (like a social security number) you MUST have a way to identify them. You application will otherwise fail as soon as you copy your object or transfer it over the process boundry. Because then it's a new object. It's like when you cloned the sheep Dolly. Is it the same sheep? No, it's Mini-Dolly.
The other part is that when you build complex structures you are violating the law of Demeter. For instance like:
public class ForumPost
{
public int Id { get; set; }
public string Title { get; set; }
public string Body { get; set; }
public User Creator { get; set; }
}
public class User
{
public string Id { get; set; }
public string FirstName { get; set; }
}
When you use that code and invoke:
post.User.FirstName = "Arnold";
postRepos.Update(post);
what do you expect to happen? Should your forum post repos suddenly be responsible of changes made in the user?
That's why ORMs are so sucky. They violate good architecture.
Back to ids. A good design is instead to use a user id. Because then we do not break law of Demeter and still got a good separation of concern.
public class ForumPost
{
public int Id { get; set; }
public string Title { get; set; }
public string Body { get; set; }
public int CreatorId { get; set; }
}
So the conclusion is:
Do not abandon ids, as it introduces complexity when trying to identify the real object from all the copies of it that you will get.
Using ids when referencing different entities helps you keep a good design with distinct responsibilities.

EF denormalize result of each group join

I have a 1-to-many relationship between a user and his/her schools. I often want to get the primary school for the user (the one with the highest "Type"). This results in having to join the primary school for every query I want to run. A user's schools barely ever change. Are there best practices on how to do this to avoid the constant join? Should I denormalize the models and if so, how? Are there other approaches that are better?
Thanks.
public class User
{
public int Id { get; set; }
public virtual IList<UserSchool> UserSchools { get; set; }
...
}
public class UserSchool
{
public int UserId { get; set; }
public string Name { get; set; }
public int Type { get; set; }
...
}
...
var schools = (from r in _dbcontext.UserSchools
group r by r.UserId into grp
select grp.OrderByDescending(x => x.Type).FirstOrDefault());
var results = (from u in _dbcontext.Users
join us in schools on u.Id equals us.UserId
select new UserContract
{
Id = u.Id,
School = us.Name
});
In past projects, when I opted to denormalize data, I have denormalized it into separate tables which are updated in the background by the database itself, and tried to keep as much of the process contained in the database software, which handles these things much better. Note that any sort of "run every x seconds" solution will cause a lag in how up-to-date your data is. For something like this, it doesn't sound like the data changes that often, so being a few seconds (or minutes, or days, by the sound of it) out of date is not a big concern. If you're considering denormalization, then retrieval speed must be much more important.
I have never had a "hard and fast" criteria for when to denormalize, but in general the data must be:
Accessed often. Like multiple times per page load often. Absolutely
critical to the application often. Retrieval time must be paramount.
Time insensitive. If the data you need is changing all the time, and it is critical that the data you retrieve is up-to-the-minute, denormalization will have too much overhead to buy you much benefit.
Either an extremely large data set or the result of a relatively complex query. Simple joins can usually be handled by proper indexing, and maybe an indexed view.
Already optimized as much as possible. We've already tried things like indexed views, reorganizing indexes, rewriting underlying queries, and things are still too slow.
Denormalizing can be very helpful, but it introduces its own headaches, so you want to be very sure that you are ready to deal with those before you commit to it as a solution to your problem.

Optimize solution for categories tree search

I'm creating some kind of auction application, and I have to decide what is the most optimize way for this problem. I'm using BL Toolkit as my OR Mapper (It have nice Linq support) and ASP.NET MVC 2.
Background
I've got multiple Category objects that are created dynamically and that are saved in my database as a representation of this class:
class Category
{
public int Id { get; set; }
public int ParentId { get; set; }
public string Name { get; set; }
}
Now every Category object can have associated multiple of the InformatonClass objects that represents single information in that category, for example it's a price or colour. Those classes are also dynamicly created by administator and stored in database. There are specific for a group of categories. The class that represents it looks following:
class InformationClass
{
public int Id { get; set; }
public InformationDataType InformationDataType { get; set; }
public string Name { get; set; }
public string Label { get; set; }
}
Now I've got third table that represents the join between them like this:
class CategoryInformation
{
public int InformationClassId { get; set; }
public int AuctionCategoryId { get; set; }
}
Problem
Now the problem is that I need to inherit all category InformationClass in the child categories. For example every product will have a price so I need to add this InformationClass only to my root category. The frequency information can be added to base CPU category and it should be avaible in AMD and Intel categories that will derive from CPU category.
I have to know which InformationClass objects are related to specifed Category very often in my application.
So here is my question. What will be the most optimize solution for this problem? I've got some ideas but I cant decide.
Load all categories from database to Application table and take them from this place everytime - as far as the categories will not change too often it will reduce number of database requests but it will still require to tree search using Linq-to-Objects
Invent (I don't know if it's possible) some fancy Linq query that can tree search and get all information class id's without stressing the database too much.
Some other nice ideas?
I will be grateful for every answers and ideas. Thank you all in advice.
Sounds like a case for an idea I once had which I blogged about:
Tree structures and DAGs in SQL with efficient querying using transitive closures
The basic idea is this: In addition to the Category table, you also have a CategoryTC table which contains the transitive closure of the parent-child relationship. It allows you to quickly and efficiently retrieve a list of all ancestor or descendant categories of a particular category. The blog post explains how you can keep the transitive closure up-to-date every time a new category is created, deleted, or a parent-child relationship changed (it’s at most two queries each time).
The post uses SQL to express the idea, but I’m sure you can translate it to LINQ.
You didn’t specify in your question how the InformationClass table is linked to the Category table, so I have to assume that you have a CategoryInformation table that looks something like this:
class CategoryInformation
{
public int CategoryId { get; set; }
public int InformationClassId { get; set; }
}
Then you can get all the InformationClasses associated with a specific category by using something like this:
var categoryId = ...;
var infoClasses = db.CategoryInformation
.Where(cinf => db.CategoryTC.Where(tc => tc.Descendant == categoryId)
.Any(tc => tc.Ancestor == cinf.CategoryId))
.Select(cinf => db.InformationClass
.FirstOrDefault(ic => ic.Id == cinf.InformationClassId));
Does this make sense? Any questions, please ask.
In the past (pre SQLServer 2005 and pre LINQ) when dealing with this sort of structure (or the more general case of a directed acyclic graph, implemented with a junction table so that items can have more than one "parent"), I've either done this by loading the entire graph into memory, or by creating a tigger-updated lookup table in the database that cached in relationship of ancestor to descendant.
There are advantages to either and which wins out depends on update frequency, complexity of the objects outside of the matter of the parent-child relationship, and frequency of updating. In general, loading into memory allows for faster individual look-ups, but with a large graph it doesn't natively scale as well due to the amount of memory used in each webserver ("each" here, because webfarm situations are one where having items cached in memory brings extra issues), meaning that you will have to be very careful about how things are kept in synch to counter-act that effect.
A third option now available is to do ancestor lookup with a recursive CTE:
CREATE VIEW [dbo].[vwCategoryAncestry]
AS
WITH recurseCategoryParentage (ancestorID, descendantID)
AS
(
SELECT parentID, id
FROM Categories
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, id
FROM recurseCategoryParentage
INNER JOIN Categories ON parentID = descendantID
)
SELECT DISTINCT ancestorID, descendantID
FROM recurseCategoryParentage
Assuming that root categories are indicated by having a null parentID.
(We use UNION ALL since we're going to SELECT DISTINCT afterwards anyway, and this way we have a single DISTINCT operation rather than repeating it).
This allows us to do the look-up table approach without the redundancy of that denormalised table. The efficiency trade-off is obviously different and generally poorer than with a table but not much (slight hit on select, slight gain on insert and delete, negliable space gain), but guarantee of correctness is greater.
I've ignored the question of where LINQ fits into this, as the trade-offs are much the same whatever way this is queried. LINQ can play nicer with "tables" that have individual primary keys, so we can change the select clause to SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID and defining that as the primary key in the [Column] attribute. Of course all columns should be indicated as DB-generated.
Edit. Some more on the trade-offs involved.
Comparing the CTE approach with look-up maintained in database:
Pro CTE:
The CTE code is simple, the above view is all the extra DB code you need, and the C# needed is identical.
The DB code is all in one place, rather than there being both a table and a trigger on a different table.
Inserts and deletes are faster; this doesn't affect them, while the trigger does.
While semantically recursive, it is so in a way the query planner understands and can deal with, so it's typically (for any depth) implemented in just two index scans (likely clustered) two light-weight spools, a concatenation and a distinct sort, rather than in the many many scans that you might imagine. So while certainly a heavier scan than a simple table lookup, it's nowhere near as bad as one might imagine at first. Indeed, even the nature of those two index scans (same table, different rows) makes it less expensive than you might think when reading that.
It is very very easy to replace this with the table look-up if later experience proves that to be the way to go.
A lookup table will, by its very nature, denormalise the database. Purity issues aside, the "bad smell" involved means that this will have to be explained and justified to any new dev, as until then it may simply "look wrong" and their instincts will send them on a wild-goose chase trying to remove it.
Pro Lookup-Table:
While the CTE is faster to select from than one might imagine, the lookup is still faster, especially when used as part of a more complicated query.
While CTEs (and the WITH keyword used to create them) are part of the SQL 99 standard, they are relatively new and some devs don't know them (though I think this particular CTE is so straightforward to read that it counts as a good learning example anyway, so maybe this is actually pro CTE!)
While CTEs are part of the SQL 99 standard, they aren't imlemented by some SQL databases, including older versions of SQLServer (which are still in live use), which may affect any porting efforts. (They are though supported by Oracle, and Postgres among others, so at this point this may not really be an issue).
It's reasonably easy to replace this with the CTE version later, if later experience suggests you should.
Comparing (both) the db-heavy options with in-memory caching.
Pro In-Memory:
Unless your implementation really sucks, it is going to be much faster than DB lookups.
It makes some secondary optimisations possible on the back of this change.
It is reasonably difficult to change from DB to in-memory if later profiling shows that in-memory is the way to go.
Pro Querying DB:
Start-up time can be very slow with in-memory.
Changes to the data are much much simpler. Most of the points are aspects of this. Really, if you go the in-memory route then the question of how to handle changes invalidating the cached information becomes a whole new ongoing concern for the lifetime of the project, and not a trivial one at all.
If you use in-memory, you are probably going to have to use this in-memory store even for operations where it is not relevant, which may complicate where it fits with the rest of your data-access code.
It is not necessary to track changes and cache freshness.
It is not necessary to ensure that every webserver in a web-farm and/or web-garden solution (a certain level of success will necessitate this) has precisely the same degree of freshness.
Similarly, the degree of scalability across machines (how close to 100% extra performance you get by doubling the number of webservers and DB slaves) is higher.
With in-memory, memory use can become very high, if either (a) the number of objects is high or (b) the size of the objects (fields, esp. strings, collections and objects which themselves have a sting or collection). Possibly "we need a bigger webserver" amounts of memory, and that goes for every machine in the farm.
7a. That heavy memory use is particularly like to continue to grow as the project evolves.
Unless changes cause an immediate refresh of the in-memory store, the in-memory solution will mean that the view used by the people in charge of administrating these categories will differ from what is seen by customers, until they are re-synchronised.
In-memory resynching can be very expensive. Unless you're very clever with it, it can cause random (to the user) massive performance spikes. If you are clever with it, it can exasperate the other issues (esp. in terms of keeping different machines at an equiv. level of freshness).
Unless you're clever with in-memory, those spikes can accumulate, putting the machine into a long-term hang. If you are clever with avoiding this, you may exasperate other issues.
It is very difficult to move from in-memory to hitting the db should that prove the way to go.
None of this leans with 100% certainty to one solution or the other, and I certainly aren't going to give a clear answer as doing so is premature optimsiation. What you can do a priori is make a reasonable decision about which is likely to be the optimal solution. Whichever you go for you should profile afterwards, esp. if the code does turn out to be a bottleneck and possibly change. You should also do so over the lifetime of the product as both changes to the code (fixes and new features) and changes to the dataset can certainly change which option is optimal (indeed, it can change from one to another and then change back to the previous one, over the course of the lifetime). This is why I included considerations of the ease of moving from one approach to another in the above list of pros and cons.

Categories

Resources