I need to know what are the tradeoffs of using a denormalized table vs using two separate tables and accessing the data using joins. I am using Entity Framework 4.
In my case I have two tables Order and OrderCategoryDetails.
I am thinking whether merging these two tables into one single table is better?
If denormalized, the added columns (OrderCategory and OrderSubcategory will be will be sparse (could be 100% empty. Will always be at least 50% empty)
On the other hand, if I keep it as it is, I am worried about frequent join operations being executed (i.e. whenever I am querying for a specific Order, I would need information from OrderCategoryDetails too.
At present, I have normalized tables and use navigational properties:
To access Order Category information from OrderItem instance
OrderItem orderItem = _context.OrderItems.Where(...).FirstOrDefault();
if(2 == orderItem.SalesOrder.Category.OrderCategory){ ...}
To access Order Category information from Order instance
Order order = _context.Orders.Where(...).FirstOrDefault();
if(2 == order.Category.OrderCategory){ ...}
This is my schema:
Table : Order
ID (Primary Key)
Date
Amount
ItemCount
OrderCategoryInfo (FK - join with OrderCategoryDetails on OrderCategoryDetails.ID)
Table : OrderCategoryDetails
ID (Primary Key)
OrderCategory
OrderSubCategory
Table : OrderItem
OrderItem ID (Primary key)
Order ID (FK - Join with Order)
Database used: SQL Server 2008 R2
My general advice would be to ask yourself the following question; does every single row from the first table require a row from the second table? If the answer is yes then you might be better off de-normalising the data. If the answer is no you're probably better off keeping it as a seperate table.
As long as you set up your foreign key association between the two tables you shouldn't concern yourself with performance implications of performing a join. It will only become an issue in pathological situations.
Based upon your answers in the comments thread, I'd recommend that you should keep the tables separate and set up a foreign key relationship between the two.
If you do get any performance problems further down the line, run a profiler on the problematic SQL and add any indexes that the profiler recommends, but only do this for queries that are used frequently. Indexes are great for speeding up queries but come at the cost of insert performance, so take care with them.
Related
I'm a developer that started working with a database architect. He is using a design I've never seen before, that I think will have negative performance implications.
In transaction tables he is using two fields in every table.. rowbegindate and rowenddate. There is a parent table that just a few fields that never changes. This is called a PersonHeader table. That key is used a fk to the child Person table. The Person table's PK is the fk of the PersonHeader table AND the RowBeginDate for that row. To retrieve the current row, I need to always check the for the RowEndDate that is NULL.
I haven't gotten into the details yet of how this will affect performance of Entity Framework, but I suspect that it will not be efficient.
I've worked on a number of projects and have never seen this approach. What are the performance implications of having this many dead records in a transaction table. I don't think there will be many updates, but I would estimate that the database person table could end up having 500,000 rows or more, not to mention the detail tables.
When working with applications that have auditing requirements, it is not uncommon to have to maintain historical versions of records. I have done similar things using (for example, when storing a history of changes to employee records), an EmployeeID and UpdatedOn field as a key so I can get the latest version of the record.
Provided that the tables are properly indexed and that the indexes don't end up being too large because of the composite key, I wouldn't worry about performance or the number of records. I've worked with tables that contained half a billion records and performance was still fine (rebuilding the indexes took a while though).
You could even create an interceptor from entity framework that would allow you to filter out "dead" records when performing your queries (see Entity Framework soft delete implementation using database interceptor not working)
PersonHeader
PersonKey (PK)
Person
PersonKey (PK, FK)
RowBeginDate(PK)
RowEndDate
(other columns)
(I added this as an answer, since a comment wouldn't take a new line...)
PersonLocation
LocationTypeKey (PK, FK)
PersonKey (PK, FK)
LocationKey (FK)
RowBeginDate (not a key)
RowEndDate
Location
LocationKey (PK)
StateKey (FK)
RowBeginDate (not a key)
RowEndDate
I have many tables that have same number of columns and names because they are all lookup tables.
For example, there are LabelType and TaskType tables. LabelType and TaskType tables have TypeID and TypeName columns. They will be used as a foreign key in other tables such as LabelType table with shippingLog table and TaskType table with EmployeeTask Table.
LabelType Table
TypeID TypeName
1 Fedex
2 UPS
3 USPS
TaskType Table
TypeID TypeName
1 Receiving
2 Pickup
3 Shipping
So far, I have more than 20 tables and I am expecting it is going to be keep increasing.
I have no problem with it , but I am just wondering whether there is any better or smarter way of using tables or not. I was even thinking to consolidate all those tables as one lookup Type Table and differentiate them by adding a foreign key from lookup table. The lookup table may have data like Label, Task, and etc. Then I just need one or two tables for all those lookup data.
Please, advise me if you have any better or smarter way of data modeling.
Just because data has similar structure doesn't mean it has the same meaning or same constraints. Keep your lookup tables separate. This keeps foreign keys separate, so the database can protect itself from referencing the wrong kind of lookup data.1
I wish relational DBMSes supported inheritance, where you could define the basic structure in the parent table and just add specific FKs in the child tables. As it stands now, you'll need to endure some repetition in your DDL...
NOTE: One exception from "keep lookup tables separate" rule might be when your system needs to be dynamic (i.e. be able to add new kinds of lookup data without actually creating new physical tables in the database), but it doesn't look that way from your question.
1 With one big lookup table, FKs alone won't stop (for example) the ShippingLog table from referencing a row meant for the EmployeeTask table. By using identifying relationships and migrating PKs, you can protect yourself from this, but not without introducing some redundancies and needing some careful constraining. It's cleaner and probably more performant to simply do the right thing and keep lookup tables separate.
Keep your lookup tables separate. It's faster at lookup time, and you will do millions of lookups between times when you add a new lookup table.
A lot of tables is not a big problem.
I have two DataBase Table (SQL CE). A Teacher table and a A Class table. The two tables have One-to-Many relationship where one teacher has many classes (i.e. Class has a foreign key teacher_id). Number of teachers (rows) is inserted (or generated) through C# code in run time, so as classes
Which of the following is faster in INSERT and SELECT?
Each time a new teacher is INSERTed, a new Class Table is created (e.g. Class_teacher001) to store whichever classes the teacher has. In this case, each Class Table doesn't have to be so large and foreign key is not needed because table name would identify itself. But there will be one Teacher table and many Class_xxx Tables
Only one Teacher table and one Class table. Each class row has a foreign key pointing at the Teacher table. Only one Class table, but it will get very long. I worry searching and reading wil be slow
Regardless of which is faster, (2) is the way to go....simply create indexes to support your searches. This is how almost all relational databases are used.
The nightmare of maintaining option (1) makes me shudder
OK, where to start. First, the relationship between Teacher and Class is potentially many-to-many, but as described by you is at least one-to-many.
The first option is absolutely the wrong way to go. Never dynamically create tables. The second option is how this sort of thing is handled. Databases are powerful, written by very smart people (usually), and can handle many more rows than all the students at a given school.
As long as you properly index your tables, they can easily support hundreds of millions of records.
I also agree with Mitch Wheat. Because when you create an index your table physically sort according to our Teacher Be creating Combined Index of (Teacher_Id ,Class_Id).
Though its will Help to get fast retrieval Of Select Statment.
Unless you are already having performance problems, I would not worry about them. There are many things that can cause performance problems other than the number of rows, and they should be dealt with differently depending on what they are. You have to worry more about the number of columns in a table affecting performance than you do about the number of rows. Also the number of concurrent connections to the database. One million rows in a table is not that many, it is the other two items in conjunction with that many rows that will make a database slow. You should use the second option.
I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.
For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.
I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam
Applicaiton is single user, 1-tier(1 pc), database SqlCE. DataService layer will be (I think) : Repository returning domain objects and quering database with LinqToSql (dbml). There are obviously a lot more columns, this is simplified view.
LogTime in separate table: http://i53.tinypic.com/9h8cb4.png
LogTime in ItemTimeLog table (as Time): http://i51.tinypic.com/4dvv4.png
alt text http://i53.tinypic.com/9h8cb4.png
This is my first attempt of creating a >2 tables database. I think the table schema makes sense, but I need some reassurance or critics. Because the table relations looks quite scary to be honest. I'm hoping you could;
Look at the table schema and respond if there are clear signs of troubles or errors that you spot right away.. And if you have time,
Look at Program Summary/Questions, and see if the table layout makes makes sense to those points.
Please be brutal, I will try to defend :)
Program summary:
a) A set of categories, each having a set of strategies (1:m)
b) Each day a number of items will be produced. And each strategy MAY reference it.
(So there can be 50 items, and a strategy may reference 23 of them)
c) An item can be referenced by more than one strategy. So I think it's an m:m relation.
d) Status values will be logged at fixed time-fractions through the day, for:
- .... each Strategy.....each StrategyItem....each item
e) An action on an item may be executed by a strategy that reference it.
- This is logged as ItemAction (Could have called it StrategyItemAction)
User Requsts
b) -> e) described the main activity mode of the program. To work with only today's DayLog , for each category. 2nd priority activity is retrieval of history, which typically will be From all categories, from day x to day y; Get all StrategyDailyLog.
Questions
First, does the overall layout look sound? I'm worried to see that there are so many relationships in all directions, connecting everything. Is this normal, or does it look like trouble?
StrategyItem is made to represent an m:m relationship. Is it correct as I noted 1:m / 1:1 (marked red) ?
StrategyItemTimeLog and ItemTimeLog; Logs values that both need to be retrieved together, when retreiving a StrategyItem. Reason I separated is that the first one is strategy-specific, and several strategies can reference same item. So I thought not to duplicate those values that are not dependent no strategy, but only on the item. Hence I also dragged out the LogTime, as it seems to be the only parameter to unite the logs. But this all looks quite disturbing with those 3 tables. Does it make sense at all? Or you have suggestion?
Pink circles shows my vague attempt of Aggregate Root Paths. I've been thinking in terms of "what entity is responsible for delete". Though I'm unsure about the actual root. I think it's Category. Does it make sense related to User Requests described above?
EDIT1:
(Updated schema, showing typical number of hierarchy items for the first few relations, for 365 days, and additional explanations)
1:1 relation: Sorry. I made a mistake. The StrategyDailyLog should be 1:m. See updated schema. It is one per Strategy, per day.
DayLog / StrategyDailyLog: I’ve been pondering over wether DayLog shall be a part of the hierarchy like this or not. The purpose of the DayLog table is to hold “sum values” derived from all the StrategyDailyLog tables for the same day. Like performance values for this day. It also holds the date value. Which allows me to omit a date value in the StrategyDailyLog (Which I feel would kind of be a duplicate modeling of the date-field), but instead the reference to DayLog exist to “find” the date. I’m not sure if this is an abuse/misconception of normalization.
Null value: I haden’t thought about this. I believe I found 2, as now marked in StrategyDailyLog and ItemAction. They can not be null on creation, but they can be set to null if one need to delete either a Strategy, or a StrategyItem. That should not require a delete of the StrategyDailyLog and the ItemAction. Hence they can be set to null.
All Id –columns: My idea was to have ID (autogenerated Integer) as PK for all my tables. I believed that also would be sufficient as candidate key. Is this not a proper way to make PKs? It’s the only way any table of mine can be identified. I asked a question before if that was ok, maybe I misunderstood, but thought that was a good approach.
m:m relation: This is what I have attempted to do: StrategyItem is the m:m table of StrategyDailyLog / DailyItem.
Ok. Here is me being brutal. I do not understand the model.
So instead of trying to comment on that so much, here are some thoughts that came to my mind when I looked at it.
I think you should have look at your 1:1 relationships (all of them). Why is DayLog and StrategyDailyLog separated in two tables? Probably because you will always have at least one DayLog item but not all DayLog items have a StrategyDailyLog item. If that is the case you can have a StrategyID FK in DayLog table with allow nulls option.
It would help to understand the model if you could show which fields are required and which fields accept null as a value.
All your tables have its own id column. That can be quite confusing when doing 1:1 relations and m:m relations. For a 1:1 relation, usually the relation between the two tables is made on the primary key in both tables. If you do not do that you have to create a candidate key on the foreign key column. In your case that means that StrategyDailyLog should have a candidate key on DayLogID.
A m:m relation between two tables is usually solved by adding a new table in between, with the primary keys from both tables. Those fields together is the primary key for the table in the middle.
Lets say for example that you should have a m:m relationship between Category and Strategy. You should then create a table called CategoryStrategy with two fields CategoryID and StrategyID that together is the primary key for table CategoryStrategy.
I hope my comments makes sense and that they are useful to you.
EDIT 2011-01-17
I do not think that you should have as a principle to use a IDENTITY column as primary key in all tables. A m:m relation does not need it so you should not do it. I also think that you have misunderstood what I meant with a candidate key. A candidate key is a key that could have been used as the primary key. In MS SQL Server you define a UNIQUE CONSTRAINT for your candidate key.
Ex: Table StrategyItem have id as PK but the combination of StrategyID and DailyItemID is the candidate key. Better would be to remove id and use StrategyID+DailyItemID as PK.
Below is the schema that I would have built with your description. I might have missed something important because I do not know everything about what you want to do.
You should not think so much about query performance and building aggregates when designing the schema. That can be handled by creating indexes on columns and using sum, count and group by in your queries. An index on column Created in the model below would be necessary for your queries on a date or date interval. In MS SQL Server there is something called the clustered index. Default the PK of a table is the clustered index but in this case I would make the index on Created column the clustered index.
A Category has 0,1 or more Strategy.
LogItem have on Category and optionally one Strategy
LogItem.Created holds date and time.