I'm a developer that started working with a database architect. He is using a design I've never seen before, that I think will have negative performance implications.
In transaction tables he is using two fields in every table.. rowbegindate and rowenddate. There is a parent table that just a few fields that never changes. This is called a PersonHeader table. That key is used a fk to the child Person table. The Person table's PK is the fk of the PersonHeader table AND the RowBeginDate for that row. To retrieve the current row, I need to always check the for the RowEndDate that is NULL.
I haven't gotten into the details yet of how this will affect performance of Entity Framework, but I suspect that it will not be efficient.
I've worked on a number of projects and have never seen this approach. What are the performance implications of having this many dead records in a transaction table. I don't think there will be many updates, but I would estimate that the database person table could end up having 500,000 rows or more, not to mention the detail tables.
When working with applications that have auditing requirements, it is not uncommon to have to maintain historical versions of records. I have done similar things using (for example, when storing a history of changes to employee records), an EmployeeID and UpdatedOn field as a key so I can get the latest version of the record.
Provided that the tables are properly indexed and that the indexes don't end up being too large because of the composite key, I wouldn't worry about performance or the number of records. I've worked with tables that contained half a billion records and performance was still fine (rebuilding the indexes took a while though).
You could even create an interceptor from entity framework that would allow you to filter out "dead" records when performing your queries (see Entity Framework soft delete implementation using database interceptor not working)
PersonHeader
PersonKey (PK)
Person
PersonKey (PK, FK)
RowBeginDate(PK)
RowEndDate
(other columns)
(I added this as an answer, since a comment wouldn't take a new line...)
PersonLocation
LocationTypeKey (PK, FK)
PersonKey (PK, FK)
LocationKey (FK)
RowBeginDate (not a key)
RowEndDate
Location
LocationKey (PK)
StateKey (FK)
RowBeginDate (not a key)
RowEndDate
Related
I have two tables
contact table
contactID (PK auto increment)
FirstName
LastName
Address
etc..
Patient table
PatientID
contactID (FK)
How can I add the contact info for Patient first, then link that contactID to Patient table
when the contactID is autoincrement (therefore not known until after the row is created)
I also have other tables
-Doctor, nurse etc
that also links to contact table..
Teacher table
TeacherID
contactID (FK)
So therefore all the contact details are located in one table.
Is this a good database design?
or is it better to put contact info for each entity in it's own table..
So like this..
Patient table
PatientID (PK auto increment)
FirstName
LastName
Address
Doctor table
DoctorID (PK auto increment)
FirstName
LastName
Address
In terms of programming, it is easier to just have one insert statement.
eg.
INSERT INTO Patient VALUES(Id, #Firstname,#lastname, #Address)
But I do like the contact table separated (since it normalize the data) but then it has issue with not knowing what the contactID is until after it is inserted, and also probably needing to do two insert statements (which I am not sure how to do)
=======
Reply to EDIT 4
With the login table, would you still have a userid(int PK) column?
E.g
Login table
UserId (int PK), Username, Password..
Username should be unique
You must first create the Contact and then once you know its primary key then create the Patient and reference the contact with the PK you now know. Or if the FK in the Patient table is nullable you can create the Patient first with NULL as the ContactId, create the contact and then update the Patient but I wouldn't do it like this.
The idea of foreign key constraints is that the row being referenced MUST exist therefore the row being referenced must exist BEFORE the row referencing it.
If you really need to be able to have the same Contact for multiple Patients then I think it's good db design. If the relationship is actually one-to-one, then you don't need to separate them into two tables. Given your examples, it might be that what you need is a Person table where you can put all the common properties of Doctors, Teachers and Patients.
EDIT:
I think it's inheritance what you are really after. There are few styles of implementing inheritance in relational db but here's one example.
Person database design
PersonId in Nurse and Doctor are foreign keys referencing Person table but they are also the primary keys of those tables.
To insert a Nurse-row, you could do like this (SQL Server):
INSERT INTO Person(FirstName) VALUES('Test nurse')
GO
INSERT INTO Nurse(PersonId, IsRegistered) VALUES(SCOPE_IDENTITY(), 1)
GO
EDIT2:
Google reveals that SCOPE_IDENTITY() equivalent in mysql is LAST_INSERT_ID() [mysql doc]
EDIT3:
I wouldn't separate doctors and nurses into their own tables so that columns are duplicated. Doing a select without inner joins would probably be more efficient but performance shouldn't be the only criteria especially if the performance difference isn't that notable. There will many occasions when you just need the common person data so you don't always have to do the joins anyway. Having each person in the same table gives the possibility to look for a person in a single table. Having common properties in a single table also allows you have to have doctor who is also a patient without duplicating any data. Later, if you want to have more common attributes, you'd need to add them to each "derived" table too and I will assure you that one day you or someone else forgets to add the properties in one of the tables.
If for some reason you are still worried about performance and are willing to sacrifice normalization to gain performance, another possibility is to have all person columns in the same table and maybe have a type column there to distinguish them and just have a lot of null columns, so that all the nurse columns are null for doctors and so on. You can read about inheritance implementation strategies to get an idea of even though you aren't using Entity Framework.
EDIT4:
Even if you don't have any nurse-specific columns at the moment, I would still create a table for them if it's even slightly possible that there will be in the future. Doing an inner join is a pretty good way to find the nurses or you could do it in the WHERE-clause (there a probably a billion ways to do this). You could have type column in the Person table but that would prevent the same person being a doctor and a patient at the same time. Also in my opinion separate tables is more "strict" and more clear for (future) developers.
I would probably make PersonId nullable in the User table since you might have users that are not actual people in the organization. For example administrators or similar service users. Think about in terms of real world entities (forget about foreign keys and nullables), is every user absolutely part of the organization? But all this is up to you and the requirements of the software. Database design should begin with an entity relationship design where you figure out the real world relationships without considering how they will be mapped to a relational database. This helps you to figure out what the actual requirements are.
I need to know what are the tradeoffs of using a denormalized table vs using two separate tables and accessing the data using joins. I am using Entity Framework 4.
In my case I have two tables Order and OrderCategoryDetails.
I am thinking whether merging these two tables into one single table is better?
If denormalized, the added columns (OrderCategory and OrderSubcategory will be will be sparse (could be 100% empty. Will always be at least 50% empty)
On the other hand, if I keep it as it is, I am worried about frequent join operations being executed (i.e. whenever I am querying for a specific Order, I would need information from OrderCategoryDetails too.
At present, I have normalized tables and use navigational properties:
To access Order Category information from OrderItem instance
OrderItem orderItem = _context.OrderItems.Where(...).FirstOrDefault();
if(2 == orderItem.SalesOrder.Category.OrderCategory){ ...}
To access Order Category information from Order instance
Order order = _context.Orders.Where(...).FirstOrDefault();
if(2 == order.Category.OrderCategory){ ...}
This is my schema:
Table : Order
ID (Primary Key)
Date
Amount
ItemCount
OrderCategoryInfo (FK - join with OrderCategoryDetails on OrderCategoryDetails.ID)
Table : OrderCategoryDetails
ID (Primary Key)
OrderCategory
OrderSubCategory
Table : OrderItem
OrderItem ID (Primary key)
Order ID (FK - Join with Order)
Database used: SQL Server 2008 R2
My general advice would be to ask yourself the following question; does every single row from the first table require a row from the second table? If the answer is yes then you might be better off de-normalising the data. If the answer is no you're probably better off keeping it as a seperate table.
As long as you set up your foreign key association between the two tables you shouldn't concern yourself with performance implications of performing a join. It will only become an issue in pathological situations.
Based upon your answers in the comments thread, I'd recommend that you should keep the tables separate and set up a foreign key relationship between the two.
If you do get any performance problems further down the line, run a profiler on the problematic SQL and add any indexes that the profiler recommends, but only do this for queries that are used frequently. Indexes are great for speeding up queries but come at the cost of insert performance, so take care with them.
I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.
For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.
I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam
Applicaiton is single user, 1-tier(1 pc), database SqlCE. DataService layer will be (I think) : Repository returning domain objects and quering database with LinqToSql (dbml). There are obviously a lot more columns, this is simplified view.
LogTime in separate table: http://i53.tinypic.com/9h8cb4.png
LogTime in ItemTimeLog table (as Time): http://i51.tinypic.com/4dvv4.png
alt text http://i53.tinypic.com/9h8cb4.png
This is my first attempt of creating a >2 tables database. I think the table schema makes sense, but I need some reassurance or critics. Because the table relations looks quite scary to be honest. I'm hoping you could;
Look at the table schema and respond if there are clear signs of troubles or errors that you spot right away.. And if you have time,
Look at Program Summary/Questions, and see if the table layout makes makes sense to those points.
Please be brutal, I will try to defend :)
Program summary:
a) A set of categories, each having a set of strategies (1:m)
b) Each day a number of items will be produced. And each strategy MAY reference it.
(So there can be 50 items, and a strategy may reference 23 of them)
c) An item can be referenced by more than one strategy. So I think it's an m:m relation.
d) Status values will be logged at fixed time-fractions through the day, for:
- .... each Strategy.....each StrategyItem....each item
e) An action on an item may be executed by a strategy that reference it.
- This is logged as ItemAction (Could have called it StrategyItemAction)
User Requsts
b) -> e) described the main activity mode of the program. To work with only today's DayLog , for each category. 2nd priority activity is retrieval of history, which typically will be From all categories, from day x to day y; Get all StrategyDailyLog.
Questions
First, does the overall layout look sound? I'm worried to see that there are so many relationships in all directions, connecting everything. Is this normal, or does it look like trouble?
StrategyItem is made to represent an m:m relationship. Is it correct as I noted 1:m / 1:1 (marked red) ?
StrategyItemTimeLog and ItemTimeLog; Logs values that both need to be retrieved together, when retreiving a StrategyItem. Reason I separated is that the first one is strategy-specific, and several strategies can reference same item. So I thought not to duplicate those values that are not dependent no strategy, but only on the item. Hence I also dragged out the LogTime, as it seems to be the only parameter to unite the logs. But this all looks quite disturbing with those 3 tables. Does it make sense at all? Or you have suggestion?
Pink circles shows my vague attempt of Aggregate Root Paths. I've been thinking in terms of "what entity is responsible for delete". Though I'm unsure about the actual root. I think it's Category. Does it make sense related to User Requests described above?
EDIT1:
(Updated schema, showing typical number of hierarchy items for the first few relations, for 365 days, and additional explanations)
1:1 relation: Sorry. I made a mistake. The StrategyDailyLog should be 1:m. See updated schema. It is one per Strategy, per day.
DayLog / StrategyDailyLog: I’ve been pondering over wether DayLog shall be a part of the hierarchy like this or not. The purpose of the DayLog table is to hold “sum values” derived from all the StrategyDailyLog tables for the same day. Like performance values for this day. It also holds the date value. Which allows me to omit a date value in the StrategyDailyLog (Which I feel would kind of be a duplicate modeling of the date-field), but instead the reference to DayLog exist to “find” the date. I’m not sure if this is an abuse/misconception of normalization.
Null value: I haden’t thought about this. I believe I found 2, as now marked in StrategyDailyLog and ItemAction. They can not be null on creation, but they can be set to null if one need to delete either a Strategy, or a StrategyItem. That should not require a delete of the StrategyDailyLog and the ItemAction. Hence they can be set to null.
All Id –columns: My idea was to have ID (autogenerated Integer) as PK for all my tables. I believed that also would be sufficient as candidate key. Is this not a proper way to make PKs? It’s the only way any table of mine can be identified. I asked a question before if that was ok, maybe I misunderstood, but thought that was a good approach.
m:m relation: This is what I have attempted to do: StrategyItem is the m:m table of StrategyDailyLog / DailyItem.
Ok. Here is me being brutal. I do not understand the model.
So instead of trying to comment on that so much, here are some thoughts that came to my mind when I looked at it.
I think you should have look at your 1:1 relationships (all of them). Why is DayLog and StrategyDailyLog separated in two tables? Probably because you will always have at least one DayLog item but not all DayLog items have a StrategyDailyLog item. If that is the case you can have a StrategyID FK in DayLog table with allow nulls option.
It would help to understand the model if you could show which fields are required and which fields accept null as a value.
All your tables have its own id column. That can be quite confusing when doing 1:1 relations and m:m relations. For a 1:1 relation, usually the relation between the two tables is made on the primary key in both tables. If you do not do that you have to create a candidate key on the foreign key column. In your case that means that StrategyDailyLog should have a candidate key on DayLogID.
A m:m relation between two tables is usually solved by adding a new table in between, with the primary keys from both tables. Those fields together is the primary key for the table in the middle.
Lets say for example that you should have a m:m relationship between Category and Strategy. You should then create a table called CategoryStrategy with two fields CategoryID and StrategyID that together is the primary key for table CategoryStrategy.
I hope my comments makes sense and that they are useful to you.
EDIT 2011-01-17
I do not think that you should have as a principle to use a IDENTITY column as primary key in all tables. A m:m relation does not need it so you should not do it. I also think that you have misunderstood what I meant with a candidate key. A candidate key is a key that could have been used as the primary key. In MS SQL Server you define a UNIQUE CONSTRAINT for your candidate key.
Ex: Table StrategyItem have id as PK but the combination of StrategyID and DailyItemID is the candidate key. Better would be to remove id and use StrategyID+DailyItemID as PK.
Below is the schema that I would have built with your description. I might have missed something important because I do not know everything about what you want to do.
You should not think so much about query performance and building aggregates when designing the schema. That can be handled by creating indexes on columns and using sum, count and group by in your queries. An index on column Created in the model below would be necessary for your queries on a date or date interval. In MS SQL Server there is something called the clustered index. Default the PK of a table is the clustered index but in this case I would make the index on Created column the clustered index.
A Category has 0,1 or more Strategy.
LogItem have on Category and optionally one Strategy
LogItem.Created holds date and time.
Suppose a
Table "Person" having
"SSN",
"Name",
"Address"
and another
Table "Contacts" having
"Contact_ID",
"Contact_Type",
"SSN" (primary key of Person)
similarly
Table "Records" having
"Record_ID",
"Record_Type",
"SSN" (primary key of Person)
Now i want that when i change or update SSN in person table that accordingly changes in other 2 tables.
If anyone can help me with a trigger for that
Or how to pass foreign key constraints for tables
Just add ON UPDATE CASCADE to the foreign key constraint.
Preferably the primary key of a table should never change. If you expect the SSN to change you should use a different primary key and have the SSN as a normal data column in the person table. If it's already too late to make this change, you can add ON UPDATE CASCADE to the foreign key constraint.
If you have PKs that change, you need to look at the table design, use an surrogate PK, like an identity.
In your question you have a Person table, which could be a FK to many many tables. In that case a ON UPDATE CASCADE could have some serious problems. The database I'm working on has well over 300 references (FK) to our equivalent table, we track all the various work that a person does in each different table. If I insert a row into our Person table and then try to delete it back out again (it will not be used in any other tables, it is new) the delete will fail with a Msg 8621, Level 17, State 2, Line 1 The query processor ran out of stack space during query optimization. Please simplify the query. As a result I can't imagine an ON UPDATE CASCADE would work either when you get many FKs on your PK.
I would never make sensitive data like a SSN a PK. Health care companies used to do this and had a painful switch because of privacy. I hope you don't have a web app and have a GET or POST variable called SSN with the actual value in it!! Or display the SSN on every report, or will you shred all old printed reports and limit access to who views each report., etc.
Well, assuming the SSN is the primary key of the Person table, I would just (in a transaction of course):
create a brand new row with the new SSN, copying all other details from the old row.
update the columns in the other tables to point to the new row.
delete the old row.
Now this is actually a good example of why you shouldn't use real data as table cross-references, if that data can change. If you'd used an artificial column to tie them together (and only stored the SSN in one place), you wouldn't have the problem.
Cascade update and delete are very dangerous to use. If you have a million child records, you could end up with a serious locking problem. You should code the updates and deletes instead.
You should never use a PK with the potential to change if it can be avoided. Nor should you ever use SSN as a PK because it should never be stored unencrypted in your database. Never, unless your company likes to be sued when they are the cause of an indentity theft incident. This is not a design flaw to shrug off as this is legacy, we don't have time to fix. This is a design flaw that could bankrupt your company if someone steals your backup tapes or gets the ssns out of the sytem in another manner (most of these types of thefts are internal BTW). This is an urgent - must fix now design flaw.
SSN is also a bad candidate because it changes (people change them when they are victims of identity theft for instance.) Plus an integer PK will have faster performance than a nine-digit PK.