I have inherited an old shabby database, and would like to put loads of foreign keys in on existing relationship columns, so that I can use things like nHibernate for relationships.
I am a little inexperienced in the realm of keys, and although I feel I understand how they work, there is some part of me that is fearful, or corrupting the database somehow.
For example, I've come across the concept of "cascade on delete". I don't think there are currently any foreign keys on the database, so I guess this won't affect me... but how can I check to be sure?
What other risks do I need to be aware of?
In a way I'd love to just use nHibernate without foreign keys, but from what I can see this wouldn't be possible?
The biggest problem of putting foreign keys on a database that was designed without them (which is a indication the orginal database designers were incompetent, so there will be many other problems to fix as well), is that there is close to a 100% chance that you have orphaned data that doesn't have a parent key. You will need to figure out what to do with this data. Some of it can just be thrown out as it is no longer usable in any fashion and is simply wasting space. However if any of it relates to orders or anything financial, you need to keep the data, in which case you may need to define a parent record of "unknown" that you can relate the records to. Find and fix all bad data first, then add the foreign keys.
Use cascade update and cascade delete sparingly as they can lock up your database if a large number of records need to be changed. Additonally, in many cases, you want the delete to fail if existing records exist. You don't want to cascade delete ever through financial records for instance. If deleting the user would delete past orders, that is a very bad thing! If you don't use cascading, you are likely to come across the buggy code that let the data get bad when you can no longer delete or change a record once the key is in place. So test all deleting and updating functionality thoroughly once you have the keys in place.
NHibernate does not require foreign keys to be present on a database to be used, however I would still recommend adding foreign keys whenever possible as foreign keys are a good thing they make sure that your database's referential integrity is as it should be.
For example, if I had a User and a Comment table within my database and I were to delete user 1 who happens to have made two comments, without foreign keys I'd now have two comments without an owner! We obviously do not want this situation to ever occur.
This is where foreign keys come in, by declaring that User is a foreign key within Comment table our database server will make sure that we can't delete a user unless it there are no comments associated with him or her (anymore).
Introducing foreign keys into a database is a good thing. It will expose existing invalid data. It will keep existing valid data, valid. You might have to perform some data manipulation on tables that have already gone haywire (i.e. create an 'Unknown user' or something similar and update all non-existing keys to point at it, this is a decision that needs to be made after examining the meaning of the data).
It might even cause a few issues initially where an existing application crash if for example it doesn't delete all the data it should do (such as not deleting all comments in my example). But this is a good thing in the long term, as it exposes where things are going wrong and allows you to fix them without the data and database getting into an even worse state in the meantime.
NHibernate cascades are seperate from foreign keys and are NHibernate's way of allowing you to for example make sure all child objects are deleted when you delete a parent. This for example allows you to make sure that any change you make to your data model does not violate your foreign key relationships (which would cause a database exception and no changes to be applied). Personally I prefer to take care of this myself, but it's up to you whether and how you want to use them.
Foreign keys formalize relationships in a normalized database. The foreign key constraints you are talking about do things like preventing the creation of duplicate keys or the deletion of a field which defines an entity still being used or referenced. This is called "referential integrity.
I suggest using some kind of modelling tool to draw a so-called ERM or entity-relationship model diagram. This will help you to get an overview of how the data is stored and where changes would be useful.
After you have done this, then you should consider whether or not the data is at a reasonable (say second or third normal form) degree of normalization. Pay particular attention to every entity having a primary key, and that the data completely describes the key. You should also try to remove redundancy and split non-atomic fields into a new table. "Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key so help you Codd." If you find the data is not normalized it would be a good time to fix any serious structural problems and/or refactor, if appropriate.
At this point, adding foreign keys and constraints is putting the cart before the horse. Ensure you have data integrity before you try to protect it. You need to do some preparation work first, then constraints will keep your not-so-shabby newly remodeled database in tip-top shape. The constraints will ensure that no one decides to make exceptions to the rules that turn the data into a mess. Take the time to give the data a better organized home now, and put locks on the doors after.
Related
What is the best practice to handle the following situation?
It is known that many records (thousands) will be inserted with a fair possibility of a primary key exception. In some cases the exception should trigger some alternative queries and logic. In other cases it doesn't matter much and I merely want to log the event.
Should the insert be attempted and the exception caught?
or
Should a query be made to check for the existing record, then attempt the insert if none exists?
or Both?
I have noticed slightly better performance when merely catching the exception, but there's not a significant difference.
IMO It depends. If the client is responsible for generating a PK, using a UUID or Snowflake etc. where keys are expected to be unique then the first option is fine. Whether you bother with a retry after generating a new ID or simply fail the operation and ask the user to try again (as it should be a 1 in a billion exception, not the norm) is up to you. If the data is relying on sequences or user-entered meaningful keys it should be managed at the DB side using DatabaseGenerated.Identity and meaningless keys with related object graphs created and committed within a single SaveChanges call.
The typical concern around ID generation and EF is usually where developers don't rely on EF/the DB to manage the PK and FKs through navigation properties. They feel they need to know the PK in order to set FKs for related data, either saving the primary entity to get the PK or generating keys client-side. One of the key benefits of using an ORM like EF is giving it the related objects and letting it manage the inserting of PKs and FKs automatically.
There are couple of things over here.
One thing required is that you must have primary key vonstraint on column at the database Level
Now at the Entity Framework level, it is good if you check whether the record exists or not. So basically what happen you query for record using primary key and if it is found, then it return the entity and then you make changes to entity and at last SaveChanges will save that entity
Now if you are not able to find entity then you have to add entity
If you try without query then it is problematic for EF and specially if multiple request try to update same record
Now one more case is that, lets assume that there is possibility that multiple request can insert same record and so primary key constraint will help here and it will not allow duplication if you are generating primary key manually
For update too, there is possibility of data loss if you are not taking care of concurrency
I was given the task of creating a stored procedure to copy every piece of data associated with a given ID in our database. This data spans dozens of tables. each table may have dozens of matching rows.
example:
table Account
pk = AccountID
Table AccountSettings
FK = AccountID
Table Users
PK = UserID
FK = AccountID
Table UserContent
PK = UserContentID
FK = UserID
I want to create a copy of everything that is associated with an AccountID(which will traverse nearly every table) The copy will have a new AccountID and UserContentID but will have the same UserID. the new data needs to be in its respective table.
:) fun right?
The above is just a sample but I will be doing this for something like 50 or 60 tables.
I have researched using CTEs but am still a bit foggy on them. that may prove to be the best method. MY SQL skills are...... well I have worked with it for about 40 logged hours so far :)
Any advice or direction on where to look would be greatly appreciated. In addition, I am not opposed to doing this via C# if that would be possible or better.
Thanks in advance for any help of info.
The simplest way to solve this is the brute force way: write a very long proc that processes each table individually. This will be error-prone and very hard to maintain. But it will have the advantage of not relying on the database or database metadata to be in any particularly consistent state.
If you want something that works based on metadata, things are more interesting. You have three challenges there:
You need to programmatically identify all the related tables.
You need to generate insert statements for all 50 or 60.
You need to capture generated ids for those tables that are more than one or two steps away from the Account table, so that they can in turn be used as foreign keys in yet more copied records.
I've looked at this problem in the past, and while I can't offer you a watertight algorithm, I can give you a general heuristic. In other words: this is how I'd approach it.
Using a later version of MS Entity Framework (you said you'd be open to using C#), build a model of the Account table and all the related tables.
Review the heck out of it. If your database is like many, some of the relationships your application(s) assume will, for whatever reason, not have an actual foreign key relationship set up in the database. Create them in your model anyway.
Write a little recursive routine in C# that can take an Account object and traverse all the related tables. Pick a couple of Account instances and have it dump table name and key information to a file. Review that for completeness and plausibility.
Once you are satisfied you have a good model and a good algorithm that picks up everything, it's time to get cracking on the code. You need to write a more complicated algorithm that can read an Account and recursively clone all the records that reference it. You will probably need reflection in order to do this, but it's not that hard: all the metadata that you need will be in there, somewhere.
Test your code. Allow plenty of time for debugging.
Use your first algorithm, in step 3, to compare results for completeness and accuracy.
The advantage of the EF approach: as the database changes, so can your model, and if your code is metadata-based, it ought to be able to adapt.
The disadvantage: if you have such phenomena as fields that are "really" the same but are different types, or complex three-way relationships that aren't modeled properly, or embedded CSV lists that you'd need to parse out, this won't work. It only works if your database is in good shape and is well-modeled. Otherwise you'll need to resort to brute force.
I'm currently working on a sandbox environment based on two databases located on different servers. What I am aiming to do is allow my clients to make changes on a test server and then once approved, I can simply hit a button and import the data across to my live database.
So far, I have managed to port the data across the two databases but what I would like to do is amend the primary keys on the test server to match those held on the live (incase I need backups and so that I can make checks to stop the same information being copied multiple times).
So far I have tried this solution:
DT_SitePage OldPage = new DT_SitePage
{
PageID = SP.PageID
};
DT_SitePage NewPage = new DT_SitePage
{
PageID = int.Parse(ViewState["PrimaryKey"].ToString())
};
Sandbox.DT_SitePages.Attach(NewPage, OldPage);
Sandbox.SubmitChanges();
However I keep getting the error:
***Value of member 'PageID' of an object of type 'DT_SitePage' changed.
A member defining the identity of the object cannot be changed.
Consider adding a new object with new identity and deleting the existing one instead.***
Is there anyway in LINQ to avoid this error and force the database to update this field???
Many Thanks
Why won't you use the stock backup/restore functionality supplied by DB manufacturer?
It makes a perfect logical sense that high-level ORM tools won't allow you to change the primary key of the record, as they only identify the record by its primary key.
You should consider making direct UPDATE queries to DB from your code instead.
And anyway, changing the primary key is the bad idea, what prevents you from INSERTing it with the needed value in the first place?
As said, modifying primary keys is typically something you don't want to do. If Linq-to-sql wouldn't have the early warning, your RDBMS would probably complain (SQL server does!). Especially when records are related by foreign key constraints updating primary keys is not trivial.
In cross-database scenarios, it is more common to use some "global" unique identification, for which GUIDs may do. Maybe your data can be identified in an alternative way? (Like when two users have the same name, they are deemed identical).
If you don't need to keep identical the database structures, you may consider using an extra field in your test database to store the "live" primary key.
Here is a post with lots of useful thoughts.
We have a text processing application developed in C# using .NET FW 4.0 where the Administrator can define various settings. All this 'settings' data reside in about 50 tables with foreign key relations and Identity primary keys (this one will make it tricky, I think). The entire database is no more than 100K records, with the average table having about 6 short columns. The system is based on MS SQL 2008 R2 Express database.
We face a requirement to create a snapshot of all this data so that the administrator of the system could roll back to one of the snapshots anytime he screws up something. We need to keep the last 5 snapshots only. Creation of the snapshot must be commenced from the application GUI and so must be the rollback to any of the snapshots if needed (use SSMS will not be allowed as direct access to the DB is denied). The system is still in development (are we ever really finished?) which means that new tables and columns are added many times. Thus we need a robust method that can take care of changes automatically (digging code after inserting/changing columns is something we want to avoid unless there's no other way). The best way would be to tell that "I want to create a snapshot of all tables where the name begins with 'Admin'". Obviously, this is quite a DB-intensive task, but due to the fact that it will be used in emergency situations only, this is something that I do not mind. I also do not mind if table locks happen as nothing will try to use these tables while the creation or rollback of the snapshot is in progress.
The problem can be divided into 2 parts:
creating the snapshot
rolling back to the snapshot
Regarding problem #1. we may have two options:
export the data into XML (file or database column)
duplicate the data inside SQL into the same or different tables (like creating the same table structure again with the same names as the original tables prefixed with "Backup").
Regarding problem #2. the biggest issue I see is how to re-import all data into foreign key related tables which use IDENTITY columns for PK generation. I need to delete all data from all affected tables then re-import everything while temporarily relaxing FK constraints and switching off Identity generation. Once data is loaded I should check if FK constraints are still OK.
Or perhaps I should find a logical way to load tables so that constraint checking can remain in place while loading (as we do not have an unmanageable number of tables this could be a viable solution). Of course I need to do all deletion and re-loading in a single transaction, for obvious reasons.
I suspect there may be no pure SQL-based solution for this, although SQL CLR might be of help to avoid moving data out of SQL Server.
Is there anyone out there with the same problem we face? Maybe someone who successfully solved such problem?
I do not expect a step by step instruction. Any help on where to start, which routes to take (export to RAW XML or keep snapshot inside the DB or both), pros/cons would be really helpful.
Thank you for your help and your time.
Daniel
We don't have this exact problem, but we have a very similar problem in which we provide our customers with a baseline set of configuration data (fairly complex, mostly identity PKs) that needs to be updated when we provide a new release.
Our mechanism is probably overkill for your situation, but I am sure there is a subset of it that is applicable.
The basic approach is this:
First, we execute a script that drops all of the FK constraints and changes the nullability of those FK columns that are currently NOT NULL to NULL. This script also drops all triggers to ensure that any logical constraints implemented in them will not be executed.
Next, we perform the data import, setting identity_insert off before updating a table, then setting it back on after the data in the table is updated.
Next, we execute a script that checks the data integrity of the newly added items with respect to the foreign keys. In our case, we know that items that do not have a corresponding parent record can safely be deleted, but you may choose to take a different approach (report the error and let someone manually handle the issue).
Finally, once we have verified the data, we execute another script that restores the nullability, adds the FKs back, and reinstalls the triggers.
If you have the budget for it, I would strongly recommend that you take a look at the tools that Red Gate provides, specifically SQL Packager and SQL Data Compare (I suspect there may be other tools out there as well, we just don't have any experience with them). These tools have been critical in the successful implementation of our strategy.
Update
We provide the baseline configuration through an SQL Script that is generated by RedGate's SQL Packager.
Because our end-users can modify the database between updates which will cause the identity values in their database to be different in ours, we actually store the baseline primary and foreign keys in separate fields within each record.
When we update the customer database and we need to link new records to known configuration information, we can use the baseline fields to find out what the database-specific FKs should be.
In otherwords, there is always a known set of field ids for well-known configuration records regardless what other data is modified in the database and we can use this to link records together.
For example, if I have Table1 linked to Table2, Table1 will have a baseline PK and Table2 will have a baseline PK and a baseline FKey containing Table1's baseline PK. When we update records, if we add a new Table2 record, all we have to do is find the Table1 record with the specified baseline PK, then update the actual FKey in Table2 with the actual PK in Table1.
A kind of versioning by date ranges is a common method for records in Enterprise applications. As an example we have a table for business entities (us) or companies (uk) and we keep the current official name in another table as follows:
CompanyID Name ValidFrom ValidTo
12 Business Lld 2000-01-01 2008-09-23
12 Business Inc 2008-09-23 NULL
The null in the last record means that this is current one. You may use the above logic and possibly add more columns to gain more control. This way there are no duplicates, you can keep the history up to any level and synchronize the current values across tables easily. Finally the performance will be great.
I have a Table in which i don't want to specify any primary key, after that i am inserting records in it using Linq...aahhh...its giving the error
"Can't perform Create, Update or Delete operations on 'Table(abc)' because it has no primary key"
can ani one tell me how to insert record without setting it primary key.
By the way im not setting any primary key because this table will have bulk of data to keep.
You can't use Linq-to-SQL directly with tables that don't have primary keys, because it's not supported.
If you're worried about the performance hit of indexing, what you can do is add a Stored Procedure that does the insert and add that to your data context. It's a bit more work, and it's not really Linq-to-SQL, it'll just be a method on that you call on your data context.
There probably won't be a noticeable performance hit on an identity primary key field anyway.
Tami - would it not be a good idea to stick to 'best practices' when using linq and add a autonumber primary key, even if it isn't going to be used other than for inserts or updates?? i can think of many instances where the seemingly 'non requirement' for a primary key later leads to trouble when trying to update to other platforms etc.
If there's a compelling reason to not add a 'blind' primary key, then it might help to detail this as well in the question. I can't think of any reasons not to add it, especially if it means that you don't have to code around the limitation.
jim
[edit]
Tami - i'll be honest with you. you might have to investigate conventions to best satisfy any answer to this question. basically, altho' you don't 'need' an index on your records, due to not being edited or deleted, the convention with linq is based around the assumption of data integrity. in essence, linq (and many other programatic tools) require a convention that allows them to succinctly identify a unique key on each object that they bring into scope. by not defining this, you are by-passing this convention and therefore linq is flagging this up for you. the only way fwd is to go with the flow. even if you 'feel' that the index is redundant, linq requires it to allow you to access the full functionality built into it.