updating primary key of master and child tables for large tables - sql

I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
use on update cascade if available?
My questions are:
Is there a better way? (Can't burrow my head in the sand, gotta do this)
What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)

Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this:
Do not do this unless you have a current backup.
I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it.
Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest.
Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest.
Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it.
There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task.
Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.

Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Disable constraints.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.

It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using CREATE TABLE AS SELECT (CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.

In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.


Adding a row to Table A if it has a required foreign key to Table B which has a required foreign key to Table A

This might sound complicated, so I'll give an example.
Say, I have two tables Instructor and Class.
Instructor has a required field called PreferredClassID which has a foreign key against Class.
Class has a required field called CurrentInstructorID which is a foreign key against Instructor
Is it possible to insert a row to either of these tables?
Cause if I insert a row to Instructor, I won't be able to as I'll need to supply a PreferredClassID, but I can't create a Class row either because it needs a CurrentInstructorID.
If I can't do this, how would I solve this problem? Would I just need to make one of those fields non-required (even if business requirements specifies it really should be required?)
If you find yourself here, reevaluate your data relation model.
In this case, you could simply have a lookup table called PreferredCourse with courseId and instructorId.
This will enforce that both the course and instructor exist before adding the row to the PreferredCourse lookup. Maintaining business model requirements without bending the rules of database model requirements.
While it may seem excessive to have another table, it will prevent a whole lot of maintenance overhead in both your database procedures and jobs, and your application code. Circular references create nothing but headaches and are easily solved with small lookup tables and JOINs.
The Impaler gave an example of how to accomplish this with your current data structure. Please note, that you have to 1: make a key nullable in at least one of the tables, and then 2: Perform INSERTs in a specified order. Or, 3: disable the constraints, 4: perform INSERTS, 5: reenable constraints, 6: roll back transaction if constraints are now broken.
There is a whole lot that can go wrong, simply fix the relation model now before things get out of hand.
As long as one of those foreign keys allows a null value, you're good. So you:
Insert the row that accepts the null value first (say Instructor), with a null value on the FK. Get the ID of the inserted row.
Insert in the other table (say Class). In the FK you use the ID you got from step #1. Once inserted, you get the ID of this new row.
Update the FK on the first row (Instructor) with the ID you got from step #2.
Alternatively, if both FKs are NOT NULL then you have a bit of a problem. The options I see for this last case are:
Use deferrable FK integrity check. Some databases do allow you to insert without checking integrity until the COMMIT happens. This is really tricky, and enabling this is looking for trouble.
Disable the FK for a short period of time. Some databases allow you to enable/disable constraints. You are not deleting them, just temporarily disabling them. If you do this, don't forget to enable them back.
Drop the constraint temporarily, while you do the insert, and the add it again. This is really a work around of last resort. Adding/Dropping constraint are DML statements and usually cannot participate in a transaction. Do this at your own peril.
Something to consider (as per user7396598's answer) is looking at how normal forms apply to your data as it fits within your relational model.
In this case, it might be worth looking at the following:
With your Instructor table, is the PreferredClassID a necessary component? Does an instructor -need- to have a preferred class, or is it okay to say "Hey, I'm creating an entry for a new instructor, I don't know their preferred class."
(if they're new, they might not have a preferred class that your school offers)
This is a case where you definitely want to have a foreign key, but it should be okay to say 'I don't necessarily know the value I want to put there.'
In a similar vein, does a Class need to have an instructor when it's created? Is it possible to create a Class that an instructor has not been assigned to yet?
Again, both of these points are really a case of 'I don't know what I want to put here, but when I do, it should be a specific instance that exists in another table.'

Update and delete records in the fact table

I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.

SQL table performance with foreign key

I have a website that needs to do a lot of active searching of users. I have a User table which contains links to all the full user details but that is only really of interest when looking at your own account. When searching for other users, there is very limited information you need so in order to make searches faster and more efficient, every time you update your user details, the code writes an entry to a separate table called UserLight - which only contains about 8 columns and is all pure data - ie no links to other child tables or collection objects, just string data for speed. Each user can only have one UserLight entry at a time which is the summary representation of how their account appears to other users.
My question is for performance, does it matter that I am making the UserId a foreign key constraint with the User table? So you cannot create a UserLight entry without the corresponding row in User, and also so when you delete the User row, it automatically cascades and deletes the UserLight entry. That is ideal and how I would like to have it but I'm just wondering if having this FK constraint on the UserLight table in any way slows down the performance on read or write operations to/from this table? If it does, I am happy to drop the FK constraint and have a completely isolated table with no constraints or external references to other objects to speed up performance, and just manage housekeeping manually, but if the FK constraint doesnt affect performance at all - I would prefer to keep it.
It will not hamper your performance instead its preferred to have data constrained so as to avoid insert/delete/update anomalies.

Setting the right foreign key on insert

Morning all,
I'm doing a lot of work to drag a database (SQL Server 2005, in 2000 compatibility mode) kicking and screaming towards having a sane design.
At the moment, all the tables' primary keys are nvarchar(32), and are set using uniqId() (oddly, this gets run through a special hashing function, no idea why)
So in several phases, I'm making some fundamental changes:
Introducing ID_int columns to each table, auto increment and primary key
Adding some extra indexing, removing unused indexes, dropping unused columns
This phase has worked well so far, test db seems a bit faster, total index sizes for each table are MUCH smaller.
My problem is with the next phase: foreign keys. I need to be able to set these INT foreign keys on insert in the other tables.
There are several applications pointing at this DB, only one of which I have much control over. It also contains many stored procs and triggers.
I can't physically make all the changes needed in one go.
So what I'd like to be able to do is add the integer FKs to each table and have them automatically set to the right thing on insert.
To illustrate this with an example:
Two tables, Call and POD, linked pod.Call_ID -> Call.Call_ID. This is an nvarchar(32) field.
I've altered call such that Call_ID_int is identity, auto increment, primary key. I need to add POD.Call_ID_int such that, on insert, it gets the right value from Call.Call_ID_int.
I'm sure I could do this with a BEFORE trigger, but I'd rather avoid this for maintenance and speed reasons.
I thought I could do this with a constraint, but after much research found I can't. I tried this:
alter table POD
add constraint
for Call_ID_int
Where the map_Call_ID_int function takes the Call_ID and returns the right Call_ID_int, but I get this error:
The name "Call_ID" is not permitted in this context. Valid expressions
are constants, constant expressions, and (in some contexts) variables.
Column names are not permitted.
Any ideas how I can achieve this?
Thanks very much in advance!
Triggers are the easiest way.
You'll have odd concurrency issues with defaults based on UDFs too (like you would for CHECK constraints).
Another trick is to use views to hide schema changes but still with triggers to intercept DML. So your "old" table no longer exists only as a view on "new" table. A write to the "old" table/view actually happens on the new table.

Fixing DB Inconsistencies - ID Fields

I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.
For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.
In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.
I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.
You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.
With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.
Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.