How do I carry old ID keys forward into a new database structure? - sql

Given a project I'm working on, we have an old database structure we're migrating data from into a new database structure, and we need to preserve the old keys for a few tables for backwards compatibility with some existing application functionality.
Currently, there are two approaches we are considering for addressing this need:
Create an extra nullable field for each table and insert the old key into that new field
Create companion table(s) that contain the old and new key mappings
Note: new data will not generate old ID keys, so in approach #1, eventually the nullable field will contain nulls over time for new records.
Which approach is better for a cleaner database design, and data management long-term?
Do you see any issues with either approach, and if so, what issues?
Is there a #3 approach that I haven't thought of yet?

You mention sql, but is it SQL-Server?
if SQL-Server, look into SET INSERT_IDENTITY. This allows you to explicitly insert values for the auto-increment columns vs being in a protected mode for that column.
However, I believe that if you explicitly include the PK in the insert statement with its value, it will respect that and save the original key in the original column you are hoping to retain without having to force yet another column for backward compatibility purposes.

Related

Enable adding unknown properties to table objects

I am creating an database managment webapp that has a strange requirement to provide the user with ability to add new 'fields' to existing objects. For example the table 'Employees' has Names and ID's of emloyers. Suddenly the owner of system wants to know if his employees have driver license. But our table and app did not expect that
The only options that come to my mind is to
1) Add big varchar field storing aditional properties as JSON or something
2) Add table 'additional properties' That will allow creating new objects linked by PK to existing users
Then we will have
TABLE USER TABLE PROPERTIES
-ID <-------------FK
-NAME -NAME (driver livence)
-VALUE (true)
How bad is the second idea? Are there any better options other then going noSQL?
There is no issue with either approach. EAV (entity-attribute-value) models have been part of relational databases, probably since the earliest databases were created.
They do have some downsides:
The value attribute is (generally) a string. That means that other types cannot be coerced into the column.
The values cannot be part of declared foreign key relationships.
Validating values is tricky -- very complicated check constraints, for instance.
There is no (easy) way to ensure that an entity has a particular value.
But for user-defined or sparsely populated values, EAV is definitely a reasonable choice.
JSON is another reasonable choice. That does, however, require a one-time change to the database to add a JSON column. Some databases offer indexing on JSON values, which can improve performance.
If "has-drivers-license" is a one-time change, then you might just want a separate table with the same primary key. The next time that a new column is needed, you can modify the "options" table, rather than the main table. This allows better support for validating values (all values are unique, for example) or defining foreign key constraints.

Design Pattern to add columns in database table dynamically

The user wants to add new fields in UI dynamically. This new field should get stored in database and they should be allowed to perform CRUD on it.
Now I can do this by specifying a XML but I wanted a better way where these new columns are searchable. Also the idea of firing ALTER statement and adding a new column seems wrong.
Can anyone help me with a design pattern on database server side of how to solve this problem?
This can be approached using a key value system. You create a table with the primary key column(s) of the table you want to annotate, a column for the name of the attribute, and a column for its value. When you user wants to add an attribute (say height) to the record of person 123 you add a row to the new table with the values (123, 'HEIGHT', '140.5').
In general you cast the values to TEXT for storage but if you know all the attributes will be numeric you can choose a different type for the value column. You can also (not recommended) use several different value columns depending on the type of the data.
This technique has the advantage that you don't need to modify the database structure to add new attributes and attributes are only stored for those records that have them. The disadvantage is that querying is not as straightforward as if the columns were all in the main data table.
There is no reason why a qualified business user should not be allowed to add a column to a table. It is less likely to cause a problem than just about anything else you can imagine including adding a new row to a table or changing. the value of a data element.
Using either of the methods described above do not avoid any risk; they are simply throwbacks to COBOL filler fields or unnecessary embellishments of the database function. The result can still be unnormalized and inaccurate.
These same business persons add columns to spreadsheets and tables to Word documents without DBAs getting in their way.
Of course, just adding the column is the smallest part of getting an information system to work, but it is often the case that it is perceived to be an almost insurmountable barrier. It is in fact 5 min worth of work assuming you know where to put it. Adding a column to the proper table with the proper datatype is easy to do, easy to use, and has the best chance of encouraging data quality.
Find out what the maximum number of user-added fields will be and add them before hand. For example 'User1', 'User2', 'User3', 'User4'...etc. You can then enable the fields on the UI based on some configurable settings.

Fabricate entities with multiple IDENTITY PKs?

Disclosure: I'm a 'natural key' advocate myself and averse to the IDENTITY PK approach. But I do have a 'live and let live' approach to lifestyle choices, so no religious arguments here please :)
I have inherited a table where the only key is the IDENTITY PK column; let's call it ID. There are a many tables that reference ID. The intended process of creating a new entity seems to be:
INSERT INTO the table.
Use scope_identity to grab the
auto-generated ID.
Use the auto-generated ID to INSERT
into related tables.
In fact, there is a helper stored proc to create an entity and return the ID. However, I have a couple of issues:
I need to go further than the helper stored proc and create rows in related tables which themselves have IDENTITY PKs, so for each entity I need to grab several auto-generated values along the way.
I need to fabricate several hundred entities and the helper procs are coded to handle one entity at a time.
What is the best way to bulk fabricate entities using the 'IDENTITY PK' design?
When using my own 'natural key' designs, I can generate the key values in advance, therefore it's simply a case of loading some scratch tables and INSERTing into the tables in the order expected by the foreign keys. Therefore, I'm tempted to find a sequence of high value INTEGER values (to match the type of the IDENTIY columns) which I know isn't being used now and hope that they won't be being used when the time comes to do the INSERT. Is this a good idea?
Are you talking specifically about MS SQL Server?
It is unfortunate that IDENTITY columns disallow explicit inserts by default. In other DBMSs, being auto-increment wouldn't stop you from inserting an explicit value into that column, which would make it easy to choose the keys in advance. Unfortunately on SQL Server you have the inconvenience of SET IDENTITY_INSERT to worry about.
there is a helper stored proc to create an entity and return the ID.
It seems a little over-the-top to me to use an sproc for that, since it's generally as simple as selecting the SCOPE_IDENTITY(). Quite often you can avoid the explicit select by writing each insert such that it can use the last insert's SCOPE_IDENTITY() directly.
find a sequence of high value INTEGER values which I know isn't being used now and hope that they won't be being used [...] Is this a good idea?
They don't necessarily have to be very high values; in fact if you did that often you'd be making many huge gaps in the IDENTITY values, which is generally better avoided. You could even use the MAX(column)+1 values as long as you either caught the error where someone else used those values in between times, or, better, do a select-max then insert in a transaction.

Fixing DB Inconsistencies - ID Fields

I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.
For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.
In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.
I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.
You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.
With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.
Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ALTER TABLE notes
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.

updating primary key of master and child tables for large tables

I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
use on update cascade if available?
My questions are:
Is there a better way? (Can't burrow my head in the sand, gotta do this)
What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)
Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this:
Do not do this unless you have a current backup.
I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it.
Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest.
Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest.
Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it.
There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task.
Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.
Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Disable constraints.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.
It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using CREATE TABLE AS SELECT (CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.
In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.