What's the best way to deprecate a column in a database schema? - sql

After reading through many of the questions here about DB schema migration and versions, I've come up with a scheme to safely update DB schema during our update process. The basic idea is that during an update, we export the database to file, drop and re-create all tables, and then re-import everything. Nothing too fancy or risky there.
The problem is that this system is somewhat "viral", meaning that it is only safe to add columns or tables, since removing them would cause problems when re-importing the data. Normally, I would be fine just ignoring these columns, but the problem is that many of the removed items have actually been refactored, and the presence of the old ones in the code fools other programmers into thinking that they can use them.
So, I would like to find a way to be able to mark columns or tables as deprecated. In the ideal case, the deprecated objects would be marked while updating the schema, but then during the next update our backup script would simply not SELECT the objects which have been marked in this way, allowing us to eventually phase out these parts of the schema.
I have found that MySQL (and probably other DB platforms too, but this is the one we are using) supports the COLUMN attribute to both fields and tables. This would be perfect, except that I can't figure out how to actually use it in a meaningful manner. How would I go about writing an SQL query to get all column names which do not contain a comment matching text containing the word "deprecated"? Or am I looking at this problem all wrong, and missing a much better way to do this?

Maybe you should refactor to use views over your tables, where the views never include the deprocated columns.

"Deprecate" usually means (to me at least) that something is marked for removal at some future date, should not used by new functionality and will be removed/changed in existing code.
I don't know of a good way to "mark" a deprecated column, other than to rename it, which is likely to break things! Even if such a facility existed, how much use would it really be?
So do you really want to deprecate or remove? From the content of your question, I'm guessing the latter.
I have the nasty feeling that you may be in one of those "if I wanted to get to there I wouldn't start from here" situations. However, here are some ideas that spring to mind:
Read Recipes for Continuous Database Integration which seems to address much of your problem area
Drop the column explicitly. In MySQL 5.0 (and even earlier?) the facility exists as part of DDL: see the ALTER TABLE syntax.
Look at how ActiveRecord::Migration works in Ruby. A migration can include the "remove_column" directive, which will deal with the problem in a platform-appropriate way. It definitely works with MySQL, from personal experience.
Run a script against your export to remove the column from the INSERT statements, both column and values lists. Probably quite viable if your DB is fairly small, which I'm guessing it must be if you export and re-import it as described.

Related

Refactoring Auto-increment ids to GUIDs in **SQL DB

Jeff and others have convinced me that GUIDs are preferable to auto-increment ids. I have a Postgres DB that is indexed by auto-increment ids so I'd like to "refactor" the indexes to UUIDs. Is there some general (or specific) approach to doing this besides writing functions that traverse the tables, and check for index matches across tables?
Update
Note: the database is not currently in production, so performance and transactional integrity are non-issues.
I'm not able to find anything that will do this automatically for you, so it looks like it's up to you to do it. Good thing the world still needs database developers, eh?
The best way, arguably, is to have the entire change scripted out. The best way to create that script is probably with another script or tool (code that writes code), which doesn't seem to be available for this particular scenario. Of course each of these adds another layer of software which must be constructed and tested. If I thought that I would want to repeat this process some time, or needed some level of audit trail (e.g. change scripts), I would probably bite the bullet and write the script that writes this script.
If this really is just a one-shot deal, and you can prevent DB access while you're doing it, then it might save time and effort to just manually make the changes, sort of like when you initially develop a database. By this, I mean adding UUID columns via your preferred method (diagrammer, SQL DDL, etc.), filling them with data (probably with ad-hoc SQL DML), setting keys and constraints, and then eventually removing the old foreign keys and columns (again, using whatever method you like).
If you have multiple environments (dev, test, prod), you can potentially do this in dev and then use a DB compare tool to script the changes, though you'll need the new FK values scripted.
An example
Here is a working script example on SQL Fiddle, though it's in SQL Server (my easiest DB), just to give you an idea about what you'll have to script (unfortunately, not how). It's still not completely transactionally consistent, as someone could modify something during one particular operation.
I realize this isn't by any means a complete answer, so feel free to vote me down (and provide a better answer).
Good luck, this is actually a fun problem.

Transferring data from one SQL table layout to a 'new & improved' one

The project I work on has undergone a transformation at the database level. For the better, about 40% of the SQL layout has been changed. Some columns were eliminated, others moved. I am now tasked with developing a data migration strategy.
What migration methods, even tools are available so that I don't have to figure out each every individual dependency and manually script a key change when their IDs (for instance) change.
I realize this question is a bit obtuse and open ended, but I assume others have had to do this before and I would appreciate any advice.
I'm on MS SQL Server 2008
#OMG Ponies Not obtuse but vague:
Great point. I guess this helps me reconsider what I am asking, at least make it more specific. How do you insert from multiple tables to multiple tables keeping the relationships established by the foreign keys intact? I now realize I could drop the ID key constraint during the insert and re-enable it after, but I guess I have to figure out what depends on what myself and make sure it goes smoothly.
I'll start there, but will leave this open in case anyone else has other recommendation.
You should create an upgrade script that morphs the current schema into the v. next schema, applying appropriate operations (alter table, select into, update, delete etc). While this may seem tedious, is the only process that will be testable: start from a backup of the current db, apply the upgrade script, test the result db for conformance with the desired schema. You can test and debug your upgrade script until is hammered into correctness. You can test it on a real data size so that you get a correct estimate of downtime due to size-of-data operations.
While there are out there tools that can copy data or transforms schema(s) (like SQL Compare) I believe approaching this as a development project, with a script deliverable that can be tested repeatedly and validated, is a much saner approach.
In future you can account for this upgrade step in your development and start with it, rather than try to squeeze it in at the end.
there are tons of commercial tools around that claim to solve this -> i wouldn't buy that...
I think your best bet is to model domain classes that represent your data and write adapters that read in/serialize to the old/new schemas.
If you haven't got a model of your domain, you should build one now.
ID's will change, so ideally they should not carry any meaning to user's of your database.

Is adding a bit mask to all tables in a database useful?

A colleague is adding a bit mask to all our database tables. In theory this is so we can track certain properties of each row across the entire system. For example...
Is the row shipped with the system or added by the client once they've started using the system
Has the row been deleted from the table (soft deletes)
Is the row a default value within a set of rows
Is this a good idea? Are there other uses where this approach would be beneficial?
My preference is these properties are obviously important, and having a dedicated column for each property is justified to make what is happening clearer to fellow developers.
Not really, no.
You can only store bits in it, and only so many. So, seems to me like it's asking for a lot of application-level headaches later on keeping track of what each one means and potential abuse later on because "hey they're everywhere". Is every bitmask on every table going to use the same definition for each bit? Will it be different on each table? What happens when you run out of bits? Add another?
There are lots of potential things you could do with it, but it begs the question "why do it that way instead of identifying what we will use those bits for right now and just make them proper columns?" You don't really circumvent the possibility of schema changes this way anyway, so it seems like it's trying to solve a problem that you can't really "solve" and especially not with bitmasks.
Each of the things you mentioned can be (and should be) solved with real columns on the database, and those are far more self-documenting than "bit 5 of the BitMaskOptions field".
A dedicated column is is better, because it's undoubtedly more obvious and less error-prone. SQL Server already stores BIT columns efficiently, so performance is not an issue.
The only argument I could see for a bitmask is not having to change the DB schema every time you add a new flag, but really, if you're adding new flags that often then something is not right.
No, it is not even remotely a good idea IMO. Each column should represent a single concept and value. Bit masks have all kinds of performance and maintenance problems. How do new developers understand what each of the bits mean? How do you prevent someone from accidentally mixing the meaning of the order of the bits?
It would be better to have a many-to-many relationship or separate columns rather than a bit mask. You will be able to index on it, enable referential integrity (depending on approach), easily add new items and change the order of the results to fit different reports and so on.

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.

Do you put your indexes in source control?

And how do you keep them in synch between test and production environments?
When it comes to indexes on database tables, my philosophy is that they are an integral part of writing any code that queries the database. You can't introduce new queries or change a query without analyzing the impact to the indexes.
So I do my best to keep my indexes in synch betweeen all of my environments, but to be honest, I'm not doing very well at automating this. It's a sort of haphazard, manual process.
I periodocally review index stats and delete unnecessary indexes. I usually do this by creating a delete script that I then copy back to the other environments.
But here and there indexes get created and deleted outside of the normal process and it's really tough to see where the differences are.
I've found one thing that really helps is to go with simple, numeric index names, like
idx_t_01
idx_t_02
where t is a short abbreviation for a table. I find index maintenance impossible when I try to get clever with all the columns involved, like,
idx_c1_c2_c5_c9_c3_c11_5
It's too hard to differentiate indexes like that.
Does anybody have a really good way to integrate index maintenance into source control and the development lifecycle?
Indexes are a part of the database schema and hence should be source controlled along with everything else. Nobody should go around creating indexes on production without going through the normal QA and release process- particularly performance testing.
There have been numerous other threads on schema versioning.
The full schema for your database should be in source control right beside your code. When I say "full schema" I mean table definitions, queries, stored procedures, indexes, the whole lot.
When doing a fresh installation, then you do:
- check out version X of the product.
- from the "database" directory of your checkout, run the database script(s) to create your database.
- use the codebase from your checkout to interact with the database.
When you're developing, every developer should be working against their own private database instance. When they make schema changes they checkin a new set of schema definition files that work against their revised codebase.
With this approach you never have codebase-database sync issues.
Yes, any DML or DDL changes are scripted and checked in to source control, mostly thru activerecord migrations in rails. I hate to continually toot rails' horn, but in many years of building DB-based systems I find the migration route to be so much better than any home-grown system I've used or built.
However, I do name all my indexes (don't let the DBMS come up with whatever crazy name it picks). Don't prefix them, that's silly (because you have type metadata in sysobjects, or in whatever db you have), but I do include the table name and columns, e.g. tablename_col1_col2.
That way if I'm browsing sysobjects I can easily see the indexes for a particular table (also it's a force of habit, wayyyy back in the day on some dBMS I used, index names were unique across the whole DB, so the only way to ensure that is to use unique names).
I think there are two issues here: the index naming convention, and adding database changes to your source control/lifecycle. I'll tackle the latter issue.
I've been a Java programmer for a long time now, but have recently been introduced to a system that uses Ruby on Rails for database access for part of the system. One thing that I like about RoR is the notion of "migrations". Basically, you have a directory full of files that look like 001_add_foo_table.rb, 002_add_bar_table.rb, 003_add_blah_column_to_foo.rb, etc. These Ruby source files extend a parent class, overriding methods called "up" and "down". The "up" method contains the set of database changes that need to be made to bring the previous version of the database schema to the current version. Similarly, the "down" method reverts the change back to the previous version. When you want to set the schema for a specific version, the Rails migration scripts check the database to see what the current version is, then finds the .rb files that get you from there up (or down) to the desired revision.
To make this part of your development process, you can check these into source control, and season to taste.
There's nothing specific or special about Rails here, just that it's the first time I've seen this technique widely used. You can probably use pairs of SQL DDL files, too, like 001_UP_add_foo_table.sql and 001_DOWN_remove_foo_table.sql. The rest is a small matter of shell scripting, an exercise left to the reader.
I always source-control SQL (DDL, DML, etc). Its code like any other. Its good practice.
I am not sure indexes should be the same across different environments since they have different data sizes. Unless your test and production environments have the same exact data, the indexes would be different.
As to whether they belong in source control, am not really sure.
I do not put my indexes in source control but the creation script of the indexes. ;-)
Index-naming:
IX_CUSTOMER_NAME for the field "name" in the table "customer"
PK_CUSTOMER_ID for the primary key,
UI_CUSTOMER_GUID, for the GUID-field of the customer which is unique (therefore the "UI" - unique index).
On my current project, I have two things in source control - a full dump of an empty database (using pg_dump -c so it has all the ddl to create tables and indexes) and a script that determines what version of the database you have, and applies alters/drops/adds to bring it up to the current version. The former is run when we're installing on a new site, and also when QA is starting a new round of testing, and the latter is run at every upgrade. When you make database changes, you're required to update both of those files.
Using a grails app the indexes are stored in source control by default since you are defining the index definition inside of a file that represents your domain object. Just offering the 'Grails' perspective as an FYI.