Upgrade strategies for bad DB schema designs

Upgrade strategies for bad DB schema designs - sql

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?

Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings

Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.

I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.

Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.

Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.

First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.

Related

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis

Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.

(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

SQL database checklist

I need to create a Sql Database Checklist,
I have some basics points like
Each table must have a primary key
Normalize data to third normal form
Check for Integrity column the column value should be incremented properly.
But can anyone help me to enhance this list ?

Objects conform to a single naming convention
Create foreign key relationships
Apply appropriate index(es)
Use of schema or other mechanisms for controlling read/write access, etc
Consideration given to how long data should be kept before deletion or archive
Version control over scripts for updating the database structure
Mechanism for applications to determine version of database
Backup and recovery plans in place

First, It would help if this is supposed to be a recuring check list, or a checklist for each new instance. Also, is there a specific implementation in mind like SQL Server? MySQL? (this is where the real check list begins). For example, you want to keep an eye on the Transactions Log if its SQL Server...
If this is a relational DB, ER diamgrams go a long way in making sure that you have your problem domain identified and analyzed. You are right track using third normal form where practical. I want to emphasize practical because you also want to try and anticipate and identify which data will be used more than others. If data is highly accessed, you may want consider indexing more than just the primary and/or denormalizing to 2nd normal form. (uses more space, but better performance). Remember that accessing data and updating data are inversely related when indexing is concerned. Hope this helps.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)

If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.

You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?

You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.

It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.

SQL Server 2005: Wrapping Tables by Views - Pros and Cons

Background
I am working on a legacy small-business automation system (inventory, sales, procurement, etc.) that has a single database hosted by SQL Server 2005 and a bunch of client applications. The main client (used by all users) is an MS Access 2003 application (ADP), and other clients include various VB/VBA applications like Excel add-ins and command-line utilities.
In addition to 60 or so tables (mostly in 3NF), the database contains about 200 views, about 170 UDFs (mostly scalar and table-valued inline ones), and about 50 stored procedures. As you might have guessed, some portion of so-called "business logic" is encapsulated in this mass of T-SQL code (and thus is shared by all clients).
Overall, the system's code (including the T-SQL code) is not very well organized and is very refactoring-resistant, so to speak. In particular, schemata of most of the tables cry for all kinds of refactorings, small (like column renamings) and large (like normalization).
FWIW, I have pretty long and decent application development experience (C/C++, Java, VB, and whatnot), but I am not a DBA. So, if the question will look silly to you, now you know why it is so. :-)
Question
While thinking about refactoring all this mess (in a peacemeal fashion of course), I've come up with the following idea:
For each table, create a "wrapper" view that (a) has all the columns that the table has; and (b) in some cases, has some additional computed columns based on the table's "real" columns.
A typical (albeit simplistic) example of such additional computed column would be sale price of a product derived from the product's regular price and discount.
Reorganize all the code (both T-SQL and VB/VBA client code) so that only the "wrapper" views refer to tables directly.
So, for example, even if an application or a stored procedure needed to insert/update/delete records from a table, they'd do that against the corresponding "table wrapper" view, not against the table directly.
So, essentially this is about isolating all the tables by views from the rest of the system.
This approach seems to provide a lot of benefits, especially from maintainability viewpoint. For example:
When a table column is to be renamed, it can be done without rewriting all the affected client code at once.
It is easier to implement derived attributes (easier than using computed columns).
You can effectively have aliases for column names.
Obviously, there must be some price for all these benefits, but I am not sure that I am seeing all the catches lurking out there.
Did anybody try this approach in practice? What are the major pitfalls?
One obvious disadvantage is the cost of maintaining "wrapper" views in sync with their corresponding tables (a new column in a table has to be added to a view too; a column deleted from a table has to be deleted from the view too; etc.). But this price seems to be small and fair for making the overall codebase more resilient.
Does anyone know any other, stronger drawbacks?
For example, usage of all those "wrapper" views instead of tables is very likely to have some adverse performance impact, but is this impact going to be substantial enough to worry about it? Also, while using ADODB, it is very easy to get a recordset that is not updateable even when it is based just on a few joined tables; so, are the "wrapper" views going to make things substantially worse? And so on, and so forth...
Any comments (especially shared real experience) would be greatly appreciated.
Thank you!
P.S. I stepped on the following old article that discusses the idea of "wrapper" views:
The Big View Myth
The article advises to avoid the approach described above. But... I do not really see any good reasons against this idea in the article. Quite the contrary, in its list of good reasons to create a view, almost each item is exactly why it is so tempting to create a "wrapper" view for each and every table (especially in a legacy system, as a part of refactoring process).
The article is really old (1999), so whatever reasons were good then may be no longer good now (and vice versa). It would be really interesting to hear from someone who considered or even tried this idea recently, with the latest versions of SQL Server and MS Access...

When designing a database, I prefer the following:
no direct table access by the code (but is ok from stored procedures and views and functions)
a base view for each table that includes all columns
an extended view for each table that includes lookup columns (types, statuses, etc.)
stored procedures for all updates
functions for any complex queries
this allows the DBA to work directly with the table (to add columns, clean things up, inject data, etc.) without disturbing the code base, and it insulates the code base from any changes made to the table (temporary or otherwise)
there may be performance penalties for doing things this way, but so far they have not been significant - and the benefits of the layer of insulation have been life-savers several times

You won't notice any performance impact for one-table views; SQL Server will use the underlying table when building the execution plans for any code using those views. I recommend you schema-bind those views, to avoid accidentally changing the underlying table without changing the view (think of the poor next guy.)
When a table column is to be renamed
In my experience, this rarely happens. Adding columns, removing columns, changing indexes and changing data types are the usual alter table scripts that you'll run.
It is easier to implement derived attributes (easier than using computed columns).
I would dispute that. What's the difference between putting the calculation in a column definition and putting it in a view definition? Also, you'll see a performance hit for moving it into a view instead of a computed column. The only real advantage is that changing the calculation is easier in a view than by altering a table (due to indexes and data pages.)
You can effectively have aliases for column names.
That's the real reason to have views; aliasing tables and columns, and combining multiple tables. Best practice in my past few jobs has been to use views where I needed to denormalise the data (lookups and such, as you've already pointed out.)
As usual, the most truthful response to a DBA question is "it depends" - on your situation, skillset, etc. In your case, refactoring "everything" is going to break all the apps anyways. If you do fix the base tables correctly, the indirection you're trying to get from your views won't be required, and will only double your schema maintenance for any future changes. I'd say skip the wrapper views, fix the tables and stored procs (which provide enough information hiding already), and you'll be fine.

I agree with Steven's comment--primarily because you are using Access. It's extremely important to keep the pros/cons of Access in focus when re-designing this database. I've been there, done that with the Access front-end/SQL Server back-end (although it wasn't an ADP project).
I would add that views are nice for ensuring that data is not changed outside of the Access forms in the project. The downside is that stored procedures be required for all updates--if you don't already have those, they'd have to be created too.

Do you put your indexes in source control?

And how do you keep them in synch between test and production environments?
When it comes to indexes on database tables, my philosophy is that they are an integral part of writing any code that queries the database. You can't introduce new queries or change a query without analyzing the impact to the indexes.
So I do my best to keep my indexes in synch betweeen all of my environments, but to be honest, I'm not doing very well at automating this. It's a sort of haphazard, manual process.
I periodocally review index stats and delete unnecessary indexes. I usually do this by creating a delete script that I then copy back to the other environments.
But here and there indexes get created and deleted outside of the normal process and it's really tough to see where the differences are.
I've found one thing that really helps is to go with simple, numeric index names, like
idx_t_01
idx_t_02
where t is a short abbreviation for a table. I find index maintenance impossible when I try to get clever with all the columns involved, like,
idx_c1_c2_c5_c9_c3_c11_5
It's too hard to differentiate indexes like that.
Does anybody have a really good way to integrate index maintenance into source control and the development lifecycle?

Indexes are a part of the database schema and hence should be source controlled along with everything else. Nobody should go around creating indexes on production without going through the normal QA and release process- particularly performance testing.
There have been numerous other threads on schema versioning.

The full schema for your database should be in source control right beside your code. When I say "full schema" I mean table definitions, queries, stored procedures, indexes, the whole lot.
When doing a fresh installation, then you do:
- check out version X of the product.
- from the "database" directory of your checkout, run the database script(s) to create your database.
- use the codebase from your checkout to interact with the database.
When you're developing, every developer should be working against their own private database instance. When they make schema changes they checkin a new set of schema definition files that work against their revised codebase.
With this approach you never have codebase-database sync issues.

Yes, any DML or DDL changes are scripted and checked in to source control, mostly thru activerecord migrations in rails. I hate to continually toot rails' horn, but in many years of building DB-based systems I find the migration route to be so much better than any home-grown system I've used or built.
However, I do name all my indexes (don't let the DBMS come up with whatever crazy name it picks). Don't prefix them, that's silly (because you have type metadata in sysobjects, or in whatever db you have), but I do include the table name and columns, e.g. tablename_col1_col2.
That way if I'm browsing sysobjects I can easily see the indexes for a particular table (also it's a force of habit, wayyyy back in the day on some dBMS I used, index names were unique across the whole DB, so the only way to ensure that is to use unique names).

I think there are two issues here: the index naming convention, and adding database changes to your source control/lifecycle. I'll tackle the latter issue.
I've been a Java programmer for a long time now, but have recently been introduced to a system that uses Ruby on Rails for database access for part of the system. One thing that I like about RoR is the notion of "migrations". Basically, you have a directory full of files that look like 001_add_foo_table.rb, 002_add_bar_table.rb, 003_add_blah_column_to_foo.rb, etc. These Ruby source files extend a parent class, overriding methods called "up" and "down". The "up" method contains the set of database changes that need to be made to bring the previous version of the database schema to the current version. Similarly, the "down" method reverts the change back to the previous version. When you want to set the schema for a specific version, the Rails migration scripts check the database to see what the current version is, then finds the .rb files that get you from there up (or down) to the desired revision.
To make this part of your development process, you can check these into source control, and season to taste.
There's nothing specific or special about Rails here, just that it's the first time I've seen this technique widely used. You can probably use pairs of SQL DDL files, too, like 001_UP_add_foo_table.sql and 001_DOWN_remove_foo_table.sql. The rest is a small matter of shell scripting, an exercise left to the reader.

I always source-control SQL (DDL, DML, etc). Its code like any other. Its good practice.

I am not sure indexes should be the same across different environments since they have different data sizes. Unless your test and production environments have the same exact data, the indexes would be different.
As to whether they belong in source control, am not really sure.

I do not put my indexes in source control but the creation script of the indexes. ;-)
Index-naming:
IX_CUSTOMER_NAME for the field "name" in the table "customer"
PK_CUSTOMER_ID for the primary key,
UI_CUSTOMER_GUID, for the GUID-field of the customer which is unique (therefore the "UI" - unique index).

On my current project, I have two things in source control - a full dump of an empty database (using pg_dump -c so it has all the ddl to create tables and indexes) and a script that determines what version of the database you have, and applies alters/drops/adds to bring it up to the current version. The former is run when we're installing on a new site, and also when QA is starting a new round of testing, and the latter is run at every upgrade. When you make database changes, you're required to update both of those files.

Using a grails app the indexes are stored in source control by default since you are defining the index definition inside of a file that represents your domain object. Just offering the 'Grails' perspective as an FYI.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas