How to handle column growth of wide, flat tables - sql-server-2005

How would you DBA's handle this? I have taken ownership of an existing app (VB6) and database that was written in 1999. The database design is fairly 'flat', meaning the main tables are fairly wide (100+ columns) and developers have continued to tack on additional columns to the end of the tables. This has resulted in columns that have a lot of Nulls since they don't directly relate to the primary key.
I am considering splitting the main table out as a way to abstract myself from the years and years of 'column explosion'. I am certain that new fields will continue to be added as new requirements come up.
So the question is, as new fields are needed, do you continue to grow the width of the existing table? Or do you STOP extending an existing table and split it out into a separate supporting table that will house new fields, thereby creating a 1-to-1 relationship? If you were to split the main table, what would your naming scheme be?
Let's assume for this example I have a table called 'Foreclosure' with 150 fields.
What is a good name for the new 1-to-1 table? 'ForeclosureExtended'? ForeclosureOtherInfo'?
By the way, there are Views and Stored Procs that will need to be modified to support any new tables, but that is inevitable anyway when columns are added.
thanks in advance for any thoughts.

80% of the time, your nulls have definite patterns.
These patterns define subclasses of your table. In your case, they will be subclasses of Foreclosure.
Your splitting should be based on these subclass relationships.
Say, for example, some Foreclosure instances have a bunch of fields related to legal proceeding that are nearly all filled in. And other Foreclosure instances have the legal proceeding fields entirely filled with nulls.
You have two classes. You need to work out the relationship between them -- are they superclass-subclass or are they peer subclasses of some other superclass?
This tells you how to partition your table to make useful stuff happen.
You may have proper superclass subclass relationships
You may have found a thing (a LegalProceeding) which should have been a separate table all along. It should not have been permanently joined into Foreclosure. This is remarkably common.
You now have some relational implementation choices.
One common choice is to put all subclasses into a single, massive table with a lot of nulls. This is what you have today, and it isn't working.
One choice is to split the two subclass relationship tables into peers, duplicating the common information.
One choice is to have a superclass table with an optional FK reference to the additional information in the subclass.
One choice is to have a subclass table with a mandatory FK reference to the superclass information.

Unless you are really brave, app is very small/simple, or there are major performance issues do not fix the schema. If it ain't broke, don't fix it.
Just create a new table ForeclosureExtended, as you suggest with the same key and start adding columns. Or, you could make proper tables with grouped columns as new columns appear. Either way, if the schema is this bad, I'll bet the code is very fragile.

Why do you feel that you have a problem? To my mind it's easier to deal with one table that has a lot of columns than it is to deal with a ton of narrower tables and all the associated views you have to maintain.

Related

Is there a database design pattern name for reducing duplicate join table data?

I have two tables with a join table to allow a many-to-many relationship.
It's a very familiar design pattern. It indicates which Branches each Member has access to.
As the number of members and branches increases I end up with a lot of data in the join table that is duplicated across members. Members tend to have access to the same groups of Branches as other Members.
So I'm looking at normalizing my data by creating a MemberProfile table that is effectively immutable. And rather than creating MemberBranch records for every Member I check for a matching MemberProfile, use if it already exists, or create one if it doesn't:
The idea being if I have a million Members with only a hundred access profiles this will save me a lot of space in my database.
I'm happy that it all works and that the development effort is worth is.
My question is "Is this a standard database design pattern, and if so, what is it called?"
EDIT: It's been pointed out that this is compressing the data not normalizing it. Which is the intent behind the design.
Unless your many:many table is always the join of particular other base tables, one is not normalizing. You aren't normalizing here. Normalization does not introduce new column names. It just rearranges the current ones among different base tables.
You are just compressing/encoding your data. There is not necessarily any benefit in this, since now some queries and updates will be slower although your database is smaller. (You have reported that it is worth it in your case.)
I understand you'd like to put a label on that precise transformation, but unfortunately, there aren't many books that discuss database design or refactoring patterns. One of the few is Martin Fowler's Refactoring Databases, which you may know for his work on analysis patterns (he also has a great blog, worth following!). In that book, Martin presents a bunch of refactoring patterns that can be applied to databases and has put a name on common database transformations, including the one you have presented, which he called Split Table.
Split Table. Vertically split (e.g. by columns) an existing table into one or more tables.
A catalog of the database refactorings presented in that book are available here.
Hi I don't know about a pattern name but I've used the same principle before.
To keep this performing well, introduce a checksum to memberProfile based upon the branches for the profile, this way a lookup for an existing profile is plain easy and fast.
But do remember that the checksum is not necessarily unique, in case of collisions you will still have to check the branches, but only for the profiles sharing the same checksum.
Cleanup can be a scheduled task is is nothing more then deleting the profiles without users.

Is it acceptable to have 2 one to many relationships between 2 tables?

I have a database for utility structures (poles, towers) with wire between them. Each span of wire is connected to two structures. I need to define the relationship between the structures and the spans.
I came up with a relationship between structure A to a "from" foreign key field in the span table and another between structure B and a "To" foreign key field in the span table. Is this acceptable database design?
Here's a diagram:
Your model indicates that a span is related to exactly 2 structures. So I think you're good, there. My only comment is that with the names, "fkFromStructureID" and, "fkToStructureID" you imply an orientation that may not exist. It's nit-picky, I admit. But if you do not have a concept of "direction" in your real-world model, you might want to consider different names. If orientation does exist, then this works fine.
Note: If you plan to be able to "walk" the chain of spans and structures using these FK fields, then you'll want to be certain that you manage your orientation on the way into the database. One span that has the From and To backwards will ruin the ability to walk the chain.
Yes, this is quite a common requirement when designing databases and your approach is probably the best way of fulfilling this requirement.

Refactor schema with multiple many-to-many relationships to semantically different but structurally similar entities?

I have a schema with a number of many to many relationships and what I'm seeing is a alot of similar data structure spread out among tables with different names. The intuition I have is that there is a more efficient/desirable way to achieve the same result but I'm not sure what alternative approaches fall into reasonable design/best practices.
Note: Counries, TrafficTypes and People - as they exist now - could all be represented by Id and Name columns, but in the future may have additional fields. Maybe what I'm after is some kind of technique akin to inheritance?
Here's the diagram of what I've got:
Don't lump together things which you think are similar; they may diverge later when you need to store more information about each entity.
Is there a problem with the number of tables you have in your database?
You are probably thinking about the problem from an object oriented design position and thinking you can use some sort of "parent" table to represent the common parts - databases don't work that way.
If you are not careful you will end up with a MUCK or OTLT table.
Off the bat, I would keep seperate enties/objects/(cars vs animals) in seperate tables.
The chances of such enties overlapping properties are slim at best.
Thing is, once those entites start to evolve in your system, you will find that a single table will have hundreds of columns with singles populated per entity.
I don't see how inheritance applies to your case. But if you are interested in inheritance, as it applies to SQL tables or relations, here are some things to look up:
At the level of ER modeling look up "ER model specialization". This is the way the extended ER model diagrams "is A" relationships.
At the level of table design, look up "Class Table Inheritance" and "Shared primary key" for a couple of techniques that, used together, sort of mimic what inheritance does for you in an OOP. You might also want to look up "single table inheritance" for an alternative that's simpler, but can be more wasteful.
Don't worry. You're doing it right.
In a highly normalized schema, you're going to have tons of tables that are nearly identical.
As the others have said, what you're starting to consider (a generic table for multiple things) is a very bad idea and has many drawbacks.
The biggest drawback is that your relationships are made useless by it, in terms of maintaining data integrity. There's nothing stopping someone from assigning a CountryCode where a TrafficTypeId should be, and so on.
Another drawback would be that having one larger table will likely perform worse than many smaller, specialized tables; due to extra, unnecessary blocking.
Your may still want to implement some type of inheritance concept, but that'll be best done in whatever code accesses the database.

Store files and comments for a web application. Table design

I have a web appliaction with several entities (tables).Each one has his CRUD pages.
I'd like to add for some the, the ability to add comments and attach files.
I was thinking of two scenarios.
One table for all comments/files - table would have some id for the entity and the particular record.
For each entity a separate comments/files table.
The files would be stored on the disk in a directory.In the table would be the name of the file and some additional info.
In term of application Design having one unique table for all coments seems to make sense. In term of application code that mean the same SQL will be reused for all entities. It's the 'classical way' used by most applications, extending on having the same acitive records and controllers used to handle comments and attachments for all objects.
In term of SQL thesecond solution could be usefull in some databases like MySQL to get more Memory Cache benefit. Every comment/attachmlent added in the 1st solution would drop from the memory cache all requests impacting the comment table. With individual tables a comment on one entity would not invalidate queries on other entities. But you would alos require more file descriptors and a bigger table cache.... so to choose this solution you would need a decision based on real-life, precise, case, where you would be able to compare the benefits in database access speed. And when you will add new entities you'll certainly find your each-entity-have-a-comment-table solution boring, things could have been automated by using 1st solution.
It's a tradeoff. With a single comments table, you get a simple, DRY (don't repeat yourself) schema, but you don't get foreign key constraints and thus no cascade deletion. Thus, if you delete an entity with comments, you must also remember to delete the comments!
If you go with multiple comment tables, you get FK constraints and cascade deletion, but you have a "wet" schema (you are repeating yourself). For example, each comment table might have a commentbody column. If you change that column definition, you have to change it in every comment table!
One interesting solution for a DRY-er schema could involve table inheritance (see http://www.postgresql.org/docs/9.0/interactive/ddl-inherit.html) but please read section 5.8.1. Caveats, as there are some "gotchas" regarding indexing, at least in postgres.
Either way, kudos to you for thinking carefully about your database design!

Is Entity Inheritance better done thru a single table or multiple?

I have a table that holds agreement information. It works well for 95% of the agreements we record.
But there is a certain type of agreement that would require another 6 or so fields to capture info specific to that type of agreement.
My question is if its better to just add those 6 fields to the existing agreement table knowing that the info is meaningless to many of the agreement records or if its better to create another table w/ a 1:1 relationship w/ the original agreement table to extend it in the case of these special types of agreements.
Neither option is all that attractive to me, but I wanted to know if one was considered a better practice than the other when you have a choice.
Thanks for any help.
Multiple tables are probably your best bet from an extensibility perspective. Although you currently have a single agreement type that requires extra fields, the very existence of such a thing suggests that similar variations in the future are worth accomodating in your design. When other variants arise, a multiple table approach will allow you to include those agreement types and their characteristic attributes gracefully.
Also, it would be worth considering an agreement_type attribute (or something similar) in your superclass table, for cases in which you may want to perform agreement type analysis without incurring the hit of a join to the subclass table(s). The intended usage of the data will be your guide as to whether or not such an attribute makes sense.
My choice is a separated table, because for 95% of agreements its pointless information.
It depends.
A single table is easier to query, and 6 additional fields aren't that much in the way of additional storage. If the number of rows in your existing table is small, or the number of existing columns is already pretty big, then on balance I would add the additional fields to the existing table.
On the other hand, if this change would massively inflate the size of an already-large existing table, it would definitely be worth considering setting up a new table.
Hope this is self-explanatory.