Add/remove columns of a table - code maintenance / optimisation - sql

What is the best way to maintain code of a big project?
Let's say you have 1000 stored procedures, and you have to add a new column to a table (or remove)
There might be 1-2 or 30 stored procedures, that might be affected.
Just a single "search" for the tablename might not be good enough, let's say you only need to know the places where the table has insert/update/delete.
searching for 'insert tablename' might be a good idea, but you might have a space between those 2 words or 2 spaces, or a TAB ... maybe the tablename is written like '[tablename]'
The same for all 3 (insert/update/delete.)
I am basically looking for some kind of 'restricted dependencies'
How is this being handled the best way?
Keep a database table with this kind of information, and change that table every time you make changes to stored procedures?
keep some specific code as comment next to each insert/update/delete, and in this way, you will be able to search for what you need?
Example: 'insert_tablename', 'update_tablename', 'delete_tablename'
anyone having a better idea?

Ideally, changes are backward compatible. Not just so that you can change a table without breaking all of the objects that reference it, but also so that you can deploy all of the database changes before you deploy all of the application code (in a distributed architecture, think a downloadable desktop app or an iPhone app, where folks connect to your database remotely, this is crucial).
For example, if you add a new column to a table, it should be NULLable or have a default value so that INSERT statements don't need to be updated immediately to reference it. Stored procedures can be updated gradually to accept a new parameter to represent this column, and it should be nullable / optional so that the application(s) don't need to be aware of this column immediately. Etc.
This also demands that your original insert statements included an explicit column list. If you just say:
INSERT dbo.table VALUES(#p1, #p2, ...);
Then that makes it much tougher to make your changes backward compatible.
As for removing a column, well, that's a little tougher. Dependencies are not perfect in SQL Server, but you should be able to find a lot of information from these dynamic management objects:
sys.dm_sql_referenced_entities
sys.dm_sql_referencing_entities
sys.sql_expression_dependencies
You might also find these articles interesting:
Keeping sysdepends up to date
Make your database changes backward compatible when adding a new column
Make your database changes backward compatible when dropping a column
Make your database changes backward compatible when renaming an entity
Make your database changes backward compatible when changing a relationship

Related

Can data and schema be changed with DB2/z load/unload?

I'm trying to find an efficient way to migrate tables with DB2 on the mainframe using JCL. When we update our application such that the schema changes, we need to migrate the database to match.
What we've been doing in the past is basically creating a new table, selecting from the old table into that, deleting the original and renaming the new table to the original name.
Needless to say, that's not a very high-performance solution when the tables are big (and some of them are very big).
With latter versions of DB2, I know you can do simple things like alter column types but we have migration jobs which need to do more complicated things to the data.
Consider for example the case where we want to combine two columns into one (firstname + lastname -> fullname). Never mind that it's not necessarily a good idea to do that, just take it for granted that this is the sort of thing we need to do. There may be arbitrarily complicated transformations to the data, basically anything you can do with a select statement.
My question is this. The DB2 unload utility can be used to pull all of the data out of a table into a couple of data sets (the load JCL used for reloading the data, and the data itself). Is there an easy way (or any way) to massage this output of unload so that these arbitrary changes are made when reloading the data?
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
Or, better yet, can the unload/load process itself do this without having to massage the members directly?
Does anyone have any experience of this, or have pointers to redbooks or redpapers (or any other sources) that describe how to do this?
Is there a different (better, obviously) way of doing this other than unload/load?
As you have noted, SELECTing from the old table into the new table will have very poor performance. Poor performance here is generally due to the relatively high costs of insertion INTO the target table (index building and RI enforcement). The SELECT itself is generally not a performance issue. This is why the LOAD utility is generally perferred when large tables need to be populated from scratch, indices may be built more efficiently and RI may be deferred.
the UNLOAD utility allows unrestricted usage of SELECT. If you can SELECT data using scalar and/or column functions to build a result set that is compatible with your new table column definitions then UNLOAD can be used to do the data conversion. Specify a SELECT statement in SYSIN for the UNLOAD utility. Something like:
//SYSIN DD *
SELECT CONCAT(FIRST_NAME, LAST_NAME) AS "FULLNAME"
FROM OLD_TABLE
/*
The resulting SYSRECxx file will contain a single column that is a concatenation of the two identified columns (result of the CONCAT function) and SYSPUNCH will contain a
compatible column definition for FULLNAME - the converted column name for the new table. All you need to do is edit the new table name in SYSPUNCH (this will have defaulted to TBLNAME) and LOAD it. Try not to fiddle with the SYSRECxx data or the SYSPUNCH column definitions - a goof here could get ugly.
Use the REPLACE option when running the LOAD utility
to create the new table (I think the default is LOAD RESUME which won't work here). Often it is a good idea to leave RI off when running the LOAD, this will improve performance and
save the headache of figuring out the order in which LOAD jobs need to be run. Once finished you need to verify the
RI and build the indices.
The LOAD utility is documented here
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
I believe you have provided the answer within your question. As to the question of "how easy that would be," it would depend on the nature of your modifications.
SORT utilities (DFSORT, SyncSort, etc.) now have very sophisticated data manipulation functions. We use these to move data around, substitute one value for another, combine fields, split fields, etc. albeit in a different context from what you are describing.
You could do something similar with your load control statements, but that might not be worth the trouble. It will depend on the extent of your changes. It may be worth your time to attempt to automate modification of the load control statements if you have a repetitive modification that is necessary. If the modifications are all "one off" then a manual solution may be more expedient.

netTiers database schema backwards compatibility

I've found that the netTiers generated code relies on an exact database schema and is very unforgiving with variations. For example, adding a column to an existing table - if a column is added somewhere in the middle of the table you will see a cast error at runtime unless netTiers is recompiled. This is because the columns are accessed by ordinal and not by name. (Looking through the change log I see that this was done as a performance improvement)
This hasn't been a problem in the past, but on my current project we are trying to build a system with zero downtime upgrades. The challenge we have is database upgrades and it would be great if we could update the database without affected the code.
Has anyone using netTiers had similar problems or looked into similar requirements?
Would altering the templates to access the columns by name be more tolerant of previous schema versions? If so, for me I think this would be worth a minor hit to performance (3% is quoted here DataReader ordinal-based lookups vs named lookups)
As you have noted, .NetTiers uses ordinal based lookups in the DataReader. The only way to get .NetTiers to play nicely, is to always add new columns to the end of existing tables, and never restructure the table field order.
By doing this, your v1 code will still work against a table that has a new column appended at the end of the table, whilst your v2 code will work with the new addition.

Help updating a column using other columns of the same table

Table: Customer with columns Start_Time and End_Time.
I need to add a new column "Duration" that is End_Time - Start_Time.
However, I need to do this using a trigger or procedure so that immediately after a new record is added to Customer table, the column Duration is updated.
If you are using MS SQL, the ideal answer is probably a computed column.
The less data you actually duplicate, the less opportunity for data inconsistency you will have, therefore the less consistency-ensuring/verification code and fewer maintenance processes will result from your schema.
To set this up, (again, if using MS SQL), just add another column using the designer, and expand the "Computed Column Specification" area. (You can refer to other columns from this same table for this calculation.) Then enter "End_Time - Start_Time". Depending on what you are going to do with this data, may want to use something like DATEDIFF(minute, Start_Time, End_Time) for your formula, instead. It's exactly what this feature is for.
If it is a very expensive calculation (which yours is probably not, from the information you've given) you could configure the results to be "persisted" - that's very much like a trigger but clearer to implement and maintain.
Alternately, you could create a new View that does the same calculation, and "project" this first table through it whenever getting information. But you probably already knew that, thus this answer was born! :)
p.s. I personally recommend avoiding triggers like the plague. They cause extra operations that are often not expected by a developer, maintainer, or admin. This can cause operations to fail, return unexpected extra result sets, or modify rows that perhaps an admin was specifically trying to avoid modifying during an administrative (read: unsupported grin) fix.
p.p.s. In this case I'd also recommend against a stored procedure, for the same maintenance reason as triggers. Although you could restrict security such that the only way to update the table was through a stored procedure, this can fail for many of the same reasons triggers can fail. Best to avoid duplicating the data if you can.
p.p.p.s :) This is not to say stored procedures are bad as a whole. On complex transactional operations or tightly integrated procedural filtering of large related tables in order to return a comparatively small result set they are still often the best choice.
As per shannon, though the the term in oracle is a "Virtual Column"
There were an 11g enhancement. Prior to that, use a view (and that is still a potential answer for 11g).
Do not use a trigger or stored procedure.

Can select * usage ever be justified?

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.
I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.
There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)
I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;
None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).
I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).
In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.
You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.
And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.
As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.
When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.
About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.
I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.
How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?
It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).
Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.
I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.
Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).
Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)
I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.
If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.

Strategy for identifying unused tables in SQL Server 2000?

I'm working with a SQL Server 2000 database that likely has a few dozen tables that are no longer accessed. I'd like to clear out the data that we no longer need to be maintaining, but I'm not sure how to identify which tables to remove.
The database is shared by several different applications, so I can't be 100% confident that reviewing these will give me a complete list of the objects that are used.
What I'd like to do, if it's possible, is to get a list of tables that haven't been accessed at all for some period of time. No reads, no writes. How should I approach this?
MSSQL2000 won't give you that kind of information. But a way you can identify what tables ARE used (and then deduce which ones are not) is to use the SQL Profiler, to save all the queries that go to a certain database. Configure the profiler to record the results to a new table, and then check the queries saved there to find all the tables (and views, sps, etc) that are used by your applications.
Another way I think you might check if there's any "writes" is to add a new timestamp column to every table, and a trigger that updates that column every time there's an update or an insert. But keep in mind that if your apps do queries of the type
select * from ...
then they will receive a new column and that might cause you some problems.
Another suggestion for tracking tables that have been written to is to use Red Gate SQL Log Rescue (free). This tool dives into the log of the database and will show you all inserts, updates and deletes. The list is fully searchable, too.
It doesn't meet your criteria for researching reads into the database, but I think the SQL Profiler technique will get you a fair idea as far as that goes.
If you have lastupdate columns you can check for the writes, there is really no easy way to check for reads. You could run profiler, save the trace to a table and check in there
What I usually do is rename the table by prefixing it with an underscrore, when people start to scream I just rename it back
If by not used, you mean your application has no more references to the tables in question and you are using dynamic sql, you could do a search for the table names in your app, if they don't exist blow them away.
I've also outputted all sprocs, functions, etc. to a text file and done a search for the table names. If not found, or found in procedures that will need to be deleted too, blow them away.
It looks like using the Profiler is going to work. Once I've let it run for a while, I should have a good list of used tables. Anyone who doesn't use their tables every day can probably wait for them to be restored from backup. Thanks, folks.
Probably too late to help mogrify, but for anybody doing a search; I would search for all objects using this object in my code, then in SQL Server by running this :
select distinct '[' + object_name(id) + ']'
from syscomments
where text like '%MY_TABLE_NAME%'