Netezza - reuse common sql code across multiple queries in a DRY manner? - sql

Is it possible with Netezza queries to include sql files (which contain specific sql code) or it is not the right way of usage ?
Here is an example.
I have some common sql code (lets say common.sql) which creates some temp table and needs to be used across multiple other queries (lets say analysis1.sql, analysis2.sql etc.) - . From a code management perspective it is quite overwhelming to maintain if the code in common.sql is repeated across the many other queries. Is there a DRY way to do this - something like #include <common.sql> from the other queries to call the reused code common.sql ?

Including sql files is not the right way to do it. If you wish to persist with this you could use a preprocessor like cpp or even php to assemble the files for you and have a build process to generate finished ones.
However from a maintainability perspective you are better off creating views and functions for reusable content. Note that this can pose optimization barriers so large queries are often the way to go.

I agree, views, functions (table values if needed) or more likely: stored procedures is the way to go.
We have had a lot of luck letting stored procedures generate complex but repeatable code patterns on the fly based on input parameters and metadata on the tables being processed.
An example: all tables has a 'unique constraint' (which is not really unique, but that doesn't matter since it isn't enforced in Netezza) that has a fixed name of UBK_[tablename]
UBK is used as a 'signal' to the stored procedure identifying the columns of the BusinessKey for a classic 'kimball style' type 2 dimension table.
The SP can then apply the 'incoming' rows to the target table just by being supplied With the name of the target table and a 'stage' table containing all the same column names and data types.
Other examples could be a SP that takes a tablename and three arguments each with a 'string,of,columns' and does a 'excel style pivot' with group-by on the columns in the first argument and uses the second argument as to do a 'select distinct' to generate new column names for the pivoted columns, and does a 'sum' on the column in the third argument into some target table you specify the name for...
Can you follow me?
I think that the nzsql command line tool may be able to do an 'include', but a combination of strong 'building block stored procedures' and perl/python and/or an ETL tool will most likely proove a better choice

Related

Is there a way in SQL to determine all columns accessed by an arbitrary query?

I support a database that contains a schema that has a couple hundred tables containing our most important data.
Our application also offers APIs implemented as queries stored in NVARCHAR(MAX) fields in a Query table which are written against the views as well as the tables in this critical schema.
Over time, columns have been added to the tables, but the APIs haven't always kept up.
I've been asked if I can find a way via SQL to identify, as nearly as possible (some false positives/negatives OK), columns in the tables that are not referenced by either the views or the SQL queries that provide the API output.
Initially this seemed do-able. I've found some similar questions on the topic, such as here and here that sort of give guidance on how to start...although I note that even with these, there's the kind of ugly fallback method that looks like:
OBJECT_DEFINITION(OBJECT_ID([Schema].[View])) LIKE '%' + [Column] + '%'
Which is likely to generate false positives as well as be super slow when I'm trying to do it for a couple of thousand column names.
Isn't there anything better/more reliable? Maybe something that could compile a query down to a plan and be able to determine from the plan every column that must be accessed in order to deliver the results?
Our application also offers APIs implemented as queries stored in
NVARCHAR(MAX) fields
So you've reimplemented views? :)
If you make them actual views you can look at INFORMATION_SCHEMA - cross reference table/columns to view/columns.
Assuming you don't want to do that, and you're prepared to write a job to run occasionally (rather than real-time) you could do some super-cheesy dynamic SQL.
Loop through your definitions that are stored in NVARCHAR(MAX) with a cursor
Create a temp view or SP from the SQL in your NVARCHAR(MAX)
Examine INFORMATION_SCHEMA from your temp view/SP and put that into a temp holding table.
Do this for all your queries then you've got a list of referenced columns
Pretty ugly but should be workable for a tactical scan of your API vs database.

Any downside to using a view to make sure stored procedures get all the columns they need?

Let me start by stating that when writing SELECT statements in a stored procedure or elsewhere in application code, I ALWAYS specify columns explicitly rather than using SELECT *.
That said, I have a case where I have many stored procedures that need exactly the same columns back because they are used to hydrate the same models on the client. In an effort to make the code less brittle and less prone forgetting to update a stored procedure with a new column, I am thinking of creating a view and selecting from that in my stored procedures using SELECT *.
To help clarify, here are examples of what the stored procedures might be named:
Entity_GetById
Entity_GetByX
Entity_GetForY
-- and on and on...
So in each of these stored procedures, I would have
SELECT *
FROM EntityView
WHERE...[criteria based on the stored procedure]
I'm wondering if there is any cost to doing this. I understand the pitfalls of SELECT * FROM Table but by selecting from a view that exists specifically to define the columns needed seems to mitigate this.
Are there any other reasons I wouldn't want to use this approach?
Personally, I don't see a problem with this approach.
However, there is a range of opinions on the use of select * in production code, which generally discourages it for the outermost query (of course, it is fine in subqueries).
You do want to make sure that the stored procedures are recompiled if there is any change to either the view or to the underlying tables supplying the view. You probably want to use schemabinding to ensure that (some) changes to the underlying tables do not inadvertently affect the view(s).
I don't know your system, but using a view would not affect performance?
SELECT * from the view makes sense, but does the view just selects specific columns from one table?
If not then look carefully into performance.
If I remember correctly in MS SQL stored procedure can return a recordset.
If I right you can try to wrap various queries into kind of sub queries stored procedure and have a one main which selects specific columns -- here complication should fail if you miss something in .
Even better would be having stored procedures which ask by various parameters and returns only primary keys (as record set or in temporary table) and one main which fetch all required columns based on returned primary keys.

Finding table and package references in oracle

I have a list of tables and a list of packages. I need to come up with the below two lists
What are the packages that uses the given set of tables
List of tables that are referenced by each of the given package
The packages uses dynamic sql, hence I may not be able to depend only on dba_reference table.
The other way I could think of is using a LIKE clause against the dba_source table. But, I will have to write a OR condition for each of the tables that I need (or of course a function or procedure which can loop through each table)
Is there any better way of doing this?
Any help is greatly appreciated.
Edit: rephrasing the question -
I have a package which select/inserts/updates several tables. This has dynamic sql. One example is provided below.
I want to identify all the tables referred in this package. What is the best way to achieve this?
In the below example I want to capture both table1 and table2.
if flag = 'Y'
then final_sql := 'insert into table1 (...)';
else final_sql := 'insert into table2 (...)';
end if;
execute immediate final_sql;
For systems using a lot of dynamic SQL I suggest two approaches.
First one is to apply strict coding standards so you know what to look for and can then reliably parse out the table names from the rest of the code. I mean, always have the table same string written to a known variable name, and search for that variable.
This is not always easy to do, especially if you have mountains of code that do not follow the standard. It only takes a couple of folk to not adhere to the standard and it all falls down. However it can be made to work, but probably never going to be 100% reliable.
Another approach is to write test scripts that exercise the whole code base and logic paths. Write them in such a way that they log the procedure name. Enable SQL Trace and capture the trace files from the tests. With clever scripting you should be able to tie the trace to the procedure. This will give you the "raw" SQL, which you can then grep for matches with you list of tables. You might be able to get the same info by harvesting V$SQL tying to V$SESSION.
This is an old school way of doing this, but one that I have used and works.
On one of the largest systems I worked on I wrote a CRUD parser which tokenised the code and produced a CRUD matrix by source file and table access. For dynamic SQL we processed SQL Trace/tkprof files.
If the code has good amount of debugging which dumps out these the table names you could again run the test scripts and harvest the debug logs.

Can data and schema be changed with DB2/z load/unload?

I'm trying to find an efficient way to migrate tables with DB2 on the mainframe using JCL. When we update our application such that the schema changes, we need to migrate the database to match.
What we've been doing in the past is basically creating a new table, selecting from the old table into that, deleting the original and renaming the new table to the original name.
Needless to say, that's not a very high-performance solution when the tables are big (and some of them are very big).
With latter versions of DB2, I know you can do simple things like alter column types but we have migration jobs which need to do more complicated things to the data.
Consider for example the case where we want to combine two columns into one (firstname + lastname -> fullname). Never mind that it's not necessarily a good idea to do that, just take it for granted that this is the sort of thing we need to do. There may be arbitrarily complicated transformations to the data, basically anything you can do with a select statement.
My question is this. The DB2 unload utility can be used to pull all of the data out of a table into a couple of data sets (the load JCL used for reloading the data, and the data itself). Is there an easy way (or any way) to massage this output of unload so that these arbitrary changes are made when reloading the data?
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
Or, better yet, can the unload/load process itself do this without having to massage the members directly?
Does anyone have any experience of this, or have pointers to redbooks or redpapers (or any other sources) that describe how to do this?
Is there a different (better, obviously) way of doing this other than unload/load?
As you have noted, SELECTing from the old table into the new table will have very poor performance. Poor performance here is generally due to the relatively high costs of insertion INTO the target table (index building and RI enforcement). The SELECT itself is generally not a performance issue. This is why the LOAD utility is generally perferred when large tables need to be populated from scratch, indices may be built more efficiently and RI may be deferred.
the UNLOAD utility allows unrestricted usage of SELECT. If you can SELECT data using scalar and/or column functions to build a result set that is compatible with your new table column definitions then UNLOAD can be used to do the data conversion. Specify a SELECT statement in SYSIN for the UNLOAD utility. Something like:
//SYSIN DD *
SELECT CONCAT(FIRST_NAME, LAST_NAME) AS "FULLNAME"
FROM OLD_TABLE
/*
The resulting SYSRECxx file will contain a single column that is a concatenation of the two identified columns (result of the CONCAT function) and SYSPUNCH will contain a
compatible column definition for FULLNAME - the converted column name for the new table. All you need to do is edit the new table name in SYSPUNCH (this will have defaulted to TBLNAME) and LOAD it. Try not to fiddle with the SYSRECxx data or the SYSPUNCH column definitions - a goof here could get ugly.
Use the REPLACE option when running the LOAD utility
to create the new table (I think the default is LOAD RESUME which won't work here). Often it is a good idea to leave RI off when running the LOAD, this will improve performance and
save the headache of figuring out the order in which LOAD jobs need to be run. Once finished you need to verify the
RI and build the indices.
The LOAD utility is documented here
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
I believe you have provided the answer within your question. As to the question of "how easy that would be," it would depend on the nature of your modifications.
SORT utilities (DFSORT, SyncSort, etc.) now have very sophisticated data manipulation functions. We use these to move data around, substitute one value for another, combine fields, split fields, etc. albeit in a different context from what you are describing.
You could do something similar with your load control statements, but that might not be worth the trouble. It will depend on the extent of your changes. It may be worth your time to attempt to automate modification of the load control statements if you have a repetitive modification that is necessary. If the modifications are all "one off" then a manual solution may be more expedient.

Best Practice: One Stored Proc that always returns all fields or different stored procedure for each field set needed?

If I have a table with Field 1, Field 2, Field 3, Field 4 and for one instance need just Field 1 and Field 2, but another I need Field 3 and Field 4 and yet another need all of them...
Is it better to have a SP for each combination I need or one SP that always returns them all?
Very important question:
Writing many stored procs that run the same query will make you spend a lot of time documenting and apologising to future maintainers.
For every time anyone wants to introduce a change, they have to consider whether it should apply to all stored procs, or to some or to one only...
I would do only one stored proc.
I would just have one Stored Procedure as it will be easier to maintain.
Does it need to be a Stored Procedure? You could rewrite it as a View then simply select the columns that you need.
If network bandwidth and memory usage is more important than hours of work and project simplicity, then make a separate SP for each task. Otherwise there's no point. (the gains aren't that great, and are noticeable only when the rowset is extremely large, or there are a lot of simultaneous requests)
As a general rule it is good practice to select only the columns we need to serve a particular purpose. This is particularly true for tables which have:
lots of columns
LOB columns
sensitive or restricted data
However, if we have a complicated system with lots of tables it is obviously impractical to build a separate stored procedure for each distinct query. In fact it is probably undesirable to do so. The resultant API would be overwhelming to use and a lot of effort to maintain.
The solutions are several and various, and really depend on the nature of the applications. Views can help, although they share some of the same maintenance issues. Dynamic SQL is another approach. We can write complicated procedures which return many differnet result sets depending on the input parameters. Heck, sometimes we can even write SQL statements in the actual application.
Oh, and there is the simple procedure which basically wraps a SELECT * FROM some_table but that comes with its own suite of problems.