Typical Kimball Star-schema Data Warehouse - Model Views Feasible? and How to Code Gen - sql

I have a data warehouse containing typical star schemas, and a whole bunch of code which does stuff like this (obviously a lot bigger, but this is illustrative):
SELECT cdim.x
,SUM(fact.y) AS y
,dim.z
FROM fact
INNER JOIN conformed_dim AS cdim
ON cdim.cdim_dim_id = fact.cdim_dim_id
INNER JOIN nonconformed_dim AS dim
ON dim.ncdim_dim_id = fact.ncdim_dim_id
INNER JOIN date_dim AS ddim
ON ddim.date_id = fact.date_id
WHERE fact.date_id = #date_id
GROUP BY cdim.x
,dim.z
I'm thinking of replacing it with a view (MODEL_SYSTEM_1, say), so that it becomes:
SELECT m.x
,SUM(m.y) AS y
,m.z
FROM MODEL_SYSTEM_1 AS m
WHERE m.date_id = #date_id
GROUP BY m.x
,m.z
But the view MODEL_SYSTEM_1 would have to contain unique column names, and I'm also concerned about performance with the optimizer if I go ahead and do this, because I'm concerned that all the items in the WHERE clause across different facts and dimensions get optimized, since the view would be across a whole star, and views cannot be parametrized (boy, wouldn't that be cool!)
So my questions are -
Is this approach OK, or is it just going to be an abstraction which hurts performance and doesn't give my anything but a lot nicer syntax?
What's the best way to code-gen these views, eliminating duplicate column names (even if the view later needs to be tweaked by hand), given that all the appropriate PK and FKs are in place? Should I just write some SQL to pull it out of the INFORMATION_SCHEMA or is there a good example already available.
Edit: I have tested it, and the performance seems the same, even on the bigger processes - even joining multiple stars which each use these views.
The automation is mainly because there are a number of these stars in the data warehouse, and the FK/PK has been done properly by the designers, but I don't want to have to pick through all the tables or the documentation. I wrote a script to generate the view (it also generates abbreviations for the tables), and it works well to generate the skeleton automagically from INFORMATION_SCHEMA, and then it can be tweaked before committing the creation of the view.
If anyone wants the code, I could probably publish it here.

I’ve used this technique on several data warehouses I look after. I have not noticed any performance degradation when running reports based off of the views versus a table direct approach but have never performed a detailed analysis.
I created the views using the designer in SQL Server management studio and did not use any automated approach. I can’t imagine the schema changing often enough that automating it would be worthwhile anyhow. You might spend as long tweaking the results as it would have taken to drag all the tables onto the view in the first place!
To remove ambiguity a good approach is to preface the column names with the name of the dimension it belongs to. This is helpful to the report writers and to anyone running ad hoc queries.

Make the view or views into into one or more summary fact tables and materialize it. These only need to be refreshed when the main fact table is refreshed. The materialized views will be faster to query and this can be a win if you have a lot of queries that can be satisfied by the summary.
You can use the data dictionary or information schema views to generate SQL to create the tables if you have a large number of these summaries or wish to change them about frequently.
However, I would guess that it's not likely that you would change these very often so auto-generating the view definitions might not be worth the trouble.

If you happen to use MS SQL Server, you could try an Inline UDF which is as close to a parameterized view as it gets.

Related

Rolling up values from multiple child tables

What is the best way to roll up values from a series of child tables into a parent table in SQL Server?
Let's say we have a contracts table. This table has a series of child tables, such as contract_timesheets, contract_materials, contract_other_expenses - etc. What is the best way to pull costs / hours / etc out of those child tables and make them easily accessible in the parent table?
Option 1: My first thought would be to simply use a view. An example might be something like this:
SELECT
contract_code,
caption,
description,
(
SELECT SUM(t.hours * l.rate_hourly)
FROM timesheets t
JOIN labor l ON t.hr_code = l.hr_code AND t.contract_code = c.contract_code
) AS labor_cost,
(
SELECT ...
) AS material_cost,
...
FROM contracts c
So we'll have a view that might have a dozen or more subqueries like that, many of which will themselves need joins to pull in all of the info we need.
This completely works fine. Until we have hundreds of thousands of rows. Then things start to get noticeably slow. It's still workable, but it the row count gets up too much higher, or the server gets too much other workload, i'm concerned that this isn't workable.
Is there a more efficient way to structure such a view?
Option 2: The other obvious solution is to roll those numbers up into physical fields in the parent table. The big issue with that is just maintaining the numbers when the data might be accessed from a variety of clients. Maybe it's a report, maybe it's a form, maybe it's some integration service. So trying to use some premade roll-up SQL file that gets run as an event in the front-end prior to displaying the report / chart / whatever isn't an ideal solution.
To ensure that the roll up numbers stay in synch, we could attach a series of triggers to all of the child tables (and possibly relatives of those, if the numbers in the child tables rely on something else). Everytime the source numbers get updated, we roll it up into the parent. This seems like a lot of trouble, but if the triggers are written correctly, i suppose this would work fine.
Option 3: Do everything in the UI. This is also an option, but with a variety of clients accessing the data, it makes things unpleasant.
Option 4(?): Since most of these records are actually completed with no need to add more data, i can also imagine some kind of hybrid system. The base table for the parent contract would have physical columns for the labor costs, material costs, or whatever. When a contract is marked as Closed (or some other status indicating no more data needs to be entered), those physical columns would be filled in (otherwise they're NULL). The view which is accessible to the clients could then decide, based upon the status (or just an ISNULL stiuation), whether to directly return the data from the physical columns, or whether it needs to calculate it on the fly. I'm not sure how the performance would be with this, but it might be worth a look. This would mean that the roll up numbers only need to be calculated for a few thousand rows at most - everything else would come from the physical fields.
So, what is the right way to do this? Am i missing other possibilities?
Try using an Indexed View. This "materializes" the view. Creating a clustered index on the view will allow your queries to go directly to the index rather than all of the underlying tables/queries that make up the view.
Edit: Modified Indexed View link.
I think the view is probably the right answer, but the way you have the query written with correlated subqueries in the SELECT list may be what causes the performance degradation when the rows increase. If you write everything out as joins with GROUP BY, it might allow the query optimizer to simply the plan for the view at execution time and give you better performance.
Have you also looked into Indexed Views? There are a lot of restrictions to creating indexed views so they may not be a viable option for you, but it's something to consider. Essentially an indexed view is a sort of denormalization. It would allow SQL Server to keep the aggregations updated for you automatically as the underlying data in the tables changes. It may of course degrade performance for inserts, updates and deletes, but it's something to consider if the performance of the aggregations is critical.
To get the best read performance in this case, indexed views are the way to go.
CREATE VIEW labor_costs
WITH SCHEMA_BINDING
AS
SELECT t.contract_code, t.hr_code, SUM(t.hours * l.rate_hourly) AS labor_cost
FROM dbo.timesheets t
GROUP BY t.contract_code, t.hr_code
GO
CREATE UNIQUE CLUSTERED INDEX UX_LaborCosts
ON LaborCosts(t.contract_code, t.hr_code)
Once you have the indexed view you can left join to it. For example:
SELECT
c.contract_code,
c.caption,
c.description,
lb.labor_cost
FROM
dbo.contracts c
LEFT JOIN dbo.labor_costs lb WITH (NOEXPAND)
on c.contract_code = lb.contract_code AND c.hr_code = lb.hr_code

Can select * usage ever be justified?

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.
I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.
There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)
I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;
None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).
I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).
In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.
You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.
And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.
As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.
When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.
About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.
I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.
How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?
It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).
Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.
I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.
Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).
Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)
I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.
If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.

A SQL story in 2 parts - Are SQL views always good and how can I solve this example?

I'm building a reporting app, and so I'm crunching an awful lot of data. Part of my approach to creating the app in an agile way is to use SQL views to take the strain off the DB if multiple users are all bashing away.
One example is:
mysql_query("CREATE VIEW view_silverpop_clicks_baby_$email AS SELECT view_email_baby_position.EmailAddress, view_email_baby_position.days, silverpop_campaign_emails.id, silverpop_actions.`Click Name` , silverpop_actions.`Mailing Id`
FROM silverpop_actions
INNER JOIN view_email_baby_position ON (silverpop_actions.Email = view_email_baby_position.EmailAddress ) , silverpop_campaign_emails
WHERE silverpop_campaign_emails.id = $email
AND view_email_baby_position.days
BETWEEN silverpop_campaign_emails.low
AND silverpop_campaign_emails.high
AND silverpop_actions.`Event Type` = 'Click Through'") or die(mysql_error());
And then later in the script this view is used to calculate the number of clicks a particular flavour of this email has had.
$sql = "SELECT count(*) as count FROM `view_silverpop_clicks_baby_$email` WHERE `Click Name` LIKE '$countme%'";
My question is in 2 parts really:
Are views always good? Can you
have too many?
Could I create yet another set of
views to cache the count variable in
the second snippet of code. If so
how could I approach this? I can't
quite make this out yet.
Thanks!
To answer your questions.
1.) I don't know that I can think of an instance where views are BAD in and of themselves, but it would be bad to use them unnecessarily. Whether you can have too many really depends on your situation.
2.) Having another set of views will not cache the count variable so it wouldn't be beneficial from that standpoint.
Having said that, I think you have a misunderstanding on what a view actually does. A view is just a definition of a particular SQL statement and it does not cache data. When you execute a SELECT * FROM myView;, the database is still executing the select statement defined in the CREATE VIEW definition just as it would if a user was executing that statement.
Some database vendors offer a different kind of view called a materialized view. In this case the table data needed to create the view is stored/cached and is usually updated based on a refresh rate specified when you create it. This is "heavy" in the sense that your data is stored twice, but can create better execution plans because the data is already joined, aggregated, etc. Note though, you only see the data based on the last refresh of the materialized view, where with a normal view you see the data as it currently exists in the underlying tables. Currently, MySQL does not support materialized views.
Some useful uses of views are to:
Create easier/cleaner SQL statements for complex queries (which is something you are doing)
Security. If you have tables where you want a user to be able to see some columns or rows, but not other columns/rows, you restrict access to the base table and create a view of the base table that only selects the columns/rows that the user should have access too.
Create aggregations of tables
Views are used by query optimizer so they often help in querying for information more efficiently.
Indexed or materialized views however create a table with the required information which can make quite a difference. Think of it as denormalization of you db scheme without changing existing scheme. You get best of both worlds.
Some views are never used so they represent needles compexity -which is bad.
Indexed views cannot reference other views (mssql) so there's hardly a point in creating such view.

MySQL Views - When to use & when not to

the mysql certification guide suggests that views can be used for:
creating a summary that may involve calculations
selecting a set of rows with a WHERE clause, hide irrelevant information
result of a join or union
allow for changes made to base table via a view that preserve the schema of original table to accommodate other applications
but from how to implement search for 2 different table data?
And maybe you're right that it doesn't
work since mysql views are not good
friends with indexing. But still. Is
there anything to search for in the
shops table?
i learn that views dont work well with indexing so, will it be a big performance hit, for the convenience it may provide?
A view can be simply thought of as a SQL query stored permanently on the server. Whatever indices the query optimizes to will be used. In that sense, there is no difference between the SQL query or a view. It does not affect performance any more negatively than the actual SQL query. If anything, since it is stored on the server, and does not need to be evaluated at run time, it is actually faster.
It does afford you these additional advantages
reusability
a single source for optimization
This mysql-forum-thread about indexing views gives a lot of insight into what mysql views actually are.
Some key points:
A view is really nothing more than a stored select statement
The data of a view is the data of tables referenced by the View.
creating an index on a view will not work as of the current version
If merge algorithm is used, then indexes of underlying tables will be used.
The underlying indices are not visible, however. DESCRIBE on a view will show no indexed columns.
MySQL views, according to the official MySQL documentation, are stored queries that when invoked produce a result set.
A database view is nothing but a virtual table or logical table (commonly consist of SELECT query with joins). Because a database view is similar to a database table, which consists of rows and columns, so you can query data against it.
Views should be used when:
Simplifying complex queries (like IF ELSE and JOIN or working with triggers and such)
Putting extra layer of security and limit or restrict data access (since views are merely virtual tables, can be set to be read-only to specific set of DB users and restrict INSERT )
Backward compatibility and query reusability
Working with computed columns. Computed columns should NOT be on DB tables, because the DB schema would be a bad design.
Views should not be use when:
associate table(s) is/are tentative or subjected to frequent structure change.
According to http://www.mysqltutorial.org/introduction-sql-views.aspx
A database table should not have calculated columns however a database view should.
I tend to use a view when I need to calculate totals, counts etc.
Hope that help!
One more down side of view that doesn't work well with mysql replicator as well as it is causing the master a bit behind of the slave.
http://bugs.mysql.com/bug.php?id=30998

How do Views work in a DBM?

Say that I have two tables like those:
Employers (id, name, .... , deptId).
Depts(id, deptName, ...).
But Those data is not going to be modified so often and I want that a query like this
SELECT name, deptName FROM Employers, Depts
WHERE deptId = Depts.id AND Employers.id="ID"
be as faster as it can.
To my head comes two possible solutions:
Denormalize the table:
Despite that with this solution I will lose some of the great advantages of have "normalized databases, but here the performance is a MUST.
Create a View for that Denormalize data.
I will keep the Data Normalized and (here is my question), the performance of a query over that view will be faster that without that view.
Or another way to ask the same question, the View is "Interpreted" every time that you make a query over it, or how works the views Stuff in a DBA?.
Generally, unless you "materialize" a view, which is an option in some software like MS SQL Server, the view is just translated into queries against the base tables, and is therefore no faster or slower than the original (minus the minuscule amount of time it takes to translate the query, which is nothing compared to actually executing the query).
How do you know you've got performance problems? Are you profiling it under load? Have you verified that the performance bottleneck is these two tables? Generally, until you've got hard data, don't assume you know where performance problems come from, and don't spend any time optimizing until you know you're optimizing the right thing - 80% of the performance issues come from 20% of the code.
If Depts.ID is the primary key of that table, and you index the Employers.DeptID field, then this query should remain very fast even over millions of records.
Denormalizing doesn't make sense to me in that scenario.
Generally speaking, performance of a view will be almost exactly the same as performance when running the query itself. The advantage of a view is simply to abstract that query away, so you don't have to think about it.
You could use a Materialized View (or "snapshot" as some say), but then your data is only going to be as recent as your last refresh.
In a comment to one of the replies, the author of the question explains that he is looking for a way to create a materialized view in MySQL.
MySQL does not wrap the concept of the materialized view in a nice package for you like other DBMSes, but it does have all the tools you need to create one.
What you need to do is this:
Create the initial materialization of the result of your query.
Create a trigger on insert into the employers table that inserts into the materialized table all rows that match the newly inserted employer.
Create a trigger on delete in the employers table that deletes the corresponding rows from the materialized table.
Create a trigger on update in the employers table that updates the corresponding rows in the materialized table.
Same for the departments table.
This may work ok if your underlying tables are not frequently updated; but you need to be aware of the added cost of create/update/delete operations once you do this.
Also you'll want to make sure some DBA who doesn't know about your trickery doesn't go migrating the database without migrating the triggers, when time comes. So document it well.
Sounds like premature optimisation unless you know it is a clear and present problem.
MySQL does not materialise views, they are no faster than queries against the base tables. Moreover, in some cases they are slower as they get optimised less well.
But views also "hide" stuff from developers maintaining the code in the future to make them imagine that the query is less complex than it actually is.