Rolling up values from multiple child tables - sql

What is the best way to roll up values from a series of child tables into a parent table in SQL Server?
Let's say we have a contracts table. This table has a series of child tables, such as contract_timesheets, contract_materials, contract_other_expenses - etc. What is the best way to pull costs / hours / etc out of those child tables and make them easily accessible in the parent table?
Option 1: My first thought would be to simply use a view. An example might be something like this:
SELECT
contract_code,
caption,
description,
(
SELECT SUM(t.hours * l.rate_hourly)
FROM timesheets t
JOIN labor l ON t.hr_code = l.hr_code AND t.contract_code = c.contract_code
) AS labor_cost,
(
SELECT ...
) AS material_cost,
...
FROM contracts c
So we'll have a view that might have a dozen or more subqueries like that, many of which will themselves need joins to pull in all of the info we need.
This completely works fine. Until we have hundreds of thousands of rows. Then things start to get noticeably slow. It's still workable, but it the row count gets up too much higher, or the server gets too much other workload, i'm concerned that this isn't workable.
Is there a more efficient way to structure such a view?
Option 2: The other obvious solution is to roll those numbers up into physical fields in the parent table. The big issue with that is just maintaining the numbers when the data might be accessed from a variety of clients. Maybe it's a report, maybe it's a form, maybe it's some integration service. So trying to use some premade roll-up SQL file that gets run as an event in the front-end prior to displaying the report / chart / whatever isn't an ideal solution.
To ensure that the roll up numbers stay in synch, we could attach a series of triggers to all of the child tables (and possibly relatives of those, if the numbers in the child tables rely on something else). Everytime the source numbers get updated, we roll it up into the parent. This seems like a lot of trouble, but if the triggers are written correctly, i suppose this would work fine.
Option 3: Do everything in the UI. This is also an option, but with a variety of clients accessing the data, it makes things unpleasant.
Option 4(?): Since most of these records are actually completed with no need to add more data, i can also imagine some kind of hybrid system. The base table for the parent contract would have physical columns for the labor costs, material costs, or whatever. When a contract is marked as Closed (or some other status indicating no more data needs to be entered), those physical columns would be filled in (otherwise they're NULL). The view which is accessible to the clients could then decide, based upon the status (or just an ISNULL stiuation), whether to directly return the data from the physical columns, or whether it needs to calculate it on the fly. I'm not sure how the performance would be with this, but it might be worth a look. This would mean that the roll up numbers only need to be calculated for a few thousand rows at most - everything else would come from the physical fields.
So, what is the right way to do this? Am i missing other possibilities?

Try using an Indexed View. This "materializes" the view. Creating a clustered index on the view will allow your queries to go directly to the index rather than all of the underlying tables/queries that make up the view.
Edit: Modified Indexed View link.

I think the view is probably the right answer, but the way you have the query written with correlated subqueries in the SELECT list may be what causes the performance degradation when the rows increase. If you write everything out as joins with GROUP BY, it might allow the query optimizer to simply the plan for the view at execution time and give you better performance.
Have you also looked into Indexed Views? There are a lot of restrictions to creating indexed views so they may not be a viable option for you, but it's something to consider. Essentially an indexed view is a sort of denormalization. It would allow SQL Server to keep the aggregations updated for you automatically as the underlying data in the tables changes. It may of course degrade performance for inserts, updates and deletes, but it's something to consider if the performance of the aggregations is critical.

To get the best read performance in this case, indexed views are the way to go.
CREATE VIEW labor_costs
WITH SCHEMA_BINDING
AS
SELECT t.contract_code, t.hr_code, SUM(t.hours * l.rate_hourly) AS labor_cost
FROM dbo.timesheets t
GROUP BY t.contract_code, t.hr_code
GO
CREATE UNIQUE CLUSTERED INDEX UX_LaborCosts
ON LaborCosts(t.contract_code, t.hr_code)
Once you have the indexed view you can left join to it. For example:
SELECT
c.contract_code,
c.caption,
c.description,
lb.labor_cost
FROM
dbo.contracts c
LEFT JOIN dbo.labor_costs lb WITH (NOEXPAND)
on c.contract_code = lb.contract_code AND c.hr_code = lb.hr_code

Related

Pre-Staging Data Solution

I have been tasked with replacing a costly stored procedure which performs calculations across 10 - 15 tables, some of which contain many millions of rows. The plan is to pre-stage the many computations and store the results in separate tables for speeding reading.
Having quickly created these new tables and inserted all of the necessary pre-staged data as a test case, the execution time of getting the same results is vastly improved, as you would expect.
My question is, what is the best practice for keeping these new separate tables up to date?
A procedure which runs at a specific interval could do it, but there
is a requirement for the data to be live.
A trigger on each table could do it, but that seems very costly, and
could cause slow-downs for everywhere else that uses these tables.
Are there other alternatives?
Have you considered Indexed Views for this? As long as you meet the criteria for creating Indexed Views (no self joins etc) it may well be a good solution.
The downsides of Indexed Views are that when the data in underlying tables is changed (delete, update, insert) then it will have to recalculate the indexed view. This can slow down these types of operations in certain circumstances so you have to be careful. I've put some links to documentation below;
https://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://msdn.microsoft.com/en-GB/library/ms191432.aspx
https://technet.microsoft.com/en-GB/library/ms187864(v=sql.105).aspx
what is the best practice for keeping these new separate tables up to date?
Answer is it depends .Depends on what ..?
1.How frequently you will use those computed values
2.what is the acceptable data latency
we to have same kind of reporting where we store computed values in seperate tables and use them in reports.In our case we run this sps before sending the reports through SQL server agent
Consider using an A/B table solution. Place a generic view on over the _A table version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_A). And then you rebuild the _B version, and then switch the view to the _B version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_B). It takes twice as much space for processing, but it gives you the opportunity to build your tables without down-time.

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

How to tell if a query will scale well?

What are some of the methods/techniques experienced SQL developers use to determine if a particular SQL query will scale well as load increases, rows in associated tables increase etc.
Some rules that I follow that make the most difference.
Don't use per-row functions in your queries like if, case, coalesce and so on. Work around them by putting data in the database in the format you're going to need it, even if that involves duplicate data.
For example, if you need to lookup surnames fast, store them in the entered form and in their lowercase form, and index the lowercase form. Then you don't have to worry about things like select * from tbl where lowercase(surname) = 'smith';
Yes, I know that breaks 3NF but you can still guarantee data integrity by judicious use of triggers or pre-computed columns. For example, an insert/update trigger on the table can force the lower_surname column to be set to the lowercase version of surname.
This moves the cost of conversion to the insert/update (which happens infrequently) and away from the select (which happens quite a lot more). You basically amortise the cost of conversion.
Make sure that every column used in a where clause is indexed. Not necessarily on its own but at least as the primary part of a composite key.
Always start off in 3NF and only revert if you have performance problems (in production). 3NF is often the easiest to handle and reverting should only be done when absolutely necessary.
Profile, in production (or elsewhere, as long as you have production data and schemas). Database tuning is not a set-and-forget operation unless the data in your tables never changes (very rare). You should be monitoring, and possibly tuning, periodically to avoid the possibility that changing data will bring down performance.
Don't, unless absolutely necessary, allow naked queries to your database. Try to control what queries can be run. Your job as a DBA will be much harder if some manager can come along and just run:
select * from very_big_table order by column_without_index;
on your database.
If managers want to be able to run ad-hoc queries, give them a cloned DBMS (or replica) so that your real users (the ones that need performance) aren't affected.
Don't use union when union all will suffice. If you know that there can be no duplicates between two selects of a union, there's no point letting the DBMS try to remove them.
Similarly, don't use select distinct on a table if you're retrieving all the primary key columns (or all columns in a unique constraint). There is no possibility of duplicates in those cases so, again, you're asking the DBMS to do unnecessary work.
Example: we had a customer with a view using select distinct * on one of their tables. Querying the view took 50 seconds. When we replaced it with a view starting select *, the time came down to sub-second. Needless to say, I got a good bottle of red wine out of that :-)
Try to avoid select * as much as possible. In other words, only get the columns you need. This makes little difference when you're using MySQL on your local PC but, when you have an app in California querying a database in Inner Mongolia, you want to minimise the amount of traffic being sent across the wire as much as possible.
don't make tables wide, keep them narrow as well as the indexes. Make sure that queries are fully covered by indexes and that those queries are SARGable.
Test with a ton of data before going in production, take a look at this: Your testbed has to have the same volume of data as on production in order to simulate normal usage
Pull up the execution plan and look for any of the following:
Table Scan
[Clustered] Index Scan
RID Lookup
Bookmark Lookup
Key Lookup
Nested Loops
Any of those things (in descending order from most to least scalable) mean that the database/query likely won't scale to much larger tables. An ideal query will have mostly index seeks, hash or merge joins, the occasional sort, and other low-impact operations (spools and so on).
The only way to prove that it will scale, as other answers have pointed out, is to test it on data of the desired size. The above is just a rule of thumb.
In addition (and along the same lines) to Robert's suggestion, consider the execution plan. Is it utilizing indexes? Are there any scans or such? Can you simply for the query in any way? For example, Eliminate IN in favor of EXISTS and only join to tables you need to join to.
You don't mention the technology -- keep in mind that different technologies can affect the efficiency of more complex queries.
I strongly recommend reading some reference material on this. This (hyperlink below) is probably a pretty good book to look into. Make sure to look under "Selectivity", among other topics.
SQL Tuning - Dan Tow

How do Views work in a DBM?

Say that I have two tables like those:
Employers (id, name, .... , deptId).
Depts(id, deptName, ...).
But Those data is not going to be modified so often and I want that a query like this
SELECT name, deptName FROM Employers, Depts
WHERE deptId = Depts.id AND Employers.id="ID"
be as faster as it can.
To my head comes two possible solutions:
Denormalize the table:
Despite that with this solution I will lose some of the great advantages of have "normalized databases, but here the performance is a MUST.
Create a View for that Denormalize data.
I will keep the Data Normalized and (here is my question), the performance of a query over that view will be faster that without that view.
Or another way to ask the same question, the View is "Interpreted" every time that you make a query over it, or how works the views Stuff in a DBA?.
Generally, unless you "materialize" a view, which is an option in some software like MS SQL Server, the view is just translated into queries against the base tables, and is therefore no faster or slower than the original (minus the minuscule amount of time it takes to translate the query, which is nothing compared to actually executing the query).
How do you know you've got performance problems? Are you profiling it under load? Have you verified that the performance bottleneck is these two tables? Generally, until you've got hard data, don't assume you know where performance problems come from, and don't spend any time optimizing until you know you're optimizing the right thing - 80% of the performance issues come from 20% of the code.
If Depts.ID is the primary key of that table, and you index the Employers.DeptID field, then this query should remain very fast even over millions of records.
Denormalizing doesn't make sense to me in that scenario.
Generally speaking, performance of a view will be almost exactly the same as performance when running the query itself. The advantage of a view is simply to abstract that query away, so you don't have to think about it.
You could use a Materialized View (or "snapshot" as some say), but then your data is only going to be as recent as your last refresh.
In a comment to one of the replies, the author of the question explains that he is looking for a way to create a materialized view in MySQL.
MySQL does not wrap the concept of the materialized view in a nice package for you like other DBMSes, but it does have all the tools you need to create one.
What you need to do is this:
Create the initial materialization of the result of your query.
Create a trigger on insert into the employers table that inserts into the materialized table all rows that match the newly inserted employer.
Create a trigger on delete in the employers table that deletes the corresponding rows from the materialized table.
Create a trigger on update in the employers table that updates the corresponding rows in the materialized table.
Same for the departments table.
This may work ok if your underlying tables are not frequently updated; but you need to be aware of the added cost of create/update/delete operations once you do this.
Also you'll want to make sure some DBA who doesn't know about your trickery doesn't go migrating the database without migrating the triggers, when time comes. So document it well.
Sounds like premature optimisation unless you know it is a clear and present problem.
MySQL does not materialise views, they are no faster than queries against the base tables. Moreover, in some cases they are slower as they get optimised less well.
But views also "hide" stuff from developers maintaining the code in the future to make them imagine that the query is less complex than it actually is.

Typical Kimball Star-schema Data Warehouse - Model Views Feasible? and How to Code Gen

I have a data warehouse containing typical star schemas, and a whole bunch of code which does stuff like this (obviously a lot bigger, but this is illustrative):
SELECT cdim.x
,SUM(fact.y) AS y
,dim.z
FROM fact
INNER JOIN conformed_dim AS cdim
ON cdim.cdim_dim_id = fact.cdim_dim_id
INNER JOIN nonconformed_dim AS dim
ON dim.ncdim_dim_id = fact.ncdim_dim_id
INNER JOIN date_dim AS ddim
ON ddim.date_id = fact.date_id
WHERE fact.date_id = #date_id
GROUP BY cdim.x
,dim.z
I'm thinking of replacing it with a view (MODEL_SYSTEM_1, say), so that it becomes:
SELECT m.x
,SUM(m.y) AS y
,m.z
FROM MODEL_SYSTEM_1 AS m
WHERE m.date_id = #date_id
GROUP BY m.x
,m.z
But the view MODEL_SYSTEM_1 would have to contain unique column names, and I'm also concerned about performance with the optimizer if I go ahead and do this, because I'm concerned that all the items in the WHERE clause across different facts and dimensions get optimized, since the view would be across a whole star, and views cannot be parametrized (boy, wouldn't that be cool!)
So my questions are -
Is this approach OK, or is it just going to be an abstraction which hurts performance and doesn't give my anything but a lot nicer syntax?
What's the best way to code-gen these views, eliminating duplicate column names (even if the view later needs to be tweaked by hand), given that all the appropriate PK and FKs are in place? Should I just write some SQL to pull it out of the INFORMATION_SCHEMA or is there a good example already available.
Edit: I have tested it, and the performance seems the same, even on the bigger processes - even joining multiple stars which each use these views.
The automation is mainly because there are a number of these stars in the data warehouse, and the FK/PK has been done properly by the designers, but I don't want to have to pick through all the tables or the documentation. I wrote a script to generate the view (it also generates abbreviations for the tables), and it works well to generate the skeleton automagically from INFORMATION_SCHEMA, and then it can be tweaked before committing the creation of the view.
If anyone wants the code, I could probably publish it here.
I’ve used this technique on several data warehouses I look after. I have not noticed any performance degradation when running reports based off of the views versus a table direct approach but have never performed a detailed analysis.
I created the views using the designer in SQL Server management studio and did not use any automated approach. I can’t imagine the schema changing often enough that automating it would be worthwhile anyhow. You might spend as long tweaking the results as it would have taken to drag all the tables onto the view in the first place!
To remove ambiguity a good approach is to preface the column names with the name of the dimension it belongs to. This is helpful to the report writers and to anyone running ad hoc queries.
Make the view or views into into one or more summary fact tables and materialize it. These only need to be refreshed when the main fact table is refreshed. The materialized views will be faster to query and this can be a win if you have a lot of queries that can be satisfied by the summary.
You can use the data dictionary or information schema views to generate SQL to create the tables if you have a large number of these summaries or wish to change them about frequently.
However, I would guess that it's not likely that you would change these very often so auto-generating the view definitions might not be worth the trouble.
If you happen to use MS SQL Server, you could try an Inline UDF which is as close to a parameterized view as it gets.