view over cte or temp table? - sql

I have a table in staging layer(not indexed) which has close to 100 million rows. In the data warehouse layer, I need to select a certain number of rows from this table and join with another table, roughly having around 50 million rows, for which I use a cte now. From this cte, again some aggregations are carried out before joining with some other tables. So here, what will happen if use a view instead of the cte. I cannot test run it since it takes a lot of time.
So in a general aspect, which holds a slight advantage in terms of performance?
cte or temp table or view ?
Any help is appreciated.

I think you should go with local (single #) temporary table with indexes. Because first you will fetch the data from main table. Then you will apply some aggregations, looping and custom logic. There will be few benefits :-
First whenever connection will close then local temporary table will dropped.
As you are saying you will get millions of records, then using index it will search faster.
Using temporary table and applying some aggregations will not put load on your main tables.

From what you describe the CTE is "created" once (when defined) and used once (when aggregated).
In general, this means that you should keep the code as a single query, letting the optimizer find the best execution path.
In general, materializing CTEs is going to be a bigger win when the CTE is referenced multiple times. Often, you can get around multiple references using window functions, but that is a different matter.
That is general advice, but not always true. Materializing a CTE as a temporary table can give two benefits:
The query optimizer has a more accurate estimate of the number of rows for optimization.
You can add indexes to boost performance.
The first is possibly not an issue, because you still have a large percentage of the original rows. The second could possibly help, but it is not a no-brainer.
You might want to create an indexed materialized view instead of a temporary table. This would stay up-to-date and possibly be a big boost to performance.

Related

Materialized view Vs Temp tables in Oracle

I have a base transaction table. Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic. The way I'm handling currently is creating temporary tables for intermediate stages and post these 15 steps populating the final result in a physical table. It it a better approach or using materialized view instead of these intermediate temp tables is a better approach. If using materialized views for the intermediate steps is a better approach, can you kindly let me know why?
Have already tried scripting both the approaches, scripted 15 intermediate steps as global temporary table as well as Materialized view. I found marginal improvement in performance in MVs when compared to Temp tables, but comes at the cost of excess physical storage. Not sure which is the best practice and why
Temporary tables write to disk, so there's I/O costs for both reading and writing. Also most sites don't manage their temporary tables properly and they end up on the default temporary tablespace, which is the same TEMP tablespace everybody uses for sorting, etc. So there's potential for resource contention there.
Materialized views are intended for materializing aspects of our data set which are commonly reused by many different queries. That's why the most common use case is for storing higher level aggregates of low level data. That doesn't sound like the use case you have here. And lo!
I'm doing a complete refresh of MVs and not a incremental refresh
So nope.
Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic.
This is a terribly procedural way of querying data. Sometimes there's no way of avoiding this, especially in certain data warehouse scenarios. However, it doesn't follow that we need to materialize the outputs of those queries. An alternative approach is to use WITH clauses. The output from one WITH subquery can feed into lower subqueries.
with sq1 as (
select whatever
, count(*) as t1_tot
from t1
group by whatever
) , sq2 as (
select sq1.whatever
, max(t2.blah) as max_blah
from sq1
join t2 on t2.whatever = sq1.whatever
) , sq3 as (
select sq2.whatever
,(t3.meh + t3.huh) as qty
from sq2
join t3 on t3.whatever = sq2.whatever
where t3.something >= sq2.max_blah
)
select sq1.whatever
,sq1.t1_tot
,sq2.max_blah
,sq3.qty
from sq1
join sq2 on sq2.whatever = sq1.whatever
join sq3 on sq3.whatever = sq1.whatever
Not saying it won't be a monstrous query, the terror of the department. But it will probably perform way better than your MViews ot GTTs. (Oracle may choose to materialize those intermediate result sets but we can use hints to affect that.)
You may even find from taking this approach that some of your steps are unnecessary and you can combine several steps into one query. Certainly in real life I would write my toy statement above as one query not a join of three subqueries.
From what you said, I'd say that using (global or private, depending on database version you use) temporary tables is a better choice. Why? Because you are "calculating" something, storing results of those calculations into some tables, reusing them for additional processing. All of that - if it can't be done without temporary tables - is to be done with tables.
Materialized view is, as its name says, a view. It is a result of some query, but - opposed to "normal" views, it actually takes space. Can be refreshed (either on demand, when source data is changed, or based on a schedule). Yes, it has its advantages, though I can't see any in what you are currently doing.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

What aspects of a sql query are relatively costly to one another? Joins? Num of records? columns selected?

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column
If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Would 1 record vs 10 record vs 100 record matter?
As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:
Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
# of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
# of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.
I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.
Several things impact the cost of a query.
First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.
Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:
LIKE '%test'
Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.
Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.
Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.
Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.
There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.
Or should I write a query that just yanks all the columns..
No. Just today there was another question about that.
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.
Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?
If possible, could you help me
understand what aspects of a query
would be relatively costly compared to
one another?
Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.
To answer the following:
How costly would SELECT One, Two,
Three be compared to SELECT One, Two,
Three, ..... N-Column
This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.
As for understanding preformance here is a good link
http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx
Or google tsql Performance
Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.
The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.
What others are saying is all true.
But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.
The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:
A covering index is a special case
where the index itself contains the
required data field(s) and can return
the data.
As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.
The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.
And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.
Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:
Designing Indexes
SQL Server Optimization
Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)
You allways have to consider what parts of the query will cause which costs.
If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).
The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.
Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.
The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.
Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)
Edit:
Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.
I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)
The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.
I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.

Which one have better performance : Derived Tables or Temporary Tables

Sometimes we can write a query with both derived table and temporary table. my question is that which one is better? why?
Derived table is a logical construct.
It may be stored in the tempdb, built at runtime by reevaluating the underlying statement each time it is accessed, or even optimized out at all.
Temporary table is a physical construct. It is a table in tempdb that is created and populated with the values.
Which one is better depends on the query they are used in, the statement that is used to derive a table, and many other factors.
For instance, CTE (common table expressions) in SQL Server can (and most probably will) be reevaluated each time they are used. This query:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will most probably yield two different NEWID()'s.
In this case, a temporary table should be used since it guarantees that its values persist.
On the other hand, this query:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM master
) q
WHERE rn BETWEEN 80 AND 100
is better with a derived table, because using a temporary table will require fetching all values from master, while this solution will just scan the first 100 records using the index on id.
It depends on the circumstances.
Advantages of derived tables:
A derived table is part of a larger, single query, and will be optimized in the context of the rest of the query. This can be an advantage, if the query optimization helps performance (it usually does, with some exceptions). Example: if you populate a temp table, then consume the results in a second query, you are in effect tying the database engine to one execution method (run the first query in its entirety, save the whole result, run the second query) where with a derived table the optimizer might be able to find a faster execution method or access path.
A derived table only "exists" in terms of the query execution plan - it's purely a logical construct. There really is no table.
Advantages of temp tables
The table "exists" - that is, it's materialized as a table, at least in memory, which contains the result set and can be reused.
In some cases, performance can be improved or blocking reduced when you have to perform some elaborate transformation on the data - for example, if you want to fetch a 'snapshot' set of rows out of a base table that is busy, and then do some complicated calculation on that set, there can be less contention if you get the rows out of the base table and unlock it as quickly as possible, then do the work independently. In some cases the overhead of a real temp table is small relative to the advantage in concurrency.
I want to add an anecdote here as it leads me to advise the opposite of the accepted answer. I agree with the thinking presented in the accepted answer but it is mostly theoretical. My experience has lead me to recommend temp tables over derived tables, common table expressions and table value functions. We used derived tables and common table expressions extensively with much success based on thoughts consistent with the accepted answer until we started dealing with larger result sets and/or more complex queries. Then we found that the optimizer did not optimize well with the derived table or CTE.
I looked at an example today that ran for 10:15. I inserted the results from the derived table into a temp table and joined the temp table in the main query and the total time dropped to 0:03. Usually when we see a big performance problem we can quickly address it this way. For this reason I recommend temp tables unless your query is relatively simple and you are certain it will not be processing large data sets.
The big difference is that you can put constraints including a primary key on a temporary table. For big (I mean millions of records) sometime you can get better performance with temporary. I have the key query that needs 5 joins (each joins happens to be similar). Performance was OK with 2 joins and then on the third performance went bad and query plan went crazy. Even with hints I could not correct the query plan. Tried restructuring the joins as derived tables and still same performance issues. With with temporary tables can create a primary key (then when I populate first sort on PK). When SQL could join the 5 tables and use the PK performance went from minutes to seconds. I wish SQL would support constraints on derived tables and CTE (even if only a PK).

Temp tables and SQL SELECT performance

Why does the use of temp tables with a SELECT statement improve the logical I/O count? Wouldn't it increase the amount of hits to a database instead of decreasing it. Is this because the 'problem' is broken down into sections? I'd like to know what's going on behind the scenes.
There's no general answer. It depends on how the temp table is being used.
The temp table may reduce IO by caching rows created after a complex filter/join that are used multiple times later in the batch. This way, the DB can avoid hitting the base tables multiple times when only a subset of the records are needed.
The temp table may increase IO by storing records that are never used later in the query, or by taking up a lot of space in the engine's cache that could have been better used by other data.
Creating a temp table to use all of its contents once is slower than including the temp's query in the main query because the query optimizer can't see past the temp table and it forces a (probably) unnecessary spool of the data instead of allowing it to stream from the source tables.
I'm going to assume by temp tables you mean a sub-select in a WHERE clause. (This is referred to as a semijoin operation and you can usually see that in the text execution plan for your query.)
When the query optimizer encounter a sub-select/temp table, it makes some assumptions about what to do with that data. Essentially, the optimizer will create an execution plan that performs a join on the sub-select's result set, reducing the number of rows that need to be read from the other tables. Since there are less rows, the query engine is able to read less pages from disk/memory and reduce the amount of I/O required.
AFAIK, at least with mysql, tmp tables are kept in RAM, making SELECTs much faster than anything that hits the HD
There are a class of problems where building the result in a collection structure on the database side is much preferable to returning the result's parts to the client, roundtripping for each part.
For example: arbitrary depth recursive relationships (boss of)
There's another class of query problems where the data is not and will not be indexed in a manner that makes the query run efficiently. Pulling results into a collection structure, which can be indexed in a custom way, will reduce the logical IO for these queries.