Materialized view Vs Temp tables in Oracle - sql

I have a base transaction table. Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic. The way I'm handling currently is creating temporary tables for intermediate stages and post these 15 steps populating the final result in a physical table. It it a better approach or using materialized view instead of these intermediate temp tables is a better approach. If using materialized views for the intermediate steps is a better approach, can you kindly let me know why?
Have already tried scripting both the approaches, scripted 15 intermediate steps as global temporary table as well as Materialized view. I found marginal improvement in performance in MVs when compared to Temp tables, but comes at the cost of excess physical storage. Not sure which is the best practice and why

Temporary tables write to disk, so there's I/O costs for both reading and writing. Also most sites don't manage their temporary tables properly and they end up on the default temporary tablespace, which is the same TEMP tablespace everybody uses for sorting, etc. So there's potential for resource contention there.
Materialized views are intended for materializing aspects of our data set which are commonly reused by many different queries. That's why the most common use case is for storing higher level aggregates of low level data. That doesn't sound like the use case you have here. And lo!
I'm doing a complete refresh of MVs and not a incremental refresh
So nope.
Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic.
This is a terribly procedural way of querying data. Sometimes there's no way of avoiding this, especially in certain data warehouse scenarios. However, it doesn't follow that we need to materialize the outputs of those queries. An alternative approach is to use WITH clauses. The output from one WITH subquery can feed into lower subqueries.
with sq1 as (
select whatever
, count(*) as t1_tot
from t1
group by whatever
) , sq2 as (
select sq1.whatever
, max(t2.blah) as max_blah
from sq1
join t2 on t2.whatever = sq1.whatever
) , sq3 as (
select sq2.whatever
,(t3.meh + t3.huh) as qty
from sq2
join t3 on t3.whatever = sq2.whatever
where t3.something >= sq2.max_blah
)
select sq1.whatever
,sq1.t1_tot
,sq2.max_blah
,sq3.qty
from sq1
join sq2 on sq2.whatever = sq1.whatever
join sq3 on sq3.whatever = sq1.whatever
Not saying it won't be a monstrous query, the terror of the department. But it will probably perform way better than your MViews ot GTTs. (Oracle may choose to materialize those intermediate result sets but we can use hints to affect that.)
You may even find from taking this approach that some of your steps are unnecessary and you can combine several steps into one query. Certainly in real life I would write my toy statement above as one query not a join of three subqueries.

From what you said, I'd say that using (global or private, depending on database version you use) temporary tables is a better choice. Why? Because you are "calculating" something, storing results of those calculations into some tables, reusing them for additional processing. All of that - if it can't be done without temporary tables - is to be done with tables.
Materialized view is, as its name says, a view. It is a result of some query, but - opposed to "normal" views, it actually takes space. Can be refreshed (either on demand, when source data is changed, or based on a schedule). Yes, it has its advantages, though I can't see any in what you are currently doing.

Related

Self-Joins: is there a way to improve the performance of this query?

The purpose of all this is to create a lookup table to avoid a self join down the road, which would involve joins for the same data against much bigger data sets.
In this instance a sales order may have one or both of bill to and ship to customer ID.
The tables here are aggregates of data from 5 different servers, differentiated by the box_id. The customer table is ~1.7M rows, and sales_order is ~55M. The end result is ~52M records and takes on average about 80 minutes to run.
The query:
SELECT DISTINCT sog.box_id ,
sog.sales_order_id ,
cb.cust_id AS bill_to_customer_id ,
cb.customer_name AS bill_to_customer_name ,
cs.cust_id AS ship_to_customer_id ,
cs.customer_name AS ship_to_customer_name
FROM sales_order sog
LEFT JOIN customer cb ON cb.cust_id = sog.bill_to_id AND cb.box_id = sog.box_id
LEFT JOIN customer cs ON cs.cust_id = sog.ship_to_id AND cs.box_id = sog.box_id
The execution plan:
https://www.brentozar.com/pastetheplan/?id=SkjhXspEs
All of this is happening on SQL Server.
I've tried reproducing the bill to and ship to customer sets as CTEs and joining to those, but found no performance benefit.
The only indexes on these tables are the primary keys (which are synthetic IDs). Somewhat curiously the execution plan analyzer is not recommending adding any indexes to either table; it usually wants me to slap indexes on almost everything.
I don't know that there necessarily IS a way to make this run faster, but I am trying to improve my query optimization and have hit the limit of my knowledge. Any insight is much appreciated.
When you run queries like yours -- queries with no WHERE filters -- often the DBMS decides it has to scan entire tables. (In SQL Server execution plans, "clustered index scan" means it is scanning the whole table.) It certainly has to wrangle all the data in the tables. The lookup table you want to create is often called a "materialized view." (An online version of SQL server has built in support for materialized views, but other versions still don't.)
Depending on how you will use your data, you may be better off avoiding this materialized lookup table. If all your uses of your proposed lookup table involve filtering out a small subset of rows using WHERE clauses, an ordinary non-materialized view may be a good choice. When you give queries involving ordinary views, the query planner folds those views into the query, and may recommend helpful indexes.

view over cte or temp table?

I have a table in staging layer(not indexed) which has close to 100 million rows. In the data warehouse layer, I need to select a certain number of rows from this table and join with another table, roughly having around 50 million rows, for which I use a cte now. From this cte, again some aggregations are carried out before joining with some other tables. So here, what will happen if use a view instead of the cte. I cannot test run it since it takes a lot of time.
So in a general aspect, which holds a slight advantage in terms of performance?
cte or temp table or view ?
Any help is appreciated.
I think you should go with local (single #) temporary table with indexes. Because first you will fetch the data from main table. Then you will apply some aggregations, looping and custom logic. There will be few benefits :-
First whenever connection will close then local temporary table will dropped.
As you are saying you will get millions of records, then using index it will search faster.
Using temporary table and applying some aggregations will not put load on your main tables.
From what you describe the CTE is "created" once (when defined) and used once (when aggregated).
In general, this means that you should keep the code as a single query, letting the optimizer find the best execution path.
In general, materializing CTEs is going to be a bigger win when the CTE is referenced multiple times. Often, you can get around multiple references using window functions, but that is a different matter.
That is general advice, but not always true. Materializing a CTE as a temporary table can give two benefits:
The query optimizer has a more accurate estimate of the number of rows for optimization.
You can add indexes to boost performance.
The first is possibly not an issue, because you still have a large percentage of the original rows. The second could possibly help, but it is not a no-brainer.
You might want to create an indexed materialized view instead of a temporary table. This would stay up-to-date and possibly be a big boost to performance.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Which one have better performance : Derived Tables or Temporary Tables

Sometimes we can write a query with both derived table and temporary table. my question is that which one is better? why?
Derived table is a logical construct.
It may be stored in the tempdb, built at runtime by reevaluating the underlying statement each time it is accessed, or even optimized out at all.
Temporary table is a physical construct. It is a table in tempdb that is created and populated with the values.
Which one is better depends on the query they are used in, the statement that is used to derive a table, and many other factors.
For instance, CTE (common table expressions) in SQL Server can (and most probably will) be reevaluated each time they are used. This query:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will most probably yield two different NEWID()'s.
In this case, a temporary table should be used since it guarantees that its values persist.
On the other hand, this query:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM master
) q
WHERE rn BETWEEN 80 AND 100
is better with a derived table, because using a temporary table will require fetching all values from master, while this solution will just scan the first 100 records using the index on id.
It depends on the circumstances.
Advantages of derived tables:
A derived table is part of a larger, single query, and will be optimized in the context of the rest of the query. This can be an advantage, if the query optimization helps performance (it usually does, with some exceptions). Example: if you populate a temp table, then consume the results in a second query, you are in effect tying the database engine to one execution method (run the first query in its entirety, save the whole result, run the second query) where with a derived table the optimizer might be able to find a faster execution method or access path.
A derived table only "exists" in terms of the query execution plan - it's purely a logical construct. There really is no table.
Advantages of temp tables
The table "exists" - that is, it's materialized as a table, at least in memory, which contains the result set and can be reused.
In some cases, performance can be improved or blocking reduced when you have to perform some elaborate transformation on the data - for example, if you want to fetch a 'snapshot' set of rows out of a base table that is busy, and then do some complicated calculation on that set, there can be less contention if you get the rows out of the base table and unlock it as quickly as possible, then do the work independently. In some cases the overhead of a real temp table is small relative to the advantage in concurrency.
I want to add an anecdote here as it leads me to advise the opposite of the accepted answer. I agree with the thinking presented in the accepted answer but it is mostly theoretical. My experience has lead me to recommend temp tables over derived tables, common table expressions and table value functions. We used derived tables and common table expressions extensively with much success based on thoughts consistent with the accepted answer until we started dealing with larger result sets and/or more complex queries. Then we found that the optimizer did not optimize well with the derived table or CTE.
I looked at an example today that ran for 10:15. I inserted the results from the derived table into a temp table and joined the temp table in the main query and the total time dropped to 0:03. Usually when we see a big performance problem we can quickly address it this way. For this reason I recommend temp tables unless your query is relatively simple and you are certain it will not be processing large data sets.
The big difference is that you can put constraints including a primary key on a temporary table. For big (I mean millions of records) sometime you can get better performance with temporary. I have the key query that needs 5 joins (each joins happens to be similar). Performance was OK with 2 joins and then on the third performance went bad and query plan went crazy. Even with hints I could not correct the query plan. Tried restructuring the joins as derived tables and still same performance issues. With with temporary tables can create a primary key (then when I populate first sort on PK). When SQL could join the 5 tables and use the PK performance went from minutes to seconds. I wish SQL would support constraints on derived tables and CTE (even if only a PK).

SQL Server: Inline Table-Value UDF vs. Inline View

I'm working with a medical-record system that stores data in a construct that resembles a spreadsheet--date/time in column headers, measurements (e.g. physician name, Rh, blood type) in first column of each row, and a value in the intersecting cell. Reports that are based on this construct often require 10 or more of these measures to be displayed.
For reporting purposes, the dataset needs to have one row for each patient, the date/time the measurement was taken, and a column for each measurement. In essence, one needs to pivot the construct by 90 degrees.
At one point, I actually used SQL Server's PIVOT functionality to do just that. For a variety of reasons, it became apparent that this approach wouldn't work. I decided that I would use an inline view (IV) to massage the data into the desired format. The simplified query resembles:
SELECT patient_id,
datetime,
m1.value AS physician_name,
m2.value AS blood_type,
m3.value AS rh
FROM patient_table
INNER JOIN ( complex query here
WHERE measure_id=1) m1...
INNER JOIN (complex query here
WHERE measure_id=2) m2...
LEFT OUTER JOIN (complex query here
WHERE measure_id=3) m3...
As you can see, in some cases these IVs are used to restrict the resulting dataset (INNER JOIN), in other cases they do not restrict the dataset (LEFT OUTER JOIN). However, the 'complex query' part is essentially the same for each of these measure, except for the difference in measure_id. While this approach works, it leads to fairly large SQL statements, limits reuse, and exposes the query to errors.
My thought was to replace the 'complex query' and WHERE clause with a Inline Table-Value UDF. This would simplify the queries quite a bit, reduce errors, and increase code reuse. The only question on my mind is performance. Will the UDF approach lead to significant decreases in performance? Might it improve matters?
Thanks for your time and consideration.
A correctly defined TVF will not introduce any problem. You'll find many claims on the interned blasting TVFs for performance problems as compared to views or temp tables and variables. What is usualy not understood is that a TVF behaves differently from a view. A View definition is placed into the original query and then the optimizer wil rearrange the query tree as it sees fit (unless the NOEXPAND clause is used on indexed views). A TVF has different semantics and sometimes, specially when updating data, this results in the TVF output being spooled for haloween protection. It helps to mark the function WITH SCHEMABINDING, see Improving query plans with the SCHEMABINDING option on T-SQL UDFs.
Alsois important to understant the concepts of deterministic and precise function. Although they apply mostly to scalar value funcitons, TVFs can be also affected. See User-Defined Function Design Guidelines.
Since you need a SQL String and may not have the ability to add a view or UDF to the system, you may want to use WITH ... AS to limit the complex query to one place (At least for this statement.).
WITH complex(patientid, datetime, measure_id, value) AS
(Select... Complex Query)
SELECT patient_id
, datetime
, m1.value AS physician_name
, m2.value AS blood_type
, m3.value AS rh
FROM patient_table
INNER JOIN (Select ,,,, From complex WHERE measure_id=1) m1...
INNER JOIN (Select ,,,, From complex WHERE measure_id=2) m2...
LEFT OUTER JOIN (Select ,,,, From complex WHERE measure_id=3) m3...
You also have a third option; a traditional VIEW (assuming that you have a key to join to). In theory, there shouldn't be a performance difference between the three options because SQL Server should evaluate and optimize the plans accordingly. The reality is that sometimes that doesn't happen as well as we'd like.
The benefit of a traditional view is that you could make it an indexed view, and give SQL Server another performance aid; however, you'll just have to test and see.
Sql Server 2005 answer:
You can reduce the inline view by using temp/var tables. Performace issues on these are the temp inserts you require per hit on the query, but if the result sets are small enough, they can help. you can use primary keys on var tables, and primary keys/ indexes on temp tables. Other than normal belive, i have found a couple of articles indicating that both temp/var tables are stored in the temp db.
UDF functions, we have found to be less performant, when you have multi layer udfs in complex queries, but will maintain usability.
Be sure to create the function correctly for the various conditions specified. Those that WILL be used for inner joins, and those that will be used for left joins.
So, in general. We do use UDFs, but when we find that the performance degrade, we move the query to insert UDF selections into temp/var tables and join on those.
Create functionality for ease of use/maintinance, and apply performance inhancements where and when required.
EDIT:
If you are required to run this for crystal, and you plan to use Stored procedures, Yes, you can execute sql statements inside the SP to temp/var tables.
Let me know if you are going to use SPs. Sql will then also cache the sp plans with given params as requied.
Also from previous experiance with crystal, things to avoid, is grouping in Crystal that can be done in the SP, page numbers if not required. and function calls, if this can be handled on the server.