Merge into CTE vs Merge into Table - sql

I have a question regarding performance of Merge into CTE against Merge into Table.
My scenario is as below:
I am facing an issue in existing stored procedure which is responsible to merge data into table from staging source table. The target table has around 20 million records.
Merging is performed into CTE which is subset of target table filtered on date range. This date range matches with the date range of source table. However, due to incorrect logic implementation, the source table sometimes contain data out of the date range as well. This results in unwanted & duplicate data insertion into target table.
The logic in the stored procedure is now corrected and this change is expected to fix the issue. But, to be sure that, such kind of issue never occur again, is it advisable to merge into table instead of CTE? Considering the size of the target table (~20M rows), what will be the performance trade-off? What are the pros and cons to merge into table considering performance point of view?

Microsoft advises against using CTEs as targets.
Use the WITH clause to filter out rows from
the source or target tables. This method is similar to specifying
additional search criteria in the ON clause and may produce incorrect
results. We recommend that you avoid using this method or test
thoroughly before implementing it.
See for more details: https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15#join-best-practices
From performance standpoint I do not think it would make any difference. I have never tested this though.

Related

SQL query timings

Does it take more time to create a table using select as statement or to just run the select statement? Is the time difference too large or can it be neglected?
For example between
create table a
as select * from b
where c = d;
and
select * from b
where c = d;
which one should run faster and can the time difference be neglected?
Creating the table will take more time. There is more overhead. If you look at the metadata for your database, you will find lots of tables or views that contain information about tables. In particular, the table names and column names need to be stored.
That said, the data processing effort is pretty similar. However, there might be overhead in storing the result set in permanent storage rather than in the data structures needed for a result set. In fact, the result set may never need to be stored "on disk" (i.e. permanently). But with a table creation, that is needed.
Depending on the database, the two queries might also be optimized differently. The SELECT query might be optimized to return the first row as fast as possible. The CREATE query might be optimized to return all rows as fast as possible. Also, the SELECT quite might just look faster if your database and interface start returning rows when they first appear.
I should point out that under most circumstances, the overhead might not really be noticeable. But, you can get other errors with the create table statement that you would not get with just the select. For instance, the table might already exist. Or duplicate column names might pose a problem (although some databases don't allow duplicate column names in result sets either).

Pre-Staging Data Solution

I have been tasked with replacing a costly stored procedure which performs calculations across 10 - 15 tables, some of which contain many millions of rows. The plan is to pre-stage the many computations and store the results in separate tables for speeding reading.
Having quickly created these new tables and inserted all of the necessary pre-staged data as a test case, the execution time of getting the same results is vastly improved, as you would expect.
My question is, what is the best practice for keeping these new separate tables up to date?
A procedure which runs at a specific interval could do it, but there
is a requirement for the data to be live.
A trigger on each table could do it, but that seems very costly, and
could cause slow-downs for everywhere else that uses these tables.
Are there other alternatives?
Have you considered Indexed Views for this? As long as you meet the criteria for creating Indexed Views (no self joins etc) it may well be a good solution.
The downsides of Indexed Views are that when the data in underlying tables is changed (delete, update, insert) then it will have to recalculate the indexed view. This can slow down these types of operations in certain circumstances so you have to be careful. I've put some links to documentation below;
https://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://msdn.microsoft.com/en-GB/library/ms191432.aspx
https://technet.microsoft.com/en-GB/library/ms187864(v=sql.105).aspx
what is the best practice for keeping these new separate tables up to date?
Answer is it depends .Depends on what ..?
1.How frequently you will use those computed values
2.what is the acceptable data latency
we to have same kind of reporting where we store computed values in seperate tables and use them in reports.In our case we run this sps before sending the reports through SQL server agent
Consider using an A/B table solution. Place a generic view on over the _A table version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_A). And then you rebuild the _B version, and then switch the view to the _B version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_B). It takes twice as much space for processing, but it gives you the opportunity to build your tables without down-time.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Select * from table vs Select col1,col2,col3 from table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.

Which one have better performance : Derived Tables or Temporary Tables

Sometimes we can write a query with both derived table and temporary table. my question is that which one is better? why?
Derived table is a logical construct.
It may be stored in the tempdb, built at runtime by reevaluating the underlying statement each time it is accessed, or even optimized out at all.
Temporary table is a physical construct. It is a table in tempdb that is created and populated with the values.
Which one is better depends on the query they are used in, the statement that is used to derive a table, and many other factors.
For instance, CTE (common table expressions) in SQL Server can (and most probably will) be reevaluated each time they are used. This query:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will most probably yield two different NEWID()'s.
In this case, a temporary table should be used since it guarantees that its values persist.
On the other hand, this query:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM master
) q
WHERE rn BETWEEN 80 AND 100
is better with a derived table, because using a temporary table will require fetching all values from master, while this solution will just scan the first 100 records using the index on id.
It depends on the circumstances.
Advantages of derived tables:
A derived table is part of a larger, single query, and will be optimized in the context of the rest of the query. This can be an advantage, if the query optimization helps performance (it usually does, with some exceptions). Example: if you populate a temp table, then consume the results in a second query, you are in effect tying the database engine to one execution method (run the first query in its entirety, save the whole result, run the second query) where with a derived table the optimizer might be able to find a faster execution method or access path.
A derived table only "exists" in terms of the query execution plan - it's purely a logical construct. There really is no table.
Advantages of temp tables
The table "exists" - that is, it's materialized as a table, at least in memory, which contains the result set and can be reused.
In some cases, performance can be improved or blocking reduced when you have to perform some elaborate transformation on the data - for example, if you want to fetch a 'snapshot' set of rows out of a base table that is busy, and then do some complicated calculation on that set, there can be less contention if you get the rows out of the base table and unlock it as quickly as possible, then do the work independently. In some cases the overhead of a real temp table is small relative to the advantage in concurrency.
I want to add an anecdote here as it leads me to advise the opposite of the accepted answer. I agree with the thinking presented in the accepted answer but it is mostly theoretical. My experience has lead me to recommend temp tables over derived tables, common table expressions and table value functions. We used derived tables and common table expressions extensively with much success based on thoughts consistent with the accepted answer until we started dealing with larger result sets and/or more complex queries. Then we found that the optimizer did not optimize well with the derived table or CTE.
I looked at an example today that ran for 10:15. I inserted the results from the derived table into a temp table and joined the temp table in the main query and the total time dropped to 0:03. Usually when we see a big performance problem we can quickly address it this way. For this reason I recommend temp tables unless your query is relatively simple and you are certain it will not be processing large data sets.
The big difference is that you can put constraints including a primary key on a temporary table. For big (I mean millions of records) sometime you can get better performance with temporary. I have the key query that needs 5 joins (each joins happens to be similar). Performance was OK with 2 joins and then on the third performance went bad and query plan went crazy. Even with hints I could not correct the query plan. Tried restructuring the joins as derived tables and still same performance issues. With with temporary tables can create a primary key (then when I populate first sort on PK). When SQL could join the 5 tables and use the PK performance went from minutes to seconds. I wish SQL would support constraints on derived tables and CTE (even if only a PK).