Are sql queries with WITH statement and without equal? - sql

I have two queries:
WITH table1
AS (SELECT id,
first AS table1_first,
second AS table1_second
FROM some_table)
SELECT omt.*,
t1.*
FROM one_more_table omt
INNER JOIN table1 t1
ON omt.id = t1.id;
and
SELECT omt.*,
t1.*
FROM one_more_table omt
INNER JOIN (SELECT id,
first AS table1_first,
second AS table1_second
FROM some_table) AS t1
ON omt.id = t1.id;
Tell me are this two sql queries equal?

From a logical point of view, yes they are identical.
However some DBMS apply different optimization strategies for common table expression (first query) and derived tables (second query).
If you added a where condition in the "outer" query that restricts the rows inside the CTE it might not be pushed down into the CTE and thus might yield a different execution plan.
But this depends on the DBMS being used (the above is at least true for Postgres and I think Oracle. I don't know about e.g. DB2, SQL Server or other DBMS).

Related

How to run the subquery first in presto

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?
The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'
Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).
From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

In T-SQL is it possible to have a result set returned by a select with subselects that cannot be obtained using joins instead?

Given a SQL SELECT expression with arbitrarily nested subselect's, it always possible to rewrite said SQL expression so that it contains no subselect's and returns the same result set?
If so, is there an algorithm for doing so?
If not, is there a characterization of those SELECT expressions that cannot be rewritten?
I'm making an application that will generate SQL SELECT statements. I'm still designing how it will work at this point. Here's the general idea, though:
The user will select what columns are displayed, how the results are sorted and how they are restricted.
The columns will not just be SQL columns but named objects such that the object can contain a SQL expression with column variables from multiple tables. These objects will contain information on how to join to each other.
I want to make the configuration of these expressions to be as flexible as possible; if it's possible to write the SELECT statement that returns some result set S, then I'd like the application to be able to generate a SELECT statement that returns S. One thing that's possible in SQL are sub-selects. I've read that rewriting said sub-select's with JOINS is better performance wise. Therefore I am considering disallowing sub-select's in the configuration. However I do not want to do this unless every sub-select can be rewritten as a join.
Subselects in the WHERE clause can often be impossible to rewrite as JOIN, especially if aggregate functions are in use.
Quoted from here:
Here is an example of a common-form subquery comparison which you can't do with a join: find all the values in table t1 which are equal to a maximum value in table t2.
SELECT column1 FROM t1
WHERE column1 = (SELECT MAX(column2) FROM t2);
Here is another example, which again is impossible with a join because it involves aggregating for one of the tables: find all rows in table t1 which contain a value which occurs twice.
SELECT * FROM t1
WHERE 2 = (SELECT COUNT(column1) FROM t1);
Therefore, if a complex subselect in the SELECT clause itself has subselects in its WHERE clause, that could be impossible to express as a JOIN.
SELECT T2.B, (SELECT A from t1 where t1.ID=T2.ID
and 2=(SELECT COUNT(A) from t1 as TX WHERE TX.A=T1.A))
FROM T2

How does the IN predicate work in SQL?

After prepairing an answer for this question I found I couldn't verify my answer.
In my first programming job I was told that a query within the IN () predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.
For example, given the query:
SELECT count(*) FROM Table1 WHERE Table1Id NOT IN (
SELECT Table1Id FROM Table2 WHERE id_user = 1)
Table1 Rows | # of "IN" executions
----------------------------------
10 | 10
100 | 100
1000 | 1000
10000 | 10000
Is this correct? How does the IN predicate actually work?
The warning you got about subqueries executing for each row is true -- for correlated subqueries.
SELECT COUNT(*) FROM Table1 a
WHERE a.Table1id NOT IN (
SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user
);
Note that the subquery references the id_user column of the outer query. The value of id_user on each row of Table1 may be different. So the subquery's result will likely be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the outer query.
The example you tested is a non-correlated subquery. Most modern RDBMS optimizers worth their salt should be able to tell when the subquery's result doesn't depend on the values in each row of the outer query. In that case, the RDBMS runs the subquery a single time, caches its result, and uses it repeatedly for the predicate in the outer query.
PS: In SQL, IN() is called a "predicate," not a statement. A predicate is a part of the language that evaluates to either true or false, but cannot necessarily be executed independently as a statement. That is, you can't just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it's not a valid statement.
It will entirely depend on the database you're using, and the exact query.
Query optimisers are very smart at times - in your sample query, I'd expect the better databases to be able to use the same sort of techniques that they do with a join. More naive databases may just execute the same query many times.
This depends on the RDBMS in question.
See detailed analysis here:
MySQL, part 1
MySQL, part 2
SQL Server
Oracle
PostgreSQL
In short:
MySQL will optimize the query to this:
SELECT COUNT(*)
FROM Table1 t1
WHERE NOT EXISTS
(
SELECT 1
FROM Table2 t2
WHERE t2.id_user = 1
AND t2.Table1ID = t1.Table2ID
)
and run the inner subquery in a loop, using the index lookup each time.
SQL Server will use MERGE ANTI JOIN.
The inner subquery will not be "executed" in a common sense of word, instead, the results from both query and subquery will be fetched concurrently.
See the link above for detailed explanation.
Oracle will use HASH ANTI JOIN.
The inner subquery will be executed once, and a hash table will be built from the resultset.
The values from the outer query will be looked up in the hash table.
PostgreSQL will use NOT (HASHED SUBPLAN).
Much like Oracle.
Note that rewriting the query as this:
SELECT (
SELECT COUNT(*)
FROM Table1
) -
(
SELECT COUNT(*)
FROM Table2 t2
WHERE (t2.id_user, t2.Table1ID) IN
(
SELECT 1, Table1ID
FROM Table1
)
)
will greatly improve the performance in all four systems.
Depends on optimizer. Check exact query plan for each particular query to see how the RDBMS will actually execute that.
In Oracle that'd be:
EXPLAIN PLAN FOR «your query»
In MySQL or PostgreSQL
EXPLAIN «your query»
Most SQL engines nowadays will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS
I would say look at your execution plan and find out :-)
Also if you have NULL values for the Table1Id column you will not get any data back
Not really. But it's butter to write such queries using JOIN
Yes, but execution stops as soon as the query processer "finds" the value you are looking for... So if, for example the first row in the outer select has Table1Id = 32, and if Table2 has a record with a TableId = 32, then
as soon as the subquery finds the row in Table2 where TableId = 32, it stops...