Resusing select subquery/result - sql

I am trying to optimize the speed of a query which uses a redundant query block. I am trying to do a row-wise join in sql server 2008 using the query below.
Select * from
(<complex subquery>) cq
join table1 t1 on (cq.id=t1.id)
union
Select * from
<complex subquery> cq
join table2 t2 on (cq.id=t2.id)
The <complex subquery> is exactly the same on both the union sub query pieces except we need to join it with multiple different tables to obtain the same columnar data.
Is there any way i can either rewrite the query to make it faster without using a temporary table to cache results?

Why not use a temporary table and see if that improves the execution stats?
In some circumstances the Query Optimizer will automatically add a spool to the plan that caches sub queries this is basically just a temporary table though.
Have you checked your current plan to be sure that the query is actually being evaluated more than once?

Without a concrete example it's difficult to help, but try a WITH statement, something like:
WITH csq(x,y,z) AS (
<complex subquery>
)
Select * from
csq
join table1 t1 on (cq.id=t1.id)
union
Select * from
csq
join table2 t2 on (cq.id=t2.id)
it sometimes speeds things up no end

Does the nature of your query allow you to invert it? Instead of "(join, join) union", do "(union) join"? Like:
Select * from
(<complex subquery>) cq
join (
Select * from table1 t1
union
Select * from table2 t2
) ts
on cq.id=ts.id
I'm not really sure if double evaluation of your complex query is actually what's wrong. But, as per your question, this would be a form of the query that would encourage SQL to only evaluate <complex query> once.

Related

How to run the subquery first in presto

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?
The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'
Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

Convert multiple SQL code with multiple subqueries into a single query

I'm starting to handle an old database that was generated years ago with ACCESS. All the queries have been designed with the ACCESS query wizard and they seem to be very time consuming and I would like to improve their performance.
All queries depend on at least three subqueries and I would like to rewrite the SQL code to convert them into a single query.
Here you have an example of what I'm talking about:
This is the main query:
SELECT Subquery1.pid, Table4.SIB, Subquery1.event,
Subquery1.event_date, Subquery2.GGG, Subquery3.status FROM Subquery1
LEFT JOIN ((Table4 LEFT JOIN Subquery2 ON Table4.SIB =
Subquery2.SIB) LEFT JOIN Subquery3 ON Table4.SIB = Subquery3.SIB)
ON Subquery1.pid = Table4.PID;
This main query depends on three subqueries:
Subquery1
SELECT Table2.id, Table2.pid, Table2.npid, Table3.event_date,
Table3.event, Table3.notes, Table2.other FROM Table2 INNER JOIN Table3
ON Table2.id = Table3.subject_id WHERE (((Table2.pid) Is Not Null) AND
((Table3.event_date)>#XX/XX/XXXX#) AND ((Table3.event) Like "*AAAA" Or
(Table3.event)="BBBB")) ORDER BY Table2.pid, Table3.event_date DESC;
Subquery2
SELECT Table1.SIB, IIf(Table1.GGG Like "AAA","BBB", IIf(Table1.GGG
Like "CCC","BBB", IIf(Table1.GGG Like "DDD","DDD","EEE"))) AS GGG FROM
Table1;
Subquery3
SELECT Table5.SIB, Table5.PID, IIf(Table5.field1 Like
"1","ZZZ",IIf(Table5.field1 Like "2","ZZZ",IIf(Table5.field1 Like
"3","ZZZ",IIf(Table5.field1 Like "4","HHH",IIf(Table5.field1 Like
"5","HHH",IIf(Table5.field1 Like "6","HHH","UUU")))))) AS SSS FROM
Table5;
Which would be the best way of improving the performance of this query and converting all the subqueries into a single statement?
I can handle each subquery, but I'm having a hard time joining them together.
If this:
Table5.field1 Like "3"
is really how some of your subqueries are written (without actual wild characters) you can save a lot of time by changing it to
Table5.field1="3"
'''you can create transient tables for each sub query'''
CREATE Transient table1 AS
'''Your sub query goes here'''
CREATE Transient table2 AS
'''Your sub query goes here'''
'''Main query to merge them into one'''
SELECT '''column names'''
FROM
table1
LEFT JOIN table2
ON table1.common_column = table2.common_column
LEFT JOIN table3
ON table1.common_column = table3.common_column
'''similarly you can combine all sub queries/transient tables'''

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

SQL select IN (select) process too long why?

Lets say TABLE1 has 1 million entrys in it.
Table2 has 50k entries in it.
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
Result of select:
5
4
That select takes 0.02s to process
But when i use it within IN it takes up to 20.20s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
If i would use it like this it would process in 0.02s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
Can anyone explain me how things work here ?
I think your RDBMS is doing a poor job of executing your query, other RDBMS(e.g. SQL Server) can see that if a subquery is not correlated with an outer query it will internally materialize the result and would not execute the subquery repeatedly. e.g.
select *
, (select count(*) from tbl) -- an smart RDBMS won't execute this repeatedly
from tbl
A good RDBMS would not execute the counting repeatedly, since it is an independent query(not correlated to the outside query)
Try all of the options, there are just few of them anyway
1st, try EXISTS. Your RDBMS's EXISTS might be faster than its IN. I encountered IN is faster than EXISTS though, example: Why the most natural query(I.e. using INNER JOIN (instead of LEFT JOIN)) is very slow Same observation by Quassnoi (IN is faster than EXISTS): http://explainextended.com/2009/06/16/in-vs-join-vs-exists/
SELECT count(*)
FROM TABLE1
WHERE
-- stringVal IN
EXISTS(
SELECT * -- please, don't bikeshed ;-)
FROM TABLE2
where
table1.stringVal = table2.stringVal -- simulated IN
and table2.idTable2 = 5);
2nd, try INNER JOIN, use this if there's no duplicate, or use DISTINCT to remove duplicates.
SELECT count(*)
FROM TABLE1
JOIN (
SELECT DISTINCT stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5 ) as x
ON X.stringVal = table1.stringVal
3rd, try to materialize the rows yourself. I encountered same problem with SQL Server, querying the materialized rows is faster than querying the result of another query.
Check the example of materializing the query result to table, then using IN on result. I see that it is faster than using IN on another query approach, you can just read the bottom part of the post: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html
Example:
SELECT distinct stringVal -- remove duplicates
into anotherTable
FROM TABLE2
where idTable2 = 5;
SELECT count(*)
FROM TABLE1 where stringVal in (select stringVal from anotherTable);
The above is working on Sql Server and Postgresql, on other RDBMS it might be like this:
create table anotherTable as
SELECT distinct stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5;
select count(*)
from table1 where stringVal in (select stringVal from anotherTable)
While I love subqueries, they are immensely powerful, their also quite slow, since the query has to be completely evaluated at each iteration, ouch! (depending on implementation)
This is why they are mine/our last resort.
Some SQL implementations are quite good and will cache the subquery though Im not quite sure how safe that would be, but still you have to traverse this entire structure and if the structure isn't properly optmize it would take quadratic even cubic time if you nest enough of them ...
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
This is linear time O(n), it can be even be constant O(1) if the sql database stores statistical information, but we will assume it doesn't as such it will search every row and return all those that match the where clause.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
Assuming the subquery isn't cache then it is being evaluated at each row, and if you have a lof them thats a lot evaluations, many many wasted repeated calculations, and even if its cache the structure may not be optimal for search, not to mention you are also comparing strings, on a list of strings.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
The subquery is still being evaluated but its a constant expression theres basically no overhead at all, it doesn't need to do any IO or deal with locks or anythig :)
Try this
SELECT count(*) FROM TABLE1 where EXISTS
(SELECT 1 FROM TABLE2 where idTable2=5 and stringVal = TABLE1.stringVal );
An you should create indexes for stringVal both TABLE1 and TABLE2 tables.
Here is a simple join that will give you the same kind of result that you were looking for. This can be applied in many different situations and this will avoid having to query against another table.
SELECT COUNT(*)
FROM TABLE1 INNER JOIN TABLE2 ON TABLE1.'COLUMN' = TABLE2.'COLUMN' AND TABLE2.IDTABLE2=5
WHERE 'WHATEVER YOU WANT'
Replace 'COLUMN' with a column that is referenced in both tables, normally an ID or primary key.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.