How to run the subquery first in presto - sql

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?

The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'

Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

Related

Querying a Partitioned table in BigQuery using a reference from a joined table

I would like to run a query that partitions table A using a value from table B.
For example:
#standard SQL
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where B.date = '2018-01-01'
This query will scan all the partitions in table A and will not take into consideration the date I specified in the where clause (for partitioning purposes). I have tried running this query in several different ways but all produced the same result - scanning all partitions in table A.
Is there any way around it?
Thanks in advance.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(date) FROM B WHERE ...);
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where A._partitiontime IN UNNEST(date_filter)
The doc says this about your use case:
Express the predicate filter as closely as possible to the table
identifier. Complex queries that require the evaluation of multiple
stages of a query in order to resolve the predicate (such as inner
queries or subqueries) will not prune partitions from the query.
The following query does not prune partitions (note the use of a subquery):
#standardSQL
SELECT
t1.name,
t2.category
FROM
table1 t1
INNER JOIN
table2 t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp from table3 where key = 2)

INNER JOIN with complex condition dramatically increases the execution time

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
OR Table1.P2 = Table2.P2
OR Table1.P1 = Table2.P2
OR Table1.P2 = Table2.P1
In the case I have huge tables this request is executing a lot of time.
I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:
SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P
The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!
Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.
The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:
"Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
"Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
"Index lookup": Uses an index to look up values in one table.
"Nested Loop": Compars each value in each table to all the values in the other table.
(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.
In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.
When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.
Without testing, I might suggest some of the following strategies.
Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
Use the union query suggested in another solution. Each query should be correctly optimized.
Assuming these are 1-1 joins, you can also do this as a set of multiple joins:
from table1 t1 left outer join
table2 t2_11
on t1.p1 = t2_11.p1 left outer join
table2 t2_12
on t1.p1 = t2_12.p2 left outer join
table2 t2_21
on t1.p2 = t2_21.p2 left outer join
table2 t2_22
on t1.p2 = t2_22.p2
And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.
you can use 4 query and Union there result
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P2
Does using CTEs help performance?
;WITH Table1_cte
AS
(
SELECT
...
[P] = P1
FROM Table1
UNION
SELECT
...
[P] = P2
FROM Table1
)
, Table2_cte
AS
(
SELECT
...
[P] = P1
FROM Table2
UNION
SELECT
...
[P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
INNER JOIN
Table2_cte y
ON x.P = y.P
I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.

Resusing select subquery/result

I am trying to optimize the speed of a query which uses a redundant query block. I am trying to do a row-wise join in sql server 2008 using the query below.
Select * from
(<complex subquery>) cq
join table1 t1 on (cq.id=t1.id)
union
Select * from
<complex subquery> cq
join table2 t2 on (cq.id=t2.id)
The <complex subquery> is exactly the same on both the union sub query pieces except we need to join it with multiple different tables to obtain the same columnar data.
Is there any way i can either rewrite the query to make it faster without using a temporary table to cache results?
Why not use a temporary table and see if that improves the execution stats?
In some circumstances the Query Optimizer will automatically add a spool to the plan that caches sub queries this is basically just a temporary table though.
Have you checked your current plan to be sure that the query is actually being evaluated more than once?
Without a concrete example it's difficult to help, but try a WITH statement, something like:
WITH csq(x,y,z) AS (
<complex subquery>
)
Select * from
csq
join table1 t1 on (cq.id=t1.id)
union
Select * from
csq
join table2 t2 on (cq.id=t2.id)
it sometimes speeds things up no end
Does the nature of your query allow you to invert it? Instead of "(join, join) union", do "(union) join"? Like:
Select * from
(<complex subquery>) cq
join (
Select * from table1 t1
union
Select * from table2 t2
) ts
on cq.id=ts.id
I'm not really sure if double evaluation of your complex query is actually what's wrong. But, as per your question, this would be a form of the query that would encourage SQL to only evaluate <complex query> once.

Problem with sql query

I'm using MySQL and I'm trying to construct a query to do the following:
I have:
Table1 [ID,...]
Table2 [ID, tID, start_date, end_date,...]
What I want from my query is:
Select all entires from Table2 Where Table1.ID=Table2.tID
**where at least one** end_date<today.
The way I have it working right now is that if Table 2 contains (for example) 5 entries but only 1 of them is end_date< today then that's the only entry that will be returned, whereas I would like to have the other (expired) ones returned as well. I have the actual query and all the joins working well, I just can't figure out the ** part of it.
Any help would be great!
Thank you!
SELECT * FROM Table2
WHERE tID IN
(SELECT Table2.tID FROM Table1
INNER JOIN Table2 ON Table1.ID = Table2.tID
WHERE Table2.end_date < NOW
)
The subquery will select all tId's that match your where clause. The main query will use this subquery to filter the entries in table 2.
Note: the use of inner join will filter all rows from table 1 with no matching entry in table 2. This is no problem; these entries wouldn't have matched the where clause anyway.
Maybe, just maybe, you could create a sub-query to join with your actual tables and in this subquery you use a count() which can be used later on you where clause.

PostgreSQL - Correlated Sub-Query Fail?

I have a query like this:
SELECT t1.id,
(SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) as num_things
FROM t1
WHERE num_things = 5;
The goal is to get the id of all the elements that appear 5 times in the other table. However, I get this error:
ERROR: column "num_things" does not exist
SQL state: 42703
I'm probably doing something silly here, as I'm somewhat new to databases. Is there a way to fix this query so I can access num_things? Or, if not, is there any other way of achieving this result?
A few important points about using SQL:
You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got.
You can do your count better using a JOIN and GROUP BY than by using correlated subqueries. It'll be much faster.
Use the HAVING clause to filter groups.
Here's the way I'd write this query:
SELECT t1.id, COUNT(t2.id) AS num_things
FROM t1 JOIN t2 USING (id)
GROUP BY t1.id
HAVING num_things = 5;
I realize this query can skip the JOIN with t1, as in Charles Bretana's solution. But I assume you might want the query to include some other columns from t1.
Re: the question in the comment:
The difference is that the WHERE clause is evaluated on rows, before GROUP BY reduces groups to a single row per group. The HAVING clause is evaluated after groups are formed. So you can't, for example, change the COUNT() of a group by using HAVING; you can only exclude the group itself.
SELECT t1.id, COUNT(t2.id) as num
FROM t1 JOIN t2 USING (id)
WHERE t2.attribute = <value>
GROUP BY t1.id
HAVING num > 5;
In the above query, WHERE filters for rows matching a condition, and HAVING filters for groups that have at least five count.
The point that causes most people confusion is when they don't have a GROUP BY clause, so it seems like HAVING and WHERE are interchangeable.
WHERE is evaluated before expressions in the select-list. This may not be obvious because SQL syntax puts the select-list first. So you can save a lot of expensive computation by using WHERE to restrict rows.
SELECT <expensive expressions>
FROM t1
HAVING primaryKey = 1234;
If you use a query like the above, the expressions in the select-list are computed for every row, only to discard most of the results because of the HAVING condition. However, the query below computes the expression only for the single row matching the WHERE condition.
SELECT <expensive expressions>
FROM t1
WHERE primaryKey = 1234;
So to recap, queries are run by the database engine according to series of steps:
Generate set of rows from table(s), including any rows produced by JOIN.
Evaluate WHERE conditions against the set of rows, filtering out rows that don't match.
Compute expressions in select-list for each in the set of rows.
Apply column aliases (note this is a separate step, which means you can't use aliases in expressions in the select-list).
Condense groups to a single row per group, according to GROUP BY clause.
Evaluate HAVING conditions against groups, filtering out groups that don't match.
Sort result, according to ORDER BY clause.
All the other suggestions would work, but to answer your basic question it would be sufficient to write
SELECT id From T2
Group By Id
Having Count(*) = 5
I'd like to mention that in PostgreSQL there is no way to use aliased column in having clause.
i.e.
SELECT usr_id AS my_id FROM user HAVING my_id = 1
Wont work.
Another example that is not going to work:
SELECT su.usr_id AS my_id, COUNT(*) AS val FROM sys_user AS su GROUP BY su.usr_id HAVING val >= 1
There will be the same error: val column is not known.
Im highliting this because Bill Karwin wrote something not really true for Postgres:
"You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got."
I think you could just rewrite your query like so:
SELECT t1.id
FROM t1
WHERE (SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) = 5;
try this
SELECT t1.id,
(SELECT COUNT(t2.id) as myCount
FROM t2
WHERE t2.id = t1.id and myCount=5
) as num_things
FROM t1