PostgreSQL - Correlated Sub-Query Fail? - sql

I have a query like this:
SELECT t1.id,
(SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) as num_things
FROM t1
WHERE num_things = 5;
The goal is to get the id of all the elements that appear 5 times in the other table. However, I get this error:
ERROR: column "num_things" does not exist
SQL state: 42703
I'm probably doing something silly here, as I'm somewhat new to databases. Is there a way to fix this query so I can access num_things? Or, if not, is there any other way of achieving this result?

A few important points about using SQL:
You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got.
You can do your count better using a JOIN and GROUP BY than by using correlated subqueries. It'll be much faster.
Use the HAVING clause to filter groups.
Here's the way I'd write this query:
SELECT t1.id, COUNT(t2.id) AS num_things
FROM t1 JOIN t2 USING (id)
GROUP BY t1.id
HAVING num_things = 5;
I realize this query can skip the JOIN with t1, as in Charles Bretana's solution. But I assume you might want the query to include some other columns from t1.
Re: the question in the comment:
The difference is that the WHERE clause is evaluated on rows, before GROUP BY reduces groups to a single row per group. The HAVING clause is evaluated after groups are formed. So you can't, for example, change the COUNT() of a group by using HAVING; you can only exclude the group itself.
SELECT t1.id, COUNT(t2.id) as num
FROM t1 JOIN t2 USING (id)
WHERE t2.attribute = <value>
GROUP BY t1.id
HAVING num > 5;
In the above query, WHERE filters for rows matching a condition, and HAVING filters for groups that have at least five count.
The point that causes most people confusion is when they don't have a GROUP BY clause, so it seems like HAVING and WHERE are interchangeable.
WHERE is evaluated before expressions in the select-list. This may not be obvious because SQL syntax puts the select-list first. So you can save a lot of expensive computation by using WHERE to restrict rows.
SELECT <expensive expressions>
FROM t1
HAVING primaryKey = 1234;
If you use a query like the above, the expressions in the select-list are computed for every row, only to discard most of the results because of the HAVING condition. However, the query below computes the expression only for the single row matching the WHERE condition.
SELECT <expensive expressions>
FROM t1
WHERE primaryKey = 1234;
So to recap, queries are run by the database engine according to series of steps:
Generate set of rows from table(s), including any rows produced by JOIN.
Evaluate WHERE conditions against the set of rows, filtering out rows that don't match.
Compute expressions in select-list for each in the set of rows.
Apply column aliases (note this is a separate step, which means you can't use aliases in expressions in the select-list).
Condense groups to a single row per group, according to GROUP BY clause.
Evaluate HAVING conditions against groups, filtering out groups that don't match.
Sort result, according to ORDER BY clause.

All the other suggestions would work, but to answer your basic question it would be sufficient to write
SELECT id From T2
Group By Id
Having Count(*) = 5

I'd like to mention that in PostgreSQL there is no way to use aliased column in having clause.
i.e.
SELECT usr_id AS my_id FROM user HAVING my_id = 1
Wont work.
Another example that is not going to work:
SELECT su.usr_id AS my_id, COUNT(*) AS val FROM sys_user AS su GROUP BY su.usr_id HAVING val >= 1
There will be the same error: val column is not known.
Im highliting this because Bill Karwin wrote something not really true for Postgres:
"You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got."

I think you could just rewrite your query like so:
SELECT t1.id
FROM t1
WHERE (SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) = 5;

try this
SELECT t1.id,
(SELECT COUNT(t2.id) as myCount
FROM t2
WHERE t2.id = t1.id and myCount=5
) as num_things
FROM t1

Related

How to run the subquery first in presto

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?
The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'
Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

How to use postgres group by statement when joining tables

I have a very simple query that i want to execute in postgres.
table1 has one to may relation to tables2 and 3.
pseudo query is as follows
select * from table1
left join table2 ON table2.table1_id = table1.id
left join table3 ON table3.table1_id = table1.id
group by table1.id
This gives me an error:
"column "table2.id" must appear in the GROUP BY clause or be used in an aggregate function",
same for table3.id
What is the point of Group by, if it forces me to add the id's of all the tables into group by, thus defeating the group by purpose( all ids are unique and no grouping occurs )
The purpose of the group by is to summarize data. There is one row in the result set for every combination of keys in the group by.
The columns in the result set are either keys in the group by or are aggregations. There is one exception to this rule, involving grouping by unique or primary keys on a table and using other columns.
The use of select * with group by is simply not a correct use of aggregation in SQL.
You seem to be misunderstanding the purpose of this construct. It is possible that you really mean order by -- that will order the result set by the the order by keys without changing (i.e. summarizing) the number of rows.

Correlated-subquery in INSERT statement - PostgreSQL

I am trying to populate a table using a query that contains a subquery.
The format is the following:
INSERT INTO table_C
SELECT columns FROM table_A, table_B
The subquery is present in one of the columns of the select statement and it refers to "table_A" again (there is a join between table_A and table_B).
Here is the code, but before reading it please consider that the select statement works perfectly if run alone (i.e. with no INSERT):
INSERT INTO hypercube_2015 (date, hour, name, rel_val)
SELECT t1.date, t1.hour, t2.name,
CAST(sum(t1.num) as float)/(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
FROM hc_num t1, names t2
WHERE date between '2015-01-01' AND '2015-12-31'
AND t1.id = t2.id
GROUP BY t1.date, t1.hour, t2.name
The issue is related to the subquery in the 3rd line, in particular to the WHERE condition. If I change it into the following it works:
SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = '2015-01-01' AND t11.hour=0
The error message is (I am working on a Redshift db via DBVis):
[Code: 500310, SQL State: XX000] Amazon Invalid operation:
This type of correlated subquery pattern is not supported due to
internal error;
I've got no solution to propose but an answer that explains why you have this error.
On RedShift there are several cases where the optimiser can't resolve a correlated subquery and trigger this error. One of them is precisely your kind of suquery:
Correlated Subquery Patterns That Are Not Supported
The query planner uses a query rewrite method called subquery
decorrelation to optimize several patterns of correlated subqueries
for execution in an MPP environment. A few types of correlated
subqueries follow patterns that Amazon Redshift cannot decorrelate and
does not support. Queries that contain the following correlation
references return errors:
References in a GROUP BY column to the results of a correlated subquery. For example:
select listing.listid,
(select count (sales.listid) from sales where sales.listid=listing.listid) as list
from listing
group by list, listing.listid;
Source : Amazon webservices Correlated Subqueries
In your subquery:
(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
you do make a reference to t1.hour which is present in the final GROUP BY:
GROUP BY t1.date, t1.hour, t2.name
Note that I might have a deeper look at your query later to propose an alternative, if nobody else does. Got no time at the moment.

Can I modify this query using a WITH clause?

I have written the following query in order to achieve the following:
1) Select all regulatory languages that do not have a specified ID.
2) Link those regulatory languages based on a hierarchy field (RL_ID_DEFINED - this field is the ID of the parent regulatory language).
My first variation used NOT IN, but after looking into it I decided that NOT EXISTS would be a more efficient approach. Additionally, I was thinking that adding a WITH clause might make it run a bit faster, since in my current code it is running the nested SELECT statement for each ID in the iteration. Would it be possible to rewrite with using a WITH clause for that nested SELECT?
SELECT
T1.ID
FROM
REGULATORY_LANGUAGES T1
WHERE
T1.INACTIVE_DATE IS NULL
AND NOT EXISTS (
SELECT
NULL
FROM
REGULATORY_LANGUAGES T2,
REVIEW_REGULATIONS T3
WHERE
T3.RVWTYPYR_ID = ?
AND T3.RL_ID = T2.ID
AND T1.ID = T2.ID)
START WITH
RL_ID_DEFINED IS NULL
AND INACTIVE_DATE IS NULL
CONNECT BY
PRIOR ID = RL_ID_DEFINED
The problem I'm running into is that when I look at the structure of a WITH clause, I would be creating it prior to my main SELECT. However, that would require me to have defined my T1 table already. Any thoughts?
(Note - this is being called in a java method, hence the ? in the line T3.RVWTYPYR_ID = ?. When I test this in the database editor via Toad, I just hard code a value for the ?).
While speed is important, so is accuracy. You mentioned that you switched from not in to not exists for efficiency. They do different things. There is another way to speed up the logic of not in. Instead of this:
where someField not in
(select someField
from etc
)
Do this
where someField in
(select someField
from etc
where whatever
minus
select someField
from etc
where whatever
and more filters that identify records to exclude
)
Now for the with keyword. It speeds up performance when you want to run the exact same subquery more than once. So, instead of this:
where field1 in
(sql for subquery)
and field 2 in
(exact same sql as above)
you do this:
with temp as (sql for subquery)
select etc
where field1 in
(select something from temp)
and field 2 in
(select something from temp)
However, that's not your situation. What you probably want to do is to investigate ways to send a list of parameters from java so that your query looks like this:
T3.RVWTYPYR_ID in (?,?,etc)
Then you wouldn't have to repeat the subquery.
Much thanks to Tom H for his insight. I've rewritten the query using JOIN:
SELECT
T1.ID
FROM
REGULATORY_LANGUAGES T1
LEFT JOIN (
SELECT
T2.ID ID
FROM
REGULATORY_LANGUAGES T2
INNER JOIN
REVIEW_REGULATIONS T3
ON
T3.RVWTYPYR_ID = ?
AND T3.RL_ID = T2.ID) T_JOIN
ON T1.ID = T_JOIN.ID
WHERE
T1.INACTIVE_DATE IS NULL
AND T_JOIN.ID IS NULL
START WITH
T1.RL_ID_DEFINED IS NULL
AND T1.INACTIVE_DATE IS NULL
CONNECT BY
PRIOR T1.ID = T1.RL_ID_DEFINED

Problem with sql query

I'm using MySQL and I'm trying to construct a query to do the following:
I have:
Table1 [ID,...]
Table2 [ID, tID, start_date, end_date,...]
What I want from my query is:
Select all entires from Table2 Where Table1.ID=Table2.tID
**where at least one** end_date<today.
The way I have it working right now is that if Table 2 contains (for example) 5 entries but only 1 of them is end_date< today then that's the only entry that will be returned, whereas I would like to have the other (expired) ones returned as well. I have the actual query and all the joins working well, I just can't figure out the ** part of it.
Any help would be great!
Thank you!
SELECT * FROM Table2
WHERE tID IN
(SELECT Table2.tID FROM Table1
INNER JOIN Table2 ON Table1.ID = Table2.tID
WHERE Table2.end_date < NOW
)
The subquery will select all tId's that match your where clause. The main query will use this subquery to filter the entries in table 2.
Note: the use of inner join will filter all rows from table 1 with no matching entry in table 2. This is no problem; these entries wouldn't have matched the where clause anyway.
Maybe, just maybe, you could create a sub-query to join with your actual tables and in this subquery you use a count() which can be used later on you where clause.