Correlated-subquery in INSERT statement - PostgreSQL - sql

I am trying to populate a table using a query that contains a subquery.
The format is the following:
INSERT INTO table_C
SELECT columns FROM table_A, table_B
The subquery is present in one of the columns of the select statement and it refers to "table_A" again (there is a join between table_A and table_B).
Here is the code, but before reading it please consider that the select statement works perfectly if run alone (i.e. with no INSERT):
INSERT INTO hypercube_2015 (date, hour, name, rel_val)
SELECT t1.date, t1.hour, t2.name,
CAST(sum(t1.num) as float)/(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
FROM hc_num t1, names t2
WHERE date between '2015-01-01' AND '2015-12-31'
AND t1.id = t2.id
GROUP BY t1.date, t1.hour, t2.name
The issue is related to the subquery in the 3rd line, in particular to the WHERE condition. If I change it into the following it works:
SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = '2015-01-01' AND t11.hour=0
The error message is (I am working on a Redshift db via DBVis):
[Code: 500310, SQL State: XX000] Amazon Invalid operation:
This type of correlated subquery pattern is not supported due to
internal error;

I've got no solution to propose but an answer that explains why you have this error.
On RedShift there are several cases where the optimiser can't resolve a correlated subquery and trigger this error. One of them is precisely your kind of suquery:
Correlated Subquery Patterns That Are Not Supported
The query planner uses a query rewrite method called subquery
decorrelation to optimize several patterns of correlated subqueries
for execution in an MPP environment. A few types of correlated
subqueries follow patterns that Amazon Redshift cannot decorrelate and
does not support. Queries that contain the following correlation
references return errors:
References in a GROUP BY column to the results of a correlated subquery. For example:
select listing.listid,
(select count (sales.listid) from sales where sales.listid=listing.listid) as list
from listing
group by list, listing.listid;
Source : Amazon webservices Correlated Subqueries
In your subquery:
(SELECT sum(t11.num) FROM hc_num t11 WHERE t11.date = t1.date AND t11.hour = t1.hour)
you do make a reference to t1.hour which is present in the final GROUP BY:
GROUP BY t1.date, t1.hour, t2.name
Note that I might have a deeper look at your query later to propose an alternative, if nobody else does. Got no time at the moment.

Related

How to run the subquery first in presto

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?
The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'
Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

Are sql queries with WITH statement and without equal?

I have two queries:
WITH table1
AS (SELECT id,
first AS table1_first,
second AS table1_second
FROM some_table)
SELECT omt.*,
t1.*
FROM one_more_table omt
INNER JOIN table1 t1
ON omt.id = t1.id;
and
SELECT omt.*,
t1.*
FROM one_more_table omt
INNER JOIN (SELECT id,
first AS table1_first,
second AS table1_second
FROM some_table) AS t1
ON omt.id = t1.id;
Tell me are this two sql queries equal?
From a logical point of view, yes they are identical.
However some DBMS apply different optimization strategies for common table expression (first query) and derived tables (second query).
If you added a where condition in the "outer" query that restricts the rows inside the CTE it might not be pushed down into the CTE and thus might yield a different execution plan.
But this depends on the DBMS being used (the above is at least true for Postgres and I think Oracle. I don't know about e.g. DB2, SQL Server or other DBMS).

In T-SQL is it possible to have a result set returned by a select with subselects that cannot be obtained using joins instead?

Given a SQL SELECT expression with arbitrarily nested subselect's, it always possible to rewrite said SQL expression so that it contains no subselect's and returns the same result set?
If so, is there an algorithm for doing so?
If not, is there a characterization of those SELECT expressions that cannot be rewritten?
I'm making an application that will generate SQL SELECT statements. I'm still designing how it will work at this point. Here's the general idea, though:
The user will select what columns are displayed, how the results are sorted and how they are restricted.
The columns will not just be SQL columns but named objects such that the object can contain a SQL expression with column variables from multiple tables. These objects will contain information on how to join to each other.
I want to make the configuration of these expressions to be as flexible as possible; if it's possible to write the SELECT statement that returns some result set S, then I'd like the application to be able to generate a SELECT statement that returns S. One thing that's possible in SQL are sub-selects. I've read that rewriting said sub-select's with JOINS is better performance wise. Therefore I am considering disallowing sub-select's in the configuration. However I do not want to do this unless every sub-select can be rewritten as a join.
Subselects in the WHERE clause can often be impossible to rewrite as JOIN, especially if aggregate functions are in use.
Quoted from here:
Here is an example of a common-form subquery comparison which you can't do with a join: find all the values in table t1 which are equal to a maximum value in table t2.
SELECT column1 FROM t1
WHERE column1 = (SELECT MAX(column2) FROM t2);
Here is another example, which again is impossible with a join because it involves aggregating for one of the tables: find all rows in table t1 which contain a value which occurs twice.
SELECT * FROM t1
WHERE 2 = (SELECT COUNT(column1) FROM t1);
Therefore, if a complex subselect in the SELECT clause itself has subselects in its WHERE clause, that could be impossible to express as a JOIN.
SELECT T2.B, (SELECT A from t1 where t1.ID=T2.ID
and 2=(SELECT COUNT(A) from t1 as TX WHERE TX.A=T1.A))
FROM T2

Sql Server query syntax

I need to perform a query like this:
SELECT *,
(SELECT Table1.Column
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 WHERE tmp = 1
I know I can take a workaround but I would like to know if this syntax is possible as it is (I think) in Mysql.
The query you posted won't work on sql server, because the sub query in your select clause could possibly return more than one row. I don't know how MySQL will treat it, but from what I'm reading MySQL will also yield an error if the sub query returns any duplicates. I do know that SQL Server won't even compile it.
The difference is that MySQL will at least attempt to run the query and if you're very lucky (Table2Id is unique in Table1) it will succeed. More probably is will return an error. SQL Server won't try to run it at all.
Here is a query that should run on either system, and won't cause an error if Table2Id is not unique in Table1. It will return "duplicate" rows in that case, where the only difference is the source of the Table1.Column value:
SELECT Table2.*, Table1.Column AS tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
Perhaps if you shared what you were trying to accomplish we could help you write a query that does it.
SELECT *
FROM (
SELECT t.*,
(
SELECT Table1.Column
FROM Table1
INNER JOIN
Table2
ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 t
) q
WHERE tmp = 1
This is valid syntax, but it will fail (both in MySQL and in SQL Server) if the subquery returns more than 1 row
What exactly are you trying to do?
Please provide some sample data and desired resultset.
I agree with Joel's solution but I want to discuss why your query would be a bad idea to use (even though the syntax is essentially valid). This is a correlated subquery. The first issue with these is that they don't work if the subquery could possibly return more than one value for a record. The second and more critical problem (in my mind) is that they must work row by row rather than on the set of data. This means they will virtually always affect performance. So correlated subqueries should almost never be used in a production system. In this simple case, the join Joel showed is the correct solution.
If the subquery is more complicated, you may want to turn it into a derived table instead (this also fixes the more than one value associated to a record problem). While a derived table looks a lot like a correlated subquery to the uninitated, it does not perform the same way because it acts on the set of data rather than row-by row and thus will often be significantly faster. You are essentially making the query a table in the join.
Below is an example of your query re-written as a derived table. (Of course in production code you would not use select * either especially in a join, spell out the fields you need)
SELECT *
FROM Table2 t2
JOIN
(SELECT Table1.[Column], Table1.Table2Id as tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id ) as t
ON t.Table2Id = Table2.Id
WHERE tmp = 1
You've already got a variety of answers, some of them more useful than others. But to answer your question directly:
No, SQL Server will not allow you to reference the column alias (defined in the select list) in the predicate (the WHERE clause). I think that is sufficient to answer the question you asked.
Additional details:
(this discussion goes beyond the original question you asked.)
As you noted, there are several workarounds available.
Most problematic with the query you posted (as others have already pointed out) is that we aren't guaranteed that the subquery in the SELECT list returns only one row. If it does return more than one row, SQL Server will throw a "too many rows" exception:
Subquery returned more than 1 value.
This is not permitted when the subquery
follows =, !=, , >= or when the
subquery is used as an expression.
For the following discussion, I'm going to assume that issue is already sufficiently addressed.
Sometimes, the easiest way to make the alias available in the predicate is to use an inline view.
SELECT v.*
FROM ( SELECT *
, (SELECT Table1.Column
FROM Table1
JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
) as tmp
FROM Table2
) v
WHERE v.tmp = 1
Note that SQL Server won't push the predicate for the outer query (WHERE v.tmp = 1) into the subquery in the inline view. So you need to push that in yourself, by including the WHERE Table1.Column = 1 predicate in the subquery, particularly if you're depending on that to make the subquery return only one value.
That's just one approach to working around the problem, there are others. I suspect that query plan for this SQL Server query is not going to be optimal, for performance, you probably want to go with a JOIN or an EXISTS predicate.
NOTE: I'm not an expert on using MySQL. I'm not all that familiar with MySQL support for subqueries. I do know (from painful experience) that subqueries weren't supported in MySQL 3.23, which made migrating an application from Oracle 8 to MySQL 3.23 particularly painful.
Oh and btw... of no interest to anyone in particular, the Teradata DBMS engine DOES have an extension that allows for the NAMED keyword in place of the AS keyword, and a NAMED expression CAN be referenced elsewhere in the QUERY, including the WHERE clause, the GROUP BY clause and the ORDER BY clause. Shuh-weeeet
That kind of syntax is basically valid (you need to move the where tmp=... to on outer "select * from (....)", though), although it's ambiguous since you have two sets named "Table2"- you should probably define aliases on at least one of your usages of that table to clear up the ambiguity.
Unless you intended that to return a column from table1 corresponding to columns in table2 ... in which case you might have wanted to simply join the tables?

PostgreSQL - Correlated Sub-Query Fail?

I have a query like this:
SELECT t1.id,
(SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) as num_things
FROM t1
WHERE num_things = 5;
The goal is to get the id of all the elements that appear 5 times in the other table. However, I get this error:
ERROR: column "num_things" does not exist
SQL state: 42703
I'm probably doing something silly here, as I'm somewhat new to databases. Is there a way to fix this query so I can access num_things? Or, if not, is there any other way of achieving this result?
A few important points about using SQL:
You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got.
You can do your count better using a JOIN and GROUP BY than by using correlated subqueries. It'll be much faster.
Use the HAVING clause to filter groups.
Here's the way I'd write this query:
SELECT t1.id, COUNT(t2.id) AS num_things
FROM t1 JOIN t2 USING (id)
GROUP BY t1.id
HAVING num_things = 5;
I realize this query can skip the JOIN with t1, as in Charles Bretana's solution. But I assume you might want the query to include some other columns from t1.
Re: the question in the comment:
The difference is that the WHERE clause is evaluated on rows, before GROUP BY reduces groups to a single row per group. The HAVING clause is evaluated after groups are formed. So you can't, for example, change the COUNT() of a group by using HAVING; you can only exclude the group itself.
SELECT t1.id, COUNT(t2.id) as num
FROM t1 JOIN t2 USING (id)
WHERE t2.attribute = <value>
GROUP BY t1.id
HAVING num > 5;
In the above query, WHERE filters for rows matching a condition, and HAVING filters for groups that have at least five count.
The point that causes most people confusion is when they don't have a GROUP BY clause, so it seems like HAVING and WHERE are interchangeable.
WHERE is evaluated before expressions in the select-list. This may not be obvious because SQL syntax puts the select-list first. So you can save a lot of expensive computation by using WHERE to restrict rows.
SELECT <expensive expressions>
FROM t1
HAVING primaryKey = 1234;
If you use a query like the above, the expressions in the select-list are computed for every row, only to discard most of the results because of the HAVING condition. However, the query below computes the expression only for the single row matching the WHERE condition.
SELECT <expensive expressions>
FROM t1
WHERE primaryKey = 1234;
So to recap, queries are run by the database engine according to series of steps:
Generate set of rows from table(s), including any rows produced by JOIN.
Evaluate WHERE conditions against the set of rows, filtering out rows that don't match.
Compute expressions in select-list for each in the set of rows.
Apply column aliases (note this is a separate step, which means you can't use aliases in expressions in the select-list).
Condense groups to a single row per group, according to GROUP BY clause.
Evaluate HAVING conditions against groups, filtering out groups that don't match.
Sort result, according to ORDER BY clause.
All the other suggestions would work, but to answer your basic question it would be sufficient to write
SELECT id From T2
Group By Id
Having Count(*) = 5
I'd like to mention that in PostgreSQL there is no way to use aliased column in having clause.
i.e.
SELECT usr_id AS my_id FROM user HAVING my_id = 1
Wont work.
Another example that is not going to work:
SELECT su.usr_id AS my_id, COUNT(*) AS val FROM sys_user AS su GROUP BY su.usr_id HAVING val >= 1
There will be the same error: val column is not known.
Im highliting this because Bill Karwin wrote something not really true for Postgres:
"You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got."
I think you could just rewrite your query like so:
SELECT t1.id
FROM t1
WHERE (SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) = 5;
try this
SELECT t1.id,
(SELECT COUNT(t2.id) as myCount
FROM t2
WHERE t2.id = t1.id and myCount=5
) as num_things
FROM t1