Snowflake subquery - sql

I have two tables. Transaction(ID, TERMINALID) and Terminal(ID, TERMINALID, EXPORT_DATE). The goal is to obtain for each row from Transaction table newest recored from Terminal table. Snowflake is used as a backend.
I have this SQL query:
SELECT tr.ID,
(SELECT te.ID
FROM "Terminal" te
WHERE te.TERMINALID = tr.TERMINALID
ORDER BY te.EXPORT_DATE DESC
LIMIT 1)
FROM "Transaction" tr;
But I get this error:
SQL compilation error: Unsupported subquery type cannot be evaluated
Error disappears if I replace tr.TERMINALID with a specific value. So I can't reference parent table from nested SELECT. Why this is not possible? Query works in MySQL.

I'm afraid Snowflake doesn't support correlated subqueries of this kind.
You can achieve what you want by using FIRST_VALUE to compute best per-terminalid id :
-- First compute per-terminalid best id
with sub1 as (
select
terminalid,
first_value(id) over (partition by terminalid order by d desc) id
from terminal
),
-- Now, make sure there's only one per terminalid id
sub2 as (
select
terminalid,
any_value(id) id
from sub1
group by terminalid
)
-- Now use that result
select tr.ID, sub2.id
FROM "Transaction" tr
JOIN sub2 ON tr.terminalid = sub2.terminalid
You can run subqueries first to see what they do.
We're working on making our support for subqueries better, and possibly there's a simpler rewrite, but I hope it helps.

SELECT
tr.ID
, (SELECT te.ID
FROM "Terminal" te
WHERE te.TERMINALID = tr.TERMINALID
ORDER BY te.EXPORT_DATE DESC
LIMIT 1
) AS the_id -- <<-- add an alias for the column
FROM "Transaction" tr
;
UPDATE:
length for type varchar cannot exceed 10485760
just use type varchar (or text) instead
Works here (with quoted identifiers):
CREATE TABLE "Transaction" ("ID" VARCHAR(123), "TERMINALID" VARCHAR(123)) ;
CREATE TABLE "Terminal" ( "ID" VARCHAR(123), "TERMINALID" VARCHAR(123), "EXPORT_DATE" DATE);
SELECT tr."ID"
, (SELECT te."ID"
FROM "Terminal" te
WHERE te."TERMINALID" = tr."TERMINALID"
ORDER BY te."EXPORT_DATE" DESC
LIMIT 1) AS meuk
FROM "Transaction" tr
;
BONUS UPDATE: avoid the scalar subquery and use plain old NOT EXISTS(...) to obtain the record with the most recent date:
SELECT tr."ID"
, te."ID" AS meuk
FROM "Transaction" tr
JOIN "Terminal" te ON te."TERMINALID" = tr."TERMINALID"
AND NOT EXISTS ( SELECT *
FROM "Terminal" nx
WHERE nx."TERMINALID" = te."TERMINALID"
AND nx."EXPORT_DATE" > te."EXPORT_DATE"
)
;

So few years later (2022), some correlated subqueries are support, but not this one:
using this data:
WITH transaction(id, terminalid) AS (
SELECT * FROM VALUES
(1,10),
(2,11),
(3,12)
), terminal(id, terminalid, export_date) AS (
SELECT * FROM VALUES
(100, 10, '2022-03-18'::date),
(101, 10, '2022-03-19'::date),
(102, 11, '2022-03-20'::date),
(103, 11, '2022-03-21'::date),
(104, 11, '2022-03-22'::date),
(105, 12, '2022-03-23'::date)
)
So compared to Marcin's we can now use a QUALIFY to select only one value per terminalid in a single step:
WITH last_terminal as (
SELECT id,
terminalid
FROM terminal
QUALIFY row_number() over(PARTITION BY terminalid ORDER BY export_date desc) = 1
)
SELECT tr.ID,
te.id
FROM transaction AS tr
JOIN last_terminal AS te
ON te.TERMINALID = tr.TERMINALID
ORDER BY 1;
giving:
ID
ID
1
101
2
104
3
105
and if you have multiple terminals per day, and terimal.id is incrementing number you could use:
QUALIFY row_number() over(PARTITION BY terminalid ORDER BY export_date desc, id desc) = 1
Now if your table is not that larger, you can do the JOIN then prune via the QUALIFY, and avoid the CTE, but on large tables this is much less performant, so I would only use this form when doing ad-hoc queries, where swapping forms if viable if performance problems occur.
SELECT tr.ID,
te.id
FROM transaction AS tr
JOIN terminal AS te
ON te.TERMINALID = tr.TERMINALID
QUALIFY row_number() over(PARTITION BY tr.id ORDER BY te.export_date desc, te.id desc) = 1
ORDER BY 1;

This kind of subquery is currently not supported.
Working with subqueries - Limitations:
The only type of subquery that allows a LIMIT / FETCH clause is an uncorrelated scalar subquery. Also, because an uncorrelated scalar subquery returns only 1 row, the LIMIT clause has little or no practical value inside a subquery
Query in question is correlated subquery, thus the result.
SELECT tr.ID,
(SELECT te.ID
FROM "Terminal" te
WHERE te.TERMINALID = tr.TERMINALID --correlation
ORDER BY te.EXPORT_DATE DESC
LIMIT 1)
FROM "Transaction" tr;

Related

Using row_number() in subquery results in ORA-00913: too many values

In Oracle, I wish to do something like the SQL below. For each row in "criteria," I want to find the latest row in another table (by last_modified_date) for the same location_id, and use that value to set default_start_interval. Or, if there is no such value, then use 30. However, as you can see, the subquery must have two values in the select statement to use row_number(). That causes an error. How do I reformat it so that it works?
update criteria pc set default_start_interval =
COALESCE(
(SELECT start_interval,
row_number() over(partition by aday.location_id
order by atime.last_modified_date desc
) as rn
FROM available_time atime
JOIN available_day aday ON aday.available_day_id = atime.available_day_id
WHERE aday.location_id = pc.location_id
and rn = 1)
, 30)
There are two issues in your update query:
The update expects only one value per row for default_start_interval, however, you have two columns in the select list.
The row number should be assigned before in the inner query, and then apply filter where rn = 1 in outer query.
Your update query should look like:
UPDATE criteria pc
SET default_start_interval = NVL(
(
SELECT start_interval FROM(
SELECT
start_interval, ROW_NUMBER() OVER(
PARTITION BY aday.location_id
ORDER BY atime.last_modified_date DESC
) AS rn
FROM
available_time atime
JOIN available_day aday ON aday.available_day_id = atime.available_day_id
WHERE
aday.location_id = pc.location_id
)
WHERE rn = 1)
, 30)
Note: You could simply use NVL instead of COALESCE as you only have one value to check for NULL. COALESCE is useful when you have multiple expressions.
I think a simpler method uses aggregation and keep to get the value you want:
update criteria pc
set default_start_interval =
(select coalesce(max(start_interval) keep (dense_rank first order by atime.last_modified_date desc), 30)
from available_time atime join
available_day aday
on aday.available_day_id = atime.available_day_id
where aday.location_id = pc.location_id
);
An aggregation query with no GROUP always returns one row. If no rows match, then the returned value is NULL -- the COALESCE() captures this case.

rownum vs max in oracle

I have a couple of queries to get latest modified parent_id
with max keyword:
select max(parent_id)
from sample_table
where modified_date=(select max(modified_date)
FROM sample_table
where id = 'test') and id = 'test';
with rownum keyword:
select *
from (
select parent_id,modified_date
from sample_table
where id = 'test'
order by modified_date desc)
WHERE rownum <= 1 ;
Both the queries returning same and correct result.
Which one is better and faster query..
Your query is somewhat unpredictable because two records can have the same modified_date. So you have to apply a trick to return a single row only.
The first query is deterministic: It takes the latestd modified_date; if it returns several rows it takes the one with the highest parent_id. The second query is unpredicatable: it depends on how Oracle executes the query.
I would use the second query and modify it slightly to move the two order criteria close to each other:
select *
from (
select parent_id,modified_date
from sample_table
where id = 'test'
order by modified_date desc, parent_id desc)
WHERE rownum <= 1;
This type of query can also be better extended to return more columns, namely by adding it to the inner SELECT clause. In the other query, it's trickier.
You would say the best way is this one:
SELECT
MAX(parent_id) KEEP (DENSE_RANK FIRST ORDER BY modified_date desc, parent_id desc),
MAX(modified_date)
FROM sample_table
WHERE ID = 'test';

optimizing row number query

I am using sql server 2008 r2 and had below query
select * from
(
select d.ID as ID,
....
....
ROW_NUMBER() OVER
(
ORDER BY #some field
) AS RowNum
from
/*some table*/
LEFT join
(select Device_ID,
Level,
ROW_NUMBER() over (partition by Device_ID order by id desc) as rn
from #sometable as de WITH (NOLOCK)
where #some condition
) t
where t.rn = 1)tmp on ID=tmp.Device_ID **/* sort operation 1*/**
/*some more joins */
WHERE /*some condition*/
) as DbD
where RowNum BETWEEN #SkipRowsLocal and (#SkipRowsLocal + #TakeRowsLocal - 1)
order by RowNum
I am trying to implement pagination kind of query from sample
http://blog.sqlauthority.com/2013/04/14/sql-server-tricks-for-row-offset-and-paging-in-various-versions-of-sql-server/
but looks like it's executing very slow, when I looked into query plan sort operation is consuming almost 50% of query time and i guess its the 1st sort operation which I marked as 1, basically in temp table t I want to retrieve latest value and in outer row number I wanted to fetch say only 40 records.
It's basically like sorting 10K rows and then taking 40 out of it, is there is any way we can improve this query?
instead of a left join, try an OUTER APPLY, to get the TOP 1 of the device you are interested in - think of it as 'I get a record from the first table, then for that record, I go and get the TOP 1 for that device ID in the other table, but remembering that in the second table, you need to search by the data from the first table, rather than a join
select * from
(
select d.ID as ID,
....
....
ROW_NUMBER() OVER
(
ORDER BY #some field
) AS RowNum
from
/*some table*/
OUTER APPLY
(select TOP 1 Device_ID,
Level
from #sometable as de WITH (NOLOCK)
where #some condition and Device_ID = [/*some table*/].id ORDER BY id DESC
) t
)tmp
/*some more joins */
WHERE /*some condition*/
) as DbD
where RowNum BETWEEN #SkipRowsLocal and (#SkipRowsLocal + #TakeRowsLocal - 1)
order by RowNum
something on those lines, you might want to build just the relevant parts of the query and compare them, it would seem to avoid sorting and with an index might well be quick

Why no windowed functions in where clauses?

Title says it all, why can't I use a windowed function in a where clause in SQL Server?
This query makes perfect sense:
select id, sales_person_id, product_type, product_id, sale_amount
from Sales_Log
where 1 = row_number() over(partition by sales_person_id, product_type, product_id order by sale_amount desc)
But it doesn't work. Is there a better way than a CTE/Subquery?
EDIT
For what its worth this is the query with a CTE:
with Best_Sales as (
select id, sales_person_id, product_type, product_id, sale_amount, row_number() over (partition by sales_person_id, product_type, product_id order by sales_amount desc) rank
from Sales_log
)
select id, sales_person_id, product_type, product_id, sale_amount
from Best_Sales
where rank = 1
EDIT
+1 for the answers showing with a subquery, but really I'm looking for the reasoning behind not being able to use windowing functions in where clauses.
why can't I use a windowed function in a where clause in SQL Server?
One answer, though not particularly informative, is because the spec says that you can't.
See the article by Itzik Ben Gan - Logical Query Processing: What It Is And What It Means to You and in particular the image here. Window functions are evaluated at the time of the SELECT on the result set remaining after all the WHERE/JOIN/GROUP BY/HAVING clauses have been dealt with (step 5.1).
really I'm looking for the reasoning behind not being able to use
windowing functions in where clauses.
The reason that they are not allowed in the WHERE clause is that it would create ambiguity. Stealing Itzik Ben Gan's example from High-Performance T-SQL Using Window Functions (p.25)
Suppose your table was
CREATE TABLE T1
(
col1 CHAR(1) PRIMARY KEY
)
INSERT INTO T1 VALUES('A'),('B'),('C'),('D'),('E'),('F')
And your query
SELECT col1
FROM T1
WHERE ROW_NUMBER() OVER (ORDER BY col1) <= 3
AND col1 > 'B'
What would be the right result? Would you expect that the col1 > 'B' predicate ran before or after the row numbering?
There is no need for CTE, just use the windowing function in a subquery:
select id, sales_person_id, product_type, product_id, sale_amount
from
(
select id, sales_person_id, product_type, product_id, sale_amount,
row_number() over(partition by sales_person_id, product_type, product_id order by sale_amount desc) rn
from Sales_Log
) sl
where rn = 1
Edit, moving my comment to the answer.
Windowing functions are not performed until the data is actually selected which is after the WHERE clause. So if you try to use a row_number in a WHERE clause the value is not yet assigned.
"All-at-once operation" means that all expressions in the same
logical query process phase are evaluated logically at the same time.
And great chapter Impact on Window Functions:
Suppose you have:
CREATE TABLE #Test ( Id INT) ;
INSERT INTO #Test VALUES ( 1001 ), ( 1002 ) ;
SELECT Id
FROM #Test
WHERE Id = 1002
AND ROW_NUMBER() OVER(ORDER BY Id) = 1;
All-at-Once operations tell us these two conditions evaluated logically at the same point of time. Therefore, SQL Server can
evaluate conditions in WHERE clause in arbitrary order, based on
estimated execution plan. So the main question here is which condition
evaluates first.
Case 1:
If ( Id = 1002 ) is first, then if ( ROW_NUMBER() OVER(ORDER BY Id) = 1 )
Result: 1002
Case 2:
If ( ROW_NUMBER() OVER(ORDER BY Id) = 1 ), then check if ( Id = 1002 )
Result: empty
So we have a paradox.
This example shows why we cannot use Window Functions in WHERE clause.
You can think more about this and find why Window Functions are
allowed to be used just in SELECT and ORDER BY clauses!
Addendum
Teradata supports QUALIFY clause:
Filters results of a previously computed ordered analytical function according to user‑specified search conditions.
SELECT Id
FROM #Test
WHERE Id = 1002
QUALIFY ROW_NUMBER() OVER(ORDER BY Id) = 1;
Snowflake - Qualify
QUALIFY does with window functions what HAVING does with aggregate functions and GROUP BY clauses.
In the execution order of a query, QUALIFY is therefore evaluated after window functions are computed. Typically, a SELECT statement’s clauses are evaluated in the order shown below:
From
Where
Group by
Having
Window
QUALIFY
Distinct
Order by
Limit
Databricks - QUALIFY clasue
Filters the results of window functions. To use QUALIFY, at least one window function is required to be present in the SELECT list or the QUALIFY clause.
You don't necessarily need to use a CTE, you can query the result set after using row_number()
select row, id, sales_person_id, product_type, product_id, sale_amount
from (
select
row_number() over(partition by sales_person_id,
product_type, product_id order by sale_amount desc) AS row,
id, sales_person_id, product_type, product_id, sale_amount
from Sales_Log
) a
where row = 1
It's an old thread, but I'll try to answer specifically the question expressed in the topic.
Why no windowed functions in where clauses?
SELECT statement has following main clauses specified in keyed-in order:
SELECT DISTINCT TOP list
FROM JOIN ON / APPLY / PIVOT / UNPIVOT
WHERE
GROUP BY WITH CUBE / WITH ROLLUP
HAVING
ORDER BY
OFFSET-FETCH
Logical Query Processing Order, or Binding Order, is conceptual interpretation order, it defines the correctness of the query. This order determines when the objects defined in one step are made available to the clauses in subsequent steps.
----- Relational result
1. FROM
1.1. ON JOIN / APPLY / PIVOT / UNPIVOT
2. WHERE
3. GROUP BY
3.1. WITH CUBE / WITH ROLLUP
4. HAVING
---- After the HAVING step the Underlying Query Result is ready
5. SELECT
5.1. SELECT list
5.2. DISTINCT
----- Relational result
----- Non-relational result (a cursor)
6. ORDER BY
7. TOP / OFFSET-FETCH
----- Non-relational result (a cursor)
For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps.
Conversely, all clauses preceding the SELECT clause cannot reference any column aliases or derived columns defined in SELECT clause. However, those columns can be referenced by subsequent clauses such as the ORDER BY clause.
OVER clause determines the partitioning and ordering of a row set before the associated window function is applied. That is, the OVER clause defines a window or user-specified set of rows within an Underlying Query Result set and window function computes result against that window.
Msg 4108, Level 15, State 1, …
Windowed functions can only appear in the SELECT or ORDER BY clauses.
The reason behind is because the way how Logical Query Processing works in T-SQL. Since the underlying query result is established only when logical query processing reaches the SELECT step 5.1. (that is, after processing the FROM, WHERE, GROUP BY and HAVING steps), window functions are allowed only in the SELECT and ORDER BY clauses of the query.
Note to mention, window functions are still part of relational layer even Relational Model doesn't deal with ordered data. The result after the SELECT step 5.1. with any window function is still relational.
Also, speaking strictly, the reason why window function are not allowed in the WHERE clause is not because it would create ambiguity, but because the order how Logical Query Processing processes SELECT statement in T-SQL.
Links: here, here and here
Finally, there's the old-fashioned, pre-SQL Server 2005 way, with a correlated subquery:
select *
from Sales_Log sl
where sl.id = (
Select Top 1 id
from Sales_Log sl2
where sales_person_id = sl.sales_person_id
and product_type = sl.product_type
and product_id = sl.product_id
order by sale_amount desc
)
I give you this for completeness, merely.
Basically first "WHERE" clause condition is read by sql and the same column/value id looked into the table but in table row_num=1 is not there still. Hence it will not work.
Thats the reason we will use parentheses first and after that we will write the WHERE clause.
Yes unfortunately when you do a windowed function SQL gets mad at you even if your where predicate is legitimate. You make a cte or nested select having the value in your select statement, then reference your CTE or nested select with that value later. Simple example that should be self explanatory. If you really HATE cte's for some performance issue on doing a large data set you can always drop to temp table or table variable.
declare #Person table ( PersonID int identity, PersonName varchar(8));
insert into #Person values ('Brett'),('John');
declare #Orders table ( OrderID int identity, PersonID int, OrderName varchar(8));
insert into #Orders values (1, 'Hat'),(1,'Shirt'),(1, 'Shoes'),(2,'Shirt'),(2, 'Shoes');
--Select
-- p.PersonName
--, o.OrderName
--, row_number() over(partition by o.PersonID order by o.OrderID)
--from #Person p
-- join #Orders o on p.PersonID = o.PersonID
--where row_number() over(partition by o.PersonID order by o.orderID) = 2
-- yields:
--Msg 4108, Level 15, State 1, Line 15
--Windowed functions can only appear in the SELECT or ORDER BY clauses.
;
with a as
(
Select
p.PersonName
, o.OrderName
, row_number() over(partition by o.PersonID order by o.OrderID) as rnk
from #Person p
join #Orders o on p.PersonID = o.PersonID
)
select *
from a
where rnk >= 2 -- only orders after the first one.

Oracle SQL order by in subquery problems!

I am trying to run a subquery in Oracle SQL and it will not let me order the subquery columns. Ordering the subquery is important as Oracle seems to choose at will which of the returned columns to return to the main query.
select ps.id, ps.created_date, pst.last_updated, pst.from_state, pst.to_state,
(select last_updated from mwcrm.process_state_transition subpst
where subpst.last_updated > pst.last_updated
and subpst.process_state = ps.id
and rownum = 1) as next_response
from mwcrm.process_state ps, mwcrm.process_state_transition pst
where ps.created_date > sysdate - 1/24
and ps.id=pst.process_state
order by ps.id asc
Really should be:
select ps.id, ps.created_date, pst.last_updated, pst.from_state, pst.to_state,
(select last_updated from mwcrm.process_state_transition subpst
where subpst.last_updated > pst.last_updated
and subpst.process_state = ps.id
and rownum = 1
order by subpst.last_updated asc) as next_response
from mwcrm.process_state ps, mwcrm.process_state_transition pst
where ps.created_date > sysdate - 1/24
and ps.id=pst.process_state
order by ps.id asc
Both dcw and Dems have provided appropriate alternative queries. I just wanted to toss in an explanation of why your query isn't behaving the way you expected it to.
If you have a query that includes a ROWNUM and an ORDER BY, Oracle applies the ROWNUM first and then the ORDER BY. So the query
SELECT *
FROM emp
WHERE rownum <= 5
ORDER BY empno
gets an arbitrary 5 rows from the EMP table and sorts them-- almost certainly not what was intended. If you want to get the "first N" rows using ROWNUM, you would need to nest the query. This query
SELECT *
FROM (SELECT *
FROM emp
ORDER BY empno)
WHERE rownum <= 5
sorts the rows in the EMP table and returns the first 5.
Actually "ordering" only makes sense on the outermost query -- if you order in a subquery, the outer query is permitted to scramble the results at will, so the subquery ordering does essentially nothing.
It looks like you just want to get the minimum last_updated that is greater than pst.last_updated -- its easier when you look at it as the minimum (an aggregate), rather than a first row (which brings about other problems, like what if there are two rows tied for next_response?)
Give this a shot. Fair warning, been a few years since I've had Oracle in front of me, and I'm not used to the subquery-as-a-column syntax; if this blows up I'll make a version with it in the from clause.
select
ps.id, ps.created_date, pst.last_updated, pst.from_state, pst.to_state,
( select min(last_updated)
from mwcrm.process_state_transition subpst
where subpst.last_updated > pst.last_updated
and subpst.process_state = ps.id) as next_response
from <the rest>
I've experienced this myself and you have to use ROW_NUMBER(), and an extra level of subquery, instead of rownum...
Just showing the new subquery, something like...
(
SELECT
last_updated
FROM
(
select
last_updated,
ROW_NUMBER() OVER (ORDER BY last_updated ASC) row_id
from
mwcrm.process_state_transition subpst
where
subpst.last_updated > pst.last_updated
and subpst.process_state = ps.id
)
as ordered_results
WHERE
row_id = 1
)
as next_response
An alternative would be to use MIN instead...
(
select
MIN(last_updated)
from
mwcrm.process_state_transition subpst
where
subpst.last_updated > pst.last_updated
and subpst.process_state = ps.id
)
as next_response
The confirmed answer is plain wrong.
Consider a subquery that generates a unique row index number.
For example ROWNUM in Oracle.
You need the subquery to create the unique record number for paging purposes (see below).
Consider the following example query:
SELECT T0.*, T1.* FROM T0 LEFT JOIN T1 ON T0.Id = T1.Id
JOIN
(
SELECT DISTINCT T0.*, ROWNUM FROM T0 LEFT JOIN T1 ON T0.Id = T1.Id
WHERE (filter...)
)
WHERE (filter...) AND (ROWNUM > 10 AND ROWNUM < 20)
ORDER BY T1.Name DESC
The inner query is the exact same query but DISTINCT on T0.
You can't put the ROWNUM on the outer query since the LEFT JOIN(s) could generate many more results.
If you could order the inner query (T1.Name DESC) the generated ROWNUM in the inner query would match.
Since you cannot use an ORDER BY in the subquery the numbers wont match and will be useless.
Thank god for ROW_NUMBER OVER (ORDER BY ...) which fixes this issue.
Although not supported by all DB engines.
One of the two methods, LIMIT (does not require ORDER) and the ROW_NUMBER() OVER will cover most DB engines.
But still if you don't have one of these options, for example the ROWNUM is your only option then a ORDER BY on the subquery is a must!