Any equivalent in bigquery to postgres any() array? - google-bigquery

Coming to BigQuery somewhat recently from Postgres - I'm used to utilizing the following pattern fairly regularly in specific cases, but not where the CTE set would be too big of course.
with cte1 as (
select id from my_table where feature1 = x, feature2 = y
)
select id, detail1, detail2, detail3 from my_table
where id = ANY( select id from cte1 )
Something to the effect of using my CTE as a place to gather records of a particular interest (something in common, something uncommon, outliers, dupes etc.) and then passing that to my main select using the id = ANY( select id from CTE ) pattern.
What would be the equivalent in BigQuery?

Equivalent is
where id in ( select id from CTE )
So, your query can look like below
with cte1 as (
select id from my_table where feature1 = x and feature2 = y
)
select id, detail1, detail2, detail3 from my_table
where id in ( select id from cte1 )

Related

insert into table where not exists (dbt)

I want to insert new data into the STG.NEW_TABLE when the id does not already exist in that table.
Since we can't use insert statements in dbt, I am not sure how to convert this query for dbt:
insert into STG.NEW_TABLE
select FILE_NAME::string as file_name,
id::string as id,
XMLGET(f.value,'value'):"$"::string as val
from STG.OLD_TABLE ot,
lateral flatten(input => XMLGET( ot.XML_DATA, 'data' ):"$") f
where not exists (select id
from stg.NEW_TABLE nt
where nt.id=ot.id);
I tried breaking it into two parts where I first flatten the xml into a TMP view and then do a union in another model:
(SELECT * FROM {{ref('TMP_VIEW')}} where id NOT IN
(SELECT id
FROM STG.STG.NEW_TABLE))
UNION ALL
SELECT * FROM STG.NEW_TABLE
as suggested here: https://docs.getdbt.com/guides/migration/tools/migrating-from-stored-procedures/2-inserts
While it seems to work, it doesn't feel like the most efficient solution. I do not want to split the process into two parts. I am gussing something with CTEs might be better. Any recommendations?
In dbt you can set the materialization of a table to be incremental in the schema.yml file or within the sql file itself with:
{{ config(materialized='incremental', dist='id') }}
https://docs.getdbt.com/docs/build/materializations#incremental
This will only insert records that don't already exist within the table
edit based on comments
What you can do is union the new and old table, then rank the id to retrieve a unique one. This allows the model to be incremental, and by specifying which data comes from the old/new table you can ensure only new records are inserted.
{{ config(materialized='incremental', dist='id') }}
with cte as (
select *, 0 oldtable from stg.NEW_TABLE
union
select FILE_NAME::string as file_name,
id::string as id,
XMLGET(f.value,'value'):"$"::string as val,
1 oldtable
from STG.OLD_TABLE ot,
lateral flatten(input => XMLGET( ot.XML_DATA, 'data' ):"$") f
where not exists (select id
from stg.NEW_TABLE nt
where nt.id=ot.id)
)
final as (
select *,
row_number() OVER ( order by oldtable, id ) rn
from cte
)
select
file_name,
id,
val
from final
where rn = 1
and oldtable = 1

sql: first row after the last row with a property

I would like to write a query that returns the first row immediately after the last row with a given property (ordered by id). Id's may not be consecutive.
Ideally it would look something like this:
...
JOIN (select max(id) id from my_table where CONDITION) m
JOIN (select min(id) from my_table where id > m.id) n
However, I can not use identifier m in the second subselect.
It is possible to use nested queries in nested queries, but is there an easier way?
Thank you.
You could use lead() to get the next id before applying the condition:
select t.*
from my_table t join
(select max(next_id) as max_next_id
from (select t.*, lead(id) over (order by id) as next_id
from my_table t
) t
where <condition>
) tt
on t.id = tt.max_next_id;
You could also do:
select t.*
from my_table t
where t.id > (select max(t2.id) from my_table t2 where <condition>)
order by t2.id asc
fetch first 1 row only;
I am not sure how this is getting woven into the rest of your query, so I have used a CTE
WITH max_next AS (
SELECT r.id as max_id
,r.next_id
FROM (
SELECT m.id
,m.next_id
,ROW_NUMBER() OVER (ORDER BY m.id DESC) AS rn
FROM (
SELECT n.* -- to provide data to satisfy CONDITIONS
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
) AS r
WHERE r.rn = 1
)
I would also shrink the n.* to the columns needed by CONDITIONS to a, not be implicit as the * slows the compile time down (or historically has) as all meta data needs to be read to understand what columns is in the ANY, and the while the compile can also prune not used columns, it's faster if you just ask for what you want (in best case just a compile time savings, worse case, it read all the data when you only need x number of columns read)
And borrowing from Gordon solution, the ROW_NUMBER part could be simpler
WITH max_next AS (
SELECT m.id
,m.next_id
--, plus what ever other things you want from m
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
ORDER BY m.id DESC LIMIT 1
)
So for an example for #PIG,
WITH my_table AS (
SELECT column1 AS id
,column2 AS con1
,column3 AS other
FROM VALUES (1,'a',123),(2,'b',234),(3,'a',345),(5,'b',456),(7,'a',567),(10,'c',678)
)
SELECT m.id
,m.next_id
,m.other
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE m.con1 = 'b'
ORDER BY m.id DESC LIMIT 1;
gives 5, 7, 456 which is the last 'b' and the new row, and an extra value on my_table for entertainment purposes (and run on Snowflake to, which means I fixed the prior SQL also.)
This should work, it's pretty straightforward (easy), and it's good that you know records may not be stored in a ordered/consecutive fashion.
SELECT *
FROM my_table
WHERE id = (
SELECT min(id)
FROM my_table
WHERE id > (
SELECT max(id)
FROM my_table
WHERE CONDITION));

How to return two values from PostgreSQL subquery?

I have a problem where I need to get the last item across various tables in PostgreSQL.
The following code works and returns me the type of the latest update and when it was last updated.
The problem is, this query needs to be used as a subquery, so I want to select both the type and the last updated value from this query and PostgreSQL does not seem to like this... (Subquery must return only one column)
Any suggestions?
SELECT last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
Query is used like this inside of a CTE;
WITH sql_query as (
SELECT id, name, address, (...other columns),
last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
FROM table_c
WHERE table_c.fk_id = 1
)
The inherent problem is that SQL (all SQL not just Postgres) requires that a subquery used within a select clause can only return a single value. If you think about that restriction for a while it does makes sense. The select clause is returning rows and a certain number of columns, each row.column location is a single position within a grid. You can bend that rule a bit by putting concatenations into a single position (or a single "complex type" like a JSON value) but it remains a single position in that grid regardless.
Here however you do want 2 separate columns AND you need to return both columns from the same row, so instead of LIMIT 1 I suggest using ROW_NUMBER() instead to facilitate this:
WITH LastVals as (
SELECT type
, max_date
, row_number() over(order by max_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS max_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS max_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
)
, sql_query as (
SELECT id
, name, address, (...other columns)
, (select type from lastVals where rn = 1) as last_type
, (select max_date from lastVals where rn = 1) as last_date
FROM table_c
WHERE table_c.fk_id = 1
)
----
By the way in your subquery you should use UNION ALL with type being a constant like 'a' or 'b' then even if MAX(a.updated_at) was identical for 2 or more tables, the rows would still be unique because of the difference in type. UNION will attempt to remove duplicate rows but here it just isn't going to help, so avoid that wasted effort by using UNION ALL.
----
For another way to skin this cat, consider using a LEFT JOIN instead
SELECT id
, name, address, (...other columns)
, lastVals.type
, LastVals.last_date
FROM table_c
WHERE table_c.fk_id = 1
LEFT JOIN (
SELECT type
, last_date
, row_number() over(order by last_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS last_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS last_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
) LastVals ON LastVals.rn = 1

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

Oracle SQL -- Analytic functions OVER a group?

My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question