Using LEAD in BigQuery

Using LEAD in BigQuery - sql

Assume my table structure is this
I am planning to group it by (USER and SEQUENCE ) and get the LEAD timestamp for the next sequence. Here is the output that I am looking for
Can I solve this without JOIN using LEAD function if possible ?

Below is for BigQuery Standard SQL
I will present two options - using JOIN (just to justify that I understood/reversed-engineered expected logic correctly) and then JOIN-less version (note I am using ts as a field name instead of timestamp)
Using JOIN
#standardSQL
SELECT a.user, a.sequence, MIN(b.ts) ts
FROM (
SELECT user, sequence, MAX(ts) AS max_ts
FROM `project.dataset.table`
GROUP BY user, sequence
) a
LEFT JOIN `project.dataset.table` b
ON a.user = b.user AND b.sequence = a.sequence + 1
WHERE a.max_ts <= IFNULL(b.ts, a.max_ts)
GROUP BY user, sequence
-- ORDER BY user, sequence
JOIN-less version
#standardSQL
SELECT
user, sequence,
(
SELECT ts FROM UNNEST(arr_ts) ts
WHERE max_ts < ts ORDER BY ts LIMIT 1
) ts
FROM (
SELECT
user, sequence, max_ts,
LEAD(arr_ts) OVER (PARTITION BY user ORDER BY sequence) arr_ts
FROM (
SELECT
user, sequence, MAX(ts) max_ts,
ARRAY_AGG(ts ORDER BY ts) arr_ts
FROM `project.dataset.table`
GROUP BY user, sequence
)
)
-- ORDER BY user, sequence
Both above versions can be tested / played with using below dummy data
WITH `project.dataset.table` AS (
SELECT 'user1' user, 2 sequence, 'T1' ts UNION ALL
SELECT 'user1', 2, 'T2' UNION ALL
SELECT 'user1', 1, 'T3' UNION ALL
SELECT 'user1', 1, 'T4' UNION ALL
SELECT 'user1', 3, 'T5' UNION ALL
SELECT 'user1', 2, 'T6' UNION ALL
SELECT 'user1', 3, 'T7' UNION ALL
SELECT 'user1', 3, 'T8'
)
and both returns below result
user sequence ts
user1 1 T6
user1 2 T7
user1 3 null

Not sure about bigquery, but in general SQL it would be written as:
select user, sequence, LEAD (max_timestamp,1) OVER (PARTITION BY user ORDER BY sequence) as timestamp
from (
select user, sequence, max(timestamp) as max_timestamp
from table
group by user, sequence) q1;
Just be aware of reserved words suchas table, user, timestamp etc.
Edit: Yes, forget about this answer, wasn't attentive enough about required output. Mikhail got it right!

Related

How do I complete a cross join with redshift?

I have two tables. One has a user ID and a date and another has a list of dates.
with first_day as (
select '2020-03-01' AS DAY_CREATED, '123' AS USER_ID
),
date_series as (
SELECT ('2020-02-28'::date + x)::date as day_n,
'one' as join_key
FROM generate_series(1, 30, 1) x
)
SELECT * from first_day cross join date_series
I'm getting this error with redshift
Error running query: Specified types or functions (one per INFO message) not supported on Redshift tables.
Can I do a cross join with redshift?

Alas, Redshift supports generete_series() but only in a very limited way -- on the master processing node. That basically renders it useless.
Assuming you have a table with enough rows, you can use that:
with first_day as (
select '2020-03-01' AS DAY_CREATED, '123' AS USER_ID
),
date_series as (
select ('2020-02-28'::date + x)::date as day_n,
'one' as join_key
from (select t.*, row_number() over () as x
from t -- big enough table
limit 30
) x
select *
from first_day cross join
date_series;

Trying to find the most recent date where a status field has changed in Oracle SQL

I have a table that shows a full history of location ID's (LOCN_ID), which includes an ACTIVE_STATUS field showing A for active, or I for inactive. Each time a location's active status changes, a new record is created with a new OP_DATE. However, any time the EXTERNALLY_VISIBLE field in the table gets changed, another record with a new OP_DATE is also created.
Here is an example of this.
For each LOCN_ID in the table, I need to be able to find the most recent OP_DATE that the ACTIVE_STATUS field changed (to either I or A). I don't care about when the EXTERNALLY_VISIBLE field changed. For the LOCN_ID shown in the example, the result should be:
OP_DATE LOCN_ID ACTIVE_STATUS
12/9/11 7:34 558732 I
There are also some cases where a LOCN_ID's active status will have never changed, in which case the result should be the oldest OP_DATE for that LOCN_ID.
How would I be able to write a query in Oracle SQL to show this desired output for each LOCN_ID?

You have to handle both situations, when there is a row where status changed and when it's absent. Lag() is obvious as it is designed to find previous values. Optional is older, slower self-join. Also we need row_number(), because you have complicated conditional ordering. In row_number as first part we need descending order, then ascending in case there were no status changes. It can be done like here:
select op_date, locn_id, active_status
from (
select a.*, row_number()
over (partition by locn_id
order by case when active_status <> las then sysdate-op_date end,
op_date) as rn
from (select t.*, lag(active_status) over (partition by locn_id order by op_date) las
from t) a)
where rn = 1
dbfiddle demo

You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by locn_id order by op_date desc) as seqnum
from (select t.*,
lag(active_status) over (partition by locn_id order by op_date) as prev_active_status
from t
) t
where prev_active_status <> active_status
) t
where seqnum = 1;

I have created the following query for you using LEFT JOIN:
-- SAMPLE DATA
WITH DATAA (OP_DATE, LOCN_ID, ACTIVE_STATUS, EXTERNALLY_VISIBLE)
AS
(
SELECT TO_DATE('04/06/2013 2:31','MM/DD/RRRR HH24:MI'), 558732, 'I', 'Y' FROM DUAL UNION ALL
SELECT TO_DATE('12/09/2011 7:34','MM/DD/RRRR HH24:MI'), 558732, 'I', 'N' FROM DUAL UNION ALL
SELECT TO_DATE('10/02/2011 3:05','MM/DD/RRRR HH24:MI'), 558732, 'A', 'N' FROM DUAL UNION ALL
SELECT TO_DATE('10/02/2011 2:59','MM/DD/RRRR HH24:MI'), 558732, 'I', 'N' FROM DUAL UNION ALL
SELECT TO_DATE('10/02/2011 3:00','MM/DD/RRRR HH24:MI'), 558732, 'I', 'Y' FROM DUAL UNION ALL
SELECT TO_DATE('04/09/2011 2:18','MM/DD/RRRR HH24:MI'), 558732, 'A', 'Y' FROM DUAL
),
-- ACTUAL QUERY STARTS FROM HERE
CTE(OP_DATE, LOCN_ID, ACTIVE_STATUS, EXTERNALLY_VISIBLE, RN) AS (
SELECT
D.*,
ROW_NUMBER() OVER(
PARTITION BY LOCN_ID
ORDER BY
OP_DATE
) AS RN
FROM
DATAA D
)
SELECT
OP_DATE,
LOCN_ID,
ACTIVE_STATUS
FROM
(
SELECT
A.OP_DATE,
A.LOCN_ID,
A.ACTIVE_STATUS,
ROW_NUMBER() OVER(
PARTITION BY A.LOCN_ID
ORDER BY
A.OP_DATE DESC
) AS RN
FROM
CTE A
LEFT JOIN CTE B ON ( A.RN = B.RN + 1 )
WHERE
( A.ACTIVE_STATUS <> B.ACTIVE_STATUS
OR B.ACTIVE_STATUS IS NULL )
)
WHERE
RN = 1;
-- Output --
OP_DATE LOCN_ID A
------------------- ---------- -
09-12-2011 07:34:00 558732 I
Demo
Cheers!!

Google big query count

I am trying to pull all the customers having less than 4 orders
in past 3 months in Google BigQuery.
SELECT a.user_id, b.refer_by, FROM water_db.tb_order a INNER JOIN
water_auth.tb_users b ON a.user_id = b.user_id WHERE ( SELECT
user_id FROM
water_db.tb_order GROUP BY
user_id HAVING
COUNT(DISTINCT(a.user_id <= 4))) AND status = 3 AND DATE(a.order_date) >=
'2017-02-15' AND DATE(a.order_date) <= '2017-05-15';------

I'm guessing that each time a record is added to the table it equates to an order but something like:
SELECT
a.userid, b.refer_by
FROM water_db.tb_order a
INNER JOIN water_auth.tb_users b ON a.user_id = b.user_id
WHERE
(COUNT(userid) < 4)
and
DATE_ADD(MONTH, -4, a.order_date)
The date function may differ as I'm not 100% sure what it is in Google Big Query

I think the best way to approach this is by selecting from the users table, so you don't need to deduplicate IDs, and just expressing the condition as part of your WHERE clause. This should help get you started:
#standardSQL
SELECT
user_id,
refer_by
FROM water_db.tb_users
WHERE (
SELECT COUNT(*)
FROM water_db.tb_order
WHERE tb_users.user_id = tb_order.user_id AND
status = 3 AND
DATE(order_date) BETWEEN '2017-02-15' AND '2017-05-15'
) <= 4;
In this query, the join is expressed as a correlated subquery involving the two tables. You can try it out using sample data with this query:
#standardSQL
WITH tb_users AS (
SELECT 1 AS user_id, 'foo#bar.com' AS refer_by UNION ALL
SELECT 2, 'a#b.com' UNION ALL
SELECT 3, 'baz#baz.com'
),
tb_order AS (
SELECT 1 AS user_id, TIMESTAMP '2017-04-12' AS order_date, 3 AS status UNION ALL
SELECT 2, TIMESTAMP '2017-05-03', 3 UNION ALL
SELECT 1, TIMESTAMP '2017-03-13', 3 UNION ALL
SELECT 1, TIMESTAMP '2017-02-28', 3 UNION ALL
SELECT 2, TIMESTAMP '2017-05-06', 3 UNION ALL
SELECT 1, TIMESTAMP '2017-05-01', 3 UNION ALL
SELECT 1, TIMESTAMP '2017-05-02', 3
)
SELECT
user_id,
refer_by
FROM tb_users
WHERE (
SELECT COUNT(*)
FROM tb_order
WHERE tb_users.user_id = tb_order.user_id AND
status = 3 AND
DATE(order_date) BETWEEN
'2017-02-15' AND '2017-05-15'
) <= 4;

Select unique id with a given code, but not if it has a another row with a different code with a later timestamp

I have table that basically consists of a user id, a code(A or B) and a timestamp.
I need to get a list of unique ids that have a code of A, but only if that same user id does not also have a row with code B with a later timestamp.
Hope that makes sense.

This English query translates into SQL almost verbatim:
get a list of unique ids that have a code of A
SELECT DISTINCT user_id FROM user u WHERE code='A' <...>
but only if that same user id does not also have a row with code B with a later timestamp
<...> AND NOT EXISTS (
SELECT *
FROM user ou
WHERE ou.user_id=u.user_id AND ou.code='B' AND ou.time_stamp > u.time_stamp
)
The trick to the translation us the use of aliases: u stands for "User", while ou stands for "Other user". Hence ou.user_id=u.user_id AND ou.code='B' AND ou.time_stamp > u.time_stamp means "another user with the same id, code of 'B', and a later timestamp".

This will get the result with only a single table scan:
Oracle Setup:
CREATE TABLE table_name ( user_id, code, time ) AS
SELECT 1, 'A', TIMESTAMP '2016-02-01 00:00:00' FROM DUAL UNION ALL
SELECT 1, 'A', TIMESTAMP '2016-02-02 00:00:00' FROM DUAL UNION ALL
SELECT 2, 'A', TIMESTAMP '2016-02-01 00:00:00' FROM DUAL UNION ALL
SELECT 2, 'B', TIMESTAMP '2016-02-02 00:00:00' FROM DUAL UNION ALL
SELECT 3, 'B', TIMESTAMP '2016-02-01 00:00:00' FROM DUAL UNION ALL
SELECT 3, 'A', TIMESTAMP '2016-02-02 00:00:00' FROM DUAL;
Query:
SELECT DISTINCT user_id
FROM (
SELECT user_id,
code,
LEAD( CASE code WHEN 'B' THEN 1 END )
IGNORE NULLS
OVER ( PARTITION BY user_id ORDER BY time ASC )
AS has_b
FROM TABLE_NAME
)
WHERE code = 'A'
AND has_b IS NULL;
Output:
USER_ID
----------
1
3

SELECT userId FROM tblName WHERE code = 'A' AND WHERE userId =
(SELECT userId FROM tblName WHERE code != 'B')

Accesing parent identifier in an ordered nested subquery

I need a query to get the value of an item together with the value of the previous item if exists.
I am using the following query (a simplification of the actual):
select v1.value item_value,
nvl(
(
select * from (
select v2.value
from ITEMS v2
where v2.insert_date<v1.insert_date
order by v2.insert_date desc
) where rownum=1
), 0
) as previous_value
from ITEMS v1
where v1.item_id=1234
This query won't work (ORA-00904) because I am using v1.insert_date in an inner select with two levels of nesting.
How can I achieve this with Oracle 11?

I think you can achieve this with analytic function LAG. More info about analytic functions LAG LEAD
I created a sample query:
with items as (
select 1 as value, sysdate as insert_date from dual
union all
select 2 as value, sysdate-1 as insert_date from dual
union all
select 3 as value, sysdate+1 as insert_date from dual
)
select v1.value item_value,
lag(v1.value,1,0) over (order by v1.insert_date desc) as previous_value,insert_date
from ITEMS v1
order by insert_date desc

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using LEAD in BigQuery - sql

Assume my table structure is this I am planning to group it by (USER and SEQUENCE ) and get the LEAD timestamp for the next sequence. Here is the output that I am looking for Can I solve this without JOIN using LEAD function if possible ?

Related

How do I complete a cross join with redshift?

Trying to find the most recent date where a status field has changed in Oracle SQL

Google big query count

Select unique id with a given code, but not if it has a another row with a different code with a later timestamp

Accesing parent identifier in an ordered nested subquery

Categories

Resources