Query with conditional lag statement - google-bigquery

I'm trying to find the previous value of a column where the row meets some criteria. Consider the table:
| user_id | session_id | time | referrer |
|---------|------------|------------|------------|
| 1 | 1 | 2018-01-01 | [NULL] |
| 1 | 2 | 2018-02-01 | google.com |
| 1 | 3 | 2018-03-01 | google.com |
I want to find, for each session, the previous value of session_id where the referrer is NULL. So, for the second AND third rows, the value of parent_session_id should be 1.
However, by just using lag(session_id) over (partition by user_id order by time), I will get parent_session_id=2 for the 3rd row.
I suspect it can be done using a combination of window functions, but I just can't figure it out.

I'd use last_value() in combination with if():
WITH t AS (SELECT * FROM UNNEST([
struct<user_id int64, session_id int64, time date, referrer string>(1, 1, date('2018-01-01'), NULL),
(1,2,date('2018-02-01'), 'google.com'),
(1,3,date('2018-03-01'), 'google.com')
]) )
SELECT
*,
last_value(IF(referrer is null, session_id, NULL) ignore nulls)
over (partition by user_id order by time rows between unbounded preceding and 1 preceding) lastNullrefSession
FROM t

You could even do this via a correlated subquery:
SELECT
session_id,
(SELECT MAX(t2.session_id) FROM yourTable t2
WHERE t2.referrer IS NULL AND t2.session_id < t1.session_id) prev_session_id
FROM yourTable t1
ORDER BY
session_id;
Here is an approach using analytic functions which might work:
WITH cte AS (
SELECT *,
SUM(CASE WHEN referrer IS NULL THEN 1 ELSE 0 END)
OVER (ORDER BY session_id) cnt
FROM yourTable
)
SELECT
session_id,
CASE WHEN cnt = 0
THEN NULL
ELSE MIN(session_id) OVER (PARTITION BY cnt) END prev_session_id
FROM cte
ORDER BY
session_id;

Related

Obtain latest NOT NULL values for different columns in a table, grouped by common column

In a PostgreSQL database, I have a table of measurements that looks as follows:
| sensor_group_id | ts | value_1 | value_2 | etc... |
|-----------------|---------------------------|---------|---------|--------|
| 1 | 2021-07-21T00:20:00+00:00 | 15 | NULL | |
| 1 | 2021-07-15T00:20:00+00:00 | NULL | 23 | |
| 2 | 2021-07-17T00:20:00+00:00 | NULL | 11 | |
| 1 | 2021-07-13T00:20:00+00:00 | 9 | 4 | |
| 2 | 2021-07-10T00:20:00+00:00 | 99 | 36 | |
There are many columns with different types of measurements in this table. Each Sensor Group produces measurements of different types at the same time, but not always all types.
So we end up with partly filled rows.
What I want to do:
For each different sensor_group_id
For each different column (measurement type)
Obtain the latest timestamp when that column was NOT NULL and the value for that measurement at that timestamp
The solution I have now, seems pretty cumbersome:
WITH
latest_value_1 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_1
FROM measurements
WHERE value_1 IS NOT NULL
ORDER BY sensor_group_id, ts DESC),
latest_value_2 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_2
FROM measurements
WHERE value_2 IS NOT NULL
ORDER BY sensor_group_id, ts DESC),
latest_value_3 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_3
FROM measurements
WHERE value_3 IS NOT NULL
ORDER BY sensor_group_id, ts DESC),
etc...
SELECT latest_value_1.sensor_group_id,
latest_value_1.ts AS latest_value_1_ts,
value_1,
latest_value_2.ts AS latest_value_2_ts,
value_2,
latest_value_3.ts AS latest_value_3_ts,
value_3,
etc...
FROM lastest_value_1
JOIN latest_value_2
ON latest_value_1.sensor_group_id = latest_value_2.sensor_group_id
JOIN latest_value_2
ON latest_value_1.sensor_group_id = latest_value_2.sensor_group_id
JOIN latest_value_3
ON latest_value_1.sensor_group_id = latest_value_3.sensor_group_id
etc...
This produces the following result:
sensor_group_id
latest_value_1_ts
value_1
latest_value_2_ts
value_2
etc...
1
2021-07-21T00:20:00+00:00
15
2021-07-21T00:20:00+00:00
23
2
2021-07-10T00:20:00+00:00
99
2021-07-17T00:20:00+00:00
11
This seems outrageously complicated, but I'm not sure if there is a better approach. Help would be much appreciated!
Not sure is it simpler...
with
sensor_groups(sgr_id) as ( -- Change it to the list of groups if you have it
select distinct sensor_group_id from measurements)
select
*
from
sensor_groups as sg
left join lateral (
select ts, value_1
from measurements
where value_1 is not null and sensor_group_id = sg.sgr_id
order by ts desc limit 1) as v1(ts_1, v_1) on true
left join lateral (
select ts, value_2
from measurements
where value_2 is not null and sensor_group_id = sg.sgr_id
order by ts desc limit 1) as v2(ts_2, v_2) on true
...
PS: Data normalization could help a lot
What you really want is the IGNORE NULLS option on LAG() or LAST_VALUE(). But Postgres does not support this functionality. Instead, you can use a two-level trick, where you assign a grouping for each value, so each NULL value is in the same group as the previous row with a value. Then "schmear" the values through the group:
select t.*,
max(value_1) over (partition by sensor_group_id, grp_1) as imputed_value_1,
max(value_2) over (partition by sensor_group_id, grp_2) as imputed_value_2,
max(value_3) over (partition by sensor_group_id, grp_3) as imputed_value_3
from (select t.*,
count(value_1) over (partition by sensor_group_id order by ts) as grp_1,
count(value_2) over (partition by sensor_group_id order by ts) as grp_2,
count(value_3) over (partition by sensor_group_id order by ts) as grp_3
from t
) t;

Redshift window function for change in column

I have a redshift table with amongst other things an id and plan_type column and would like a window function group clause where the plan_type changes so that if this is the data for example:
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | A | 2019-01-02 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
| 2 | A | 2-10-01-05 |
I would like a result like this where I get the first date that the plan_type was "new":
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
Is this possible with window functions?
EDIT
Since I have some garbage in the data where plan_type can sometimes be null and the accepted solution does not include the first row (since I can't have the OR is not null I had to make some modifications. Hopefully his will help other people if they have similar issues. The final query is as follows:
SELECT * FROM
(
SELECT
user_id,
plan_type,
created_at,
lag(plan_type) OVER (PARTITION by user_id ORDER BY created_at) as prev_plan,
row_number() OVER (PARTITION by user_id ORDER BY created_at) as rownum
FROM tablename
WHERE plan_type IS NOT NULL
) userHistory
WHERE
userHistory.plan_type <> userHistory.prev_plan
OR userHistory.rownum = 1
ORDER BY created_at;
The plan_type IS NOT NULL filters out bad data at the source table and the outer where clause gets any changes OR the first row of data that would not be included otherwise.
ALSO BE CAREFUL about the created_at timestamp if you are working of your prev_plan field since it would of course give you the time of the new value!!!
This is a gaps-and-islands problem. I think lag() is the simplest approach:
select user_id, plan_type, created
from (select t.*,
lag(plan_type) over (partition by user_id order by created) as prev_plan_type
from t
) t
where prev_plan_type is null or prev_plan_type <> plan_type;
This assumes that plan types can move back to another value and you want each one.
If not, just use aggregation:
select user_id, plan_type, min(created)
from t
group by user_id, plan_type;
use row_number() window function
select * from
(select *,row_number()over(partition by user_id,plan_type order by created) rn
) a where a.rn=1
use lag()
select * from
(
select user_id, plant_type, lag(plan_type) over (partition by user_id order by created) as changes, created
from tablename
)A where plan_type<>changes and changes is not null

Count values checking if consecutive

This is my table:
Event Order Timestamp
delFailed 281475031393706 2018-07-24T15:48:08.000Z
reopen 281475031393706 2018-07-24T15:54:36.000Z
reopen 281475031393706 2018-07-24T15:54:51.000Z
I need to count the number of event 'delFailed' and 'reopen' to calculate #delFailed - #reopen.
The difficulty is that there cannot be two same consecutives events, so that in this case the result will be "0" not "-1".
This is what i have achieved so far (Which is wrong because it gives me -1 instead of 0 due to the fact there are two consecutive "reopen" events )
with
events as (
select
event as events,
orders,
"timestamp"
from main_source_execevent
where orders = '281475031393706'
and event in ('reopen', 'delFailed')
order by "timestamp"
),
count_events as (
select
count(events) as CEvents,
events,
orders
from events
group by orders, events
)
select (
(select cevents from count_events where events = 'delFailed') - (select cevents from count_events where events = 'reopen')
) as nAttempts,
orders
from count_events
group by orders
How can i count once if there are two same consecutive events?
It is a gaps-and-islands problem, you can use make to row number to check rows are two same consecutive events
Explain
one row number created by normal.
another row number created by Event column
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn |
|-----------|-----------------|----------------------|-----|----|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 |
when you create those two row you can get an upper result, then use grp - rn to get calculation the row are or are not same consecutive.
SELECT *,grp-rn
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn | grp-rn |
|-----------|-----------------|----------------------|-----|----|----------|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 | 0 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 | 1 |
you can see when if there are two same consecutive events grp-rn column will be the same, so we can group by by grp-rn column and get count
Final query.
CREATE TABLE T(
Event VARCHAR(50),
"Order" VARCHAR(50),
Timestamp Timestamp
);
INSERT INTO T VALUES ('delFailed',281475031393706,'2018-07-24T15:48:08.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:36.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:51.000Z');
Query 1:
SELECT
SUM(CASE WHEN event = 'delFailed' THEN 1 END) -
SUM(CASE WHEN event = 'reopen' THEN 1 END) result
FROM (
SELECT Event,COUNT(distinct Event)
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
group by grp - rn,Event
)t1
Results:
| result |
|--------|
| 0 |
I would just use lag() to get the first event in any sequence of similar values. Then do the calculation:
select sum( (event = 'reopen')::int ) as num_reopens,
sum( (event = 'delFailed')::int ) as num_delFailed
from (select mse.*,
lag(event) over (partition by orders order by "timestamp") as prev_event
from main_source_execevent mse
where orders = '281475031393706' and
event in ('reopen', 'delFailed')
) e
where prev_event <> event or prev_event is null;

Subtracting two column on group by same table SQL

I have this table
create table events
(
event_type integer not null,
value integer not null,
time timestamp not null,
unique(event_type, time)
);
I want to write a SQL query that, for each that has been registered more than once, returns the difference between the latest (i.e. the most recent in terms of) and the second latest. The table should be ordered by (in ascending order).
Sample data is:
event_type | value | time
-------------+------------+--------------------
2 | 5 | 2015-05-09 12:42:00
4 | -42 | 2015-05-09 13:19:57
2 | 2 | 2015-05-09 14:48:30
2 | 7 | 2015-05-09 12:54:39
3 | 16 | 2015-05-09 13:19:57
3 | 20 | 2015-05-09 15:01:09
The output should be
event_type | value
------------+-----------
2 | -5
3 | 4
So far I tried doing this
SELECT event_type
FROM events
GROUP BY event_type
HAVING COUNT(event_type) > 1
ORDER BY event_type
I cannot find a way two get the right value for the second column that I've mentioned. I'm using PostgreSQL 9.4
One way to do it using lead, which gets the next value of a given column based on a specified ordering. The penultimate row for a given event_type will have the latest value which can be used for subtraction in this case. (Run the inner query to see how the next_val is assigned)
select event_type,next_val-value as diff
from (select t.*
,lead(value) over(partition by event_type order by time) as next_val,
,row_number() over(partition by event_type order by time desc) as rnum
from tbl t
) t
where next_val is not null and rnum=2
One more option with DISTINCT ON and lead.
select distinct on (event_type) event_type,next_val-value as diff
from (select t.*,lead(value) over(partition by event_type order by time) as next_val
from events t
) t
where next_val is not null
order by event_type,time desc
You can do this using ANSI/ISO standard window functions:
select event_type,
sum(case when seqnum = 1 then value
when seqnum = 2 then - value
end) as diff_latest
from (select e.*,
row_number() over (partition by event_type order by time desc) as seqnum
from events e
) e
where seqnum in (1, 2)
group by event_type
having count(*) = 2;
Here is a SQL Fiddle.

SQL ORACLE - get min row with sequence equal values

My have table similar to:
MY_DAT | STATUS
=========|========
1.1.2017 | A
2.1.2017 | A
3.1.2017 | A
4.1.2017 | B
5.1.2017 | B
6.1.2017 | A
7.1.2017 | C
8.1.2017 | A
9.1.2017 | A
10.1.2017| A
I want SQL query that by date(MY_DAT) return min date with equal STATUS without interruption.
Example
MY_DAT = '1.1.2017' -> '1.1.2017',A
MY_DAT = '3.1.2017' -> '1.1.2017',A
MY_DAT = '10.1.2017' -> '8.1.2017',A
MY_DAT = '5.1.2017' -> '4.1.2017',B
I don't how this sql have to look like.
EDIT
I need result to be for every date. In this example result have to be:
MY_DAT | STATUS | BEGIN
=========|========|========
1.1.2017 | A |1.1.2017
2.1.2017 | A |1.1.2017
3.1.2017 | A |1.1.2017
4.1.2017 | B |4.1.2017
5.1.2017 | B |4.1.2017
6.1.2017 | A |6.1.2017
7.1.2017 | C |7.1.2017
8.1.2017 | A |8.1.2017
9.1.2017 | A |8.1.2017
10.1.2017| A |8.1.2017
ANSWER
select my_date, status,
min(my_date) over (partition by grp, status) as begin
from (select my_date,status ,
row_number() over(order by my_date)
-row_number() over(partition by status order by my_date) as grp
from tbl ) t
Thanks to Vamsi Prabhala
Use a difference of row numbers approach to assign groups to consecutive rows with same status. (Run the inner query to see this.). After this, it is just a group by operation to get the min date.
select status,min(my_date)
from (select my_date,status
,row_number() over(order by my_date)
-row_number() over(partition by status order by my_date) as grp
from tbl
) t
group by grp,status
Please try this.
SELECT status, min(my_dat)
FROM dates
GROUP BY status
OK, then what about this?
SELECT *
FROM dates
INNER JOIN
(
SELECT status, min(my_dat)
FROM dates
GROUP BY status
) sub
ON dates.status = sub.status