Selecting only first_value by date and ID in BigQuery

Selecting only first_value by date and ID in BigQuery - google-bigquery

I'm trying to get only first user event (row) for every day.
date
userId
event
2018-09-30
1
login
2018-09-30
2
login
2018-09-30
1
next
2018-09-30
1
next
2018-09-30
2
next
2018-09-29
1
login
and my goal is to get this.
date
userId
event
2018-09-30
1
login
2018-09-30
2
login
2018-09-29
1
login
For now I stopped on this. But it returns first date of users activity. But I need only first event of user by date.
select *, FIRST_VALUE(date) over(partition by date order by userId) AS firstValue
FROM table
date
userId
event
firstValue
2018-09-30
1
login
2018-09-29
2018-09-30
2
login
2018-09-30
2018-09-30
1
next
2018-09-29
2018-09-30
1
next
2018-09-29
2018-09-30
2
next
2018-09-30
2018-09-29
1
login
2018-09-29
So what should i do to get only first appearence of user by day?

In your design you are missing some extra column that would define order within the day
If you will use below - this will give you one event per user per day - but no order within the day defined/guaranteed here:
select *
from `project.dataset.table`
where true
qualify row_number() over(partition by userid, date) = 1
with output
meantime if you have some column that can be used to order events within the day - for example - order_column - you can use below
select *
from `project.dataset.table`
where true
qualify row_number() over(partition by userid, date order by order_column) = 1

First generate a row_number. If there is any other column indicating the order within your dataset, e.g. timestamp or order number, please use it instead. In the next step group by the desired field date and userid.
To determine the first element array_agg the complete column and take the first entry by offset(0). This approach is quite flexible; here also the amounts of events per user and date are shown.
select date,userID , array_agg(event order by row_id limit 1)[offset(0)]
,count(1) as event_per_user_date
from(
select *, row_number() over() as row_id
from(
Select "2018-09-30" as date, 1 as userID, "login" as event
union all select "2018-09-30", 2, "login"
union all select "2018-09-30", 1 ,"next"
union all select "2018-09-30", 1 ,"next"
union all select "2018-09-30", 2 ,"next"
union all select "2018-09-29", 1 ,"login"
)
)
group by 1,2

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.

You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn

In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

How to get values from the previous row?

I have a table like this:
ID
NUMBER
TIMESTAMP
1
1
05/28/2020 09:00:00
2
2
05/29/2020 10:00:00
3
1
05/31/2020 21:00:00
4
1
06/01/2020 21:00:00
And I want to show data like this:
ID
NUMBER
TIMESTAMP
RANGE
1
1
05/28/2020 09:00:00
0 Days
2
2
05/29/2020 10:00:00
0 Days
3
1
05/31/2020 21:00:00
3,5 Days
4
1
06/01/2020 21:00:00
1 Days
So it takes 3,5 Days to process the number 1 process.
I tried:
select a.id, a.number, a.timestamp, ((a.timestamp-b.timestamp)/24) as days
from my_table a
left join (select number,timestamp from my_table) b
on a.number=b.number
Didn't work as expected. How to do this properly?

Use the window function lag().
With standard interval output:
SELECT *, timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)
FROM tbl
ORDER BY id;
If you need decimal number like in your example:
SELECT *, round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2) || ' days'
FROM tbl
ORDER BY id;
If you also need to display '0 days' instead of NULL like in your example:
SELECT *, COALESCE(round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2), 0) || ' days'
FROM tbl
ORDER BY id;
db<>fiddle here

Category Entry and Exit Dates per ID AND Category

I have the following table, where ID is the unique identifier. An can move from category to category, both up and down. My table records each day an ID stays in a given category. I am trying to identify the start date and the end date of an ID in a given category. The problem is that an ID can move up a category, and move back down to its original category after a certain number of days. Here is my table as an example with only 1 ID:
ID Category Date
1 1 2021-01-01
1 1 2021-01-02
...
1 1 2021-01-24
1 2 2021-01-25
...
1 2 2021-02-15
1 1 2021-02-16
...
1 1 2021-04-20
1 2 2021-04-21
When I try to get the MIN(DATE) and MAX(DATE) and group by the category and ID, it shows me that the account was in Category 1 from 2021-01-01 to 2021-04-20, and in Category 2 from 02-25 to 04-21. I am trying to track the movements of the file in each bucket step by step, meaning in my ideal result, the movements of the account will be tracked as:
ID Category StartDate EndDate
1 1 2021-01-01 2021-01-24
1 2 2021-01-25 2021-02-15
1 1 2021-02-16 2021-04-20
1 2 2021-04-21 NULL (or GETDATE())
How can I achieve this result? Any help would be appreciated. I tried using the RANK() function but because the table records every single day, it seems useless.

This is a type of gaps-and-islands problem that is most easily solved using the difference of row numbers:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, category order by date) as seqnum_2
from t
) t
group by id, category, (seqnum - seqnum_2);
Actually, the difference of row numbers is only simplest because you have not specified the database. You can just subtract a sequence of numbers from the date to get a constant that defines each group. That looks like:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id, category order by date) as seqnum
from t
) t
group by id, category, date - seqnum * interval '1 day';
However, the date arithmetic varies by database.

Select start and end dates for changing values in SQL

I have a database with accounts and historical status changes
select Date, Account, OldStatus, NewStatus from HistoricalCodes
order by Account, Date
Date
Account
OldStatus
NewStatus
2020-01-01
12345
1
2
2020-10-01
12345
2
3
2020-11-01
12345
3
2
2020-12-01
12345
2
1
2020-01-01
54321
2
3
2020-09-01
54321
3
2
2020-12-01
54321
2
3
For every account I need to determine Start Date and End Date when Status = 2. An additional challenge is that the status can change back and forth multiple times. Is there a way in SQL to create something like this for at least first two timeframes when account was in 2? Any ideas?
Account
StartDt_1
EndDt_1
StartDt_2
EndDt_2
12345
2020-01-01
2020-10-01
2020-11-01
2020-12-01
54321
2020-09-01
2020-12-01

I would suggest putting this information in separate rows:
select t.*
from (select account, date as startdate,
lead(date) over (partition by account order by date) as enddate
from t
) t
where newstatus = 2;
This produces a separate row for each period when an account has a status of 2. This is better than putting the dates in separate pairs of columns, because you do not need to know the maximum number of periods of status = 2 when you write the query.

For a fixed maximum of status changes per account, you can use window functions and conditional aggregation:
select account,
max(case when rn = 1 then date end) as start_dt1,
max(case when rn = 1 then lead_date end) as end_dt1,
max(case when rn = 2 then date end) as start_dt2,
max(case when rn = 2 then lead_date end) as end_dt2
from (
select t.*,
row_number() over(partition by account, newstatus order by date) as rn,
lead(date) over(partition by account order by date) as lead_date
from mytable t
) t
where newstatus = 2
group by account
You can extend the select clause with more conditional expressions to handle more possible ranges per account.

Number of user that came back within 3 days after playing at least three sessions?

I have data which contains user, eventdate and sessions.I want to separate users who had atleast 3 sessions and came back for new session within 3 days.
user eventdate session
A 2018-02-05 1
A 2018-02-05 2
A 2018-02-06 3
A 2018-02-10 4
The output the users who had done 3 sessions and then came back for forth session within 3 days.
I tried the following query but it is not giving me the answer that is needed.
SELECT distinct user, MIN(eventdate) startdate, MAX(eventdate) enddate
FROM (SELECT user, eventdate
FROM (SELECT user, eventdate
FROM tablename
where datediff(startdate,enddate)<=3
ORDER BY user, eventdate) where sessions>=3) t
GROUP BY user
ORDER BY user, startdate;
I know the query has many issues but I am simply unable to figure out how to move forward. Any suggestions?

Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM (
SELECT
user, eventdate, sessions_in_a_day,
SUM(sessions_in_a_day) OVER(PARTITION BY user ORDER BY eventdate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) total_sessions_before,
DATE_DIFF(eventdate, LAG(eventdate) OVER(PARTITION BY user ORDER BY eventdate), DAY) delay
FROM (
SELECT user, eventdate, COUNT(1) sessions_in_a_day
FROM t
GROUP BY user, eventdate
)
)
WHERE total_sessions_before >= 3
AND delay <= 3
-- ORDER BY user, eventdate
You can test / play with above using dummy data
#standardSQL
WITH t AS (
SELECT 'A' user, DATE '2018-02-05' eventdate, 1 session UNION ALL
SELECT 'A', DATE '2018-02-05', 2 UNION ALL
SELECT 'A', DATE '2018-02-06', 3 UNION ALL
SELECT 'A', DATE '2018-02-06', 4 UNION ALL
SELECT 'A', DATE '2018-02-09', 5 UNION ALL
SELECT 'A', DATE '2018-02-09', 6 UNION ALL
SELECT 'A', DATE '2018-02-10', 7 UNION ALL
SELECT 'A', DATE '2018-02-13', 8
)
SELECT *
FROM (
SELECT
user, eventdate, sessions_in_a_day,
SUM(sessions_in_a_day) OVER(PARTITION BY user ORDER BY eventdate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) total_sessions_before,
DATE_DIFF(eventdate, LAG(eventdate) OVER(PARTITION BY user ORDER BY eventdate), DAY) delay
FROM (
SELECT user, eventdate, COUNT(1) sessions_in_a_day
FROM t
GROUP BY user, eventdate
)
)
WHERE total_sessions_before >= 3
AND delay <= 3
ORDER BY user, eventdate
result is
Row user eventdate sessions_in_a_day total_sessions_before delay
1 A 2018-02-09 2 4 3
2 A 2018-02-10 1 6 1
3 A 2018-02-13 1 7 3
Playing with WHERE clause you can "tune" to whatever case you need
In example above, you show only users who had at least 3 sessions before they reached next session within next 3 days
If you are interested in only those who had exactly 3 sessions and reached their fourth session - you can add respective filter

WITH Sess AS
(
select user, session
from tablename
group by user
HAVING count(session) >= 3
)
select user
from tablename join Sess on tablename.session = Sess.session
group by user
having (datediff(day, min(eventdate), Max(eventdate)) <=3)
and (min(eventdate) <> Max(eventDate))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selecting only first_value by date and ID in BigQuery - google-bigquery

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

How to get values from the previous row?

Category Entry and Exit Dates per ID AND Category

Select start and end dates for changing values in SQL

Number of user that came back within 3 days after playing at least three sessions?

Categories

Resources