How to number rows according to values in columns - sql

Imagine I have an event log (ordered by UserID and Start, Start_of_previous_event is added using LAG(), inactive time = Start - Start_of_previous_event):
UserID Event Start Start_of_previous_event inactive_time
1 Onboarding 2024-01-01 01:00:00 null null
1 Main 2024-01-01 01:01:00 2024-01-01 01:00:00 1
1 Cart 2024-01-01 01:05:00 2024-01-01 01:01:00 4
1 Main 2024-01-01 02:00:00 2024-01-01 01:05:00 55
2 Onboarding 2024-01-01 01:00:00 null null
How can I add a column with a session_ids ? New session starts after 30 minutes of inactive time and for new UserID.
Session_id column for the above example:
1
1
1
2
3
Is there a way to avoid it if I want to group the resulting table like this:
Select Event, Count(distinct session_id)
from sessions
group by Event

You can assign the session with date arithmetic and a cumulative sum. Date arithmetic varies by database, but this should give you the idea:
select el.*,
sum(case when start_of_previous_event > start - interval '30 minute'
then 0 else 1
end) over (order by userid order by start) as session_cnt
from eventlog el;

Related

Row number with condition

I want to increase the row number of a partition based on a condition. This question refers to the same problem, but in my case, the column I want to condition on is another window function.
I want to identify the session number of each user (id) depending on how long ago was their last recorded action (ts).
My table looks as follows:
id ts
1 2022-08-01 09:00:00 -- user 1, first session
1 2022-08-01 09:10:00
1 2022-08-01 09:12:00
1 2022-08-03 12:00:00 -- user 1, second session
1 2022-08-03 12:03:00
2 2022-08-01 11:04:00 -- user 2, first session
2 2022-08-01 11:07:00
2 2022-08-25 10:30:00 -- user 2, second session
2 2022-08-25 10:35:00
2 2022-08-25 10:36:00
I want to assign each user a session identifier based on the following conditions:
If the user's last action was 30 or more minutes ago (or doesn't exist), then increase (or initialize) the row number.
If the user's last action was less than 30 minutes ago, don't increase the row number.
I want to get the following result:
id ts session_id
1 2022-08-01 09:00:00 1
1 2022-08-01 09:10:00 1
1 2022-08-01 09:12:00 1
1 2022-08-03 12:00:00 2
1 2022-08-03 12:03:00 2
2 2022-08-01 11:04:00 1
2 2022-08-01 11:07:00 1
2 2022-08-25 10:30:00 2
2 2022-08-25 10:35:00 2
2 2022-08-25 10:36:00 2
If I had a separate column with the seconds since their last session, I could simply add 1 to each user's partitioned sum. However, this column is a window function itself. Hence, the following query doesn't work:
select
id
,ts
,extract(
epoch from (
ts - lag(ts, 1) over(partition by id order by ts)
)
) as seconds_since -- Number of seconds since last action (works well)
,sum(
case
when coalesce(
extract(
epoch from (
ts - lag(ts, 1) over (partition by id order by ts)
)
), 1800
) >= 1800 then 1
else 0 end
) over (partition by id order by ts) as session_id -- Window inside window (crashes)
from
t
order by
id
,ts
ERROR: Aggregate window functions with an ORDER BY clause require a frame clause
Use LAG() window function to get the previous ts of each row and create flag column indicating if the difference between the 2 timestamps is greater than 30 minutes.
Then use SUM() window function over that flag:
SELECT
id
,ts
,SUM(flag) OVER (
PARTITION BY id
ORDER BY ts
rows unbounded preceding -- necessary in aws-redshift
) as session_id
FROM (
SELECT
*
,COALESCE((LAG(ts) OVER (PARTITION BY id ORDER BY ts) < ts - INTERVAL '30 minute')::int, 1) flag
FROM
tablename
) t
;
See the demo.

How to filter out multiple downtime events in SQL Server?

There is a query I need to write that will filter out multiples of the same downtime event. These records get created at the exact same time with multiple different timestealrs which I don't need. Also, in the event of multiple timestealers for a downtime event I need to make the timestealer 'NULL' instead.
Example table:
Id
TimeStealer
Start
End
Is_Downtime
Downtime_Event
1
Machine 1
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
2
Machine 2
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
3
NULL
2022-01-01 00:01:00
2022-01-01 00:59:59
0
Operating
What I need the query to return:
Id
TimeStealer
Start
End
Is_Downtime
Downtime_Event
1
NULL
2022-01-01 01:00:00
2022-01-01 01:01:00
1
Malfunction
2
NULL
2022-01-01 00:01:00
2022-01-01 00:59:59
0
Operating
Seems like this is a top 1 row of each group, but with the added logic of making a column NULL when there are multiple rows. You can achieve that by also using a windowed COUNT, and then a CASE expression in the outer SELECT to only return the value of TimeStealer when there was 1 event:
WITH CTE AS(
SELECT V.Id,
V.TimeStealer,
V.Start,
V.[End],
V.Is_Downtime,
V.Downtime_Event,
ROW_NUMBER() OVER (PARTITION BY V.Start, V.[End], V.Is_Downtime,V.Downtime_Event ORDER BY ID) AS RN,
COUNT(V.ID) OVER (PARTITION BY V.Start, V.[End], V.Is_Downtime,V.Downtime_Event) AS Events
FROM(VALUES('1','Machine 1',CONVERT(datetime2(0),'2022-01-01 01:00:00'),CONVERT(datetime2(0),'2022-01-01 01:01:00'),'1','Malfunction'),
('2','Machine 2',CONVERT(datetime2(0),'2022-01-01 01:00:00'),CONVERT(datetime2(0),'2022-01-01 01:01:00'),'1','Malfunction'),
('3','NULL',CONVERT(datetime2(0),'2022-01-01 00:01:00'),CONVERT(datetime2(0),'2022-01-01 00:59:59'),'0','Operating'))V(Id,TimeStealer,[Start],[End],Is_Downtime,Downtime_Event))
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS ID,
CASE WHEN C.Events = 1 THEN C.TimeStealer END AS TimeStealer,
C.Start,
C.[End],
C.Is_Downtime,
C.Downtime_Event
FROM CTE C
WHERE C.RN = 1;

SQL BigQuery - COUNTIF with criteria from current row and partitioned rows

I'm running this line of code:
COUNTIF(
type = "credit"
AND
DATETIME_DIFF(credit_window_end, start_at_local_true_01, DAY) BETWEEN 0 and 5
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future_and_within_credit_window,
And I'm getting this
Row
case_id
start_at_local_true_01
type
credit_window_end
credit_count_per_case_id_in_future_and_within_credit_window
1
12123
2022-02-01 11:00:00
null
2022-02-06 11:00:00
0
2
12123
2022-02-01 11:15:00
run
null
0
3
12123
2022-02-01 11:21:00
jump
2022-02-06 11:21:00
0
4
12123
2022-02-04 11:31:00
run
2022-02-09 11:31:00
0
5
12123
2022-02-05 11:34:00
jump
null
0
6
12123
2022-02-08 12:38:00
credit
null
0
7
12555
2022-02-01 11:15:00
null
null
0
But I want this
Row
case_id
start_at_local_true_01
type
credit_window_end
credit_count_per_case_id_in_future_and_within_credit_window
1
12123
2022-02-01 11:00:00
null
2022-02-06 11:00:00
0
2
12123
2022-02-01 11:15:00
run
null
0
3
12123
2022-02-01 11:21:00
jump
2022-02-06 11:21:00
0
4
12123
2022-02-04 11:31:00
run
2022-02-09 11:31:00
1
5
12123
2022-02-05 11:34:00
jump
null
0
6
12123
2022-02-08 12:38:00
credit
null
0
7
12555
2022-02-01 11:15:00
null
null
0
The 4th row should be 1 because (from the 6th row) credit = credit AND DATETIMEDIFF(2022-02-08T12:38:00, 2022-02-04 11:31:00, DAY) between 0 and 5
The calculation within the cell would look like this:
COUNTIF(
run = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-04T11:31:00, DAY ) between 0 and 5
jump = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-05T11:34:00, DAY ) between 0 and 5
credit = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-08T12:38:00, DAY ) between 0 and 5
)
COUNTIF(
false and false
false and false
true and true
)
COUNTIF(
0
0
1
)
I think I know why, but I don't know how to fix it.
It's because the DATETIME_DIFF function is taking both values from the same row (from each partitioned row). The second element should stay the same (start_at_local_true_01). But I want the first element to be fixed to the CURRENT ROW's credit_window_end (not each partitioned row's credit_window_end).
This is my code so far (including sample table):
with data_table as(
select * FROM UNNEST(ARRAY<STRUCT<
case_id INT64, start_at_local_true_01 DATETIME, type STRING, credit_window_end DATETIME>>
[
(12123, DATETIME("2022-02-01 11:00:00"), null, DATETIME("2022-02-06 11:00:00"))
,(12123, DATETIME("2022-02-01 11:15:00"), 'run', null)
,(12123, DATETIME("2022-02-01 11:21:00"), 'jump', DATETIME("2022-02-06 11:21:00"))
,(12123, DATETIME("2022-02-04 11:31:00"), 'run', DATETIME("2022-02-09 11:31:00"))
,(12123, DATETIME("2022-02-05 11:34:00"), 'jump', null)
,(12123, DATETIME("2022-02-08 12:38:00"), 'credit', null)
,(12555, DATETIME("2022-02-01 11:15:00"), null, null)
]
)
)
select
data_table.*,
COUNTIF(
type = "credit"
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future,
COUNTIF(
type = "credit"
AND
DATETIME_DIFF(start_at_local_true_01, credit_window_end, DAY) BETWEEN 0 and 5
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future_and_within_credit_window,
--does not work. does not even run
-- DATETIME_DIFF(
-- credit_window_end,
-- array_agg(
-- IFNULL(start_at_local_true_01,DATETIME("2000-01-01 00:00:00"))
-- )
-- over (partition by case_id order by start_at_local_true_01 asc
-- ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
-- , DAY
-- )
-- as credit_count_per_case_id_in_future_and_within_credit_window_02,
from data_table
Thanks for the help!
As confirmed by #Phil in the comments, this was solved by changing the window to:
over (partition by case_id order by UNIX_MILLIS(TIMESTAMP(start_at_local_true_01)) RANGE BETWEEN CURRENT ROW AND 432000000 FOLLOWING)
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

sql query using time series

I have the below table in bigquery:
Timestamp variant_id activity
2020-04-02 08:50 1 active
2020-04-03 07:39 1 not_active
2020-04-04 07:40 1 active
2020-04-05 10:22 2 active
2020-04-07 07:59 2 not_active
I want to query this subset of data to get the number of active variant per day.
If variant_id 1 is active at date 2020-04-04, it still active the follwing dates also 2020-04-05, 2020-04-06 until the value activity column is not_active , the goal is to count each day the number of variant_id who has the value active in the column activity, but I should take into account that each variant_id has the value of the last activity on a specific date.
for example the result of the desired query in the subset data must be:
Date activity_count
2020-04-02 1
2020-04-03 0
2020-04-04 1
2020-04-05 2
2020-04-06 2
2020-04-07 1
2020-04-08 1
2020-04-09 1
2020-04-10 1
any help please ?
Consider below approach
select date, count(distinct if(activity = 'active', variant_id, null)) activity_count
from (
select date(timestamp) date, variant_id, activity,
lead(date(timestamp)) over(partition by variant_id order by timestamp) next_date
from your_table
), unnest(generate_date_array(date, ifnull(next_date - 1, '2020-04-10'))) date
group by date
if applied to sample data in your question - output is

Update the list of dates to have the same day

I have this in my table
TempTable
Id Date
1 1-15-2010
2 2-14-2010
3 3-14-2010
4 4-15-2010
i would like to change every record so that they have all same day, that is the 15th
like this
TempTable
Id Date
1 1-15-2010
2 2-15-2010 <--change to 15
3 3-15-2010 <--change to 15
4 4-15-2010
what if i like on the 30th?
the records should be
TempTable
Id Date
1 1-30-2010
2 2-28-2010 <--change to 28 because feb has 28 days only
3 3-30-2010 <--change to 30
4 4-30-2010
thanks
You can play some fun tricks with DATEADD/DATEDIFF:
create table T (
ID int not null,
DT date not null
)
insert into T (ID,DT)
select 1,'20100115' union all
select 2,'20100214' union all
select 3,'20100314' union all
select 4,'20100415'
SELECT ID,DATEADD(month,DATEDIFF(month,'20100101',DT),'20100115')
from T
SELECT ID,DATEADD(month,DATEDIFF(month,'20100101',DT),'20100130')
from T
Results:
ID
----------- -----------------------
1 2010-01-15 00:00:00.000
2 2010-02-15 00:00:00.000
3 2010-03-15 00:00:00.000
4 2010-04-15 00:00:00.000
ID
----------- -----------------------
1 2010-01-30 00:00:00.000
2 2010-02-28 00:00:00.000
3 2010-03-30 00:00:00.000
4 2010-04-30 00:00:00.000
Basically, in the DATEADD/DATEDIFF, you specify the same component to both (i.e. month). Then, the second date constant (i.e. '20100130') specifies the "offset" you wish to apply from the first date (i.e. '20100101'), which will "overwrite" the portion of the date your not keeping. My usual example is when wishing to remove the time portion from a datetime value:
SELECT DATEADD(day,DATEDIFF(day,'20010101',<date column>),'20100101')
You can also try something like
UPDATE TempTable
SET [Date] = DATEADD(dd,15-day([Date]), DATEDIFF(dd,0,[Date]))
We have a function that calculates the first day of a month, so I just addepted it to calculate the 15 instead...