Conditional incrementing in BigQuery - google-bigquery

I have a data table like this:
user_id event_time
1 1456812346
1 1456812350
1 1456812446
1 1456812950
1 1456812960
Now, I am trying to define a 'session_id' for the user based on the event_time. If the events come after a lag of 180 seconds, the events are considered as from new session. So, I would like an output similar to:
user_id event_time session_id
1 1456812346 1
1 1456812350 1
1 1456812446 1
1 1456812950 2
1 1456812960 2
The session is incremented at 4th row as the time is 504 secs after 3rd row and thus more than the threshold of 180 secs.
In Mysql, I could just declare a variable and then increment it conditionally. As variable creation is not supported in BigQuery, is there an alternate way to achieve this?

SELECT
user_id, event_time, session_id
FROM (
SELECT
user_id, event_time, event_time - last_time > 180 AS new_session,
SUM(IFNULL(new_session, 1))
OVER(PARTITION BY user_id ORDER BY event_time) AS session_id
FROM (
SELECT user_id, event_time,
LAG(event_time) OVER(PARTITION BY user_id ORDER BY event_time) AS last_time
FROM YourTable
)
)
ORDER BY event_time

Related

How to select specific rows in a "group by" groups using conditions on multiple columns?

I have the following table with many userId (in the example only one userId for demo purpose):
For every userId I want to extract two rows:
The first row should be isTransaction = 0 and the earliest date!
The second row should be isTransaction = 1, device should be different from that of the first row, isTransaction should be equal to 1 and the earliest date right after that of the first row
That is, the output should be:
Time userId device isTransaction
2021-01-27 10187675 mobile 0
2021-01-30 10187675 web 1
I tried to rank rows with partitioning and ordering but it didn't work:
Select * from
(SELECT *, rank() over(partition by userId, device, isTransaction order by isTransaction, Time) as rnk
FROM table 1)
where rnk=1
order by Time
Please help! It would be also good to check the time difference between these two rows to not exceed 30 days. Otherwise, userId should be dropped.
You can first identify the earliest time for 0. Then enumerate the rows and take only the first one:
select t.*
from (select t.*,
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1;
This satisfies the two conditions you enumerated.
Then buried in the text, you want to eliminate rows where the difference is greater than 30 days. That is a little tricker . . . but not too hard:
select t.*
from (select t.*,
min(case when isTransaction = 1 then time end) over (partition by userid) as time_1
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1 and
time_1 < timestamp_add(time_0, interval 30 day);

Derive session duration when only timestamp is available in SQL

I want to calculate the session duration for the usage of an app. However, in the provided log, the only relevant information I can obtain is timestamp. Below is a simplified log for a single user.
record_num, user_id, record_ts
-----------------------------
1, uid_1, 12:01am
2, uid_1, 12:02am
3, uid_1, 12:03am
4, uid_1, 12:22am
5, uid_1, 12:22am
6, uid_1, 12:25am
Assuming a session is concluded after 15 minutes of inactivity, the above log would consist 2 sessions. And now I would like to calculate the average duration for the two sessions.
I can derive the number of sessions by first calculate the time differences between each record, and whenever a difference exceeds 15 minutes, a session is counted.
But to derive the duration as I would need to know the min(record_ts) and max(record_ts) for each session. However, without a session_id of some sort, I could not group the records into associated sessions.
Is there any SQL based approach where I can solve this?
Assuming you have the date too (without it would mean calculating whether the end time of the session began before the start time), something like this would work:
WITH CTE AS
(SELECT * FROM
(SELECT 1 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:01:00') record_ts)
UNION ALL
(SELECT 2 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:02:00') record_ts)
UNION ALL
(SELECT 3 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:03:00') record_ts)
UNION ALL
(SELECT 4 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:22:00') record_ts)
UNION ALL
(SELECT 5 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:22:00') record_ts)
UNION ALL
(SELECT 6 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:25:00') record_ts)
UNION ALL
(SELECT 7 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:59:00') record_ts)),
sessions as
(SELECT
if(timestamp_diff(record_ts,lag(record_ts,1) OVER (PARTITION BY user_id ORDER BY
record_ts, record_num),MINUTE) >= 15 OR
lag(record_ts,1) OVER (PARTITION BY user_id ORDER BY record_ts, record_num) IS NULL,1,0)
session, record_num, user_id, record_ts
FROM CTE)
SELECT sum(session) OVER (PARTITION BY user_id ORDER BY record_ts, record_num)
sessionNo, record_num, user_id, record_ts
FROM sessions
The key being the number of minutes you want between sessions. In the case above I've put it at 15 minutes (>= 15). Obviously it might be useful to concatenate the session number with the user_Id and a session start time to create a unique session identifer.
I would do this in the following steps:
Use lag() and some logic to determine when a session begins.
Use cumulative sum to assign sessions.
Then aggregation to get averages.
So, to get information on each session:
select user_id, session, min(record_ts), max(record_ts),
timestamp_diff(max(record_ts), min(record_ts), second) as dur_seconds
from (select l.*,
countif( record_ts > timestamp_add(prev_record_ts, interval 15 minute) ) as session
from (select l.*,
lag(record_ts, 1, record_ts) over (partition by user_id order by record_ts) as prev_record_ts
from log l
) l
group by record_num, user_id;
The average is one further step:
with s as (
select user_id, session, min(record_ts), max(record_ts),
timestamp_diff(max(record_ts), min(record_ts), second) as dur_seconds
from (select l.*,
countif( record_ts > timestamp_add(prev_record_ts, interval 15 minute) ) as session
from (select l.*,
lag(record_ts, 1, record_ts) over (partition by user_id order by record_ts) as prev_record_ts
from log l
) l
group by record_num, user_id
)
select user_id, avg(dur_seconds)
from s
group b user_id;

Grouping data in SQL by difference in column values

I have following data in my logs table in postgres table:
logid => int (auto increment)
start_time => bigint (stores epoch value)
inserted_value => int
Following is the data stored in the table (where start time actual is not a column, just displaying start_time value in UTC format in 24 hour format)
logid user_id start_time inserted_value start time actual
1 1 1518416562 15 12-Feb-2018 06:22:42
2 1 1518416622 8 12-Feb-2018 06:23:42
3 1 1518417342 9 12-Feb-2018 06:35:42
4 1 1518417402 12 12-Feb-2018 06:36:42
5 1 1518417462 18 12-Feb-2018 06:37:42
6 1 1518418757 6 12-Feb-2018 06:59:17
7 1 1518418808 11 12-Feb-2018 07:00:08
I want to group and sum values according to difference in start_time
For above data, sum should be calculated in three groups:
user_id sum
1 15 + 8
1 9 + 12 + 18
1 6 + 11
So, values in each group has 1 minute difference. This 1 can be considered as any x minutes difference.
I was also trying LAG function but could not understand it fully. I hope I'm able to explain my question.
You can use a plain group by to achieve what you want. Just make all start_time values equal that belong to the same minute. For example
select user_id, start_time/60, sum(inserted_value)
from log_table
group by user_id, start_time/60
I assume your start_time column contains integers representing milliseconds, so /60 will properly truncate them to minutes. If the values are floats, you should use floor(start_time/60).
If you also want to select a human readable date of the minute you're grouping, you can add to_timestamp((start_time/60)*60) to the select list.
You can use LAG to check if current row is > 60 seconds more than previous row and set group_changed (a virtual column) each time this happens.
In next step, use running sum over that column. This creates a group_number which you can use to group results in the third step.
WITH cte1 AS (
SELECT
testdata.*,
CASE WHEN start_time - LAG(start_time, 1, start_time) OVER (PARTITION BY user_id ORDER BY start_time) > 60 THEN 1 ELSE 0 END AS group_changed
FROM testdata
), cte2 AS (
SELECT
cte1.*,
SUM(group_changed) OVER (PARTITION BY user_id ORDER BY start_time) AS group_number
FROM cte1
)
SELECT user_id, SUM(inserted_value)
FROM cte2
GROUP BY user_id, group_number
SQL Fiddle

Active customers for each day who were active in last 30 days

I have a BQ table, user_events that looks like the following:
event_date | user_id | event_type
Data is for Millions of users, for different event dates.
I want to write a query that will give me a list of users for every day who were active in last 30 days.
This gives me total unique users on only that day; I can't get it to give me the last 30 for each date. Help is appreciated.
SELECT
user_id,
event_date
FROM
[TableA]
WHERE
1=1
AND user_id IS NOT NULL
AND event_date >= DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY')
GROUP BY
1,
2
ORDER BY
2 DESC
Below is for BigQuery Standard SQL and has few assumption about your case:
there is only one row per date per user
user is considered active in last 30 days if user has at least 5 (sure can be any number - even just 1) entries/rows within those 30 days
If above make sense - see below
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM `yourTable`
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
If above assumption #1 is not correct - you can just simple add pre-grouping as a sub-select
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
UPDATE
From comments: If user have any of the event_type IN ('view', 'conversion', 'productDetail', 'search') , they will be considered active. That means any kind of event triggered within the app
So, you can go with below, I think
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
WHERE event_type IN ('view', 'conversion', 'productDetail', 'search')
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date

sql - find the number of days a user was using the app

I like to write a sql query that counts the number of days each user used the application and how many concurrent days. A user can enter the app several times a day but that should count as 1.
My table looks like this:
id | bigint
user_id | bigint
action_date | timestamp without time zone
To count the number of days per user:
SELECT user_id, count(DISTINCT action_date::date) AS days
FROM user_action_tbl
GROUP BY user_id;
One way to do it
SELECT user_id, COUNT(*) days_total, SUM(conseq) days_consecutive
FROM
(
SELECT user_id,
CASE WHEN LEAD(date, 1) OVER (PARTITION BY user_id ORDER BY date) - date = 1 THEN 1 ELSE 0 END consecutive
FROM
(
SELECT user_id, action_date::date date
FROM table1
GROUP BY user_id, action_date::date
) q
) p
GROUP BY user_id
Here is a SQLFiddle demo