BigQuery SQL: filter on event sequence - sql

I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download for the same user ("followed" meaning the event_time_utc of store_app_view is older than event_time_utc of store_app_download).
Sample data:
WITH
`project.dataset.dummy_data_init` AS (SELECT event_id FROM UNNEST(GENERATE_ARRAY(1, 10000)) event_id),
`project.dataset.dummy_data_completed` AS (SELECT event_id,
user_id[OFFSET(CAST(20 * RAND() - 0.5 AS INT64))] user_id,
app_id[OFFSET(CAST(100 * RAND() - 0.5 AS INT64))] app_id,
event_type[OFFSET(CAST(6 * RAND() - 0.5 AS INT64))] event_type,
event_time_utc[OFFSET(CAST(26 * RAND() - 0.5 AS INT64))] event_time_utc
FROM `project.dataset.dummy_data_init`,
(SELECT GENERATE_ARRAY(1, 20) user_id),
(SELECT GENERATE_ARRAY(1, 100) app_id),
(SELECT ['store_app_view', 'store_app_view', 'store_app_download','store_app_install','store_app_update','store_fetch_manifest'] event_type),
(SELECT GENERATE_TIMESTAMP_ARRAY('2020-01-01 00:00:00', '2020-01-26 00:00:00',
INTERVAL 1 DAY) AS event_time_utc))
Select * FROM `project.dataset.dummy_data_completed`
Thanks!

I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download.
Your provided query seems to have almost no connection to this question, so I'll ignore it.
For each user/app pair, you can get the rows that matching your conditions using GROUP BY:
select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end);
To get the total for each app, use a subquery or CTE:
select app_id, count(*)
from (select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end)
) ua
group by app_id;

Related

Double counting problem in Rolling weekly / monthly active endpoints

Here is my current code to calculate DAE,WAE,MAE:
select event_timestamp as day, Section, users as DAE, SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as WAE,
SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as MAE
from (
select count(distinct user_pseudo_id) as users, Section, event_timestamp
from
(select distinct *
from
(
select *,
CASE
WHEN param_value = "Names" or param_value = "SingleName" THEN 'Names'
ELSE param_value
END AS Section
from(
select user_pseudo_id, DATE_TRUNC(EXTRACT(DATE from TIMESTAMP_MICROS(event_timestamp)), DAY) as event_timestamp, event_name, params.value.string_value as param_value
from `rayn-deen-app.analytics_317927526.events_*`, unnest(event_params) as params
where (event_name = 'screen_view' and params.key = 'firebase_screen' and (
# Promises
params.value.string_value = "Promises"
# Favourites
or params.value.string_value = "Favourites"
))
group by user_pseudo_id, event_timestamp, event_name, param_value
order by event_timestamp, user_pseudo_id) raw
) base
order by event_timestamp) as events_table
group by Section, event_timestamp
)
The problem is that for WAE,MAE there is repeat counts of the same users happening. So for example user A was a "daily active user" for 4 days that week. Then in the WAE count, it will consider that as 4 users instead of one. So there is a problem of repeat counts which I need to remove somehow.

Error: Subquery returned more than 1 value. This is not permitted when the subquery follows

I got this kind of table
I already tried to obtain the following query and it works fine
SELECT DISTINCT
USER_ID, SESSION_ID,
(SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 1 AND USER_ID = 407) AS START_TIME_SESSION,
(SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 2 AND USER_ID = 407) AS END_TIME_SESSION,
DATEDIFF(mi, (SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 1 AND USER_ID = 407),
(SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 2 AND USER_ID = 407)) AS TIME_SESSION
FROM
Actividades_Usuarios
WHERE
USER_ID = 407
but for a single user, for example:
Then I try to make it out for the general case, setting EVENT*_*TIME as START_TIME_SESSION when EVENT_ID is 1, instead END_TIME_SESSION when EVENT_ID is 2, with the follow query:
SELECT DISTINCT USER_ID,
SESSION_ID,
(SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 1) as START_TIME_SESSION,
(SELECT EVENT_TIME FROM Actividades_Usuarios
WHERE EVENT_ID = 2) as END_TIME_SESSION
FROM Actividades_Usuarios
but get the common error:
Subquery returned more than 1 value...
when I'm looking exactly that query result. In fact, if I go run one of the column as a simple query, returns the same quantity of rows, whatever select...
SELECT USER_ID,SESSION_ID,
EVENT_TIME as START_TIME_SESSION
FROM Actividades_Usuarios
WHERE EVENT_ID = 1
But, how can I do this adding the other column (END_TIME_SESSION) and the final goal column result (TIME_SESSION) from DATEDIFF between these two?
You need to correlate all of your subqueries i.e. add to you WHERE clauses in your subqueries something along the lines of “user id of subquery = user id of main query”
Clarifying my answer with an example…
SELECT DISTINCT
A.USER_ID,
A.SESSION_ID,
(SELECT B.EVENT_TIME
FROM Actividades_Usuarios B
WHERE B.EVENT_ID = 1
AND B.USER_ID = A.USER_ID) as START_TIME_SESSION,
(SELECT C.EVENT_TIME
FROM Actividades_Usuarios C
WHERE C.EVENT_ID = 2
AND C.USER_ID = A.USER_ID) as END_TIME_SESSION
FROM Actividades_Usuarios A
What you really want to use is aggregation and avoid all the subqueries in the first place:
select USER_ID, SESSION_ID,
min(case when EVENT_ID = 1 then EVENT_TIME end) as START_TIME_SESSION,
min(case when EVENT_ID = 2 then EVENT_TIME end) as END_TIME_SESSION
from Actividades_Usuarios
where EVENT_ID in (1, 2)
group by USER_ID, SESSION_ID;
Using CTEs will make calculations a breeze:
with data as (
select USER_ID, SESSION_ID,
min(case when EVENT_ID = 1 then EVENT_TIME end) as START_TIME_SESSION,
min(case when EVENT_ID = 2 then EVENT_TIME end) as END_TIME_SESSION
from Actividades_Usuarios
where EVENT_ID in (1, 2)
group by USER_ID, SESSION_ID
)
select *, datediff(minute, START_TIME_SESSION, END_TIME_SESSION) as TIME_SESSION
from data;
You should also look into the way that datediff only counts interval boundaries. I suspect the result you want is more like this:
datediff(millisecond, START_TIME_SESSION, END_TIME_SESSION) / 60000 as TIME_SESSION

COUNT / MAX (COUNT) not working in BigQuery

I'm not much used to SQL, but on my own I've been able to run this code:
SELECT
event_name,
COUNT(event_name) AS count,
COUNT(event_name) / SUM(COUNT(event_name)) OVER () * 100 AS event_percent
FROM `table_1`
WHERE
event_name IN ('session_start', 'view_item', 'select_item', 'add_to_cart', 'remove_from_cart', 'begin_checkout', 'purchase' )
GROUP BY
event_name
ORDER BY
count DESC
enter image description here
What I'd like to achive is the percentatge of each COUNT divided by the MAX COUNT. Example: purchase / session_start (22 / 1258)
If anyone can help.. I've tried some things but none worked
I guess a CTE would work
WITH prep AS (
SELECT
event_name,
COUNT(event_name) AS cnt,
COUNT(event_name) / SUM(COUNT(event_name)) OVER () * 100 AS event_percent
FROM `table_1`
WHERE
event_name IN ('session_start', 'view_item', 'select_item', 'add_to_cart', 'remove_from_cart', 'begin_checkout', 'purchase' )
GROUP BY
event_name
ORDER BY
count DESC
)
SELECT
*,
cnt / max(cnt) over()
FROM
prep

How to run SUM() OVER PARTITION BY for COUNT DISTINCT

I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour.
I'm using Athena/Presto as the query engine.
I tried the following query:
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
count,
SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
COUNT(DISTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
GROUP BY 1,2
ORDER BY 1,2
);
But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.
So, I tried the below query
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
);
But this query fails with the following error:
SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause
Count the first time a user appears for the running distinct count:
SELECT eventname, date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
COUNT(DISTINCT moengageuserid) as hour_cont,
SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid as hour_count,
ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
FROM clickstream.moengage
WHERE date = '2020-08-20' AND
eventname IN ('e1', 'e2', 'e3', 'e4')
) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
To calculate running distinct count you can collect user IDs into set (distinct array) and get the size:
cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
This is analytic function and will assign the same value to the whole partition (eventname, date), you can aggregate records in upper subquery using max(), etc.

Hive transformation

I am trying to make a simple hive transformation.
Can some one provide me a way to do this? I have tried collect_set and currently looking at klout's open source UDF.
I think this gives you what you want. I wasn't able to run it and debug it though. Good luck!
select start_point.unit
, start_time as start
, start_time + min(stop_time - start_time) as stop
from
(select * from
(select date_time as start_time
, unit
, last_value(unit) over (order by date_time row desc between current row and 1 following) as previous_unit
from table
) previous
where unit <> previous_unit
) start_points
left outer join
(select * from
(select date_time as stop_time
, unit
, last_value(unit) over (order by date_time row between current row and 1 following) as next_unit
from table
) next
where unit <> next_unit
) stop_points
on start_points.unit = stop_points.unit
where stop_time > start_time
group by start_point.unit, start_time
;
What about using the min and max functions? I think the following will get you what you need:
SELECT
Unit,
MIN(datetime) as start,
MAX(datetime) as stop
from table_name
group by Unit
;
I found it. Thanks for the pointer to use window functions
select *
from
(select *,
case when lag(unit,1) over (partition by id order by effective_time_ut desc) is NULL THEN 1
when unit<>lag(unit,1) over (partition by id order by effective_time_ut desc) then 1
when lead(unit,1) over (partition by id order by effective_time_ut desc) is NULL then 1
else 0 end as different_loc
from units_we_care) a
where different_loc=1
create table temptable as select unit, start_date, end_time, row_number () over() as row_num from (select unit, min(date_time) start_date, max(date_time) as end_time from table group by unit) a;
select a.unit, a.start_date as start_date, nvl(b.start_date, a.end_time) end_time from temptable a left outer join temptable b on (a.row_num+1) = b.row_num;