How to calculate weekly retention in BigQuery using SQL - sql

I have the following table with the week number and the retention rate.
|creation_week |num_engaged_users |num_users_in_cohort |retention_rate|
|:------------:|:-----------------:|:------------------:|:------------:|
|37| 373114 |4604 |67.637|
|38| 1860 |4604. |40.4|
|39| 1233 |4604 |26.781|
|40| 668 |4604 |14.509|
|41| 450 |4604 |9.774|
|42| 463| 4604|10.056|
What I need is to make it look something like this
|week |week0 |week1 |week2|week3|week4|week5|week6|
|:---:|:----:|:----:|:---:|:---:|:---:|:---:|:---:|
|week37|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week38|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week39|100|ret.rate|ret.rate|ret.rate|ret.rate|
|week40|100|ret.rate|ret.rate|ret.rate|
|week41|100|ret.rate|ret.rate|
|week42|100|ret.rate|
how can I do that using BigQuery SQL?
For some reason Stackoverflow doesn't allow to post this question unless all the tables are marked as code...
I will provide the SQL code I used in the first answer because it doesn't let me post it either

WITH
new_user_cohort AS (
WITH
#table with cookie and user_ids for the later matching
table_1 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
user_id
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND user_id>0),
#second table which gives acess to the sample with the users who performed the event
table_2 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
EXTRACT(WEEK
FROM
creation_date) AS first_week
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND event_type = 'launch_first_time'
#set the date from when starting cohort analysis
AND EXTRACT(WEEK
FROM
creation_date) = EXTRACT(WEEK
FROM
DATE '2021-09-15'))
#join user id with cookie_id and group the elements to remove the duplicates
SELECT
user_id,
first_week
FROM
table_2
JOIN
table_1
ON
table_1.cookie_id = table_2.cookie_id
#group the results to avoid duplicates
GROUP BY
user_id,
first_week ),
num_new_users AS (
SELECT
COUNT(*) AS num_users_in_cohort,
first_week
FROM
new_user_cohort
GROUP BY
first_week ),
engaged_users_by_day AS (
SELECT
COUNT(DISTINCT `stockduel.analytics.ws_raw_sessions_v2`.user_id) AS num_engaged_users,
EXTRACT(WEEK
FROM
started_at) AS creation_week,
FROM
`stockduel.analytics.ws_raw_sessions_v2`
JOIN
new_user_cohort
ON
new_user_cohort.user_id = `stockduel.analytics.ws_raw_sessions_v2`.user_id
WHERE
EXTRACT(WEEK
FROM
started_at) BETWEEN EXTRACT(WEEK
FROM
DATE '2021-09-15')
AND EXTRACT(WEEK
FROM
DATE '2021-10-22')
GROUP BY
creation_week )
SELECT
creation_week,
num_engaged_users,
num_users_in_cohort,
ROUND((100*(num_engaged_users / num_users_in_cohort)), 3) AS retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
creation_week

Related

ETL query need some changes go get it right

Hello guys I have a query which is working but when I remove 2 filters (2 where clauses at the end doesn't work as expected but still have to be removed from the query)
I have accounts 1000001,1000002,1000003,1000004 and 1000005
I only get 1000005 accounts, Pretty sure that it`s is about the window MAX function, but still.
I want to get the all values for the accounts.
SELECT a12.month_id,
a12.populate_id AS account_id,
LAST_VALUE(current_bal IGNORE NULLS) OVER
(PARTITION BY Populate_id ORDER BY date_id ASC ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_dly_bal
FROM (SELECT TO_CHAR(date_id, 'YYYYMM') AS month_id,
date_id,
account_id AS "account_id",
MAX(account_id) OVER (PARTITION by TO_CHAR(date_id, 'YYYYMM')) as populate_id,
current_bal
FROM (SELECT t.date_id, ad.account_id, ad.current_bal
FROM timedate t
FULL OUTER JOIN (SELECT src_extract_dt, account_id, current_bal
FROM account_dly
WHERE account_id = 1000001) ad
on t.date_id = ad.src_extract_dt
WHERE TO_CHAR(date_id, 'YYYYMM') = '201908'
order by t.date_id)) a12;
https://i.stack.imgur.com/xphVh.png

SQL Server LEAD function

-- FIRST LOGIN DATE
WITH CTE_FIRST_LOGIN AS
(
SELECT
PLAYER_ID, EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE ASC) AS RN
FROM
ACTIVITY
),
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS
(
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM
ACTIVITY A
JOIN
CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE
NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY
A.PLAYER_ID
)
-- FRACTION
SELECT
NULLIF(ROUND(1.00 * COUNT(CTE_CONSEC.PLAYER_ID) / COUNT(DISTINCT PLAYER_ID), 2), 0) AS FRACTION
FROM
ACTIVITY
JOIN
CTE_CONSEC_PLAYERS CTE_CONSEC ON CTE_CONSEC.PLAYER_ID = ACTIVITY.PLAYER_ID
I am getting the following error when I run this query.
[42S22] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column name 'NEXT_DATE'. (207) (SQLExecDirectW)
This is a leetcode medium question 550. Game Play Analysis IV. I wanted to know why it can't identify the column NEXT_DATE here and what am I missing? Thanks!
The problem is in this CTE:
-- CONSECUTIVE LOGINS prep
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY A.PLAYER_ID
)
Note that you are creating NEXT_DATE as a column alias in this CTE but also referring to it in the WHERE clause. This is invalid because by SQL clause-ordering rules the NEXT_DATE column alias does not exist until you get to the ORDER BY clause which is the last evaluated clause in a SQL query or subquery. You don't have an ORDER BY clause in this subquery, so technically the NEXT_DATE column alias only exists to [sub]queries that both come after and reference your CTE_CONSEC_PLAYERS CTE.
To fix this you'd probably want two CTEs like this (untested):
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS_pre AS (
SELECT
PLAYER_ID,
RN,
EVENT_DATE,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
)
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
MAX(NEXT_DATE) AS NEXT_DATE,
FROM CTE_CONSEC_PLAYERS_pre
WHERE NEXT_DATE = DATEADD(DAY, 1, EVENT_DATE) AND RN = 1
GROUP BY PLAYER_ID
)
You gave every table an alias (for example JOIN CTE_FIRST_LOGIN C has the alias C), and every column access is via the alias. You need to add the correct alias from the correct table to NEXT_DATE.
Your primary issue is that NEXT_DATE is a window function, and therefore cannot be referred to in the WHERE because of SQL's order of operations.
But it seems this query is over-complicated.
The problem to be solved appears to be: how many players logged in the day after they first logged in, as a percentage of all players.
This can be done in a single pass (no joins), by using multiple window functions together:
WITH CTE_FIRST_LOGIN AS (
SELECT
PLAYER_ID,
EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE) AS RN,
-- if EVENT_DATE is a datetime and can have multiple per day then group by CAST(EVENT_DATE AS date) first
LEAD(EVENT_DATE, 1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) AS NextDate
FROM ACTIVITY
),
BY_PLAYERS AS (
SELECT
c.PLAYER_ID,
SUM(CASE WHEN c.RN = 1 AND c.NextDate = DATEADD(DAY, 1, c.EVENT_DATE)
THEN 1 END) AS IsConsecutive
FROM CTE_FIRST_LOGIN AS c
GROUP BY c.PLAYER_ID
)
SELECT ROUND(
1.00 *
COUNT(c.IsConsecutive) /
NULLIF(COUNT(*), 0)
,2) AS FRACTION
FROM BY_PLAYERS AS c;
You could theoretically merge BY_PLAYERS into the outer query and use COUNT(DISTINCT but splitting them feels cleaner

Group by in columns and rows, counts and percentages per day

I have a table that has data like following.
attr |time
----------------|--------------------------
abc |2018-08-06 10:17:25.282546
def |2018-08-06 10:17:25.325676
pqr |2018-08-05 10:17:25.366823
abc |2018-08-06 10:17:25.407941
def |2018-08-05 10:17:25.449249
I want to group them and count by attr column row wise and also create additional columns in to show their counts per day and percentages as shown below.
attr |day1_count| day1_%| day2_count| day2_%
----------------|----------|-------|-----------|-------
abc |2 |66.6% | 0 | 0.0%
def |1 |33.3% | 1 | 50.0%
pqr |0 |0.0% | 1 | 50.0%
I'm able to display one count by using group by but unable to find out how to even seperate them to multiple columns. I tried to generate day1 percentage with
SELECT attr, count(attr), count(attr) / sum(sub.day1_count) * 100 as percentage from (
SELECT attr, count(*) as day1_count FROM my_table WHERE DATEPART(week, time) = DATEPART(day, GETDate()) GROUP BY attr) as sub
GROUP BY attr;
But this also is not giving me correct answer, I'm getting all zeroes for percentage and count as 1. Any help is appreciated. I'm trying to do this in Redshift which follows postgresql syntax.
Let's nail the logic before presenting:
with CTE1 as
(
select attr, DATEPART(day, time) as theday, count(*) as thecount
from MyTable
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
select t1.attr, t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
From here you can pivot to create a day by day if you feel the need
I am trying to enhance the query #johnHC btw if you needs for 7days then you have to those days in case when
with CTE1 as
(
select attr, time::date as theday, count(*) as thecount
from t group by attr,time::date
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
,
CTE3 as
(
select t1.attr, EXTRACT(DOW FROM t1.theday) as day_nmbr,t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
)
select CTE3.attr,
max(case when day_nmbr=0 then CTE3.thecount end) as day1Cnt,
max(case when day_nmbr=0 then percentofday end) as day1,
max(case when day_nmbr=1 then CTE3.thecount end) as day2Cnt,
max( case when day_nmbr=1 then percentofday end) day2
from CTE3 group by CTE3.attr
http://sqlfiddle.com/#!17/54ace/20
In case that you have only 2 days:
http://sqlfiddle.com/#!17/3bdad/3 (days descending as in your example from left to right)
http://sqlfiddle.com/#!17/3bdad/5 (days ascending)
The main idea is already mentioned in the other answers. Instead of joining the CTEs for calculating the values I am using window functions which is a bit shorter and more readable I think. The pivot is done the same way.
SELECT
attr,
COALESCE(max(count) FILTER (WHERE day_number = 0), 0) as day1_count, -- D
COALESCE(max(percent) FILTER (WHERE day_number = 0), 0) as day1_percent,
COALESCE(max(count) FILTER (WHERE day_number = 1), 0) as day2_count,
COALESCE(max(percent) FILTER (WHERE day_number = 1), 0) as day2_percent
/*
Add more days here
*/
FROM(
SELECT *, (count::float/count_per_day)::decimal(5, 2) as percent -- C
FROM (
SELECT DISTINCT
attr,
MAX(time::date) OVER () - time::date as day_number, -- B
count(*) OVER (partition by time::date, attr) as count, -- A
count(*) OVER (partition by time::date) as count_per_day
FROM test_table
)s
)s
GROUP BY attr
ORDER BY attr
A counting the rows per day and counting the rows per day AND attr
B for more readability I convert the date into numbers. Here I take the difference between current date of the row and the maximum date available in the table. So I get a counter from 0 (first day) up to n - 1 (last day)
C calculating the percentage and rounding
D pivot by filter the day numbers. The COALESCE avoids the NULL values and switched them into 0. To add more days you can multiply these columns.
Edit: Made the day counter more flexible for more days; new SQL Fiddle
Basically, I see this as conditional aggregation. But you need to get an enumerator for the date for the pivoting. So:
SELECT attr,
COUNT(*) FILTER (WHERE day_number = 1) as day1_count,
COUNT(*) FILTER (WHERE day_number = 1) / cnt as day1_percent,
COUNT(*) FILTER (WHERE day_number = 2) as day2_count,
COUNT(*) FILTER (WHERE day_number = 2) / cnt as day2_percent
FROM (SELECT attr,
DENSE_RANK() OVER (ORDER BY time::date DESC) as day_number,
1.0 * COUNT(*) OVER (PARTITION BY attr) as cnt
FROM test_table
) s
GROUP BY attr, cnt
ORDER BY attr;
Here is a SQL Fiddle.

SQL - values from two rows into new two rows

I have a query that gives a sum of quantity of items on working days. on weekend and holidays that quantity value and item value is empty.
I would like that on empty days is last known quantity and item.
My query is like this:
`select a.dt,b.zaliha as quantity,b.artikal as item
from
(select to_date('01-01-2017', 'DD-MM-YYYY') + rownum -1 dt
from dual
connect by level <= to_date(sysdate) - to_date('01-01-2017', 'DD-MM-YYYY') + 1
order by 1)a
LEFT OUTER JOIN
(select kolicina,sum(kolicina)over(partition by artikal order by datum_do) as zaliha,datum_do,artikal
from
(select sum(vv.kolicinaulaz-vv.kolicinaizlaz)kolicina,vz.datum as datum_do,vv.artikal
from vlpzaglavlja vz, vlpvarijante vv
where vz.id=vv.vlpzaglavlje
and vz.orgjed='01006'
and vv.skladiste='01006'
and vv.artikal in (3069,6402)
group by vz.datum,vv.artikal
order by vv.artikal,vz.datum asc)
order by artikal,datum_do asc)b
on a.dt=b.datum_do
where a.dt between to_date('12102017','ddmmyyyy') and to_date('16102017','ddmmyyyy')
order by a.dt`
and my output is like this:
and I want this:
In short, if quantity is null use lag(... ignore nulls) and coalesce or nvl:
select dt, item,
nvl(quantity, lag(quantity ignore nulls) over (partition by item order by dt))
from t
order by dt, item
Here is the full query, I cannot test it, but it is something like:
with t as (
select a.dt, b.zaliha as quantity, b.artikal as item
from (
select date '2017-10-10' + rownum - 1 dt
from dual
connect by date '2017-10-10' + rownum - 1 <= date '2017-10-16' ) a
left join (
select kolicina, datum_do, artikal,
sum(kolicina) over(partition by artikal order by datum_do) as zaliha
from (
select sum(vv.kolicinaulaz-vv.kolicinaizlaz) kolicina,
vz.datum as datum_do, vv.artikal
from vlpzaglavlja vz
join vlpvarijante vv on vz.id = vv.vlpzaglavlje
where vz.orgjed = '01006' and vv.skladiste='01006'
and vv.artikal in (3069,6402)
group by vz.datum, vv.artikal)) b
on a.dt = b.datum_do)
select *
from (
select dt, item,
nvl(quantity, lag(quantity ignore nulls)
over (partition by item order by dt)) qty
from t)
where dt >= date '2017-10-12'
order by dt, item
There are several issues in your query, major and minor:
in date generator (subquery a) you are selecting dates from long period, january to september, then joining with main tables and summing data and then selecting only small part. Why not filter dates at first?,
to_date(sysdate). sysdate is already date,
use ansi joins,
do not use order by in subqueries, it has no impact, only last ordering is important,
use date literals when defining dates, it is more readable.

Filling in missing dates DB2 SQL

My initial query looks like this:
select process_date, count(*) batchCount
from T1.log_comments
order by process_date asc;
I need to be able to do some quick analysis for weekends that are missing, but wanted to know if there was a quick way to fill in the missing dates not present in process_date.
I've seen the solution here but am curious if there's any magic hidden in db2 that could do this with only a minor modification to my original query.
Note: Not tested, framed it based on my exposure to SQL Server/Oracle. I guess this gives you the idea though:
*now amended and tested on DB2*
WITH MaxDateQry(MaxDate) AS
(
SELECT MAX(process_date) FROM T1.log_comments
),
MinDateQry(MinDate) AS
(
SELECT MIN(process_date) FROM T1.log_comments
),
DatesData(ProcessDate) AS
(
SELECT MinDate from MinDateQry
UNION ALL
SELECT (ProcessDate + 1 DAY) FROM DatesData WHERE ProcessDate < (SELECT MaxDate FROM MaxDateQry)
)
SELECT a.ProcessDate, b.batchCount
FROM DatesData a LEFT JOIN
(
SELECT process_date, COUNT(*) batchCount
FROM T1.log_comments
) b
ON a.ProcessDate = b.process_date
ORDER BY a.ProcessDate ASC;