Calculate 7 Day Retention with SQL - sql

Given the following tables,
users page_views
+-----------------+-----------+ +----------+-----------+
| user_id |varchar| <----+ | pv_id | varchar |
| reg_ts |timestamp| | pv_ts | timestamp |
| reg_device |varchar| +----> | user_id | varchar |
| mktg_channel |varchar| | url | varchar |
+-----------------+-----------+ | device | varchar |
+----------+-----------+
Table "users" has one row per registered user.
Table "page_views" has one row per page view event.
What % of users who first visit on a given day came back again 1 week later?
I'm currently using SQLlite and created a sample database but my output is off...
Below is what I have so far:
-- day 1 active users
SELECT *
FROM page_views
LEFT JOIN page_views AS future_page_views
ON page_views.user_id = future_page_views.user_id
AND page_views.pv_ts = future_page_views.pv_ts - datetime(future_page_views.pv_ts, '+7 day')
-- day 7 retained users
SELECT
future_page_views.pv_ts,
COUNT(DISTINCT page_views.user_id) as active_users,
COUNT(DISTINCT future_page_views.user_id) as retained_users,
CAST(COUNT(DISTINCT future_page_views.user_id) / COUNT(DISTINCT page_views.user_id) AS float) retention
FROM page_views
LEFT JOIN page_views as future_page_views
ON page_views.user_id = future_page_views.user_id
AND page_views.pv_ts = future_page_views.pv_ts - datetime(page_views.pv_ts, '+7 day')
GROUP BY 1
Not sure if I should use Strftime function (DATEDIFF) in this instance to capture the 7 day. Open to any suggestions and feedback, thanks in advance.
EDIT**
Sample data below, based on the below data set,
I expect only user_id (8) to show up as 7 day retained (first day 2020-01-02) (last day 2020-01-09)
Desired Output:
User_ID
p.pv_ts as First_Day
f.pv_ts as Last_Day
Retention Days (i.e 1,2,3,4,5 days...)
% of users who visited and came back on day 7

You can look at just the first two page visits and then aggregate. This gives
select user_id, min(pv_ts) as first_ts,
nullif(max(pv_ts), min(pv_ts)) as second_ts
from (select pv.*,
row_number() over (partition by user_id order by pv_ts) as seqnum
from page_views pv
) pv
where seqnum <= 2
group by user_id;
Then to get the totals:
select count(*),
sum(case when second_ts < datetime(first_ts, '+7day') then 1 else 0 end)
from (select user_id, min(pv_ts) as first_ts,
nullif(max(pv_ts), min(pv_ts)) as second_ts
from (select pv.*,
row_number() over (partition by user_id order by pv_ts) as seqnum
from page_views pv
) pv
where seqnum <= 2
group by user_id
) u;

Related

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

Select most popular hour per country based on number of sales

I want to get the most popular hour for each country based on max value of count(id) which tells how many purchases were made.
I've tried getting the max value of purchases and converted the timestamp into hours, but it always returns each hour for each country when I want only a single hour (the one with most purchases) per country.
The table is like:
id | country | time
1 | AE | 19:20:00.00000
1 | AE | 20:13:00.00000
3 | GB | 23:17:00.00000
4 | IN | 10:23:00.00000
6 | IN | 02:01:00.00000
7 | RU | 05:54:00.00000
2 | RU | 16:34:00.00000
SELECT max(purchases), country, tss
FROM (
SELECT time_trunc(time, hour) AS tss,
count(id) as purchases,
country
FROM spending
WHERE dt > date_sub(current_date(), interval 30 DAY)
GROUP BY tss, country
)
GROUP BY tss, country
Expected output:
amount of purchases | Country | Most popular Hour
34 | GB | 16:00
445 | US | 21:00
You can use window functions along with group by. Notice that it uses RANK function so, for example, if one particular country has same amount of sales at 11AM and 2PM it'll return both hours for that country.
WITH cte AS (
SELECT country
, time_trunc(time, hour) AS hourofday
, COUNT(id) AS purchases
, RANK() OVER(PARTITION BY country ORDER BY COUNT(id) DESC) AS rnk
FROM t
GROUP BY country, time_trunc(time, hour)
)
SELECT *
FROM cte
WHERE rnk = 1

Get sum over last entries per day per article

Let's say there is a table structured like this:
ID | article_id | article_count | created_at
---|------------------------------------------
1 | 1 | 10 | 2019-03-20T18:20:03.685059Z
2 | 1 | 22 | 2019-03-20T19:20:03.685059Z
3 | 2 | 32 | 2019-03-20T18:20:03.685059Z
4 | 2 | 20 | 2019-03-20T19:20:03.685059Z
5 | 1 | 3 | 2019-03-21T18:20:03.685059Z
6 | 1 | 15 | 2019-03-21T19:20:03.685059Z
7 | 2 | 3 | 2019-03-21T18:20:03.685059Z
8 | 2 | 30 | 2019-03-21T19:20:03.685059Z
The goal now is to sum over all article_count of all article_ids for the last entries per day and give back this total count per day. So in the case above I'd like to get a result showing:
total | date
--------|------------
42 | 2019-03-20
45 | 2019-03-21
So far, I tried something like:
SELECT SUM(article_count), DATE_TRUNC('day', created_at)
FROM myTable
WHERE created_at IN
(
SELECT DISTINCT ON (a.created_at::date, article_id::int) created_at
FROM myTable a
ORDER BY created_at::date DESC, article_id, created_at DESC
)
GROUP BY DATE_TRUNC('day', created_at)
In the distinct query I tried to pull only the latest entries per day per article_id and then match the created_at to sum up all the article_count values.
This does not work - it still outputs the sum of the whole day instead of sum up over the last entries.
Besides that I am quite sure that there might be a more elegant way than the where condition.
Thanks in advance (as well for any explanation).
I think you just want to filter down to the last entry per day for each article:
SELECT DATE_TRUNC('day', created_at), SUM(article_count)
FROM (SELECT DISTINCT ON (a.created_at::date, article_id::int) a.*
FROM myTable a
ORDER BY article_id, created_at::date DESC, created_at DESC
) a
GROUP BY DATE_TRUNC('day', created_at);
You are looking for rank function:
WITH cte
AS (SELECT article_id,
article_count,
Date_trunc('day', created_at) AS some_date,
Row_number ()
OVER (
partition BY article_id, Date_trunc( 'day', created_at)
ORDER BY created_at DESC ) AS n
FROM mytable)
SELECT Sum(article_count) AS total,
some_date
FROM cte
WHERE n = 1
GROUP BY some_date
Just add the first of each day / article.
Check it at https://rextester.com/INODNS67085

Getting proper count for longest user streaks

I'm having a difficult time getting the correct counts for longest user streaks. Streaks are consecutive days with check-ins for each user.
Any help would be greatly appreciated. Here's a fiddle with my script and sample data: http://sqlfiddle.com/#!17/d2825/1/0
check_ins table:
user_id goal_id check_in_date
------------------------------------------
| colt | 40365fa0 | 2019-01-07 15:35:53
| colt | d31efe70 | 2019-01-11 15:35:52
| berry| be2fcd50 | 2019-01-12 15:35:51
| colt | e754d050 | 2019-01-13 15:17:16
| colt | 9c87a7f0 | 2019-01-14 15:35:54
| colt | ucgtdes0 | 2019-01-15 12:30:59
PostgreSQL script:
WITH dates(DATE) AS
(SELECT DISTINCT Cast(check_in_date AS DATE),
user_id
FROM check_ins),
GROUPS AS
(SELECT Row_number() OVER (
ORDER BY DATE) AS rn, DATE - (Row_number() OVER (ORDER BY DATE) * interval '1' DAY) AS grp, DATE, user_id
FROM dates)
SELECT Count(*) AS streak,
user_id
FROM GROUPS
GROUP BY grp,
user_id
ORDER BY 1 DESC;
Here's what I get when I run the code above:
streak user_id
--------------
4 colt
1 colt
1 berry
What it should be. I'd like to also only get the longest streak for each user.
streak user_id
--------------
3 colt
1 berry
In Postgres, you can write this as:
select distinct on (user_id) user_id, count(distinct check_in_date::date) as num_days
from (select ci.*,
dense_rank() over (partition by user_id order by check_in_date::date) as seq
from check_ins ci
) ci
group by user_id, check_in_date::date - seq * interval '1 day'
order by user_id, num_days desc;
Here is a db<>fiddle.
This follows similar logic to your approach, but your query seems more complicated than necessary. This does use the Postgres distinct on functionality, which is handy to avoid an additional subquery.
Firstly, Thanks for the fiddle script and sample data.
You are not using the right row_number to implement gaps and islands problem. It should be like in the below query for your data set. On top of that, to get the one with the highest streak, you would need to use DISTINCT ON after grouping by the the group number (grp in your query, I called it seq).
I hope you want to see only the distinct entries per day for a user's data. I have tried to reflect the same with slight changes in the with clause.
SELECT * FROM (
WITH check_ins_dt AS
( SELECT DISTINCT check_in_date::DATE as check_in_date,
user_id
FROM check_ins)
SELECT DISTINCT ON (user_id) COUNT(*) AS streak,user_id
FROM (
SELECT c.*,
ROW_NUMBER() OVER(
ORDER BY check_in_date
) - ROW_NUMBER() OVER(
PARTITION BY user_id
ORDER BY check_in_date
) AS seq
FROM check_ins_dt c
) s
GROUP BY user_id,
seq
ORDER BY user_id,
COUNT(*) DESC ) q order
by streak desc;
Demo

Getting percentage value of records using the max count of records in SQL Server

I'm a newbie to SQL learner and I have an issue I'd like you all to help me with. I've got a table User_Activity_Log that contains the names of students with their ID (user_id), Date of Attendance in the year (User_timestamp) in the format (February 25,2015).
Say the User_Activity_Log table contains
| user_id | user_timestamp |
| jude | February 22 |
| jude | February 24 |
| annie | February 1 |
| sam | January |
I'd like to know how to get a table showing the User Id, the number of counts a student is seen in the month and the percentage count, which should be gotten from the max(count) of a student.
Here's what I've done so far, this gives me error.
USE FinalYearProject
declare #maxval int
select
#maxval = (SELECT MAX(fromsubq.SM) as PA
FROM
(SELECT COUNT (user_Id) as SM
FROM dbo.User_Activity_Log
WHERE user_Timestamp LIKE 'February%'
GROPU BY User_Id) fromsubq
)
(SELECT COUNT
FROM dbo.User_Activity_Log
WHERE user_Timestamp like 'February%'
GROUP BY user_Id) * 100.0 / #maxval
Expected output should be
| User_id | Count | PercentageCount |
| Jude | 2 | 100 % |
| annie | 1 | 50 % |
| sam | 0 | 0 % |
Please help me point out the problem and possible solutions
Thanks in advance.
You can do this by using conditional aggregation in a subquery/cte and adding OVER() to an aggregate:
;with cte AS (SELECT User_ID
,SUM(CASE WHEN user_timestamp LIKE 'February%' THEN 1 ELSE 0 END) as CT
FROM User_Activity_Log
GROUP BY User_ID
)
SELECT User_ID
,CT
,CT*100.0 / MAX(CT) OVER() AS PercentageCount
FROM cte
ORDER BY CT DESC
Demo: SQL Fiddle
Note: It's bad practice to store dates as strings, if you can avoid it at all you should.
Edit: Here's how it would be done with a subquery instead of cte:
SELECT User_ID
,CT
,CT*100.0 / MAX(CT) OVER() AS PercentageCount
FROM (SELECT User_ID
,SUM(CASE WHEN user_timestamp LIKE 'February%' THEN 1 ELSE 0 END) as CT
FROM User_Activity_Log
GROUP BY User_ID
) AS Sub
ORDER BY CT DESC
UPDATE: To use the PercentageCount in a CASE expression, something like:
;with cte AS (SELECT User_ID
,SUM(CASE WHEN user_timestamp LIKE 'February%' THEN 1 ELSE 0 END) as CT
FROM User_Activity_Log
GROUP BY User_ID
)
,cte2 AS (SELECT User_ID
,CT
,CT*100.0 / MAX(CT) OVER() AS PercentageCount
FROM cte
)
SELECT *,CASE WHEN PercentageCount > .5 THEN 'Qualified' ELSE 'NotQualified' END AS Qualified
FROM cte2
ORDER BY CT DESC
First find the Count per user_id in sub-select then find the the percentage in outer query.
Use max over() to find the max value in count then divide each count by max count to get the percentage. Try this.
SELECT user_Id,
Cnt AS [Count],
( Cnt / Max(Cnt) OVER() ) * 100 AS PercentageCount
FROM (SELECT Count(user_Id) AS Cnt,
user_Id,
FROM dbo.User_Activity_Log
WHERE user_Timestamp LIKE 'February%'
GROUP BY User_Id) A