Counting subscriber numbers given events on SQL - sql

I have a dataset on mysql in the following format, showing the history of events given some client IDs:
Base Data
Text of the dataset (subscriber_table):
user_id type created_at
A past_due 2021-03-27 10:15:56
A reactivate 2021-02-06 10:21:35
A past_due 2021-01-27 10:30:41
A new 2020-10-28 18:53:07
A cancel 2020-07-22 9:48:54
A reactivate 2020-07-22 9:48:53
A cancel 2020-07-15 2:53:05
A new 2020-06-20 20:24:18
B reactivate 2020-06-14 10:57:50
B past_due 2020-06-14 10:33:21
B new 2020-06-11 10:21:24
date_table:
full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01
I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:
Output
date subscriber_count
2020-05-01 0
2020-06-01 2
2020-07-01 2
2020-08-01 1
2020-09-01 1
2020-10-01 2
2020-11-01 2
2020-12-01 2
2021-01-01 2
2021-02-01 2
2021-03-01 2
Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.
My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?
with last_record_per_month as (
select
date_trunc('month', created_at)::date order by created_at) as month_year ,
user_id ,
type,
created_at as created_at
from
subscriber_table
where
user_id in ('A', 'B')
order by
created_at desc
), final as (
select
month_year,
created_at,
type
from
last_record_per_month lrpm
right join (
select
date_trunc('month', full_date)::date as month_year
from
date_table
where
full_date between '2020-05-01' and '2021-03-31'
group by
1
order by
1
) dd
on lrpm.created_at = dd.month_year
and num = 1
order by
month_year
)
select
*
from
final
I do have a premade base table with every single date in many years to use as a joining table
Any help with this is GREATLY appreciated
Thanks!

The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.
SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new
LEFT JOIN subscriber cancel
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1
Let me breakdown some sections
First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'
The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at
Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this
We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.
Lastly we aggregate based on the full_date and consider the unique count of users
fiddle

Related

how to aggregate one record multiple times based on condition

I have a bunch of records in the table below.
product_id produced_date expired_date
123 2010-02-01 2012-05-31
234 2013-03-01 2014-08-04
345 2012-05-01 2018-02-25
... ... ...
I want the output to display how many unexpired products currently we have at the monthly level. (Say, if a product expires on August 04, we still count it in August stock)
Month n_products
2010-02-01 10
2010-03-01 12
...
2022-07-01 25
2022-08-01 15
How should I do this in Presto or Hive? Thank you!
You can use below SQL.
Here we are using case when to check if a product is expired or not(produced_date >= expired_date ), if its expired, we are summing it to get count of product that has been expired. And then group that data over expiry month.
select
TRUNC(expired_date, 'MM') expired_month,
SUM( case when produced_date >= expired_date then 1 else 0 end) n_products
from mytable
group by 1
We can use unnest and sequence functions to create a derived table; Joining our table with this derived table, should give us the desired result.
Select m.month,count(product_id) as n_products
(Select
(select x
from unnest(sequence(Min(month(produced_date)), Max(month(expired_date)), Interval '1' month)) t(x)
) as month
from table) m
left join table t on m.month >= t.produced_date and m.month <= t.expired_date
group by 1
order by 1

redshift cumulative count records via SQL

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

How to obtain information from 10 dates without using 10+ left joins

I have some information as shown in the simplified table below.
login_date | userid
-------------------------
2020-12-01 | 123
2020-12-01 | 456
2020-12-02 | 123
2020-12-02 | 456
2020-12-02 | 789
2020-12-03 | 123
2020-12-03 | 789
The range of dates found in login_date span from 2020-12-01 to 2020-12-12 and the userid for each day is unique.
What I wish to obtain comes in 2 folds:
The number of users who first logged in on a certain date. excluding users who logged in on preceding day(s).
For users who first logged in on a certain date (e.g. 2020-12-01), how many of them logged in on subsequent days as well? (i.e. of the batch who first logged in on 2020-12-01, how many were found to log in on 2020-12-02, 2020-12-03.. and so on)
For the above table, an example of the desired result may be as follows:
| 2020-12-01 | 2020-12-02 | 2020-12-03 | ... (users' first login date)
----------------------------------------------------------------------------------------
| 2020-12-01 | 2 x x
users who continued | 2020-12-02 | 2 1 x
to log in on these | 2020-12-03 | 1 1 0
dates | ... |
Reasoning:
On the first day, two new users logged in, 123 and 456.
On the second day, the same old users, 123 and 456, logged in as well. In addition, a new user (logging in for the first time), 789, was added.
On the third day, only one of the original old users, 123 logged in. (count of 1). The new user (from the second day), 789, logged in as well. (count of 1)
My attempt
I actually managed to obtain a (rough) solution in two parts. For the first day, 2012-12-01, I simply filtered users who logged in on the first day and performed left joins for all the remaining dates:
select count(d1.userid) as d1_users, count(d2.userid) as d2_users, ... (repeated for all joined tables)
from table1 d1
left join (
select userid
from table1
where login_date = date('2020-12-02')
) d2
on d1.userid = d2.userid
... -- (10 more left joins, with each filtering by an incremented date value)
where d1.login_date = date('2020-12-01')
For dates following the second day onwards, I did a bit of preprocessing to exclude users who had logged in on preceding day(s):
with d2_users as (
select userid
from table1 a
left join (
select userid
from table1
where login_date = date('2020-12-01')
) b
on a.userid = b.userid
where b.userid is null -- filtering out users who logged in on preceding day(s)
and a.login_date = date('2020-12-02')
)
select count(d2.userid) as d2_users, ... -- (repeated for all joined tables)
from d2_users d2
left join (
select userid
from table1
where login_date = date('2020-12-03')
) d3
on d2.userid = d3.userid
... -- (similar to the query for the 2020-12-01)
In the process of writing and executing this query it took a lot of manual editing (deleting of unnecessary left joins for later dates and count), and ultimately the entire query for just two days takes up 300+ lines of SQL code. I am not sure whether there is a more efficient process for this.
Any advice would be greatly appreciated! I would be happy to provide further clarification if needed as well since the optimization of the solution to this problem has been bugging me for some time.
I apologize for the poor formatting of the desired result, as I currently only have a representation of it in a spreadsheet and not an idea of how it may look like as a SQL output.
Edit:
I realized I may not have communicated the ideal outcomes properly. For each min_login_date identified, what I wish to obtain is the number of users who continue to log in from a preceding date. An example would be:
10 users log in on 2020-12-01. Hence, the count for 2020-12-01 = 10.
Of the 10 previous users, 8 users log in on 2020-12-02. Hence the count for 2020-12-02 = 8.
Of the 8 users (from the previous day), 6 users log in on 2020-12-03. Hence the count for 2020-12-03 = 6.
As such for each min_login_date, the user count for subsequent dates should be <= that of the user count for previous dates. Hope this helps! I apologize for any miscommunication.
You can use window functions to get the earliest date. And then aggregate:
select min_login_date, count(*) as num_on_day,
sum(case when login_date = '2020-12-01' then 1 else 0 end) as login_20201201,
sum(case when login_date = '2020-12-02' then 1 else 0 end) as login_20201203,
. . .
from (select t.*,
min(login_date) over (partition by user_id) as min_login_date
from t
) t
group by min_login_date
I think you need some tweak using analytical function and aggregate function as follows:
select login_date,
Count(case when min_login_date = '2020-12-01' then 1 end) as login_20201201,
Count(case when min_login_date = '2020-12-02' then 1 end) as login_20201202,
......
from (select t.*,
min(login_date) over (partition by user_id) as min_login_date,
Lag(login_date) over (partition by user_id) as lag_login_date,
from your_taeble t
Where t.login_date between '2020-12-01' and '2020-12-12'
) t
where (lag_login_date = login_date - interval '1 day' or lag_login_date is null)
group by login_date

SQL Find Last 30 Days records count grouped by

I am trying to retrieve the count of customers daily per each status in a dynamic window - Last 30 days.
The result of the query should show each day how many customers there are per each customer status (A,B,C) for the Last 30 days (i.e today() - 29 days). Every customer can have one status at a time but change from one status to another within the customer lifetime. The purpose of this query is to show customer 'movement' across their lifetime. I've generated a series of date ranging from the first date a customer was created until today.
I've put together the following query but it appears that something I'm doing is incorrect because the results depict most days as having the same count across all statuses which is not possible, each day new customers are created. We checked with another simple query and confirmed that the split between statuses is not equal.
I tried to depict below the data and the SQL I use to reach the optimal result.
Starting point (example table customer_statuses):
customer_id | status | created_at
---------------------------------------------------
abcdefg1234 B 2019-08-22
abcdefg1234 C 2019-01-17
...
abcdefg1234 A 2018-01-18
bcdefgh2232 A 2017-09-02
ghijklm4950 B 2018-06-06
statuses - A,B,C
There is no sequential order for the statuses, a customer can have any status at the start of the business relationship and switch between statuses throughout their lifetime.
table customers:
id | f_name | country | created_at |
---------------------------------------------------------------------
abcdefg1234 Michael FR 2018-01-18
bcdefgh2232 Sandy DE 2017-09-02
....
ghijklm4950 Daniel NL 2018-06-06
SQL - current version:
WITH customer_list AS (
SELECT
DISTINCT a.id,
a.created_at
FROM
customers a
),
dates AS (
SELECT
generate_series(
MIN(DATE_TRUNC('day', created_at)::DATE),
MAX(DATE_TRUNC('day', now())::DATE),
'1d'
)::date AS day
FROM customers a
),
customer_statuses AS (
SELECT
customer_id,
status,
created_at,
ROW_NUMBER() OVER
(
PARTITION BY customer_id
ORDER BY created_at DESC
) col
FROM
customer_status
)
SELECT
day,
(
SELECT
COUNT(DISTINCT id) AS accounts
FROM customers
WHERE created_at::date BETWEEN day - 29 AND day
),
status
FROM dates d
LEFT JOIN customer_list cus
ON d.day = cus.created_at
LEFT JOIN customer_statuses cs
ON cus.id = cs.customer_id
WHERE
cs.col = 1
GROUP BY 1,3
ORDER BY 1 DESC,3 ASC
Currently what the results from the query look like:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1230 B
2020-01-24 1230 A
2020-01-23 1200 C
2020-01-23 1200 B
2020-02-23 1200 A
2020-02-22 1150 C
2020-02-22 1150 B
...
2017-01-01 50 C
2017-01-01 50 B
2017-01-01 50 A
Two things I've noticed from the results above - most of the time the results show the same count across all statuses in a given day. The second observation, there are days that only two statuses appear - which should not be the case. If now new accounts are created in a given day with a certain status, the count of the previous day should be carried over - right? or is this the problem with the query I created or with the logic I have in mind??
Perhaps I'm expecting a result that will not happen logically?
Required result:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1000 B
2020-01-24 2500 A
2020-01-23 1200 C
2020-01-23 1050 B
2020-02-23 2450 A
2020-02-22 1160 C
2020-02-22 1020 B
2020-02-22 2400 A
...
2017-01-01 10 C
2017-01-01 4 B
2017-01-01 50 A
Thank You!
Your query seems overly complicated. Here is another approach:
Use lead() to get when the status ends for each customer status record.
Use generate_series() to generate the days.
The rest is just filtering and aggregation:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
I've altered the query a bit because I've noticed that I get duplicate records on the days a customer changes a status - one record with the old status and one records for the new day.
For example output with #Gordon's query:
dte | status
---------------------------
2020-02-12 B
... ...
01.02.2020 A
01.02.2020 B
31.01.2020 A
30.01.2020 A
I've adapted the query, see below, while the results depict the changes between statuses correctly (no duplicate records on the day of change), however, the records continue up until now()::date - interval '1day' and not include now()::date (as in today). I'm not sure why and can't find the correct logic to ensure all of this is how I want it.
Dates correctly depict the status of each customer and the status returned include today.
Adjusted query:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1day' as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca, interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
The two adjustments:
The adjustments also seem counter-intuitive as it seems i'm taking the interval day away from one part of the query only to add it to another (which to me seems to yield the same result)
a - added the decrease of 1 day from the lead function (line 3)
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1 day' as next_ca
b - removed the decrease of 1 day from the next_ca variable (line 6)
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day')
Example of the output with the adjusted query:
dte | status
---------------------------
2020-02-11 B
... ...
01.02.2020 B
31.01.2020 A
30.01.2020 A
Thanks for your help!