SQL Find Last 30 Days records count grouped by - sql

I am trying to retrieve the count of customers daily per each status in a dynamic window - Last 30 days.
The result of the query should show each day how many customers there are per each customer status (A,B,C) for the Last 30 days (i.e today() - 29 days). Every customer can have one status at a time but change from one status to another within the customer lifetime. The purpose of this query is to show customer 'movement' across their lifetime. I've generated a series of date ranging from the first date a customer was created until today.
I've put together the following query but it appears that something I'm doing is incorrect because the results depict most days as having the same count across all statuses which is not possible, each day new customers are created. We checked with another simple query and confirmed that the split between statuses is not equal.
I tried to depict below the data and the SQL I use to reach the optimal result.
Starting point (example table customer_statuses):
customer_id | status | created_at
---------------------------------------------------
abcdefg1234 B 2019-08-22
abcdefg1234 C 2019-01-17
...
abcdefg1234 A 2018-01-18
bcdefgh2232 A 2017-09-02
ghijklm4950 B 2018-06-06
statuses - A,B,C
There is no sequential order for the statuses, a customer can have any status at the start of the business relationship and switch between statuses throughout their lifetime.
table customers:
id | f_name | country | created_at |
---------------------------------------------------------------------
abcdefg1234 Michael FR 2018-01-18
bcdefgh2232 Sandy DE 2017-09-02
....
ghijklm4950 Daniel NL 2018-06-06
SQL - current version:
WITH customer_list AS (
SELECT
DISTINCT a.id,
a.created_at
FROM
customers a
),
dates AS (
SELECT
generate_series(
MIN(DATE_TRUNC('day', created_at)::DATE),
MAX(DATE_TRUNC('day', now())::DATE),
'1d'
)::date AS day
FROM customers a
),
customer_statuses AS (
SELECT
customer_id,
status,
created_at,
ROW_NUMBER() OVER
(
PARTITION BY customer_id
ORDER BY created_at DESC
) col
FROM
customer_status
)
SELECT
day,
(
SELECT
COUNT(DISTINCT id) AS accounts
FROM customers
WHERE created_at::date BETWEEN day - 29 AND day
),
status
FROM dates d
LEFT JOIN customer_list cus
ON d.day = cus.created_at
LEFT JOIN customer_statuses cs
ON cus.id = cs.customer_id
WHERE
cs.col = 1
GROUP BY 1,3
ORDER BY 1 DESC,3 ASC
Currently what the results from the query look like:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1230 B
2020-01-24 1230 A
2020-01-23 1200 C
2020-01-23 1200 B
2020-02-23 1200 A
2020-02-22 1150 C
2020-02-22 1150 B
...
2017-01-01 50 C
2017-01-01 50 B
2017-01-01 50 A
Two things I've noticed from the results above - most of the time the results show the same count across all statuses in a given day. The second observation, there are days that only two statuses appear - which should not be the case. If now new accounts are created in a given day with a certain status, the count of the previous day should be carried over - right? or is this the problem with the query I created or with the logic I have in mind??
Perhaps I'm expecting a result that will not happen logically?
Required result:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1000 B
2020-01-24 2500 A
2020-01-23 1200 C
2020-01-23 1050 B
2020-02-23 2450 A
2020-02-22 1160 C
2020-02-22 1020 B
2020-02-22 2400 A
...
2017-01-01 10 C
2017-01-01 4 B
2017-01-01 50 A
Thank You!

Your query seems overly complicated. Here is another approach:
Use lead() to get when the status ends for each customer status record.
Use generate_series() to generate the days.
The rest is just filtering and aggregation:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'

I've altered the query a bit because I've noticed that I get duplicate records on the days a customer changes a status - one record with the old status and one records for the new day.
For example output with #Gordon's query:
dte | status
---------------------------
2020-02-12 B
... ...
01.02.2020 A
01.02.2020 B
31.01.2020 A
30.01.2020 A
I've adapted the query, see below, while the results depict the changes between statuses correctly (no duplicate records on the day of change), however, the records continue up until now()::date - interval '1day' and not include now()::date (as in today). I'm not sure why and can't find the correct logic to ensure all of this is how I want it.
Dates correctly depict the status of each customer and the status returned include today.
Adjusted query:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1day' as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca, interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
The two adjustments:
The adjustments also seem counter-intuitive as it seems i'm taking the interval day away from one part of the query only to add it to another (which to me seems to yield the same result)
a - added the decrease of 1 day from the lead function (line 3)
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1 day' as next_ca
b - removed the decrease of 1 day from the next_ca variable (line 6)
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day')
Example of the output with the adjusted query:
dte | status
---------------------------
2020-02-11 B
... ...
01.02.2020 B
31.01.2020 A
30.01.2020 A
Thanks for your help!

Related

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

Snowflake SQL Filter by transactions in the last rolling 30 days

I have a data table something similar to having a customer ID and an item purchase date as shown below. As a filter, I want to return customer ID IFF a given Customer ID has at least 1 purchase in the last 30 rolling days.
Is this something that can be done with a simple WHERE clause ? For my purposes, this data table has many records where a customer ID might have hundreds of transactions
Customer ID Item Date Purchased
233 2021-05-27
111 2021-05-27
111 2021-05-21
23 2021-05-12
412 2021-03-11
111 2021-03-03
Desired output:
Customer ID
233
111
23
Originally thought to use a CTE to initially filter out any users that don't have at least 1 item purchase within the last 30 days. Tried the following two different where statements but both didn't work returning incorrect date timeframes.
SELECT *
FROM data d
WHERE 30 <= datediff(days, d.ITEM_PURCHASE_DATE, current_date) X
WHERE t.DATE_CREATED <= current_date + interval '30 days' X
To get the customers that have made at least one purchase in the last 30 days you can do this:
select distinct
customer_id
from sample_table
where item_date_purchased > dateadd(day, -30, current_date());
the dateadd function shown above returns a date 30 days prior to the current date.
The approach with INTERVAL was almost correct.
SELECT *
FROM data d
WHERE d.DATE_CREATED <= current_date + interval '30 days'
-- this date is from the future instead of the past
=>
SELECT *
FROM data d
WHERE d.DATE_CREATED >= current_date - interval '30 days'

Counting subscriber numbers given events on SQL

I have a dataset on mysql in the following format, showing the history of events given some client IDs:
Base Data
Text of the dataset (subscriber_table):
user_id type created_at
A past_due 2021-03-27 10:15:56
A reactivate 2021-02-06 10:21:35
A past_due 2021-01-27 10:30:41
A new 2020-10-28 18:53:07
A cancel 2020-07-22 9:48:54
A reactivate 2020-07-22 9:48:53
A cancel 2020-07-15 2:53:05
A new 2020-06-20 20:24:18
B reactivate 2020-06-14 10:57:50
B past_due 2020-06-14 10:33:21
B new 2020-06-11 10:21:24
date_table:
full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01
I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:
Output
date subscriber_count
2020-05-01 0
2020-06-01 2
2020-07-01 2
2020-08-01 1
2020-09-01 1
2020-10-01 2
2020-11-01 2
2020-12-01 2
2021-01-01 2
2021-02-01 2
2021-03-01 2
Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.
My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?
with last_record_per_month as (
select
date_trunc('month', created_at)::date order by created_at) as month_year ,
user_id ,
type,
created_at as created_at
from
subscriber_table
where
user_id in ('A', 'B')
order by
created_at desc
), final as (
select
month_year,
created_at,
type
from
last_record_per_month lrpm
right join (
select
date_trunc('month', full_date)::date as month_year
from
date_table
where
full_date between '2020-05-01' and '2021-03-31'
group by
1
order by
1
) dd
on lrpm.created_at = dd.month_year
and num = 1
order by
month_year
)
select
*
from
final
I do have a premade base table with every single date in many years to use as a joining table
Any help with this is GREATLY appreciated
Thanks!
The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.
SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new
LEFT JOIN subscriber cancel
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1
Let me breakdown some sections
First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'
The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at
Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this
We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.
Lastly we aggregate based on the full_date and consider the unique count of users
fiddle

How to create monthly running sum when some months have no records

The goal is to graph the total volume of created orders over time in a monthly digest
WITH monthly_sums AS (
SELECT
date_trunc('month', created_at) AS month,
sum(count(created_at)) OVER (ORDER BY date_trunc('month', created_at)) AS sum
FROM orders
GROUP BY date_trunc('month', created_at)
)
SELECT
to_char(date_range.month, 'Month') AS month,
COALESCE(monthly_sums.sum, 0) AS total
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 month') date_range(month)
LEFT OUTER JOIN monthly_sums
ON monthly_sums.month = date_range.month;
Which returns:
month | total
-----------+-------
October | 0
November | 0
December | 0
January | 1
February | 0 <-- should be 1
March | 3
(6 rows)
selecting from monthly_sums returns:
month | sum
---------------------+-----
2015-01-01 00:00:00 | 1
| <-- no records created in February
2015-03-01 00:00:00 | 3
| 3
(3 rows)
The problem is there were no orders in February so the total is coalesced into 0. How can I alter or rethink this query in order to get the desired result?
I am unfamiliar with the postgresql syntax for this construct, but the general pattern across all SQL engines for implementing this type of operation efficiently is always the following:
Aggregate by month;
Left join from the full set of desired periods to the aggregates in (1), coalescing absent sums to zero;
Perform the running summation by period.
Use either CTE's or subqueries to build 1, 2, and 3 successively.
following Pieter's answer, this altered query gives the correct result
WITH monthly_counts AS (
SELECT
date_trunc('month', created_at) AS month,
count(created_at) AS count
FROM orders
GROUP BY date_trunc('month', created_at)
)
SELECT
to_char(sq.month, 'Month') AS month,
sum(sq.count) OVER (ORDER BY sq.month)
FROM
(SELECT
date_range.month,
COALESCE(monthly_counts.count, 0) AS count
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 month') date_range(month)
LEFT OUTER JOIN monthly_counts
ON monthly_counts.month = date_range.month) AS sq;

Using outer query result in a subquery in postgresql

I have two tables points and contacts and I'm trying to get the average points.score per contact grouped on a monthly basis. Note that points and contacts aren't related, I just want the sum of points created in a month divided by the number of contacts that existed in that month.
So, I need to sum points grouped by the created_at month, and I need to take the count of contacts FOR THAT MONTH ONLY. It's that last part that's tricking me up. I'm not sure how I can use a column from an outer query in the subquery. I tried something like this:
SELECT SUM(score) AS points_sum,
EXTRACT(month FROM created_at) AS month,
date_trunc('MONTH', created_at) + INTERVAL '1 month' AS next_month,
(SELECT COUNT(id) FROM contacts WHERE contacts.created_at <= next_month) as contact_count
FROM points
GROUP BY month, next_month
ORDER BY month
So, I'm extracting the actual month that my points are being summed, and at the same time, getting the beginning of the next_month so that I can say "Get me the count of contacts where their created at is < next_month"
But it complains that column next_month doesn't exist This is understandable as the subquery knows nothing about the outer query. Qualifying with points.next_month doesn't work either.
So can someone point me in the right direction of how to achieve this?
Tables:
Points
score | created_at
10 | "2011-11-15 21:44:00.363423"
11 | "2011-10-15 21:44:00.69667"
12 | "2011-09-15 21:44:00.773289"
13 | "2011-08-15 21:44:00.848838"
14 | "2011-07-15 21:44:00.924152"
Contacts
id | created_at
6 | "2011-07-15 21:43:17.534777"
5 | "2011-08-15 21:43:17.520828"
4 | "2011-09-15 21:43:17.506452"
3 | "2011-10-15 21:43:17.491848"
1 | "2011-11-15 21:42:54.759225"
sum, month and next_month (without the subselect)
sum | month | next_month
14 | 7 | "2011-08-01 00:00:00"
13 | 8 | "2011-09-01 00:00:00"
12 | 9 | "2011-10-01 00:00:00"
11 | 10 | "2011-11-01 00:00:00"
10 | 11 | "2011-12-01 00:00:00"
Edit
Now with running sum of contacts. My first draft used new contacts per month, which is obviously not what OP wants.
WITH c AS (
SELECT created_at
,count(id) OVER (order BY created_at) AS ct
FROM contacts
), p AS (
SELECT date_trunc('month', created_at) AS month
,sum(score) AS points_sum
FROM points
GROUP BY 1
)
SELECT p.month
,EXTRACT(month FROM p.month) AS month_nr
,p.points_sum
,( SELECT c.ct
FROM c
WHERE c.created_at < (p.month + interval '1 month')
ORDER BY c.created_at DESC
LIMIT 1) AS contacts
FROM p
ORDER BY 1
This works for any number of months across the years.
Assumes that no month is missing in the table points. If you want all months, including missing ones in points, generate a list of months with generate_series() and LEFT JOIN to it.
Build a running sum in a CTE with a window function.
Both CTE are not strictly necessary - for performance and simplification only.
Get contacts_count in a subselect.
Your original form of the query could work like this:
SELECT month
,EXTRACT(month FROM month) AS month_nr
,points_sum
,(SELECT count(*)
FROM contacts c
WHERE c.created_at < (p.month + interval '1 month')) AS contact_count
FROM (
SELECT date_trunc('MONTH', created_at) AS month
,sum(score) AS points_sum
FROM points p
GROUP BY 1
) p
ORDER BY 1
The fix for the immediate cause of your error is to put the aggregate into a subquery. You were mixing levels in a way that is impossible.
I expect my variant to be slightly faster with big tables. Not sure about smaller tables. Would be great if you'd report back with test results.
Plus a minor fix: < instead of <=.