I'm trying to figure out some sliding window stats on my users. I have a table with a user, and columns such as created_at and verified_at. For each month, I'd like to find out how many users registered (a simple group by date_trunc of the created_at), and then of those people, how many verified within my sliding window (call it 60 days).
I'd like to do a SQL query that gives me something like:
Month | Registered | Verified in 60 days
Jan 2009 | 1543 | 107
Feb 2009 | 2000 | 250
I'm using postgresql. I starting looking at sum(case...), but I don't know if I can get my case to be dependent on the date_trunc somehow.
This doesn't work, of course, but here's the idea:
SELECT DATE_TRUNC('month', created_at) as month,
COUNT(*) as registered,
SUM(CASE WHEN verified_at < month+60 THEN 1 ELSE 0 END) as verified
FROM users
GROUP BY DATE_TRUNC('month', created_at)
SELECT COUNT(created_at) AS registered,
SUM(CASE WHEN verified_at <= created_at + '60 day'::INTERVAL THEN 1 ELSE 0 END) AS verified
FROM generate_series(1, 20) s
LEFT JOIN
users
ON created_at >= '2009-01-01'::datetime + (s || ' month')::interval
AND created_at < '2009-01-01'::datetime + (s + 1 || ' month')::interval
GROUP BY
s
perhaps you could union together the different months.
select sum(whatever), 'january' from user where month = 'january'
union all
select sum(whatever), 'february' from user where month = 'february'
...
SELECT
MONTH,
COUNT(*) AS Registered,
SUM (CASE WHEN datediff(day,reg_date,ver_date) < 60 THEN 1 ELSE 0) as 'Verified in 60 //days datediff is an MSSQL function amend for postgresql'
FROM
TABLE
GROUP BY
MONTH
Related
I am trying to build a query to count user reactivations per month, where "reactivation" is defined as (for e.g. March 2021):
Sent activity during, or before, January 2021
Did not send activity during February 2021
Sent activity during March 2021
(so 1 or more full calendar months of no activity as the threshold for inactive).
The source table F_ACTIVITY is a per-user per-day time series with columns:
dt (date), user_id, is_active (boolean).
The desired outcome is a table showing:
month, reactivations_this_month
The closest I can get is a count of reactivations in the current month, or something relative to the current date with more case statements (e.g. repeating for current month -2):
SELECT
COUNT(*) AS reactivations_this_month
FROM
(SELECT
* FROM
(SELECT
user_id,
SUM(current_month_active) AS cma,
SUM(last_month_active) AS lma,
SUM(historical_active) AS h_a
FROM
(SELECT
user_id,
dt,
CASE WHEN DATE_TRUNC(MONTH, DT) = ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), 0) THEN 1 ELSE 0 END AS current_month_active,
CASE WHEN DATE_TRUNC(MONTH, DT) = ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), -1) THEN 1 ELSE 0 END AS last_month_active,
CASE WHEN DATE_TRUNC(MONTH, DT) < ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), -1) THEN 1 ELSE 0 END AS historical_active
FROM F_ACTIVITY
WHERE is_active = 1
) AS x
GROUP BY user_id) AS y
WHERE cma > 0
AND lma = 0
AND h_a > 0) AS z
Any help transforming this into a rolling monthly query greatly appreciated - thanks all!
Final note: I'm trying this in Snowflake, so the dialect is SnowSQL
First summarize by month and user, then use lag():
SELECT yyyymm,
SUM(CASE WHEN prev_yyyymm < yyyymm - INTERVAL '1 month' THEN 1 ELSE 0 END) as num_reactivations
FROM (SELECT user_id, DATE_TRUNC(MONTH, DT) as yyyymm,
LAG(DATE_TRUNC(MONTH, DT)) OVER (PARTITION BY user_id ORDER BY DATE_TRUNC(MONTH, DT)) as prev_yyyymm
FROM F_ACTIVITY
WHERE is_active = 1
GROUP BY user_id, DATE_TRUNC(MONTH, DT)
) um
GROUP BY yyyymm;
The CLIENTS table contains a monthly snapshot of the bank's clients,
who have made any transactions in the given month. Attributes: report_month
and client_id. We assume that the client "outflow" from the bank in month N, if in month N
it is active (present in the CLIENTS table) and inactive in months N + 1, N + 2, N + 3.
How to find the share of clients who "outflow" every month?
Table looks like:
report_month client_id
2020-01-01 0023
2020-03-01 0125
...
You can do this with window functions and a window frame. In standard SQL, this would look like:
select report_month, sum(case when cnt = 0 then 1 else 0 end) as outflow
from (
select t.*,
count(*) over(
partition by client_id
order by report_month
range between interval '1' month following and interval '3' month following
) cnt
from mytable t
) t
group by report_month
This assumes that report_month is of a date-like datatype, and that each customer has 0 or 1 record per report_month. If a customer may appear more than once in a month, you would change the outer conditional sum() to:
count(distinct case when cnt = 0 then client_id end) as outflow
In SQLite, that has poor date arithmetics support, it is a bit more complicated. If you can live with an approximation of month periods, you could do something like this:
select report_month, sum(case when cnt = 0 then 1 else 0 end) as outflow
from (
select t.*,
count(*) over(
partition by client_id
order by julianday(report_month)
range between 28 following and 92 following
) cnt
from mytable t
) t
group by report_month
I'm using Redshift (Postgres), and Pandas to do my work. I'm trying to get the number of user actions, lets say purchases to make it easier to understand. I have a table, purchases that holds the following data:
user_id, timestamp , price
1, , 2015-02-01, 200
1, , 2015-02-02, 50
1, , 2015-02-10, 75
ultimately I would like the number of purchases over a certain timestamp. Such as
userid, 28-14_days, 14-7_days, 7
Here is what I have so far, I'm aware I don't have an upper limit on the dates:
SELECT DISTINCT x_days.user_id, SUM(x_days.purchases) AS x_num, SUM(y_days.purchases) AS y_num,
x_days.x_date, y_days.y_date
FROM
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as x_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(x_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id
) AS x_days
JOIN
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as y_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(y_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id) AS y_days
ON
x_days.user_id = y_days.user_id
GROUP BY
x_days.user_id, x_days.x_date, y_days.y_date
params={'x_days_ago':x_days_ago, 'y_days_ago':y_days_ago}
where these are set in python/pandas
x_days_ago = 14
y_days_ago = 7
But this didn't work out exactly as planned:
user_id x_num y_num x_date y_date
0 5451772 1 1 2015-02-10 2015-02-10
1 5026678 1 1 2015-02-09 2015-02-09
2 6337993 2 1 2015-02-14 2015-02-13
3 6204432 1 3 2015-02-10 2015-02-11
4 3417539 1 1 2015-02-11 2015-02-11
Even though I don't have an upper date to look between (so x is effectively searching from 14 days to now and y is 7 days to now, meaning overlap), in some cases y is higher.
Can anyone help me either fix this or give me a better way?
Thanks!
It might not be the most efficient answer, but you can generate each sum with a sub-select:
WITH
summed AS (
SELECT user_id, day, COUNT(1) AS purchases
FROM (SELECT user_id, DATE(timestamp) AS day FROM purchases) AS _
GROUP BY user_id, day
),
users AS (SELECT DISTINCT user_id FROM purchases)
SELECT user_id,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval ' 7 days')) AS days_7,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval '14 days')) AS days_14
FROM users;
(This was tested in Postgres, not in Redshift; but the Redshift documentation suggests that both WITH and DISTINCT are supported.) I would have liked to do this with a window, to obtain rolling sums; but it's a little onerous without generate_series().
Trying to get a basic table that shows retention from one month to the next. So if someone buys something last month and they do so the next month it gets counted.
month, num_transactions, repeat_transactions, retention
2012-02, 5, 2, 40%
2012-03, 10, 3, 30%
2012-04, 15, 8, 53%
So if everyone that bought last month bought again the following month you have 100%.
So far I can only calculate stuff manually. This gives me the rows that have been seen in both months:
select count(*) as num_repeat_buyers from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-03'
) as table1,
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-04'
) as table2
where table1.email = table2.email
This is not right but I feel like I can use some of Postgres' windowing functions. Keep in mind the windowing functions don't let you specify WHERE clauses. You mostly have access to the previous rows and the preceding rows:
select month, count(*) as num_transactions, count(*) over (PARTITION BY month ORDER BY month)
from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id
order by
month
) as transactions_by_month
group by
month
Given the following test table (which you should have provided):
CREATE TEMP TABLE transaction (buyer_id int, tstamp timestamp);
INSERT INTO transaction VALUES
(1,'2012-01-03 20:00')
,(1,'2012-01-05 20:00')
,(1,'2012-01-07 20:00') -- multiple transactions this month
,(1,'2012-02-03 20:00') -- next month
,(1,'2012-03-05 20:00') -- next month
,(2,'2012-01-07 20:00')
,(2,'2012-03-07 20:00') -- not next month
,(3,'2012-01-07 20:00') -- just once
,(4,'2012-02-07 20:00'); -- just once
Table auth_user is not relevant to the problem.
Using tstamp as column name since I don't use base types as identifiers.
I am going to use the window function lag() to identify repeated buyers. To keep it short I combine aggregate and window functions in one query level. Bear in mind that window functions are applied after aggregate functions.
WITH t AS (
SELECT buyer_id
,date_trunc('month', tstamp) AS month
,count(*) AS item_transactions
,lag(date_trunc('month', tstamp)) OVER (PARTITION BY buyer_id
ORDER BY date_trunc('month', tstamp))
= date_trunc('month', tstamp) - interval '1 month'
OR NULL AS repeat_transaction
FROM transaction
WHERE tstamp >= '2012-01-01'::date
AND tstamp < '2012-05-01'::date -- time range of interest.
GROUP BY 1, 2
)
SELECT month
,sum(item_transactions) AS num_trans
,count(*) AS num_buyers
,count(repeat_transaction) AS repeat_buyers
,round(
CASE WHEN sum(item_transactions) > 0
THEN count(repeat_transaction) / sum(item_transactions) * 100
ELSE 0
END, 2) AS buyer_retention
FROM t
GROUP BY 1
ORDER BY 1;
Result:
month | num_trans | num_buyers | repeat_buyers | buyer_retention_pct
---------+-----------+------------+---------------+--------------------
2012-01 | 5 | 3 | 0 | 0.00
2012-02 | 2 | 2 | 1 | 50.00
2012-03 | 2 | 2 | 1 | 50.00
I extended your question to provide for the difference between the number of transactions and the number of buyers.
The OR NULL for repeat_transaction serves to convert FALSE to NULL, so those values do not get counted by count() in the next step.
-> SQLfiddle.
This uses CASE and EXISTS to get repeated transactions:
SELECT
*,
CASE
WHEN num_transactions = 0
THEN 0
ELSE round(100.0 * repeat_transactions / num_transactions, 2)
END AS retention
FROM
(
SELECT
to_char(timestamp, 'YYYY-MM') AS month,
count(*) AS num_transactions,
sum(CASE
WHEN EXISTS (
SELECT 1
FROM transaction AS t
JOIN auth_user AS u
ON t.buyer_id = u.id
WHERE
date_trunc('month', transaction.timestamp)
+ interval '1 month'
= date_trunc('month', t.timestamp)
AND auth_user.email = u.email
)
THEN 1
ELSE 0
END) AS repeat_transactions
FROM
transaction
JOIN auth_user
ON transaction.buyer_id = auth_user.id
GROUP BY 1
) AS summary
ORDER BY 1;
EDIT: Changed from minus 1 month to plus 1 month after reading the question again. My understanding now is that if someone buy something in 2012-02, and then buy something again in 2012-03, then his or her transactions in 2012-02 are counted as retention for the month.
I have a table of tickets. I am trying to calculate how many tickets were "open" at each month end over the course of the current year. As well, I am pushing this to a bar chart and I am needing out put this into an array through LINQ.
My SQL query to get my calculation is:
SELECT
(SELECT COUNT(*) FROM tblMaintenanceTicket t WHERE (CreateDate < DATEADD(MM, 1, '01/01/2012')))
-
(SELECT COUNT(*) FROM tblMaintenanceTicket t WHERE (CloseDate < DATEADD(MM, 1, '01/01/2012'))) AS 'Open #Month End'
My logic is the following: Count all tickets open between first and end of the month. Subtract that count from the tickets closed before the end of the month.
UPDATED:
I have updated my query with the comments below and it is not working with errors in the GROUP, but I am not truly understanding the logic I guess, my lack of skill in SQL is to blame.
I have added a SQL Fiddle example to show you my query: http://sqlfiddle.com/#!3/c9b638/1
Desired output:
-----------
| Jan | 3 |
-----------
| Feb | 4 |
-----------
| Mar | 0 |
-----------
Your SQL has several erros . . . are grouping by CreateDate but you don't have it as a column from the subqueries. And, you don't have a column alias on the count(*).
I think this is what you are trying to do:
select DATENAME(MONTH,CreateDate), DATEPART(YEAR,CreateDate),
(sum(case when CreateDate < DATEADD(MM, 1, '01/01/2012') then 1 else 0 end) -
sum(case when CloseDate < DATEADD(MM, 1, '01/01/2012') then 1 else 0 end)
)
from tblMaintenanceTicket
group by DATENAME(MONTH,CreateDate), DATEPART(YEAR,CreateDate)
Your comment seems to elucidate what you want clearer than your question (the explanation in the question is a bit buried). What you need is a driver table of months and then join this to your table. Something like:
select mons.yr, mons.mon, count(*) as OpenTickets
from (select month(CreateDate) as mon, year(CreateDate) as yr,
cast(min(CreateDate) as date) as MonthStart,
cast(max(CreateDate) as date) as monthEnd
from tblMaintenanceTicket
group by month(CreateDate), year(CreateDate)
) mons left outer join
tblMaintenanceTicket mt
on mt.CreateDate <= mons.MonthEnd and
(mt.CloseDate > mons.MonthEnd or mt.CloseDate is null)
group by mons.yr, mons.mon
I am assuming records are created on every day. This is a convenience so I don't have to think about getting the first and last day of each month using other SQL functions.
If your query is returning what you need, then simply use DATENAME(MONTH, yourDate) to retrieve the month and group by Month,Year:
SELECT SUM(*), DATENAME(MONTH,yourDate), DATEPART(YEAR,yourDate)
FROM
(
your actual query here
)
GROUP BY DATENAME(MONTH,yourDate), DATEPART(YEAR,yourDate)