Month over Month percent change in user registrations - sql

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:

You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.

I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month

First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

Related

Day wise Rolling 30 day uniques user count bigquery

I am trying to generate a day on day rolling 30 days unique count using this query but the problem is running this query day on the day I need aug full month rolling 30 days day on day count in one script pls help
-----------------------------------------
SELECT max(date),count(DISTINCT user_id) as MAU
FROM user_data
WHERE date between DATE_SUB('2020-08-31' ,INTERVAL 29 DAY) and '2020-08-31';
BigQuery doesn't support rolling windows for count(distinct). So, one approach is a brute force method:
select dte,
(select count(distinct ud.user_id)
from user_data ud
where ud.date between DATE_SUB(dte, INTERVAL 29 DAY) and dte
) as num_users
from unnest(generate_date_array(date('2020-08-01'), date('2020-08-31'))) dte
Gordon approach works great.
If you need to calculate more numbers - Cross join the data.
SELECT
date_gen,
COUNT(DISTINCT IF(ud.date BETWEEN DATE_SUB(date_gen ,INTERVAL 29 DAY) AND date_gen,ud.user_id,NULL)) as MAU
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE_SUB('2020-08-31' ,INTERVAL 29 DAY), date('2020-08-31'))) date_gen,
(SELECT * FROM user_data WHERE date BETWEEN DATE_SUB('2020-08-31' ,INTERVAL 60 DAY) AND '2020-08-31') AS ud
GROUP BY 1
ORDER BY 1 DESC
With SET and DECLARE you can get rid of replacing the 'DATE' multiple times.
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM `project.dataset.user_data`
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t
Above assumes you have entries in project.dataset.user_data table for all days in time period of your interest
If this is not a case, and you actually have some gaps in your data - you can use below
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM UNNEST(GENERATE_DATE_ARRAY('2020-08-01', '2020-08-31')) AS date
LEFT JOIN `project.dataset.user_data`
USING(date)
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t

how to find consecutive user login across week

I'm fairly new to SQL & maybe the complexity level for this report is above my pay grade
I need help to figure out the list of users who are logging to the app consecutively every week in the time period chosen(this logic eventually needs to be extended to a month, quarter & year ultimately but a week is good for now)
Table structure for ref
events: User_id int, login_date timestamp
The table events can have 1 or more entries for a user. This inherently means that the user can login multiple times to the app. To shed some light, if we focus on Jan 2020- Mar2020 then I need the following in the output
user_id who logged into the app every week from 2020wk1 to 2020Wk14
at least once
the week they logged in
number of times they logged in that week
I'm also okay if the output of the query is just the user_id. The thing is I'm unable to make sense out of the output that I'm seeing on my end after trying the following SQL code, perhaps working on this problem for so long might be the reason for that!
SQL code tried so far:
SELECT DISTINCT user_id
,extract('year' FROM timestamp)||'Wk'|| extract('week' FROM timestamp)
,lead(extract('week' FROM timestamp)) over (partition by user_id, extract('week' FROM timestamp) order by extract('week' FROM timestamp))
FROM events
WHERE user_id = 'Anything that u wish to enter'
You can get the summary you want as:
select user_id, date_trunc('week', timestamp) as week, count(*)
from events
group by user_id, week;
But the filtering is tricker. It is better to go with dates rather than week numbers:
select user_id, date_trunc('week', timestamp) as week, count(*) as cnt,
count(*) over (partition by user_id) as num_weeks
from events
where timestamp >= ? and timestamp < ?
group by user_id, week;
Then you can use a subquery:
select uw.*
from (select user_id, date_trunc('week', timestamp) as week, count(*) as cnt,
count(*) over (partition by user_id) as num_weeks
from events
where timestamp >= ? and timestamp < ?
group by user_id, week
) uw
where num_weeks = ? -- 14 in your example

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Days Since Last Help Ticket was Filed

I am trying to create a report to show me the last date a customer filed a ticket.
Customers can file dozens of tickets. I want to know when the last ticket was filed and show how many days it's been since they have done so.
The fields I have are:
Customer,
Ticket_id,
Date_Closed
All from the Same table "Tickets"
I'm thinking I want to do a ranking of tickets by min date? I tried this query to grab something but it's giving me all the tickets from the customer. (I'm using SQL in a product called Domo)
select * from (select *, rank() over (partition by "Ticket_id"
order by "Date_Closed" desc) as date_order
from tickets ) zd
where date_order = 1
This should be simple enough,
SELECT customer,
MAX (date_closed) last_date,
ROUND((SYSDATE - MAX (date_closed)),0) days_since_last_ticket_logged
FROM emp
GROUP BY customer
select Customer, datediff(day, date_closed, current_date) as days_since_last_tkt
from
(select *, rank() over (partition by Customer order by "Date_Closed" desc) as date_order
from tickets) zd
join tickets t on zd.date_closed = t.date_closed
where zd.date_order = 1
Or you can simply do
select customer, datediff(day, max(Date_closed), current_date) as days_since_last_tkt
from tickets
group by customer
To select other fields
select t.*
from tickets t
join (select customer, max(Date_closed) as mxdate,
datediff(day, max(Date_closed), current_date) as days_since_last_tkt
from tickets
group by customer) tt
on t.customer = tt.customer and tt.mxdate = t.date_closed
I would do this with a simple sub-query to select the last closed date for the customer. Then compare this to today with datediff() to get the number of days since last closed.
Select
LastTicket.Customer,
LastTicket.LastClosedDate,
DateDiff(day,LastTicket.LastClosedDate,getdate()) as DaysSinceLastClosed
From
(select
tickets.customer
max(tickets.dateClosed) as LastClosedDate
from tickets
Group By tickets.Customer) as LastTicket
Based on the responses this is what I did:
select "Customer",
Max("date_closed") "last_date,
round(datediff(DAY, CURRENT_DATE, max("date_closed")), 0) as "Closed_date"
from tickets
group by "Customer"
ORDER BY "Customer"

RedShift: Alternative to 'where in' to compare annual login activity

Here are the two cases:
Members Lost: Get the distinct count of user ids from 365 days ago who haven't had any activity since then
Members Added: Get the distinct count of user ids from today who don't exist in the previous 365 days.
Here are the SQL statements I've been writing. Logically I feel like this should work (and it does for sample data), but the dataset is 5Million+ rows and takes forever! Is there any way to do this more efficiently? (base_date is a calendar that I'm joining on to build out a 2 year trend. I figured this was faster than joining the 5million table on itself...)
-- Members Lost
SELECT
effective_date,
COUNT(DISTINCT dwuserid) as members_lost
FROM base_date
LEFT JOIN site_visit
-- Get Login Activity for 365th day
ON DATEDIFF(day, srclogindate, effective_date) = 365
WHERE dwuserid NOT IN (
-- Get Distinct Login activity for Current Day (PY) + 1 to Current Day (CY) (i.e. 2013-01-02 to 2014-01-01)
SELECT DISTINCT dwuserid
FROM site_visit b
WHERE DATEDIFF(day, b.srclogindate, effective_date) BETWEEN 0 AND 364
)
GROUP BY effective_date
ORDER BY effective_date;
-- Members Added
SELECT
effective_date,
COUNT(DISTINCT dwuserid) as members_added
FROM base_date
LEFT JOIN site_visit ON srclogindate = effective_date
WHERE dwuserid NOT IN (
SELECT DISTINCT dwuserid
FROM site_visit b
WHERE DATEDIFF(day, b.srclogindate, effective_date) BETWEEN 1 AND 365
)
GROUP BY effective_date
ORDER BY effective_date;
Thanks in advance for any help.
UPDATE
Thanks to #JohnR for pointing me in the right direction. I had to tweak your response a bit because I need to know on any login day how many were "Member Added" or "Member Lost" so it had to be a 365 rolling window looking back or looking forward. Finding the IDs that didn't have a match in the LEFT JOIN was much faster.
-- Trim data down to one user login per day
CREATE TABLE base_login AS
SELECT DISTINCT "dwuserid", "srclogindate"
FROM site_visit
-- Members Lost
SELECT
current."srclogindate",
COUNT(DISTINCT current."dwuserid") as "members_lost"
FROM base_login current
LEFT JOIN base_login future
ON current."dwuserid" = future."dwuserid"
AND current."srclogindate" < future."srclogindate"
AND DATEADD(day, 365, current."srclogindate") >= future."srclogindate"
WHERE future."dwuserid" IS NULL
GROUP BY current."srclogindate"
-- Members Added
SELECT
current."srclogindate",
COUNT(DISTINCT current."dwuserid") as "members_added"
FROM base_login current
LEFT JOIN base_login past
ON current."dwuserid" = past."dwuserid"
AND current."srclogindate" > past."srclogindate"
AND DATEADD(day, 365, past."srclogindate") >= current."srclogindate"
WHERE past."dwuserid" IS NULL
GROUP BY current."srclogindate"
NOT IN should generally be avoided because it has to scan all data.
Instead of joining to the site_visit table (which is presumably huge), try joining to a sub-query that selects UserID and the most recent login date -- that way, there is only one row per user instead of one row per visit.
For example:
SELECT dwuserid, min (srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
You could then simplify the queries to something like:
-- Members Lost: Last login was between 12 and 13 months ago
SELECT
COUNT(*)
FROM
(
SELECT dwuserid, min(srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
)
WHERE
last_login BETWEEN current_date - interval '13 months' and current_date - interval '12 months'
-- Members Added: First visit in last 12 months
SELECT
COUNT(*)
FROM
(
SELECT dwuserid, min(srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
)
WHERE
first_login > current_date - interval '12 months'