Postgresql query to return number of customers that made more than 1 order in a month - sql

so I have this query in my code:
SELECT month, total_orders, first_orders, first_orders::numeric / total_orders::numeric * 100 AS ratio
FROM (
SELECT month, COUNT(1) AS total_orders
FROM (
SELECT DISTINCT date_trunc('month', date) AS month, email
FROM shop_orders
WHERE status_code = 'CONFIRMED'
) AS customers
GROUP BY month
) AS total_orders_my_month
LEFT JOIN (
SELECT month, COUNT(1) AS first_orders FROM (
SELECT email, date_trunc('month', MIN(date)) AS month
FROM shop_orders
WHERE status_code = 'CONFIRMED'
GROUP BY email
) AS first_orders_by_month
GROUP BY month
) AS first_orders_by_month2 USING (month)
ORDER BY month DESC
What it does is it retursn rows in a form: month | number of customers that month | number of first time customers that month | percentage of first time customers
What I need is find for each month also number of customers that made more then 1 order.
So I added this code:
LEFT JOIN (
SELECT month, COUNT(1) AS multiple_orders FROM (
SELECT email, date_trunc('month', MIN(date)) AS month
FROM shop_orders
WHERE status_code = 'CONFIRMED'
GROUP BY email HAVING COUNT(1) >=2
) AS multiple_orders_by_month
GROUP BY month
) AS multiple_orders_by_month2 USING (month)
This however returns only customers with 2 or more orders out of first-time customers not all customers. Can someone help please?
NOTES: All data is in shop_orders table, unique customer is identified by email field, date field has info about date of order.
I hope it's clear.
I am pretty new to postgresql so can someone point to the right direction? Thanks.

I would first like to point out that I am looking at this from a purely SQL route rather than postsgresql but the theory is pretty much the same. Your sub query SQL is probably overly complicated, here is my working on how I would create the query:
First, get a list of all e-mail addresses and how many orders each made that month; the following query will return the number of orders assigned to each e-mail:
SELECT
month
, email
, COUNT(*)
FROM
shop_orders
WHERE
status_code = 'Confirmed'
GROUP BY
month
, email
This will return for each month the order count for each e-mail address. You can then limit this list to all e-mails which have more than one per month using the HAVING filter method.
SELECT
month
, email
, COUNT(*) as ordersThisMonth
FROM
shop_orders
WHERE
status_code = 'Confirmed'
GROUP BY
month
, email
HAVING
COUNT(*) > 1
This sub-query could then be accessed using either another aggregate such as SUM(CASE WHEN ordersThisMonth > 1 THEN 1 ELSE 0 END) which will only count when more than one order is received. Personally I would try simplifying your query into two sub queries, one containing details which are specific to a month, the other for the month AND email.
(The following code has not been syntax checked and as previously mentioned I am looking at this from a purely SQL point of view.)
SELECT
month
, total_orders
, first_orders
/* Add together all firstMonthCount entries as it should only be 1 for the customers first month */
, SUM(firstMonthCount)::numeric / total_orders::numeric * 100 AS ratio
/* Count only e-mails which have more than one order for the month */
, SUM(CASE WHEN orderCount > 1 THEN 1 ELSE 0 END) AS multipleOrders
FROM
/*The following sub-query gets all the aggregates specific to a month */
( SELECT
date_trunc('month', date) AS month
, COUNT(1) AS total_orders
FROM
shop_orders
WHERE
status_code = 'CONFIRMED'
GROUP BY
month
) AS month_attributes
/*The following sub-query gets all the aggregates specific to a month AND e-mail */
LEFT JOIN (
SELECT
date_trunc('month', date) AS month
, email
, CASE WHEN date_trunc('month', date) = date_trunc('month', MIN(date)) THEN 1 ELSE 0 END AS firstMonthCount
, COUNT(*) AS orderCount
FROM
shop_orders
WHERE
status_code = 'CONFIRMED'
GROUP BY
month
,email
) AS email_month_attributes USING (month)
GROUP BY
month
Hopefully the above query and notes regarding the HAVING will help you find a solution that suits your purposes. I'm sure the above query could be simplified further if necessary for readability/performance.

Related

Month over Month percent change in user registrations

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

Group By - select by a criteria that is met every month

The below query returns all USERS that have SUM(AMOUNT) > 10 in a given month. It includes Users in a month even if they don't meet the criteria in other months.
But I'd like to transform this query to return all USERS who must meet the criteria SUM(AMOUNT) > 10 every single month (i.e., from the first month in the table to the last one) across the entire data.
Put another way, exclude users who don't meet SUM(AMOUNT) > 10 every single month.
select USERS, to_char(transaction_date, 'YYYY-MM') as month
from Table
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10;
One approach uses a generated calendar table representing all months in your data set. We can left join this calendar table to your current query, and then aggregate over all months by user:
WITH months AS (
SELECT DISTINCT TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
),
cte AS (
SELECT USERS, TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10
)
SELECT
t.USERS
FROM months m
LEFT JOIN cte t
ON m.month = t.month
GROUP BY
t.USERS
HAVING
COUNT(t.USERS) = (SELECT COUNT(*) FROM months);
The HAVING clause above asserts that the number of months to which a user matches is in fact the total number of months. This would imply that the user meets the sum criteria for every month.
Perhaps you could use a correlated subquery, such as:
select t.*
from (select distinct table.users from table) t
where not exists
(
select to_char(u.transaction_date, 'YYYY-MM') as month
from table u
where u.users = t.users
group by month
having sum(u.amount) <= 10
)
One option would be using sign(amount-10) vs. sign(amount) logic as
SELECT q.users
FROM
(
with tab(users, transaction_date,amount) as
(
select 1,date'2018-11-24',8 union all
select 1,date'2018-11-24',18 union all
select 2,date'2018-10-24',13 union all
select 3,date'2018-11-24',18 union all
select 3,date'2018-10-24',28 union all
select 3,date'2018-09-24', 3 union all
select 4,date'2018-10-24',28
)
SELECT users, to_char(transaction_date, 'YYYY-MM') as month,
sum(sign(amount-10)) as cnt1,
sum(sign(amount)) as cnt2
FROM tab t
GROUP BY users, month
) q
GROUP BY q.users
HAVING sum(q.cnt1) = sum(q.cnt2)
GROUP BY q.users
users
-----
2
4
Rextester Demo
You need to compare the number of months > 10 to the number of months between the min and the max date:
SELECT users, Count(flag) AS months, Min(mth), Max(mth)
FROM
(
SELECT users, date_trunc('month',transaction_date) AS mth,
CASE WHEN Sum(amount) > 10 THEN 1 end AS flag
FROM tab t
GROUP BY users, mth
) AS dt
GROUP BY users
HAVING -- adding the number of months > 10 to the min date and compare to max
Min(mth) + (INTERVAL '1' MONTH * (Count(flag)-1)) = Max(mth)
If missing months don't count it would be a simple count(flag) = count(*)

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Days Since Last Help Ticket was Filed

I am trying to create a report to show me the last date a customer filed a ticket.
Customers can file dozens of tickets. I want to know when the last ticket was filed and show how many days it's been since they have done so.
The fields I have are:
Customer,
Ticket_id,
Date_Closed
All from the Same table "Tickets"
I'm thinking I want to do a ranking of tickets by min date? I tried this query to grab something but it's giving me all the tickets from the customer. (I'm using SQL in a product called Domo)
select * from (select *, rank() over (partition by "Ticket_id"
order by "Date_Closed" desc) as date_order
from tickets ) zd
where date_order = 1
This should be simple enough,
SELECT customer,
MAX (date_closed) last_date,
ROUND((SYSDATE - MAX (date_closed)),0) days_since_last_ticket_logged
FROM emp
GROUP BY customer
select Customer, datediff(day, date_closed, current_date) as days_since_last_tkt
from
(select *, rank() over (partition by Customer order by "Date_Closed" desc) as date_order
from tickets) zd
join tickets t on zd.date_closed = t.date_closed
where zd.date_order = 1
Or you can simply do
select customer, datediff(day, max(Date_closed), current_date) as days_since_last_tkt
from tickets
group by customer
To select other fields
select t.*
from tickets t
join (select customer, max(Date_closed) as mxdate,
datediff(day, max(Date_closed), current_date) as days_since_last_tkt
from tickets
group by customer) tt
on t.customer = tt.customer and tt.mxdate = t.date_closed
I would do this with a simple sub-query to select the last closed date for the customer. Then compare this to today with datediff() to get the number of days since last closed.
Select
LastTicket.Customer,
LastTicket.LastClosedDate,
DateDiff(day,LastTicket.LastClosedDate,getdate()) as DaysSinceLastClosed
From
(select
tickets.customer
max(tickets.dateClosed) as LastClosedDate
from tickets
Group By tickets.Customer) as LastTicket
Based on the responses this is what I did:
select "Customer",
Max("date_closed") "last_date,
round(datediff(DAY, CURRENT_DATE, max("date_closed")), 0) as "Closed_date"
from tickets
group by "Customer"
ORDER BY "Customer"

SQL: aggregation (group by like) in a column

I have a select that group by customers spending of the past two months by customer id and date. What I need to do is to associate for each row the total amount spent by that customer in the whole first week of the two month time period (of course it would be a repetition for each row of one customer, but for some reason that's ok ). do you know how to do that without using a sub query as a column?
I was thinking using some combination of OVER PARTITION, but could not figure out how...
Thanks a lot in advance.
Raffaele
Query:
select customer_id, date, sum(sales)
from transaction_table
group by customer_id, date
If it's a specific first week (e.g. you always want the first week of the year, and your data set normally includes January and February spending), you could use sum(case...):
select distinct customer_id, date, sum(sales) over (partition by customer_ID, date)
, sum(case when date between '1/1/15' and '1/7/15' then Sales end)
over (partition by customer_id) as FirstWeekSales
from transaction_table
In response to the comments below; I'm not sure if this is what you're looking for, since it involves a subquery, but here's my best shot:
select distinct a.customer_id, date
, sum(sales) over (partition by a.customer_ID, date)
, sum(case when date between mindate and dateadd(DD, 7, mindate)
then Sales end)
over (partition by a.customer_id) as FirstWeekSales
from transaction_table a
left join
(select customer_ID, min(date) as mindate
from transaction_table group by customer_ID) b
on a.customer_ID = b.customer_ID