How to create monthly running sum when some months have no records - sql

The goal is to graph the total volume of created orders over time in a monthly digest
WITH monthly_sums AS (
SELECT
date_trunc('month', created_at) AS month,
sum(count(created_at)) OVER (ORDER BY date_trunc('month', created_at)) AS sum
FROM orders
GROUP BY date_trunc('month', created_at)
)
SELECT
to_char(date_range.month, 'Month') AS month,
COALESCE(monthly_sums.sum, 0) AS total
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 month') date_range(month)
LEFT OUTER JOIN monthly_sums
ON monthly_sums.month = date_range.month;
Which returns:
month | total
-----------+-------
October | 0
November | 0
December | 0
January | 1
February | 0 <-- should be 1
March | 3
(6 rows)
selecting from monthly_sums returns:
month | sum
---------------------+-----
2015-01-01 00:00:00 | 1
| <-- no records created in February
2015-03-01 00:00:00 | 3
| 3
(3 rows)
The problem is there were no orders in February so the total is coalesced into 0. How can I alter or rethink this query in order to get the desired result?

I am unfamiliar with the postgresql syntax for this construct, but the general pattern across all SQL engines for implementing this type of operation efficiently is always the following:
Aggregate by month;
Left join from the full set of desired periods to the aggregates in (1), coalescing absent sums to zero;
Perform the running summation by period.
Use either CTE's or subqueries to build 1, 2, and 3 successively.

following Pieter's answer, this altered query gives the correct result
WITH monthly_counts AS (
SELECT
date_trunc('month', created_at) AS month,
count(created_at) AS count
FROM orders
GROUP BY date_trunc('month', created_at)
)
SELECT
to_char(sq.month, 'Month') AS month,
sum(sq.count) OVER (ORDER BY sq.month)
FROM
(SELECT
date_range.month,
COALESCE(monthly_counts.count, 0) AS count
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 month') date_range(month)
LEFT OUTER JOIN monthly_counts
ON monthly_counts.month = date_range.month) AS sq;

Related

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

SQL Find Last 30 Days records count grouped by

I am trying to retrieve the count of customers daily per each status in a dynamic window - Last 30 days.
The result of the query should show each day how many customers there are per each customer status (A,B,C) for the Last 30 days (i.e today() - 29 days). Every customer can have one status at a time but change from one status to another within the customer lifetime. The purpose of this query is to show customer 'movement' across their lifetime. I've generated a series of date ranging from the first date a customer was created until today.
I've put together the following query but it appears that something I'm doing is incorrect because the results depict most days as having the same count across all statuses which is not possible, each day new customers are created. We checked with another simple query and confirmed that the split between statuses is not equal.
I tried to depict below the data and the SQL I use to reach the optimal result.
Starting point (example table customer_statuses):
customer_id | status | created_at
---------------------------------------------------
abcdefg1234 B 2019-08-22
abcdefg1234 C 2019-01-17
...
abcdefg1234 A 2018-01-18
bcdefgh2232 A 2017-09-02
ghijklm4950 B 2018-06-06
statuses - A,B,C
There is no sequential order for the statuses, a customer can have any status at the start of the business relationship and switch between statuses throughout their lifetime.
table customers:
id | f_name | country | created_at |
---------------------------------------------------------------------
abcdefg1234 Michael FR 2018-01-18
bcdefgh2232 Sandy DE 2017-09-02
....
ghijklm4950 Daniel NL 2018-06-06
SQL - current version:
WITH customer_list AS (
SELECT
DISTINCT a.id,
a.created_at
FROM
customers a
),
dates AS (
SELECT
generate_series(
MIN(DATE_TRUNC('day', created_at)::DATE),
MAX(DATE_TRUNC('day', now())::DATE),
'1d'
)::date AS day
FROM customers a
),
customer_statuses AS (
SELECT
customer_id,
status,
created_at,
ROW_NUMBER() OVER
(
PARTITION BY customer_id
ORDER BY created_at DESC
) col
FROM
customer_status
)
SELECT
day,
(
SELECT
COUNT(DISTINCT id) AS accounts
FROM customers
WHERE created_at::date BETWEEN day - 29 AND day
),
status
FROM dates d
LEFT JOIN customer_list cus
ON d.day = cus.created_at
LEFT JOIN customer_statuses cs
ON cus.id = cs.customer_id
WHERE
cs.col = 1
GROUP BY 1,3
ORDER BY 1 DESC,3 ASC
Currently what the results from the query look like:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1230 B
2020-01-24 1230 A
2020-01-23 1200 C
2020-01-23 1200 B
2020-02-23 1200 A
2020-02-22 1150 C
2020-02-22 1150 B
...
2017-01-01 50 C
2017-01-01 50 B
2017-01-01 50 A
Two things I've noticed from the results above - most of the time the results show the same count across all statuses in a given day. The second observation, there are days that only two statuses appear - which should not be the case. If now new accounts are created in a given day with a certain status, the count of the previous day should be carried over - right? or is this the problem with the query I created or with the logic I have in mind??
Perhaps I'm expecting a result that will not happen logically?
Required result:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1000 B
2020-01-24 2500 A
2020-01-23 1200 C
2020-01-23 1050 B
2020-02-23 2450 A
2020-02-22 1160 C
2020-02-22 1020 B
2020-02-22 2400 A
...
2017-01-01 10 C
2017-01-01 4 B
2017-01-01 50 A
Thank You!
Your query seems overly complicated. Here is another approach:
Use lead() to get when the status ends for each customer status record.
Use generate_series() to generate the days.
The rest is just filtering and aggregation:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
I've altered the query a bit because I've noticed that I get duplicate records on the days a customer changes a status - one record with the old status and one records for the new day.
For example output with #Gordon's query:
dte | status
---------------------------
2020-02-12 B
... ...
01.02.2020 A
01.02.2020 B
31.01.2020 A
30.01.2020 A
I've adapted the query, see below, while the results depict the changes between statuses correctly (no duplicate records on the day of change), however, the records continue up until now()::date - interval '1day' and not include now()::date (as in today). I'm not sure why and can't find the correct logic to ensure all of this is how I want it.
Dates correctly depict the status of each customer and the status returned include today.
Adjusted query:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1day' as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca, interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
The two adjustments:
The adjustments also seem counter-intuitive as it seems i'm taking the interval day away from one part of the query only to add it to another (which to me seems to yield the same result)
a - added the decrease of 1 day from the lead function (line 3)
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1 day' as next_ca
b - removed the decrease of 1 day from the next_ca variable (line 6)
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day')
Example of the output with the adjusted query:
dte | status
---------------------------
2020-02-11 B
... ...
01.02.2020 B
31.01.2020 A
30.01.2020 A
Thanks for your help!

Get historical count and current count on parking data

I've previously got very good help here on SO in regards to analyze parking data. This is my query:
select parking_meter_id, avg(cnt) from
(select parking_meter_id, count(*) as cnt, to_char(start,'YYYYMMDD') as day
from parking_transactions
where start >= now() - interval '3 month' -- last three months
and to_char(start,'YYYYMMDD') < to_char(now(),'YYYYMMDD') -- but not today
and to_char(start,'D') = to_char(now(),'D') -- same weekday
and to_char(now(),'HH24MISS') between to_char(start,'HH24MISS') and to_char(stop,'HH24MISS') -- same time
group by parking_meter_id, to_char(start,'YYYYMMDD') -- group by day
) as parking_transactions group by parking_meter_id
It does work and show average count on active transactions this is due to the fact that transactions from today (now()) are filtered away.
Is it possible, in same run through, to have the query also return the current active transactions:
select count(*) as cnt from parking_transactions where now() between start and stop
so one can easily compare the current status with the historical?
My table structure are:
parking_meter_id, start, stop
Currently I get the following output:
parking_meter_id, avg(cnt) minus today
I would like to have the following output:
parking_meter_id, avg(cnt) minus today, count(*) for today only
The -- but not today are the where clause which ignores todays transactions.
An example of output as of now is the following:
parking_meter_id | cnt | day
------------------+-----+----------
4406 | 1 | 20141217
4406 | 5 | 20150107
4406 | 1 | 20150121
4406 | 3 | 20150128
4406 | 3 | 20150114
I would like to have returned:
parking_meter_id | avg(cnt-without-today) | cnt-day
------------------+-----+------------------------------
4406 | 2.6 | 3
Use WITH to create temporary tables for daily count and avg count minus today and join the tables to get desired result
SQL Fiddle
SQL
WITH daily_count AS -- temp table to store daily counts
(
SELECT parking_meter_id,
COUNT(*) AS cnt,
to_char(start,'YYYYMMDD') AS day
FROM parking_transactions
WHERE start >= now() - interval '3 month' -- last three months
AND to_char(start,'D') = to_char(now(),'D') -- same weekday
AND to_char(now(),'HH24MISS') BETWEEN to_char(start,'HH24MISS') AND to_char(stop,'HH24MISS') -- same time
GROUP BY parking_meter_id,
to_char(start,'YYYYMMDD') -- group by parking meter id and day
), avg_count_minus_today AS -- temp table to store avg count minus today
(
SELECT parking_meter_id,
AVG(cnt) AS avg_count
FROM daily_count
WHERE day < to_char(now(),'YYYYMMDD') -- but not today
GROUP BY parking_meter_id
)
SELECT a.parking_meter_id,
a.avg_count, --avg count minus today
d.cnt AS today_count
FROM avg_count_minus_today a
INNER JOIN daily_count d
ON a.parking_meter_id= d.parking_meter_id AND d.day=to_char(now(),'YYYYMMDD'); --today in daily count

How to create a pivot table by product by month in SQL

I have 3 tables:
users (id, account_balance)
grocery (user_id, date, amount_paid)
fishmarket (user_id, date, amount_paid)
Both fishmarket and grocery tables may have multiple occurrences for the same user_id with different dates and amounts paid or have nothing at all for any given user. I am trying to develop a pivot table of the following structure:
id | grocery_amount_paid_January | fishmarket_amount_paid_January
1 10 NULL
2 40 71
The only idea I can come with is to create multiple left joins, but this should be wrong since there will be 24 joins (per each month) for each product. Is there a better way?
I have provided a lot of answers on crosstab queries in PostgreSQL lately. Sometimes a "plain" query like the following does the job:
WITH x AS (SELECT '2012-01-01'::date AS _from
,'2012-12-01'::date As _to) -- provide date range once in CTE
SELECT u.id
,to_char(m.mon, 'MM.YYYY') AS month_year
,g.amount_paid AS grocery_amount_paid
,f.amount_paid AS fishmarket_amount_paid
FROM users u
CROSS JOIN (SELECT generate_series(_from, _to, '1 month') AS mon FROM x) m
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, grocery -- CROSS JOIN with a single row
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) g ON g.user_id = u.id AND m.mon = g.mon
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, fishmarket
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) f ON f.user_id = u.id AND m.mon = g.mon
ORDER BY u.id, m.mon;
produces this output:
id | month_year | grocery_amount_paid | fishmarket_amount_paid
---+------------+---------------------+------------------------
1 | 01.2012 | 10 | NULL
1 | 02.2012 | NULL | 65
1 | 03.2012 | 98 | 13
...
2 | 02.2012 | 40 | 71
2 | 02.2012 | NULL | NULL
Major points
The first CTE is for convenience only. So you have to type your date range once only. You can use any date range - as long as it's dates with the first of the month (rest of the month will be included!). You could add date_trunc() to it, but I guess you can keep the urge to use invalid dates in check.
First CROSS JOIN users to the result of generate_series() (m) which provides one row per month in your date range. You have learned in your last question how that results in multiple rows per user.
The two subqueries are identical twins. Use WHERE clauses that operate on the base column, so it can utilize an index - which you should have if your table runs over many years (no use for only one or two years, a sequential scan will be faster):
CREATE INDEX grocery_date ON grocery (date);
Then reduce all dates to the first of the month with date_trunc() and sum amount_paid per user_id and the resulting mon.
LEFT JOIN the result to the base table, again by user_id and the resulting mon. This way, rows are neither multiplied nor dropped. You get one row per user_id and month. Voilá.
BTW, I'd never use a column name id. Call it user_id in the table users as well.

Using outer query result in a subquery in postgresql

I have two tables points and contacts and I'm trying to get the average points.score per contact grouped on a monthly basis. Note that points and contacts aren't related, I just want the sum of points created in a month divided by the number of contacts that existed in that month.
So, I need to sum points grouped by the created_at month, and I need to take the count of contacts FOR THAT MONTH ONLY. It's that last part that's tricking me up. I'm not sure how I can use a column from an outer query in the subquery. I tried something like this:
SELECT SUM(score) AS points_sum,
EXTRACT(month FROM created_at) AS month,
date_trunc('MONTH', created_at) + INTERVAL '1 month' AS next_month,
(SELECT COUNT(id) FROM contacts WHERE contacts.created_at <= next_month) as contact_count
FROM points
GROUP BY month, next_month
ORDER BY month
So, I'm extracting the actual month that my points are being summed, and at the same time, getting the beginning of the next_month so that I can say "Get me the count of contacts where their created at is < next_month"
But it complains that column next_month doesn't exist This is understandable as the subquery knows nothing about the outer query. Qualifying with points.next_month doesn't work either.
So can someone point me in the right direction of how to achieve this?
Tables:
Points
score | created_at
10 | "2011-11-15 21:44:00.363423"
11 | "2011-10-15 21:44:00.69667"
12 | "2011-09-15 21:44:00.773289"
13 | "2011-08-15 21:44:00.848838"
14 | "2011-07-15 21:44:00.924152"
Contacts
id | created_at
6 | "2011-07-15 21:43:17.534777"
5 | "2011-08-15 21:43:17.520828"
4 | "2011-09-15 21:43:17.506452"
3 | "2011-10-15 21:43:17.491848"
1 | "2011-11-15 21:42:54.759225"
sum, month and next_month (without the subselect)
sum | month | next_month
14 | 7 | "2011-08-01 00:00:00"
13 | 8 | "2011-09-01 00:00:00"
12 | 9 | "2011-10-01 00:00:00"
11 | 10 | "2011-11-01 00:00:00"
10 | 11 | "2011-12-01 00:00:00"
Edit
Now with running sum of contacts. My first draft used new contacts per month, which is obviously not what OP wants.
WITH c AS (
SELECT created_at
,count(id) OVER (order BY created_at) AS ct
FROM contacts
), p AS (
SELECT date_trunc('month', created_at) AS month
,sum(score) AS points_sum
FROM points
GROUP BY 1
)
SELECT p.month
,EXTRACT(month FROM p.month) AS month_nr
,p.points_sum
,( SELECT c.ct
FROM c
WHERE c.created_at < (p.month + interval '1 month')
ORDER BY c.created_at DESC
LIMIT 1) AS contacts
FROM p
ORDER BY 1
This works for any number of months across the years.
Assumes that no month is missing in the table points. If you want all months, including missing ones in points, generate a list of months with generate_series() and LEFT JOIN to it.
Build a running sum in a CTE with a window function.
Both CTE are not strictly necessary - for performance and simplification only.
Get contacts_count in a subselect.
Your original form of the query could work like this:
SELECT month
,EXTRACT(month FROM month) AS month_nr
,points_sum
,(SELECT count(*)
FROM contacts c
WHERE c.created_at < (p.month + interval '1 month')) AS contact_count
FROM (
SELECT date_trunc('MONTH', created_at) AS month
,sum(score) AS points_sum
FROM points p
GROUP BY 1
) p
ORDER BY 1
The fix for the immediate cause of your error is to put the aggregate into a subquery. You were mixing levels in a way that is impossible.
I expect my variant to be slightly faster with big tables. Not sure about smaller tables. Would be great if you'd report back with test results.
Plus a minor fix: < instead of <=.