Aggregating over multiple groups in Postgres

Aggregating over multiple groups in Postgres - sql

I have a table with the following data:
+------------+-------------+---------------+
| shop_id | visit_date | visit_reason |
+------------+-------------+---------------+
| A | 2010-06-14 | shopping |
| A | 2010-06-15 | browsing |
| B | 2010-06-16 | shopping |
| B | 2010-06-14 | stealing |
+------------+-------------+---------------|
I need to build up an aggregate table that is grouped by shop, year, month, activity as well as total values for year and month. For example, if Shop A has 10 sales a month and 2 thefts a month and no other types of visit then the return would look like:
shop_id, year, month, reason, reason_count, month_count, year_count
A, 2010, 06, shopping, 10, 12, 144
A, 2010, 06, stealing, 2, 12, 144
Where month_count is the total number of visits, of any type, to the shop for 2010-06. Year-count is the same except for 2010.
I can get everything except the month and year counts with:
SELECT
shop_id,
extract(year from visit_date) as year,
extract(month from visit_date) as month,
visit_reason as reason,
count(visit_reason) as reason_count,
FROM shop_visits
GROUP BY shop_id, year, month
Should I be using some kind of CTE to double group by?

You can use window functions to add up the counts. The following is phrased using date_trunc(), which I find more convenient for aggregating by month:
select shop_id, date_trunc('month', visit_date) as yyyymm, reason,
count(*) as month_count,
sum(count(*)) over (partition by shop_id, date_trunc('year', min(visit_date))) as year_count
from t
group by shop_id, date_trunc('month', visit_date), reason;

Related

BigQuery for running count of distinct values with a dynamic date-range

We are trying to make a query where we get the sum of unique customers on a specific year-month + the sum of unique customers on the 364 days before the specific date.
For example:
Our customer-table looks like this:
| order_date | customer_unique_id |
| -------- | -------------- |
| 2020-01-01 | tom#email.com |
| 2020-01-01 | daisy#email.com |
| 2019-05-02 | tom#email.com |
In this example we have two customers who ordered on 2020-01-01 and one of them already ordered within the 364-days timeframe.
The desired table should look like this:
| year_month | unique_customers |
| -------- | -------------- |
| 2020-01 | 2 |
We tried multiple solutions, such as partitioning and windows, but nothing seem to work correctly. The tricky part is the uniqueness. We want the look 364 days back but want to do a count distinct on the customers based on that whole period and not based on date/year/month because then we would get duplicates. For example, if you partition by date, year or month tom#email.com would be counted twice instead of once.
The goal of this query is to get insight into the order-frequency (orders divided by customers) over a time period from 12 months.
We work with Google BigQuery.
Hope someone can help us out! :)

Here is a way to achieve your desired results. Note that this query does year-month join in a separate query, and joins it with the rolling 364-day-interval query.
with year_month_distincts as (
select
concat(
cast(extract(year from order_date) as string),
'-',
cast(extract(month from order_date) as string)
) as year_month,
count(distinct customer_id) as ym_distincts
from customer_table
group by 1
)
select x.order_date, x.ytd_distincts, y.ym_distincts from (
select
a. order_date,
(select
count(distinct customer_id)
from customer_table b
where b.order_date between date_sub(a.order_date, interval 364 day) and a.order_date
) as ytd_distincts
from orders a
group by 1
) x
join year_month_distincts y on concat(
cast(extract(year from x.order_date) as string),
'-',
cast(extract(month from x.order_date) as string)
) = y.year_month

Two options using arrays that may help.
Look back 364 days as requested
In case you wish to look back 11 months (given reporting is monthly)
month_array AS (
SELECT
DATE_TRUNC(order_date,month) AS order_month,
STRING_AGG(DISTINCT customer_unique_id) AS cust_mth
FROM customer_table
GROUP BY 1
),
year_array AS (
SELECT
order_month,
STRING_AGG(cust_mth) OVER(ORDER by UNIX_DATE(order_month) RANGE BETWEEN 364 PRECEDING AND CURRENT ROW) cust_12m
-- (option 2) STRING_AGG(cust_mth) OVER (ORDER by cast(format_date('%Y%m', order_month) as int64) RANGE BETWEEN 99 PRECEDING AND CURRENT ROW) AS cust_12m
FROM month_array
)
SELECT format_date('%Y-%m',order_month) year_month,
(SELECT COUNT(DISTINCT cust_unique_id) FROM UNNEST(SPLIT(cust_12m)) AS cust_unique_id) as unique_12m
FROM year_array

Counting number of orders per customer

I have a table with the following columns: date, customers_id, and orders_id (unique).
I want to addd a column in which, for each order_id, I can see how many times the given customer has already placed an order during the previous year.
e.g. this is what it would look like:
customers_id | orders_id | date | order_rank
2083 | 4725 | 2018-08-314 | 1
2573 | 4773 | 2018-09-035 | 1
3393 | 3776 | 2017-09-11 | 1
3393 | 4172 | 2018-01-09 | 2
3393 | 4655 | 2018-08-17 | 3
I'm doing this in BigQuery, thank you!

Use count(*) with a window frame. Ideally, you would use an interval. But BigQuery doesn't (yet) support that syntax. So convert to a number:
select t.*,
count(*) over (partition by customer_id
order by unix_date(date)
range between 364 preceding and current row
) as order_rank
from t;
This treats a year as 365 days, which seems suitable for most purposes.

I suggest that you use the over clause and restrict the data in your where clause. You don't really need a window for your case. If you consider one your a period from 365 days in the past until now, this is gonna work:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date > DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
order by customer_id, c
If you need some specific year, for example 2019, you can do something like:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date between cast("2019-01-01" as date) and cast("2019-12-31" as date)
order by customer_id, c

Max and Min group by 2 columns

I have the following table that shows me every time a car has his tank filled. It returns the date, the car id, the mileage it had at that time and the liters filled:
| Date | Vehicle_ID | Mileage | Liters |
| 2016-10-20 | 234 | 123456 | 100 |
| 2016-10-20 | 345 | 458456 | 215 |
| 2016-10-20 | 323 | 756456 | 265 |
| 2016-10-25 | 234 | 123800 | 32 |
| 2016-10-26 | 345 | 459000 | 15 |
| 2016-10-26 | 323 | 756796 | 46 |
The idea is to calculate the average comsumption by month (I can't do it by day because not every car fills the tank every day).
To get that, i tried to get max(mileage)-min(mileage)/sum(liters) group by month. But this will only work for 1 specific car and 1 specific month.
If I try for 1 specific car and several months, the max and min will not return properly. If I add all the cars, even worse, as it will assume the max and min as if every car was the same.
select convert(char(7), Date, 127) as year_month,
sum("Liters tanked")/(max("Mileage")-min("Mileage"))*100 as Litres_per_100KM
from Tanking
where convert(varchar(10),"Date",23) >= DATEADD(mm, -5, GETDATE())
group by convert(char(7), Date, 127)
This will not work as it will assume the max and min from all the cars.
The "workflow" shoud be this:
- For each month, get the max and min mileage for each car. Calculate max-min to get the mileage it rode that month. Sum the mileage for each car to get a total mileage driven by all the cars. Sum the liters tanked. Divide the total liters by the total mileage.
How can I get the result:
| YearMonth | Average |
| 2016-06 | 30 |
| 2016-07 | 32 |
| 2016-08 | 46 |
| 2016-09 | 34 |

This is a more complicated problem than it seems. The problem is that you don't want to lose miles between months. It is tempting to do something like this:
select year(date), month(date),
sum(liters) / (max(mileage) - min(mileage))
from Tanking
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
However, this misses miles and liters that span month boundaries. In addition, the liters on the first record of the month are for the previous milage difference. Oops! That is not correct.
One way to fix this is to look up the next values. The query looks something like this:
select year(date), month(date),
sum(next_liters) / (max(next_mileage) - min(mileage))
from (select t.*,
lead(date) over (partition by vehicle_id order by date) as next_date,
lead(mileage) over (partition by vehicle_id order by date) as next_mileage,
lead(liters) over (partition by vehicle_id order by date) as next_liters
from Tanking t
) t
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
These queries use simplified column names, so escape characters don't interfere with the logic.
EDIT:
Oh, you have multiple cars (probably what vehicle_Id is there for). You want two levels of aggregation. The first query would look like:
select yyyy, mm, sum(liters) as liters, sum(mileage_diff) as mileage_diff,
sum(mileage_diff) / sum(liters) as mileage_per_liter
from (select vehicle_id, year(date) as yyyy, month(date) as mm,
sum(liters) as liters,
(max(mileage) - min(mileage)) as mileage_diff
from Tanking
where Date >= dateadd(month, -5, getdate())
group by vehicle_year(date), month(date)
) t
group by yyyy, mm;
Similar changes to the second query (with vehicle_id in the partition by clauses) would work for the second version.

Try to get the sums per car per month in a subquery. Then calculate the average per month in an outer query using the values of the subquery:
select year_month,
(1.0*sum(liters_per_car)/sum(mileage_per_car))*100.0 as Litres_per_100KM
from (
select convert(char(7), [Date], 127) as year_month,
sum(Liters) as liters_per_car,
max(Mileage)-min(Mileage) as mileage_per_car
from Tanking
group by convert(char(7), [Date], 127), Vehicle_ID) as t
group by year_month

You can use a CTE to get dif(mileage) and then calculate consumption:
Can check it here: http://rextester.com/OKZO55169
with cte (car, datec, difm, liters)
as
(
select
car,
datec,
mileage - lag(mileage,1,mileage) over(partition by car order by car, mileage) as difm,
liters
from #consum
)
select
car,
year(datec) as [year],
month(datec) as [month],
((cast(sum(liters) as float)/cast(sum(difm) as float)) * 100.0) as [l_100km]
from
cte
group by
car, year(datec), month(datec)

How to retrieve all unique (Year, Quarter) combinations from a list of dates?

This works as a charm to retrieve all unique (Year, Month) combinations from my table:
SELECT
STRFTIME_UTC_USEC(TIMESTAMP_TO_USEC(Date), '%Y-%m') as month,
FROM
[Table]
It returns 2016-05, 2016-06, 2016-07, etc
Want to do the same thing for (Year, Quarter) but have found nothing. Any tips? I know quarter is very tricky on sql. Thanks!

Combine the QUARTER function with the YEAR function:
select concat(string(year(timestamp)),'-Q',string(quarter(timestamp))) as mquerter
from [table]
group by 1

For the sake of completeness, using standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can do:
SELECT DISTINCT EXTRACT(YEAR FROM t), EXTRACT(QUARTER FROM t)
FROM MyTable;
For example:
WITH MyTable AS (
SELECT t
FROM UNNEST([TIMESTAMP '2016-01-01',
TIMESTAMP '2016-04-03',
TIMESTAMP '2016-02-28',
TIMESTAMP '2017-06-25']) AS t
)
SELECT DISTINCT
EXTRACT(YEAR FROM t) AS year,
EXTRACT(QUARTER FROM t) AS quarter
FROM MyTable;
+------+---------+
| year | quarter |
+------+---------+
| 2016 | 1 |
| 2016 | 2 |
| 2017 | 2 |
+------+---------+

Rolling counts based on rolling cohorts

Using Postgres 9.5. Test data:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
I am trying to understand the behaviour of existing customers. I am trying to answer this question:
What is the likelihood of a customer to order again based on when their last order was (current month, previous month (m-1)...to m-12)?
Likelihood is calculated as:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
Thus, I need to generate a table that lists a count of the people who ordered in the current month, who belong in a given cohort.
Thus, what are the rules for being in a cohort?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
I am using the DVD Store database as sample data to develop the query: http://linux.dell.com/dvdstore/
Here is an example of cohort rules and aggregations, based on July being the
"month's orders being analysed" (please notice: the "month's orders being analysed" column is the first column in the 'Desired output' table below):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
This data would output the following table:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
My attempts so far have been on variations of:
generate_series()
and
row_number() over (partition by customer_id order by rental_id desc)
I haven't been able to get everything to come together yet (I've tried for many hours and haven't yet solved it).
For readability, I think posting my work in parts is better (if anyone wants me to post the sql query in its entirety please comment - and I'll add it).
series query:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
rank query:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
I want to do something like: run the orders_ranked query for every row returned by the series query, and then base aggregations on each return of orders_ranked.
Something like:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series

This query does everything. It operates on the whole table and works for any time range.
Based on some assumptions and assuming current Postgres version 9.5. Should work with pg 9.1 at least. Since your definition of "cohort" is unclear to me, I skipped the "how many people in cohort" columns.
I would expect it to be faster than anything you tried so far. By orders of magnitude.
SELECT *
FROM crosstab (
$$
SELECT mon
, sum(count(*)) OVER (PARTITION BY mon)::int AS m0
, gap -- count of months since last order
, count(*) AS gap_ct
FROM (
SELECT mon
, mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
FROM (
SELECT DISTINCT ON (1,2)
date_trunc('month', rental_date)::date AS mon
, customer_id AS c_id
, extract(YEAR FROM rental_date)::int * 12
+ extract(MONTH FROM rental_date)::int AS mon_int
FROM rental
) dist_customer
) gap_to_last_month
GROUP BY mon, gap
ORDER BY mon, gap
$$
, 'SELECT generate_series(1,12)'
) ct (mon date, m0 int
, m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
, m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);
Result:
mon | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | |
2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | |
...
m0 .. customers with >= 1 order this month
m01 .. customers with >= 1 order this month and >= 1 order 1 month before (nothing in between)
m02 .. customers with >= 1 order this month and >= 1 order 2 month before and no order in between
etc.
How?
In subquery dist_customer reduce to one row per month and customer_id (mon, c_id) with DISTINCT ON:
Select first row in each GROUP BY group?
To simplify later calculations add a count of months for the date (mon_int). Related:
How do you do date math that ignores the year?
If there are many orders per (month, customer), there are faster query techniques for the first step:
Optimize GROUP BY query to retrieve latest record per user
In subquery gap_to_last_month add the column gap indicating the time gap between this month and the last month with any orders of the same customer. Using the window function lag() for this. Related:
PostgreSQL window function: partition by comparison
In the outer SELECT aggregate per (mon, gap) to get the counts you are after. In addition, get the total count of distinct customers for this month m0.
Feed this query to crosstab() to pivot the result into the desired tabular form for the result. Basics:
PostgreSQL Crosstab Query
About the "extra" column m0:
Pivot on Multiple Columns using Tablefunc

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Aggregating over multiple groups in Postgres - sql

Related

BigQuery for running count of distinct values with a dynamic date-range

Counting number of orders per customer

Max and Min group by 2 columns

How to retrieve all unique (Year, Quarter) combinations from a list of dates?

Rolling counts based on rolling cohorts

Categories

Resources