Can Window Function be used with Grouping Sets? - sql

Introduction
In this Q&A we will explore the use of window functions and grouping sets in the same query using PostgreSQL. The reader should be familiar with both of those concepts to get the most out of this post. This fiddle can be used to follow along.
Data
The following table contains a few transactions. For each transaction we have the customer, the city where it took place and the number of units delivered to the customer.
CREATE TABLE transactions (customer TEXT, city TEXT, units INT);
INSERT INTO transactions VALUES ('Bob', 'Manchester', 10),
('Chuck', 'Manchester', 20),
('Bob', 'Concord', 10),
('Tim', 'Manchester', 15),
('Jane', 'Derry', 10),
('Tim', 'Derry', 15),
('Tim', 'Concord', 20),
('Bob', 'Manchester', 20),
('Chuck', 'Concord', 10);
Desired results
I want to be able to produce a report which contains the answers to questions like "what proportion of all transactions is represented by this row" or "what is the ratio of this to all the transactions by this customer". Also I want to be able to answer all such possible questions. With that kind of a report we could simply look at each row and extract information about its relation to the whole report or to specific slices of the data.
First attempt
We could attempt to create such a report by using multiple queries with different GROUP BY clauses and then uniting them with UNION ALL or we can try something like the following:
-- Incorrect results
SELECT customer
, city
, SUM(units) AS "units"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER () , 2) AS "to report"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY customer) , 2) AS "to customer"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY city) , 2) AS "to city"
FROM transactions
GROUP BY CUBE(customer, city)
Here we use CUBE in the GROUP BY clause which will produce groupings corresponding to all possible combinations of the customer and city columns. The numerator of the ratios is an aggregate that corresponds to the units total for that row. Notice that it is the same for all ratios and that it contains the expression used as the "units" column, i.e. SUM(units). Calculating the denominator is more complicated because we need a window function to calculate the total number of units for the wanted slice, i.e. the total for a particular customer or city or the total for the whole report.
Incorrect Results
Unfortunately the ratios produced by the query above are not correct. For example the first row has 10 units which is 7.7% of the total (130), 25% of the total for Bob (40), and 25% of the total for Concord (40) yet the results show less than the correct ratio in all cases. As another example take the row where both "customer" and "city" are NULL, here the "unit" column is 130 and yet the calculated ratio "to report" is 25%. Clearly the denominator in the ratio columns is wrong. How can we get the desired results?
customer
city
units
to report
to customer
to city
Bob
Concord
10
1.92
12.50
12.50
Bob
Manchester
30
5.77
37.50
23.08
Bob
null
40
7.69
50.00
15.38
Chuck
Concord
10
1.92
16.67
12.50
Chuck
Manchester
20
3.85
33.33
15.38
Chuck
null
30
5.77
50.00
11.54
Jane
Derry
10
1.92
50.00
20.00
Jane
null
10
1.92
50.00
3.85
Tim
Concord
20
3.85
20.00
25.00
Tim
Derry
15
2.88
15.00
30.00
Tim
null
50
9.62
50.00
19.23
Tim
Manchester
15
2.88
15.00
11.54
null
null
130
25.00
50.00
50.00
null
Manchester
65
12.50
25.00
50.00
null
Derry
25
4.81
9.62
50.00
null
Concord
40
7.69
15.38
50.00

Why are the results wrong?
Notice that although the results in the question are wrong they are not totally nonsensical. Take, for example, the row in which both "customer" and "city" are NULL. This row has 130 units which is the total number of units in the data, so we should expect the ratio "to report" to be 100% but the result shows 25%, which means that the denominator of the ratio in that case was four times 130, or 520. Take the first row as another example, here we have 10 units and a ratio "to report" of 1.92%, again the denominator is wrong by a factor of four, i.e. the actual ratio should be 7.69%. Clearly the total of the report is taken to be four times what it actually is.
The results for the "to customer" and "to city" columns are wrong as well but by a different factor. Take for example the rows where "customer" is NULL and "city" is not NULL. The ratio "to city" for the three cities is 50% but should be 100%. This is because the denominator of the ratio is twice what it should be. The same thing happens for the "to customer" ratio. The problem lies in the partitioning of the rows. For instance, there is no PARTITION BY clause in the "to report" column, so we are taking into consideration all of the rows of the report, which add up to 520.
Consider that the GROUP BY clause produces four groupings:
(customer, city) - equivalent to GROUP BY customer, city
(customer) - equivalent to GROUP BY customer
(city) - equivalent to GROUP BY city
() - equivalent to using an aggregate without GROUP BY clause
Each of those groupings is a different way to slice the data and for each the units total is 130. In the case of the "to customer" and "to city" columns we have a denominator that is twice as large as it should be. Take for example the cities and notice that the rows where "city" is not NULL contain the units for each city twice: once where "customer" is not NULL and another time where "customer" is NULL. These two categories correspond to the first and third groupings above, respectively. The same can be said for the customers but the rows would correspond to the first and second groupings. Clearly when calculating the denominator of each ratio we need to take into consideration that different rows belong to different slices of the data.
The Key
The key to getting the desired results is to use the GROUPING aggregate function. This function "returns a bit mask indicating which GROUP BY expressions are not included in the current grouping set". In other words it can give a different result for each grouping produced by the GROUP BY clause. We can use this function to help calculate the denominator for each of the ratio columns. To get the desired effect we use the GROUPING function inside the PARTITION BY clause of our window functions like so:
SELECT customer
, city
, SUM(units) AS "units"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city)) , 2) AS "to report"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), customer) , 2) AS "to customer"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), city) , 2) AS "to city"
FROM transactions
GROUP BY CUBE(customer, city)
Having the GROUPING function inside the PARTITION BY clause ensures that the denominator for each ratio corresponds only to the rows of a particular grouping. Fortunately we do not have to give much thought to the arguments that we will pass to the GROUPING function. You can simply include all the columns that appear in the GROUP BY, although only the ones that do not appear in all of the grouping sets are necessary.
Desired Result
Now that we are calculating the denominator taking the groupings into consideration we get the correct results.
customer
city
units
to report
to customer
to city
Bob
Manchester
30
23.08
75.00
46.15
Bob
Concord
10
7.69
25.00
25.00
Chuck
Concord
10
7.69
33.33
25.00
Chuck
Manchester
20
15.38
66.67
30.77
Jane
Derry
10
7.69
100.00
40.00
Tim
Concord
20
15.38
40.00
50.00
Tim
Derry
15
11.54
30.00
60.00
Tim
Manchester
15
11.54
30.00
23.08
Bob
null
40
30.77
100.00
30.77
Chuck
null
30
23.08
100.00
23.08
Jane
null
10
7.69
100.00
7.69
Tim
null
50
38.46
100.00
38.46
null
Concord
40
30.77
30.77
100.00
null
Derry
25
19.23
19.23
100.00
null
Manchester
65
50.00
50.00
100.00
null
null
130
100.00
100.00
100.00

Related

What logic should be used to label customers (monthly) based on the categories they bought more often in the preceding 4 calendar months?

I have a table that looks like this:
user
type
quantity
order_id
purchase_date
john
travel
10
1
2022-01-10
john
travel
15
2
2022-01-15
john
books
4
3
2022-01-16
john
music
20
4
2022-02-01
john
travel
90
5
2022-02-15
john
clothing
200
6
2022-03-11
john
travel
70
7
2022-04-13
john
clothing
70
8
2022-05-01
john
travel
200
9
2022-06-15
john
tickets
10
10
2022-07-01
john
services
20
11
2022-07-15
john
services
90
12
2022-07-22
john
travel
10
13
2022-07-29
john
services
25
14
2022-08-01
john
clothing
3
15
2022-08-15
john
music
5
16
2022-08-17
john
music
40
18
2022-10-01
john
music
30
19
2022-11-05
john
services
2
20
2022-11-19
where i have many different users, multiple types making purchases daily.
I want to end up with a table of this format
user
label
month
john
travel
2022-01-01
john
travel
2022-02-01
john
clothing
2022-03-01
john
travel-clothing
2022-04-01
john
travel-clothing
2022-05-01
john
travel-clothing
2022-06-01
john
travel
2022-07-01
john
travel
2022-08-01
john
services
2022-10-01
john
music
2022-11-01
where the label would record the most popular type (based on % of quantity sold) for each user in a timeframe of the last 4 months (including the current month). So for instance, for March 2022 john ordered 200/339 clothing (Jan to and including Mar) so his label is clothing. But for months where two types are almost even I'd want to use a double label like for April (185 travel 200 clothing out of 409). In terms of rules this is not set in stone yet but it's something like, if two types are around even (e.g. >40%) then use both types in the label column; if three types are around even (e.g. around 30% each) use three types as label; if one label is 40% but the rest is made up of many small % keep the first label; and of course where one is clearly a majority use that. One other tricky bit is that there might be missing months for a user.
I think regarding the rules I need to just compare the % of each type, but I don't know how to retrieve the type as label afterwards. In general, I don't have the SQL/BigQuery logic very clearly in my head. I have done somethings but nothing that comes close to the target table.
Broken down in steps, I think I need 3 things:
group by user, type, month and get the partial and total count (I have done this)
then retrieve the counts for the past 4 months (have done something but it's not exactly accurate yet)
compare the ratios and make the label column
I'm not very clear on the sql/bigquery logic here, so please advise me on the correct steps to achieve the above. I'm working on bigquery but sql logic will also help
Consider below approach. It looks a little bit messy and has a room to optimize but hope you get some idea or a direction to address your problem.
WITH aggregation AS (
SELECT user, type, DATE_TRUNC(purchase_date, MONTH) AS month, month_no,
SUM(quantity) AS net_qty,
SUM(SUM(quantity)) OVER w1 AS rolling_qty
FROM sample_table, UNNEST([EXTRACT(YEAR FROM purchase_date) * 12 + EXTRACT(MONTH FROM purchase_date)]) month_no
GROUP BY 1, 2, 3, 4
WINDOW w1 AS (
PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
)
),
rolling AS (
SELECT user, month, ARRAY_AGG(STRUCT(type, net_qty)) OVER w2 AS agg, rolling_qty
FROM aggregation
QUALIFY ROW_NUMBER() OVER (PARTITION BY user, month) = 1
WINDOW w2 AS (PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW)
)
SELECT user, month, ARRAY_TO_STRING(ARRAY(
SELECT type FROM (
SELECT type, SUM(net_qty) / SUM(SUM(net_qty)) OVER () AS pct,
FROM r.agg GROUP BY 1
) QUALIFY IFNULL(FIRST_VALUE(pct) OVER (ORDER BY pct DESC) - pct, 0) < 0.10 -- set threshold to 0.1
), '-') AS label
FROM rolling r
ORDER BY month;
Query results

Showing Two Fields With Different Timeline in the Same Date Structure

In the project I am currently working on in my company, I would like to show sales related KPIs together with Customer Score metric on SQL / Tableau / BigQuery
The primary key is order id in both tables. However, order date and the date we measure Customer Score may be different. For example the the sales information for an order that is released in Feb 2020 will be aggregated in Feb 2020, however if the customer survey is made in March 2020, the Customer Score metric must be aggregated in March 2020. And what I would like to achieve in the relational database is as follows:
Sales:
Order ID
Order Date(m/d/yyyy)
Sales ($)
1000
1/1/2021
1000
1001
2/1/2021
2000
1002
3/1/2021
1500
1003
4/1/2021
1700
1004
5/1/2021
1800
1005
6/1/2021
900
1006
7/1/2021
1600
1007
8/1/2021
1900
Customer Score Table:
Order ID
Customer Survey Date(m/d/yyyy)
Customer Score
1000
3/1/2021
8
1001
3/1/2021
7
1002
4/1/2021
3
1003
6/1/2021
6
1004
6/1/2021
5
1005
7/1/2021
3
1006
9/1/2021
1
1007
8/1/2021
7
Expected Output:
KPI
Jan-21
Feb-21
Mar-21
Apr-21
May-21
June-21
July-21
Aug-21
Sep-21
Sales($)
1000
2000
1500
1700
1800
900
1600
1900
AVG Customer Score
7.5
3
5.5
3
7
1
I couldn't find a way to do this, because order date and survey date may/may not be the same.
For sample data and expected output, click here.
I think what you want to do is aggregate your results to the month (KPI) first before joining, as opposed to joining on the ORDER_ID
For example:
with order_month as (
select date_trunc(order_date, MONTH) as KPI, sum(sales) as sales
from `testing.sales`
group by 1
),
customer_score_month as (
select date_trunc(customer_survey_date, MONTH) as KPI, avg(customer_score) as avg_customer_score
from `testing.customer_score`
group by 1
)
select coalesce(order_month.KPI,customer_score_month.KPI) as KPI, sales, avg_customer_score
from order_month
full outer join customer_score_month
on order_month.KPI = customer_score_month.KPI
order by 1 asc
Here, we aggregate the total sales for each month based on the order date, then we aggregate the average customer score for each month based on the date the score was submitted. Now we can join these two on the month value.
This results in a table like this:
KPI
sales
avg_customer_score
2021-01-01
1000
null
2021-02-01
2000
null
2021-03-01
1500
7.5
2021-04-01
1700
3.0
2021-05-01
1800
null
2021-06-01
900
5.5
2021-07-01
1600
3.0
2021-08-01
1900
7.0
2021-09-01
null
1.0
You can pivot the results of this table in Tableau, or leverage a case statement to pull out each month into its own column - I can elaborate more if that will be helpful

Distinct count for entire dataset, grouped by month

I am dealing with a sales order table (ORDER) that looks roughly like this (updated 2018/12/20 to be closer to my actual data set):
SOID SOLINEID INVOICEDATE SALESAMOUNT AC
5 1 2018-11-30 100.00 01
5 2 2018-12-05 50.00 02
4 1 2018-12-12 25.00 17
3 1 2017-12-31 75.00 03
3 2 2018-01-03 25.00 05
2 1 2017-11-25 100.00 17
2 2 2017-11-27 35.00 03
1 1 2017-11-20 15.00 08
1 2 2018-03-15 30.00 17
1 3 2018-04-03 200.00 05
I'm able to calculate the average sales by SOID and SOLINEID:
SELECT SUM(SALESAMOUNT) / COUNT(DISTINCT SOID) AS 'Total Sales per Order ($)',
SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM ORDER
This seems to provide a perfectly good answer, but I was then given an additional constraint, that this count be done by year and month. I thought I could simply add
GROUP BY YEAR(INVOICEDATE), MONTH(MONTH)
But this aggregates the SOID and then performs the COUNT(DISTINCT SOID). This becomes a problem with SOIDs that appears across multiple months, which is fairly common since we invoice upon shipment.
I want to get something like this:
Year Month Total Sales Per Order Total Sales Per Line
2018 11 0.00
The sore thumb sticking out is that I need some way of defining in which month and year an SOID will be aggregated if it spans across multiple ones; for that purpose, I'd use MAX(INVOICEDATE).
From there, however, I'm just not sure how to tackle this. WITH? A subquery? Something else? I would appreciate any help, even if it's just pointing in the right direction.
You should select Year() and month() for invocedate and group by
SELECT YEAR(INVOICEDATE) year
, MONTH(INVOICEDATE) month
, SUM(SALESAMOUNT) / COUNT(DISTINCT SOID) AS 'Total Sales per Order ($)'
, SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM ORDER
GROUP BY YEAR(INVOICEDATE), MONTH(INVOICEDATE)
Here are the results, but the data sample does not have enuf rows to show Months...
SELECT
mDateYYYY,
mDateMM,
SUM(SALESAMOUNT) / COUNT(DISTINCT t1.SOID) AS 'Total Sales per Order ($)',
SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM DCORDER as t1
left join
(Select
SOID
,Year(max(INVOICEDATE)) as mDateYYYY
,Month(max(INVOICEDATE)) as mDateMM
From DCOrder
Group By SOID
) as t2
On t1.SOID = t2.SOID
Group by mDateYYYY, mDateMM
mDateYYYY mDateMM Total Sales per Order ($) Total Sales per Line ($)
2018 12 87.50 58.33
I have used new SQL still MAX(INVOICEDATE)(not above), with new 12/20 data, and excluded AC=17.
YYYY MM Total Sales per Order ($) Total Sales per Line ($)
2017 11 35.00 35.00
2018 1 100.00 50.00
2018 4 215.00 107.50
2018 12 150.00 75.00

Determining if value is between two other values in the same column

I'm currently working on a project that involves using a user-provided charge table to calculate fees.
The table looks like:
MaxAmount Fee
10.00 1.95
20.00 2.95
30.00 3.95
50.00 4.95
As seen in the table above, any MaxAmount up to 10.00 is charged a 1.95 fee. Any MaxAmount between 10.01 and 20.00 is charge a 2.95 fee, etc. Finally, any MaxAmount above 50.00 is charged 4.95.
I'm trying to come up with a sql query that will return the correct fee for a given MaxAmount. However, I'm having trouble doing so. I've tried something similar to the following (assuming a provided MaxAmt of 23.00):
SELECT Fee FROM ChargeTable WHERE 23.00 BETWEEN MaxAmt AND MaxAmt
Of course, this doesn't give me the desired result of 3.95.
I'm having trouble adapting SQL's set-based logic to this type of problem.
Any and all help is greatly appreciated!
If the MaxAmount behaves as the table suggests, then you can use:
select top 1 fee
from ChargeTable ct
where #Price <= MaxAount
order by MaxAmount desc
As you describe it, you really want another row:
MaxAmount Fee
0.00 1.95
10.00 1.95
20.00 2.95
30.00 3.95
50.00 4.95
Your original table does not have enough values. When you have 4 break points, you actually need 5 values -- to handle the two extremes.
With this structure, then you can do:
select top 1 fee
from ChargeTable ct
where #Price >= MaxAount
order by MaxAmount desc
You could try something like:
SELECT min(Fee) FROM Fees WHERE 23<=MaxAmount
Have a look here for an example:
http://sqlfiddle.com/#!2/43f2a/5

Counting Instances of Unique Value in Field

Suppose you have a table in SQL:
Prices
------
13.99
14.00
52.00
52.00
52.00
13.99
How would you count the amount of times a DIFFERENT field has been entered in? Therefore an example of such a count would output:
13.99 - 2 times.
14.00 - 1 times.
52.00 - 3 times.
OR perhaps:
3 (i.e. 13.99, 14.00, 52.00)
Can anyone advise? Cheers.
How about:
SELECT Prices, COUNT(*) FROM TheTable GROUP BY Prices
Can't say I've tried it on MySql, but I'd expect it to work...