Introduction
In this Q&A we will explore the use of window functions and grouping sets in the same query using PostgreSQL. The reader should be familiar with both of those concepts to get the most out of this post. This fiddle can be used to follow along.
Data
The following table contains a few transactions. For each transaction we have the customer, the city where it took place and the number of units delivered to the customer.
CREATE TABLE transactions (customer TEXT, city TEXT, units INT);
INSERT INTO transactions VALUES ('Bob', 'Manchester', 10),
('Chuck', 'Manchester', 20),
('Bob', 'Concord', 10),
('Tim', 'Manchester', 15),
('Jane', 'Derry', 10),
('Tim', 'Derry', 15),
('Tim', 'Concord', 20),
('Bob', 'Manchester', 20),
('Chuck', 'Concord', 10);
Desired results
I want to be able to produce a report which contains the answers to questions like "what proportion of all transactions is represented by this row" or "what is the ratio of this to all the transactions by this customer". Also I want to be able to answer all such possible questions. With that kind of a report we could simply look at each row and extract information about its relation to the whole report or to specific slices of the data.
First attempt
We could attempt to create such a report by using multiple queries with different GROUP BY clauses and then uniting them with UNION ALL or we can try something like the following:
-- Incorrect results
SELECT customer
, city
, SUM(units) AS "units"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER () , 2) AS "to report"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY customer) , 2) AS "to customer"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY city) , 2) AS "to city"
FROM transactions
GROUP BY CUBE(customer, city)
Here we use CUBE in the GROUP BY clause which will produce groupings corresponding to all possible combinations of the customer and city columns. The numerator of the ratios is an aggregate that corresponds to the units total for that row. Notice that it is the same for all ratios and that it contains the expression used as the "units" column, i.e. SUM(units). Calculating the denominator is more complicated because we need a window function to calculate the total number of units for the wanted slice, i.e. the total for a particular customer or city or the total for the whole report.
Incorrect Results
Unfortunately the ratios produced by the query above are not correct. For example the first row has 10 units which is 7.7% of the total (130), 25% of the total for Bob (40), and 25% of the total for Concord (40) yet the results show less than the correct ratio in all cases. As another example take the row where both "customer" and "city" are NULL, here the "unit" column is 130 and yet the calculated ratio "to report" is 25%. Clearly the denominator in the ratio columns is wrong. How can we get the desired results?
customer
city
units
to report
to customer
to city
Bob
Concord
10
1.92
12.50
12.50
Bob
Manchester
30
5.77
37.50
23.08
Bob
null
40
7.69
50.00
15.38
Chuck
Concord
10
1.92
16.67
12.50
Chuck
Manchester
20
3.85
33.33
15.38
Chuck
null
30
5.77
50.00
11.54
Jane
Derry
10
1.92
50.00
20.00
Jane
null
10
1.92
50.00
3.85
Tim
Concord
20
3.85
20.00
25.00
Tim
Derry
15
2.88
15.00
30.00
Tim
null
50
9.62
50.00
19.23
Tim
Manchester
15
2.88
15.00
11.54
null
null
130
25.00
50.00
50.00
null
Manchester
65
12.50
25.00
50.00
null
Derry
25
4.81
9.62
50.00
null
Concord
40
7.69
15.38
50.00
Why are the results wrong?
Notice that although the results in the question are wrong they are not totally nonsensical. Take, for example, the row in which both "customer" and "city" are NULL. This row has 130 units which is the total number of units in the data, so we should expect the ratio "to report" to be 100% but the result shows 25%, which means that the denominator of the ratio in that case was four times 130, or 520. Take the first row as another example, here we have 10 units and a ratio "to report" of 1.92%, again the denominator is wrong by a factor of four, i.e. the actual ratio should be 7.69%. Clearly the total of the report is taken to be four times what it actually is.
The results for the "to customer" and "to city" columns are wrong as well but by a different factor. Take for example the rows where "customer" is NULL and "city" is not NULL. The ratio "to city" for the three cities is 50% but should be 100%. This is because the denominator of the ratio is twice what it should be. The same thing happens for the "to customer" ratio. The problem lies in the partitioning of the rows. For instance, there is no PARTITION BY clause in the "to report" column, so we are taking into consideration all of the rows of the report, which add up to 520.
Consider that the GROUP BY clause produces four groupings:
(customer, city) - equivalent to GROUP BY customer, city
(customer) - equivalent to GROUP BY customer
(city) - equivalent to GROUP BY city
() - equivalent to using an aggregate without GROUP BY clause
Each of those groupings is a different way to slice the data and for each the units total is 130. In the case of the "to customer" and "to city" columns we have a denominator that is twice as large as it should be. Take for example the cities and notice that the rows where "city" is not NULL contain the units for each city twice: once where "customer" is not NULL and another time where "customer" is NULL. These two categories correspond to the first and third groupings above, respectively. The same can be said for the customers but the rows would correspond to the first and second groupings. Clearly when calculating the denominator of each ratio we need to take into consideration that different rows belong to different slices of the data.
The Key
The key to getting the desired results is to use the GROUPING aggregate function. This function "returns a bit mask indicating which GROUP BY expressions are not included in the current grouping set". In other words it can give a different result for each grouping produced by the GROUP BY clause. We can use this function to help calculate the denominator for each of the ratio columns. To get the desired effect we use the GROUPING function inside the PARTITION BY clause of our window functions like so:
SELECT customer
, city
, SUM(units) AS "units"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city)) , 2) AS "to report"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), customer) , 2) AS "to customer"
, ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), city) , 2) AS "to city"
FROM transactions
GROUP BY CUBE(customer, city)
Having the GROUPING function inside the PARTITION BY clause ensures that the denominator for each ratio corresponds only to the rows of a particular grouping. Fortunately we do not have to give much thought to the arguments that we will pass to the GROUPING function. You can simply include all the columns that appear in the GROUP BY, although only the ones that do not appear in all of the grouping sets are necessary.
Desired Result
Now that we are calculating the denominator taking the groupings into consideration we get the correct results.
customer
city
units
to report
to customer
to city
Bob
Manchester
30
23.08
75.00
46.15
Bob
Concord
10
7.69
25.00
25.00
Chuck
Concord
10
7.69
33.33
25.00
Chuck
Manchester
20
15.38
66.67
30.77
Jane
Derry
10
7.69
100.00
40.00
Tim
Concord
20
15.38
40.00
50.00
Tim
Derry
15
11.54
30.00
60.00
Tim
Manchester
15
11.54
30.00
23.08
Bob
null
40
30.77
100.00
30.77
Chuck
null
30
23.08
100.00
23.08
Jane
null
10
7.69
100.00
7.69
Tim
null
50
38.46
100.00
38.46
null
Concord
40
30.77
30.77
100.00
null
Derry
25
19.23
19.23
100.00
null
Manchester
65
50.00
50.00
100.00
null
null
130
100.00
100.00
100.00
I have a table which has the monthly sales values for each of the items. I need last 3 months average sales value next to the current month sales for each item.
Need to perform this operation in hive.
The sample input table looks like below
Item_ID Sales Month
A 4295 Dec-2018
A 245 Nov-2018
A 1337 Oct-2018
A 3290 Sep-2018
A 2000 Aug-2018
B 856 Dec-2018
B 1694 Nov-2018
B 4286 Oct-2018
B 2780 Sep-2018
B 3100 Aug-2018
The result table should look like this
Item_ID Sales_Current_Month Month Sales_Last_3_months_average
A 4295 Dec-2018 1624
A 245 Nov-2018 2209
B 856 Dec-2018 2920
B 1694 Nov-2018 3388.67
Assuming there is no missing months data, you can use avg window function to do this.
select t.*
,avg(sales) over(partition by item_id order by month rows between 3 preceding and 1 preceding) as avg_sales_prev_3_months
from tbl t
If month column is in a format different from yyyyMM, use an appropriate conversion so the ordering works as expected.
I have two tables containing date ranges that I want to cross multiply in a way to get all distinct ranges. That is, all ranges that have a boundary in one of the tables.
Specifically I have a table with product prices and their validity dates as well as conversion factors with a validity date. I want, as a result, each instance of a specific price/conversion_factor combination and from when to when it was valid:
products:
product_id start_date end_date price_eur
1 2000-01-01 2000-12-31 100
1 2001-01-01 2002-12-31 150
conversion_factors:
start_date end_date dollar_to_eur
1970-01-01 2000-03-31 1.50
2000-04-01 2000-06-30 1.60
2000-07-01 2001-06-30 1.70
2001-07-01 2003-06-30 2.00
result:
product_id start_date end_date price_eur dollar_to_eur
1 2000-01-01 2000-03-31 100 1.50
1 2000-04-01 2000-06-30 100 1.60
1 2000-07-01 2000-12-31 100 1.70
1 2001-01-01 2001-06-30 150 1.70
1 2001-07-01 2002-12-31 150 2.00
So each time one of the tables hits a new date, a new row should be returned. In the result the first two rows reference the validity of the first product row, but split up into two intervals in the conversion_factos table. Similarly the second and third row of the result come from the second conversion factor row, but with different product rows.
Is there any way to do this with a clever join (in PostgreSQL) or do I need to use a PL/pgSQL function?
There are to parts in this, you ask for a smart join and you ask for displaying the correct result. This should answer your problems:
SELECT Greatest(p.start_date, cf.start_date) AS start_date
,Least(p.end_date, cf.end_date) AS end_date
,p.price_eur
,cf.dollar_to_eur
FROM products AS p
JOIN conversion_factors AS cf
ON p.start_date <= cf.end_date AND p.end_date >= cf.start_date
Perhaps my title is misleading, but I am not sure how else to phrase this. I have two tables, tblL and tblDumpER. They are joined based on the field SubjectNumber. This is a one (tblL) to many (tblDumpER) relationship.
I need to write a query that will give me, for all my subjects, a value from tblDumpER associated with a date in tblL. This is to say:
SELECT tblL.SubjectNumber, tblDumpER.ER_Q1
FROM tblL
LEFT JOIN tblDumpER ON tblL.SubjectNumber=tblDumpER.SubjectNumber
WHERE tblL.RandDate=tblDumpER.ER_DATE And tblDumpER.ER_Q1 Is Not Null
This is straightforward enough. My problem is the value RandDate from tblL is different for every subject. However, it needs to be displayed as Day1 so I can have tblDumpER.ER_Q1 as Day1 for every subject. Then I need RandDate+1 As Day2, etc until I hit either null or Day84. The 'dumb' solution is to write 84 queries. This is obviously not practical. Any advice would be greatly appreciated!
I appreciate the responses so far but I don't think that I'm explaining this correctly so here is some example data:
SubjectNumber RandDate
1001 1/1/2013
1002 1/8/2013
1003 1/15/2013
SubjectNumber ER_DATE ER_Q1
1001 1/1/2013 5
1001 1/2/2013 6
1001 1/3/2013 2
1002 1/8/2013 1
1002 1/9/2013 10
1002 1/10/2013 8
1003 1/15/2013 7
1003 1/16/2013 4
1003 1/17/2013 3
Desired outcome:
(Where Day1=RandDate, Day2=RandDate+1, Day3=RandDate+2)
SubjectNumber Day1_ER_Q1 Day2_ER_Q1 Day3_ER_Q1
1001 5 6 2
1002 1 10 8
1003 7 4 3
This data is then going to be plotted on a graph with Day# on the X-axis and ER_Q1 on the Y-axis
I would do this in two steps:
Create a query that gets the MIN date for each SubjectNumber
Join this query to your existing query, so you can perform a DATEDIFF calculation on the MIN date and the date of the current record.
I'm not entirely sure of what it is that you need, but perhaps a calendar table would be of help. Just create a local table that contains all of the days of the year in it, then use that table to JOIN your dates up?
i have a requirement with a below table.
conditions:-
1> i have to take the avg of salaries clints, who has 1day date of birth gap.
2> if there are no nearest 1day dob's gap between the gap between the clients, then no need to take that client into consideration.
please see the results.
Table:
ClientID ClinetDOB's Slaries
1 2012-03-14 300
2 2012-04-11 400
3 2012-05-09 200
4 2012-06-06 400
5 2012-07-30 600
6 2012-08-14 1200
7 2012-08-15 1800
8 2012-08-17 1200
9 2012-08-20 2400
10 2012-08-21 1500
Result Should looks LIKE this:-
ClientID ClinetDOB's AVG(Slaries)
7 2012-08-15 1500 --This avg of 1200,1800(because clientID's 6,7 have dob's have 1day gap)
10 2012-08-20 1950 --This avg of 2400,1500(because clientID's 9,10 have dob's have 1day gap))
Please help.
Thank You In advance!
A self-join will connect current record with all records having yesterday's date. In this context group by allows many records having the same date to be counted. t1 needs to be accounted for separately, so the Salary is added afterwards, and count(*) is incremented to calculate average.
Here is Sql Fiddle with example.
select t1.ClientID,
t1.ClinetDOBs,
(t1.Slaries + sum (t2.Slaries)) / (count (*) + 1) Avg_Slaries
from table1 t1
inner join table1 t2
on t1.ClinetDOBs = dateadd(day, 1, t2.ClinetDOBs)
group by t1.ClientID,
t1.ClinetDOBs,
t1.Slaries