Understanding the differences between CUBE and ROLLUP - sql

My assignment asked me to find out "how many invoices are written for each date?"
I was a little stuck and asked my professor for help. She emailed me a query that would answer the question, "How many stoves of each type and version have been built?
For a challenge but no extra points, include the total number of stoves."
This was the query she sent me:
SELECT STOVE.Type + STOVE.Version AS 'Type+Version'
, COUNT(*) AS 'The Count'
FROM STOVE
GROUP BY STOVE.Type + STOVE.Version WITH ROLLUP;
So, I tweaked that query until it met my needs. This is what I came up with:
SELECT InvoiceDt
, COUNT(InvoiceNbr) AS 'Number of Invoices'
FROM INVOICE
GROUP BY InvoiceDt WITH ROLLUP
ORDER BY InvoiceDt ASC;
And it returned the following results that I wanted.
Anyway, I decided to read up on the ROLLUP clause and started with an article from Microsoft. It said that the ROLLUP clause was similar to the CUBE clause but that it was distinguished from the CUBE clause in the following way:
CUBE generates a result set that shows aggregates for all combinations of values in the selected columns.
ROLLUP generates a result set that shows aggregates for a hierarchy of values in the selected columns.
So, I decided to replace the ROLLUP in my query with CUBE to see what would happen. They produced the same results. I guess that's where I'm getting confused.
It seems like, if you're using the type of query that I am here, that there isn't any practical difference between the two clauses. Is that right? Or, am I not understanding something? I had thought, when I finished reading the Microsoft article, that my results should've been different using the CUBE clause.

You won't see any difference since you're only rolling up a single column. Consider an example where we do
ROLLUP (YEAR, MONTH, DAY)
With a ROLLUP, it will have the following outputs:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()
With CUBE, it will have the following:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()
CUBE essentially contains every possible rollup scenario for each node whereas ROLLUP will keep the hierarchy in tact (so it won't skip MONTH and show YEAR/DAY, whereas CUBE will)
This is why you didn't see a difference since you only had a single column you were rolling up.

We can understand the difference between ROLLUP and CUBE with a simple example. Consider we have a table which contains the results of quarterly test of students. In certain cases we need to see the total corresponding to the quarter as well as the students. Here is the sample table
SELECT * INTO #TEMP
FROM
(
SELECT 'Quarter 1' PERIOD,'Amar' NAME ,97 MARKS
UNION ALL
SELECT 'Quarter 1','Ram',88
UNION ALL
SELECT 'Quarter 1','Simi',76
UNION ALL
SELECT 'Quarter 2','Amar',94
UNION ALL
SELECT 'Quarter 2','Ram',82
UNION ALL
SELECT 'Quarter 2','Simi',71
UNION ALL
SELECT 'Quarter 3' ,'Amar',95
UNION ALL
SELECT 'Quarter 3','Ram',83
UNION ALL
SELECT 'Quarter 3','Simi',77
UNION ALL
SELECT 'Quarter 4' ,'Amar',91
UNION ALL
SELECT 'Quarter 4','Ram',84
UNION ALL
SELECT 'Quarter 4','Simi',79
)TAB
1. ROLLUP(Can find total for corresponding to one column)
(a) Get total score of each student in all quarters.
SELECT * FROM #TEMP
UNION ALL
SELECT PERIOD,NAME,SUM(MARKS) TOTAL
FROM #TEMP
GROUP BY NAME,PERIOD
WITH ROLLUP
HAVING PERIOD IS NULL AND NAME IS NOT NULL
// Having is used inorder to emit a row that is the total of all totals of each student
Following is the result of (a)
(b) Incase you need to get total score of each quarter
SELECT * FROM #TEMP
UNION ALL
SELECT PERIOD,NAME,SUM(MARKS) TOTAL
FROM #TEMP
GROUP BY PERIOD,NAME
WITH ROLLUP
HAVING PERIOD IS NOT NULL AND NAME IS NULL
Following is the result of (b)
2. CUBE(Find total for Quarter as well as students in a single shot)
SELECT PERIOD,NAME,SUM(MARKS) TOTAL
FROM #TEMP
GROUP BY NAME,PERIOD
WITH CUBE
HAVING PERIOD IS NOT NULL OR NAME IS NOT NULL
Following is the result of CUBE
Now you may be wondering about the real time use of ROLLUP and CUBE. Sometimes we need a report in which we need to see the total of each quarter and total of each student in a single shot. Here is an example
I am changing the above CUBE query slightly as we need total for both totals.
SELECT CASE WHEN PERIOD IS NULL THEN 'TOTAL' ELSE PERIOD END PERIOD,
CASE WHEN NAME IS NULL THEN 'TOTAL' ELSE NAME END NAME,
SUM(MARKS) MARKS
INTO #TEMP2
FROM #TEMP
GROUP BY NAME,PERIOD
WITH CUBE
DECLARE #cols NVARCHAR (MAX)
SELECT #cols = COALESCE (#cols + ',[' + PERIOD + ']',
'[' + PERIOD + ']')
FROM (SELECT DISTINCT PERIOD FROM #TEMP2) PV
ORDER BY PERIOD
DECLARE #query NVARCHAR(MAX)
SET #query = 'SELECT * FROM
(
SELECT * FROM #TEMP2
) x
PIVOT
(
SUM(MARKS)
FOR [PERIOD] IN (' + #cols + ')
) p;'
EXEC SP_EXECUTESQL #query
Now you will get the following result

This is because you only have one column that you are grouping by.
Add Group by InvoiceDt, InvoiceCountry (or whatever field will give you more data.
With Cube will give you a Sum for each InvoiceDt and you will get a Sum for each InvoiceCountry.

You can find more detail about GROUPING SET, CUBE, ROLL UP. TL;DR they just expand GROUP BY + UNION ALL in some ways to get aggregation :)
https://technet.microsoft.com/en-us/library/bb510427(v=sql.105).aspx

All Voted answers are good.
One Important difference in general is
The N elements of a ROLLUP specification correspond to N+1 GROUPING
SETS.
The N elements of a CUBE specification correspond to 2^N GROUPING
SETS.
Further reading see my article with respect to spark sql
For example :
store_id,product_type
rollup is equivalent to
GROUP BY store_id,product_type
GROUPING SETS (
(store_id,product_type)
,(product_type)
, ())
for 2 (n) group by columns grouping set has (n+1 ) = 3 combinations of columns
Cube is equivalent to
GROUP BY store_id,product_type
GROUPING SETS (
(store_id,product_type)
,(store_id)
,(product_type)
, ())
for 2 (n) group by columns grouping set has (2^n ) = 4 combinations of columns

Related

Is there a way to use DISTINCT and COUNT(*) together to bulletproof your code against DUPLICATE entries?

I got help with a function yesterday to correctly get the count of multiple items in a column based on multiple criteria/columns. However, if there is a way to get the DISTINCT count of all the entries in the table based on aggregated GROUP BY statement.
SELECT TIME = ap.day,
acms.tenantId,
acms.CallingService,
policyList = ltrim(sp.value),
policyInstanceList = ltrim(dp.value),
COUNT(*) AS DISTINCTCount
FROM dbo.acms_data acms
CROSS APPLY string_split(acms.policyList, ',') sp
CROSS APPLY string_split(acms.policyInstanceList, ',') dp
CROSS APPLY (select day = convert(date, acms.[Time])) ap
GROUP BY ap.day, acms.tenantId, sp.value, dp.value, acms.CallingService
I would just like to know if there would be a way to see if there is a workaround for using DISTINCT and Count(*) together and whether or not it would affect my results to make this algorithm potentially invulnerable to duplicate entries.
The reason why I have to use COUNT(*) is because I am aggregating based on every column in the table not just a specific column or multiple.
We can use DISTINCT with COUNT together like this example.
USE AdventureWorks2012
GO
-- This query shows 290 JobTitle
SELECT COUNT(JobTitle) Total_JobTitle
FROM [HumanResources].[Employee]
GO
-- This query shows only 67 JobTitle
SELECT COUNT( DISTINCT JobTitle) Total_Distinct_JobTitle
FROM [HumanResources].[Employee]
GO

Get sum of previous 6 values including the group

I need to sum up the values for the last 7 days,so it should be the current plus the previous 6. This should happen for each row i.e. in each row the column value would be current + previous 6.
The case :-
(Note:- I will calculate the hours,by suming up the seconds).
I tried using the below query :-
select SUM([drivingTime]) OVER(PARTITION BY driverid ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
from [f.DriverHseCan]
The problem I face is I have to do grouping on driver,asset for a date
In the above case,the driving time should be sumed up and then,its previous 6 rows should be taken,
I cant do this using rank() because I need these rows as well as I have to show it in the report.
I tried doing this in SSRS and SQL both.
In short it is adding total driving time for current+ 6 previous days
Try the following query
SELECT
s.date
, s.driverid
, s.assetid
, s.drivingtime
, SUM(s2.drivingtime) AS total_drivingtime
FROM f.DriverHseCan s
JOIN (
SELECT date,driverid, SUM(drivingtime) drivingtime
FROM f.DriverHseCan
GROUP BY date,driverid
) AS s2
ON s.driverid = s2.driverid AND s2.date BETWEEN DATEADD(d,-6,s.date) AND s.date
GROUP BY
s.date
, s.driverid
, s.assetid
, s.drivingtime
If you have week start/end dates, there could be better performing alternatives to solve your problem, e.g. use the week number in SSRS expressions rather than do the self join on SQL server
I think aggregation does what you want:
select sum(sum([drivingTime])) over (partition by driverid
order by date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
from [f.DriverHseCan]
group by driverid, date
I guess you need to use CROSS APPLY.
Something like following? :
SELECT driverID,
date,
CA.Last6DayDrivingTime
FROM YourTable YT
CROSS APPLY
(
SELECT SUM(drivingTime) AS Last6DayDrivingTime
FROM YourTable CA ON CA.driverID=YT.driverID
WHERE CA.date BETWEEN DATEADD(DAY,-6,YT.date) AND YT.date)
) CA
Edit:
As you commented that cross apply slow down the performance, other option is to pre calculate the week values in temp table or using CTE and then use them in your main query.

how to perform multiple aggregations on a single SQL query

I have a table with Three columns:
GEOID, ParcelID, and PurchaseDate.
The PKs are GEOID and ParcelID which is formatted as such:
GEOID PARCELID PURCHASEDATE
12345 AB123 1/2/1932
12345 sfw123 2/5/2012
12345 fdf323 4/2/2015
12346 dfefej 2/31/2022 <-New GEOID
What I need is an aggregation based on GEOID.
I need to count the number of ParcelIDs from last month PER GEOID
and I need to provide a percentage of that GEOID of all total sold last month.
I need to produce three columns:
GEOID Nbr_Parcels_Sold Percent_of_total
For each GEOID, I need to know how many Parcels Sold Last month, and with that Number, find out how much percentage that entails for all Solds.
For example: if there was 20 Parcels Sold last month, and 4 of them were sold from GEOID 12345, then the output would be:
GEOID Nbr_Parcels_Sold Perc_Total
12345 4 .2 (or 20%)
I am having issues with the dual aggregation. The concern is that the table in question has over 8 million records.
if there is a SQL Warrior out here who have seen this issue before, Any wisdom would be greatly appreciated.
Thanks.
Hopefully you are using SQL Server 2005 or later version, in which case you can get advantage of windowed aggregation. In this case, windowed aggregation will allow you to get the total sale count alongside counts per GEOID and use the total in calculations. Basically, the following query returns just the counts:
SELECT
GEOID,
Nbr_Parcels_Sold = COUNT(*),
Total_Parcels_Sold = SUM(COUNT(*)) OVER ()
FROM
dbo.atable
GROUP BY
GEOID
;
The COUNT(*) call gives you counts per GEOID, according to the GROUP BY clause. Now, the SUM(...) OVER expression gives you the grand total count in the same row as the detail count. It is the empty OVER clause that tells the SUM function to add up the results of COUNT(*) across the entire result set. You can use that result in calculations just like the result of any other function (or any expression in general).
The above query simply returns the total value. As you actually want not the value itself but a percentage from it for each GEOID, you can just put the SUM(...) OVER call into an expression:
SELECT
GEOID,
Nbr_Parcels_Sold = COUNT(*),
Percent_of_total = COUNT(*) * 100 / SUM(COUNT(*)) OVER ()
FROM
dbo.atable
GROUP BY
GEOID
;
The above will give you integer percentages (truncated). If you want more precision or a different representation, remember to cast either the divisor or the dividend (optionally both) to a non-integer numeric type, since SQL Server always performs integral division when both operands are integers.
How about using sub-query to count the sum
WITH data AS
(
SELECT *
FROM [Table]
WHERE
YEAR(PURCHASEDATE) * 100 + MONTH(PURCHASEDATE) = 201505
)
SELECT
GEOID,
COUNT(*) AS Nbr_Parcels_Sold,
CONVERT(decimal(18,8), COUNT(*)) /
(SELECT COUNT(*) FROM data) AS Perc_Total
FROM
data t
GROUP BY
GEOID
EDIT
To update another table by the result, use UPDATE under WITH()
WITH data AS
(
SELECT *
FROM [Table]
WHERE
YEAR(PURCHASEDATE) * 100 + MONTH(PURCHASEDATE) = 201505
)
UPDATE target SET
Nbr_Parcels_Sold = source.Nbr_Parcels_Sold,
Perc_Total = source.Perc_Total
FROM
[AnotherTable] target
INNER JOIN
(
SELECT
GEOID,
COUNT(*) AS Nbr_Parcels_Sold,
CONVERT(decimal(18,8), COUNT(*)) /
(SELECT COUNT(*) FROM data) AS Perc_Total
FROM
data t
GROUP BY
GEOID
) source ON target.GEOID = source.GEOID
Try the following. It grabs the total sales into a variable then uses it in the subsequent query:
DECLARE #pMonthStartDate DATETIME
DECLARE #MonthEndDate DATETIME
DECLARE #TotalPurchaseCount INT
SET #pMonthStartDate = <EnterFirstDayOfAMonth>
SET #MonthEndDate = DATEADD(MONTH, 1, #pMonthStartDate)
SELECT
#TotalPurchaseCount = COUNT(*)
FROM
GEOIDs
WHERE
PurchaseDate BETWEEN #pMonthStartDate
AND #MonthEndDate
SELECT
GEOID,
COUNT(PARCELID) AS Nbr_Parcels_Sold,
CAST(COUNT(PARCELID) AS FLOAT) / CAST(#TotalPurchaseCount AS FLOAT) * 100.0 AS Perc_Total
FROM
GEOIDs
WHERE
ModifiedDate BETWEEN #pMonthStartDate
AND #MonthEndDate
GROUP BY
GEOID
I'm guessing your table name is GEOIDs. Change the value of #pMonthStartDate to suit yourself. If your PKs are as you say then this will be a quick query.

MDX - Top X Sales People by Total Sales for Each Date

I'm trying to do this, but with MDX in my cube:
select
*
from
(
select
Date, SalesPerson, TotalSales, row_number() over(partition by Date order by TotalSales desc) as Num
from SalesFact as ms
) as x
where
Num < 5
order by
Date, SalesPerson, Num desc
Let's say I have a cube with these dimensions:
Date (Year, Month, Date) - date is always 1st of month
SalesPerson
The fact table has three columns - Date, SalesPerson, TotalSales - ie, the amount that person sold in that month.
I want, for each month, to see the top 5 sales people, and each of their TotalSales. The top 5 sales people can be different from one month to the next.
I am able to get the results for one month, using a query that looks like this:
select
[Measures].[TotalSales] on columns,
(
subset
(
order
(
[SalesPerson].children,
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
where
(
[Date].[Date].&[2009-03-01T00:00:00]
)
What I'm after is a query that puts Date and SalesPerson on rows, and TotalSales on columns.
I want to see over time each month, and for each month, the top 5 sales people, and how much they sold.
When I try to do it this way, it doesn't seem to filter / group the sales people by each date (get top 5 for each date). The values returned are all over the place and include very low and null values. Notably, the SalesPerson list is the same for each date, even though TotalSales varies a lot.
select
[Measures].[TotalSales] on columns,
(
[Date].[Hierarchy].[Date].members,
subset
(
order
(
[SalesPerson].children,
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
It seems that everything inside "subset" needs to be filtered by the current [Date].[Hierarchy].[Date], but using CurrentMember gives a crossjoin / axis error:
select
[Measures].[TotalSales] on columns,
(
[Date].[Hierarchy].[Date].members,
subset
(
order
(
([SalesPerson].children, [Date].[Hierarchy].CurrentMember),
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
Error: Executing the query ... Query (3, 2) The Hierarchy hierarchy
is used more than once in the Crossjoin function.
Execution complete
I've tried several variations of the last query with no luck.
Hopefully the answers will be helpful to others new to MDX as well.
I eventually found out how to do what I was looking for. The solution revolved around using the Generate function, and starting with the basic example on MSDN and modifying the dimensions and measure to be the ones in my cube got me going in the right direction.
From http://msdn.microsoft.com/en-us/library/ms145526.aspx
Is there a better way?
Also, be wary of trying to refactor sets into the with block. This seems to change when the set is evaluated / change its scope and will change the results.
with
set
Dates as
{
[Date].[Hierarchy].[Date].&[2009-02-01T00:00:00],
[Date].[Hierarchy].[Date].&[2009-03-01T00:00:00],
[Date].[Hierarchy].[Date].&[2009-04-01T00:00:00]
}
select
Measures.[TotalSales]
on columns,
generate
(
Dates,
topcount
(
[Date].Hierarchy.CurrentMember
*
[SalesPerson].Children,
5,
Measures.[TotalSales]
)
)
on rows
from
Hypercube

Weighted average in T-SQL (like Excel's SUMPRODUCT)

I am looking for a way to derive a weighted average from two rows of data with the same number of columns, where the average is as follows (borrowing Excel notation):
(A1*B1)+(A2*B2)+...+(An*Bn)/SUM(A1:An)
The first part reflects the same functionality as Excel's SUMPRODUCT() function.
My catch is that I need to dynamically specify which row gets averaged with weights, and which row the weights come from, and a date range.
EDIT: This is easier than I thought, because Excel was making me think I required some kind of pivot. My solution so far is thus:
select sum(baseSeries.Actual * weightSeries.Actual) / sum(weightSeries.Actual)
from (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Weighty'
) baseSeries inner join (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Tons Milled'
) weightSeries on baseSeries.RecordDate = weightSeries.RecordDate
Quassnoi's answer shows how to do the SumProduct, and using a WHERE clause would allow you to restrict by a Date field...
SELECT
SUM([tbl].data * [tbl].weight) / SUM([tbl].weight)
FROM
[tbl]
WHERE
[tbl].date >= '2009 Jan 01'
AND [tbl].date < '2010 Jan 01'
The more complex part is where you want to "dynamically specify" the what field is [data] and what field is [weight]. The short answer is that realistically you'd have to make use of Dynamic SQL. Something along the lines of:
- Create a string template
- Replace all instances of [tbl].data with the appropriate data field
- Replace all instances of [tbl].weight with the appropriate weight field
- Execute the string
Dynamic SQL, however, carries it's own overhead. Is the queries are relatively infrequent , or the execution time of the query itself is relatively long, this may not matter. If they are common and short, however, you may notice that using dynamic sql introduces a noticable overhead. (Not to mention being careful of SQL injection attacks, etc.)
EDIT:
In your lastest example you highlight three fields:
RecordDate
KPI
Actual
When the [KPI] is "Weight Y", then [Actual] the Weighting Factor to use.
When the [KPI] is "Tons Milled", then [Actual] is the Data you want to aggregate.
Some questions I have are:
Are there any other fields?
Is there only ever ONE actual per date per KPI?
The reason I ask being that you want to ensure the JOIN you do is only ever 1:1. (You don't want 5 Actuals joining with 5 Weights, giving 25 resultsing records)
Regardless, a slight simplification of your query is certainly possible...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
CalcProductionRecords AS [baseSeries]
INNER JOIN
CalcProductionRecords AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
-- AND [weightSeries].someOtherID = [baseSeries].someOtherID
WHERE
[baseSeries].KPI = 'Tons Milled'
AND [weightSeries].KPI = 'Weighty'
The commented out line only needed if you need additional predicates to ensure a 1:1 relationship between your data and the weights.
If you can't guarnatee just One value per date, and don't have any other fields to join on, you can modify your sub_query based version slightly...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
(
SELECT
RecordDate,
SUM(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Tons Milled'
GROUP BY
RecordDate
)
AS [baseSeries]
INNER JOIN
(
SELECT
RecordDate,
AVG(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Weighty'
GROUP BY
RecordDate
)
AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
This assumes the AVG of the weight is valid if there are multiple weights for the same day.
EDIT : Someone just voted for this so I thought I'd improve the final answer :)
SELECT
SUM(Actual * Weight) / SUM(Weight)
FROM
(
SELECT
RecordDate,
SUM(CASE WHEN KPI = 'Tons Milled' THEN Actual ELSE NULL END) AS Actual,
AVG(CASE WHEN KPI = 'Weighty' THEN Actual ELSE NULL END) AS Weight
FROM
CalcProductionRecords
WHERE
KPI IN ('Tons Milled', 'Weighty')
GROUP BY
RecordDate
)
AS pivotAggregate
This avoids the JOIN and also only scans the table once.
It relies on the fact that NULL values are ignored when calculating the AVG().
SELECT SUM(A * B) / SUM(A)
FROM mytable
If I have understand the problem then try this
SET DATEFORMAT dmy
declare #tbl table(A int, B int,recorddate datetime,KPI varchar(50))
insert into #tbl
select 1,10 ,'21/01/2009', 'Weighty'union all
select 2,20,'10/01/2009', 'Tons Milled' union all
select 3,30 ,'03/02/2009', 'xyz'union all
select 4,40 ,'10/01/2009', 'Weighty'union all
select 5,50 ,'05/01/2009', 'Tons Milled'union all
select 6,60,'04/01/2009', 'abc' union all
select 7,70 ,'05/01/2009', 'Weighty'union all
select 8,80,'09/01/2009', 'xyz' union all
select 9,90 ,'05/01/2009', 'kws' union all
select 10,100,'05/01/2009', 'Tons Milled'
select SUM(t1.A*t2.A)/SUM(t2.A)Result from
(select RecordDate,A,B,KPI from #tbl)t1
inner join(select RecordDate,A,B,KPI from #tbl t)t2
on t1.RecordDate = t2.RecordDate
and t1.KPI = t2.KPI