Best Index for query with window functions - sql

I am having some performance issues with a query in SQL Server 2012.
The query is used to insert data in a table using window functions to aggregate sales data in different ways (Previous month, previous year month, Cycle to date, YTD, MAT).
After doing a pretty extensive research in windows functions I think that an appropriate index in the table from which the data is read would help a lot, but I am struggling to find the correct one (too many columns involved)...
The source table from which the table reads the data has around 50 million rows and is truncated and reloaded in a daily basis by an SSIS package that can be modified to drop and create the indexes in each execution.
Could somebody suggest what index might work (if any) or any other performance improvement method?
The select statement is as follows:
SELECT
PERIOD,
CUENTA_ID,
PROD_ID,
TIPO_VENTA,
VENTA_EUROS,
CICLO,
DELEGADO_B2B,
SUM(VENTA_EUROS) OVER (PARTITION BY CUENTA_ID, PROD_ID, TIPO_VENTA,DELEGADO_B2B ORDER BY PERIOD ROWS BETWEEN 12 PRECEDING AND 12 PRECEDING) AS VENTA_EUROS_PREV,
SUM(VENTA_EUROS) OVER (PARTITION BY CUENTA_ID, PROD_ID, TIPO_VENTA,DELEGADO_B2B,YEAR ORDER BY PERIOD ROWS UNBOUNDED PRECEDING) AS VENTA_EUROS_YTD,
SUM(VENTA_EUROS) OVER (PARTITION BY CUENTA_ID, PROD_ID, TIPO_VENTA,DELEGADO_B2B,YEAR, CICLO ORDER BY PERIOD ROWS UNBOUNDED PRECEDING) AS VENTA_EUROS_CTD,
SUM(VENTA_EUROS) OVER (PARTITION BY CUENTA_ID, PROD_ID, TIPO_VENTA,DELEGADO_B2B ORDER BY PERIOD ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS VENTA_EUROS_MONTH_PREV,
SUM(VENTA_EUROS) OVER (PARTITION BY CUENTA_ID, PROD_ID, TIPO_VENTA,DELEGADO_B2B ORDER BY PERIOD ROWS 11 PRECEDING) AS VENTA_EUROS_MAT
FROM _REPORTING.[dbo].[RPT_VENTA_MENSUAL_STEP_1]
WHERE YEAR>=YEAR(DATEADD(day,-1,GETDATE()))-1
I checked the execution plan and the parts that are taking the biggest percentages are the three sortings for the three different "OVER(PARTITION BY)
Here is the plan:
https://www.brentozar.com/pastetheplan/?id=B1fsgwjBE
Thanks & Regards

The first thing the index needs to resolve is the WHERE clause. Unfortunately, it has an inequality, which pretty much makes it impossible for the optimizer to help with the windowing clauses.
If you had:
WHERE YEAR = YEAR(DATEADD(day, -1, GETDATE())) - 1
Then the optimizer could take advantage of an index on (YEAR, CUENTA_ID, PROD_ID, TIPO_VENTA, DELEGADO_B2B, PERIOD).

Related

How to use a moving limited RANGE window to multiple ORDER BYs?

This is my table:
userID
Year
Month
Day
NbOfVisits
I would like to calculate the 7 days moving average, my query is as follows:
select userID,year,month,day, sum(nbofvisits) OVER (Partition by userID order by year,month,day RANGE BETWEEN 7 PRECEDING AND CURRENT ROW) as nbVisits7days
from table
order by userID, year, month, day;
But I keep getting this error: "A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions". I understand I have multiple "Order Bys", but I can't think of a straight way other than this.
Following Jon Armstrong's comment, I managed to run my intended query as follows:
select userID,year,month,day, sum(nbofvisits) OVER (Partition by userID order by TIMESTAMP(concat(annee,'-',mois,'-',jour)) RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) as nbVisits7days
from table
order by userID, year, month, day;
Thank you!

Why Window Functions Require My Aggregated Column in Group

I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.
For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)
With Data as (
Select
CohortDate as month
,datediff(week,CohortDate,EventDate) as EventAge
,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
)
Select
month
,EventAge
,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2
I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?
Thanks a ton!
What you are looking for is presumably a cumulative sum. That would be:
select month, EventAge,
sum(sum(Events)) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from data
group by 1, 2
order by 1, 2 ;
Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:
select me.*
sum(sum_events) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from (select month, EventAge, sum(events) as sum_events
from data
group by 1, 2
) me
order by 1, 2 ;
This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.
The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.

aggregate multiple rows based on time ranges

i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!
If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.

Netezza SQL: Specify an offset in a window frame

When making the "frame" for a windowed analytic function, one can specify a literal number of rows to "look back" over. E.g., the following will get the trailing 26 weeks weekly sales for a households.
,sum(sales) over (partition by household_id order by week_id rows 26 preceding) as x26
But... what if you wanted to look back (or forward) with an offset? E.g., if for week n, you wanted the sales for the 26 weeks that ended 8 weeks before week n? As I was typing this, it occurred to me that I could probably do it in parts. I.e.,
,sum(sales) over (partition by household_id order by week_id rows 34 preceding) as x34
,sum(sales) over (partition by household_id order by week_id rows 8 preceding) as x8
...and have trailing26_offeset8 = x34 - x8
Hm... Glad I asked. But anyway, do you know if there's an feature that will let me specify the offset right in the partition specification itself?
Thanks!
Try using between in the window range specification:
sum(sales) over (partition by household_id
order by week_id
rows between 34 preceding and 8 preceding
) as x34

Oracle Running Total

Looking for advice with 2 different types of sub-totals using PLSQL.
I need to pull a data set with 1) a unique headcount, and 2) a total number of credits, as a running total over time.
Raw Data:
This is the transactional data -- every time a student registers or a course, a record is inserted with the date, student id, and credits (along with course number and a bunch of other relevant data). One record per course per student.
STUDENT_ID CREDITS DATE
1 3 01-JAN-12
1 2 02-JAN-12
57 1 03-JAN-12
1 1 03-JAN-12
Processed Data:
This is what the boss needs to see -- it will be used for trending later (to see, for example, how this year's Jan-01 is measuring up against last year's Jan-01, etc.).
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 7 03-JAN-12
The brute approach to this is to write a bunch of separate SELECTS (one for each day), and UNION them together. For example:
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'01-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120101'
GROUP BY
'01-JAN-12'
UNION
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'02-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120102'
GROUP BY
'02-JAN-12'
UNION
...
And that works -- the results are accurate -- but as you can see -- this is nowhere near elegant -- and if you have to do it for 365 days, well...it's a beast. There's got to be a better way to do it.
So far in my search, I've learned about an 'OVER' clause that I can use -- like this:
SELECT
COUNT(DISTINCT STUDENT_ID) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) "UniqueHeadcount",
SUM(CREDIT_HR) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as "SumCredits",
TRUNC(RSTS_DATE) as "DATE"
FROM
REGISTRATIONS
This query is way, way shorter (yay) -- but has two significant problems that I can't yet find my way around. First is that it doesn't work (by design, aparently?) with the COUNT DISTINCT. So I comment that out for a moment, but then run into the second problem: it ignores the TRUNC() function. The RSTS_DATE, though it appears to be just a day/month/year value when you run a SELECT on it, actually holds the time as well, so the result set I get is not summed simply over date, but also over times -- so instead of one record per day, my processed data returns hundreds of records per day (one for each individual course registration). For example:
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 6 03-JAN-12 (hidden time: 07:32:27)
2 7 03-JAN-12 (hidden time: 08:01:33)
Not what I'm after.
So I'm looking for expertise -- if what I've explained so far makes sense -- is there another way to use the OVER clause, or perhaps there may be another feature of PLSQL altogether I should be using for this? I'm not strong in PLSQL if you can't tell, but if anyone can give me some direction -- even just words to google, I'd appreciate the help.
Thanks
Try this:
WITH CRdata AS
(
SELECT COUNT(DISTINCT STUDENT_ID) AS UniqueHeadcount,
SUM(CREDIT_HR) AS SumCredits,
TRUNC(RSTS_DATE) RSTS_DATE
FROM REGISTRATIONS
GROUP BY TRUNC(RSTS_DATE)
)
SELECT SUM(UniqueHeadcount) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS UniqueHeadcount,
SUM(SumCredits) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS SumCredits,
RSTS_DATE
FROM CRdata