SQL duration between two dates in different rows - sql

I would really appreciate some assistance if somebody could help me construct a MSSQL Server 2000 query that would return the duration between a customer's A entry and their B entry.
Not all customers are expected to have a B record and so no results would be returned.
Customers Audit
+---+---------------+---+----------------------+
| 1 | Peter Griffin | A | 2013-01-01 15:00:00 |
| 2 | Martin Biggs | A | 2013-01-02 15:00:00 |
| 3 | Peter Griffin | C | 2013-01-05 09:00:00 |
| 4 | Super Mario | A | 2013-01-01 15:00:00 |
| 5 | Martin Biggs | B | 2013-01-03 18:00:00 |
+---+---------------+---+----------------------+
I'm hoping for results similar to:
+--------------+----------------+
| Martin Biggs | 1 day, 3 hours |
+--------------+----------------+

Something like the below (don't know your schema, so you'll need to change names of objects) should suffice.
SELECT ABS(DATEDIFF(HOUR, CA.TheDate, CB.TheDate)) AS HoursBetween
FROM dbo.Customers CA
INNER JOIN dbo.Customers CB
ON CB.Name = CA.Name
AND CB.Code = 'B'
WHERE CA.Code = 'A'

SELECT A.CUSTOMER, DATEDIFF(HOUR, A.ENTRY_DATE, B.ENTRY_DATE) DURATION
FROM CUSTOMERSAUDIT A, CUSTOMERSAUDIT B
WHERE B.CUSTOMER = A.CUSTOMER AND B.ENTRY_DATE > A.ENTRY_DATE

This is Oracle query but all features available in MS Server as far as I know. I'm sure I do not have to tell you how to concatenate the output to get desired result. All values in output will be in separate columns - days, hours, etc... And it is not always easy to format the output here:
SELECT id, name, grade
, NVL(EXTRACT(DAY FROM day_time_diff), 0) days
, NVL(EXTRACT(HOUR FROM day_time_diff), 0) hours
, NVL(EXTRACT(MINUTE FROM day_time_diff), 0) minutes
, NVL(EXTRACT(SECOND FROM day_time_diff), 0) seconds
FROM
(
SELECT id, name, grade
, (begin_date-end_date) day_time_diff
FROM
(
SELECT id, name, grade
, CAST(start_date AS TIMESTAMP) begin_date
, CAST(end_date AS TIMESTAMP) end_date
FROM
(
SELECT id, name, grade, start_date
, LAG(start_date, 1, to_date(null)) OVER (ORDER BY id) end_date
FROM stack_test
)
)
)
/
Output:
ID NAME GRADE DAYS HOURS MINUTES SECONDS
------------------------------------------------------------
1 Peter Griffin A 0 0 0 0
2 Martin Biggs A 1 1 0 0
3 Peter Griffin C 2 17 0 0
4 Super Mario A -3 -18 0 0
5 Martin Biggs A 2 3 0 0
The table structure/columns I used - it would be great if you took care of this and data in advance:
CREATE TABLE stack_test
(
id NUMBER
,name VARCHAR2(50)
,grade VARCHAR2(3)
,start_date DATE
)
/

Related

How to make entries from column appear as row title

I have a hospital database which looks something like this
id | patient_name | admitDate | DischargeDate |RoomCategory
1 | john |3/01/2011 | 5/01/2011 |Category1
2 | lisa |3/01/2011 | 4/01/2011 |Category2
3 | ron |5/01/2011 | 10/01/2011 |Category1
4 | howard |6/01/2012 | 10/01/2012 |Category3
5 | john |6/05/2011 | 7/05/2011 |Category4
6 | rammy |6/02/2011 | 7/03/2011 |Category4
I have to calculate the number of patients in hospital on each day (both admit and discharge date to be counted) and group them by category
Suppose on 3/01/2011 we have 2 patients, one in category 1 and one in category 2 on 4/01/2011 we again have same 2 patients but on 5/01/2011 lisa (id 2) is discharged so we only have 1 patient from category 1 but now ron (id 3) is also admitted so now we also have to count him.
The output should look something like this
Date | Category1 | Category2 | Category3 |Category4
3/01/2011 | 1 | 1 | 0 | 0
4/01/2011 | 1 | 1 | 0 | 0
5/01/2011 | 2 | 0 | 0 | 0
I am not able to figure out how to list all the dates which might have a patient, because the actual table is huge and a lot of dates don't have any patient. I also am not able to get how will I count distinctively to get count under each category.
I have 15 categories in total in my actual table so using where for each one of them separately wouldn't be very efficient.
You have 2 problems here. 1 you need a calendar table, and then 2 a pivot. I suggest, if I am honest, you invest in creating a calendar table firstly, but I use an inline one here. Then you can use pivoting to convert the values to columns. I use conditional aggregation here, as it is transferable and less restrictive.
SELECT *
INTO dbo.YourTable
FROM (VALUES(1,'john ',CONVERT(date,'3/01/2011'),CONVERT(date,'5/01/2011 '),'Category1'),
(2,'lisa ',CONVERT(date,'3/01/2011'),CONVERT(date,'4/01/2011 '),'Category2'),
(3,'ron ',CONVERT(date,'5/01/2011'),CONVERT(date,'10/01/2011'),'Category1'),
(4,'howard',CONVERT(date,'6/01/2012'),CONVERT(date,'10/01/2012'),'Category3'),
(5,'john ',CONVERT(date,'6/05/2011'),CONVERT(date,'7/05/2011 '),'Category4'),
(6,'rammy ',CONVERT(date,'6/02/2011'),CONVERT(date,'7/03/2011 '),'Category4'))V(id,patient_name,admitDate,DischargeDate,RoomCategory)
GO
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT 0 AS I
UNION ALL
SELECT TOP (SELECT DATEDIFF(DAY, MIN(admitDate), MAX(DischargeDate)) FROM dbo.YourTable)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3), --UP to 1000 days. Add more cross joins for more days
Calendar AS(
SELECT DATEADD(DAY, T.I, YT.MinAdmitDate) AS D
FROM Tally T
CROSS APPLY (SELECT MIN(admitDate) AS MinAdmitDate FROM dbo.YourTable) YT)
SELECT C.D AS [Date],
COUNT(CASE YT.RoomCategory WHEN 'Category1' THEN 1 END) AS Category1,
COUNT(CASE YT.RoomCategory WHEN 'Category2' THEN 1 END) AS Category2,
COUNT(CASE YT.RoomCategory WHEN 'Category3' THEN 1 END) AS Category3,
COUNT(CASE YT.RoomCategory WHEN 'Category4' THEN 1 END) AS Category4
FROM Calendar C
LEFT JOIN dbo.YourTable YT ON C.D >= YT.admitDate
AND C.D <= DischargeDate
GROUP BY C.D;
GO
DROP TABLE dbo.YourTable;
db<>fiddle Note that that results might not be what you expect as DB Fiddle defaults to American, and you provide an ambiguous date format and I don't provide an explicit style in the CONVERT functions.

In BigQuery, can I add rows of missing data? [duplicate]

This question already has answers here:
How to add records for each user based on another existing row in BigQuery?
(3 answers)
Closed 2 years ago.
I have a table where each row represents the number of transactions a user has per day. If they had no transaction that day then they don't have a row for that date. How can I add these 'missing rows' and set the number of transactions to 0
My table:
Date | User | numTransactions
2020-01-01 | anna | 2
2020-01-01 | john | 3
2020-01-02 | anna | 1
2020-01-04 | anna | 1
2020-01-05 | john | 2
Anna had transactions on Jan 1,2, and 4 but not Jan 3, and 5
John had transactions on Jan 1, and 5 but not Jan 2, 3, and 4
I want to add rows which shows the dates there are 0 transactions
Date | User | numTransactions
2020-01-01 | anna | 2
2020-01-01 | john | 3
2020-01-02 | anna | 1
2020-01-04 | anna | 1
2020-01-05 | john | 2
2020-01-02 | john | 0
2020-01-03 | anna | 0
2020-01-03 | john | 0
2020-01-04 | john | 0
2020-01-05 | anna | 0
You can join with GENERATE_DATE_ARRAY:
WITH test_table AS (
SELECT DATE '2020-01-01' AS Date, 'anna' AS User, 2 AS numTransactions UNION ALL
SELECT '2020-01-01', 'john', 3 UNION ALL
SELECT '2020-01-02', 'anna', 1 UNION ALL
SELECT '2020-01-04', 'anna', 1 UNION ALL
SELECT '2020-01-05', 'john', 2
),
clients_list AS (
SELECT DISTINCT User FROM test_table
)
SELECT
Date,
User,
IFNULL(numTransactions, 0) AS numTransactions
FROM UNNEST(GENERATE_DATE_ARRAY('2020-01-01', '2020-01-05')) AS Date
CROSS JOIN clients_list
LEFT JOIN test_table USING(Date, User)
I recommend writing the code in this fashion:
WITH t AS (
SELECT DATE '2020-01-01' AS Date, 'anna' AS User, 2 AS numTransactions UNION ALL
SELECT '2020-01-01', 'john', 3 UNION ALL
SELECT '2020-01-02', 'anna', 1 UNION ALL
SELECT '2020-01-04', 'anna', 1 UNION ALL
SELECT '2020-01-05', 'john', 2
)
SELECT u.user, COALESCE(dte, u.date) as date,
(CASE WHEN dte = u.date THEN u.numTransactions ELSE 0 END) as numTransactions
FROM (SELECT user, date, numTransactions,
COALESCE(DATE_ADD(LEAD(DATE) OVER (PARTITION BY user ORDER BY date), INTERVAL -1 DAY), DATE '2020-01-05') as end_date
FROM t
) u LEFT JOIN
UNNEST(GENERATE_DATE_ARRAY(date, end_date, INTERVAL 1 DAY)) dte
ON 1=1
ORDER BY user, date;
This is slightly simpler than generating all the dates up-front (not requiring getting the unique names and then re-joining to the same table).
Much more important are the performance characteristics, which have proven very important in my experience in making this scalable. Basically, the CROSS JOIN for generating all user/date combinations uses a lot of resources. This version keeps all the operations "local" to a given user (well, there is some data movement to get all the users co-located on the same node).
Specifically, I have seen queries that run out of resources or literally take hours to complete using the CROSS JOIN method finish within a minute using this method.

SQL query to find the visitor together with the date time

My visitor log table has id, visitor, department,vtime fields.
id | visitor | Visittime | Department_id
--------------------------------------------------------------
1 1 2019-05-07 13:53:50 1
2 2 2019-05-07 13:56:54 1
3 1 2019-05-07 14:54:10 3
4 2 2019-05-08 13:54:49 1
5 1 2019-05-08 13:58:15 1
6 2 2019-05-08 18:54:30 2
7 1 2019-05-08 18:54:37 2
And I have already have the following index
CREATE INDEX Idx_VisitorLog_Visitor_VisitTime_Includes ON VisitorLog
(Visitor, VisitTime) INCLUDE (DepartmentId, ID)
From the above table 4 filters are passed from User interface, visitor 1 and visitor 2 and visiting start time and end time.
In what are the department visitor 1 and visitor 2 both together with the VisitTime difference with in 5 mins those need to be filtered
Output shout be
id | visitor | Visittime | Department_id
--------------------------------------------------------------
1 1 2019-05-07 13:53:50 1
2 2 2019-05-07 13:56:54 1
4 2 2019-05-08 13:54:49 1
5 1 2019-05-08 13:58:15 1
For that I had used the following query,
;with CTE1 AS(
Select id,visitor,Visittime,department_id from visitorlog where visitor=1
)
,CTE2 AS(
Select id,visitor,Visittime,department_id from visitorlog where visitor=2
)
select * from CTE2 V2
Inner join CTE1 V1 on V2.department_id=V1.department_id and DATEDIFF(minute,V2.Visittime,V1.Visittime)between -5 and 5**
The above query takes too much of time to give response. Because in my table, almost 20 million records are available
Could any one suggest the correct way for my requirement.
Thanks in advance
This is a completely revised answer, based upon your additional information above.
After reviewing the data file above and the results you desire, this seems like the cleanest way to provide your results. First, we need a different index:
create index idx_POC_visitorlog on visitorlog
(visitor, Department_id, Visittime) include(id);
With this index, we can limit the queries to only the two passed in IDs. To simulate that, I created variables to hold their values. This query returns the data you are looking for.
DECLARE #Visitor1 int = 1,
#Visitor2 int = 2
;with t as (
select Department_id,
dateadd(minute, -5, visittime) as EarlyTime,
dateadd(minute, 5, Visittime) as LateTime,
id
from visitorlog
where visitor = #Visitor1
),
v as (
select v.id,
t.id as tid
from visitorlog v
INNER JOIN t
ON v.visitor = #Visitor2
AND v.Department_id = t.Department_id
and v.Visittime BETWEEN t.EarlyTime and t.LateTime
)
SELECT *
FROM visitorlog vl
WHERE ID IN (
SELECT v.id
FROM v
UNION
SELECT v.tid
FROM v
)
ORDER BY visittime;
If your version of SQL Server supports the LAG and LEAD functions, try rewriting the query as follows:
with t as (
select
*,
dateadd(minute, 5,
lag(Visittime) over(partition by Department_id order by Visittime)) lag_visit_time,
dateadd(minute, -5,
lead(Visittime) over(partition by Department_id order by Visittime)) lead_visit_time
from visitorlog
where visitor in(1, 2)
)
select
id, visitor, visittime, department_id
from t
where lag_visit_time >= Visittime or lead_visit_time <= Visittime;
This index is called a POC.
Results:
+----+---------+----------------------+---------------+
| id | visitor | visittime | department_id |
+----+---------+----------------------+---------------+
| 1 | 1 | 2019-05-07T13:53:50Z | 1 |
| 2 | 2 | 2019-05-07T13:56:54Z | 1 |
| 4 | 2 | 2019-05-08T13:54:49Z | 1 |
| 5 | 1 | 2019-05-08T13:58:15Z | 1 |
| 6 | 2 | 2019-05-08T18:54:30Z | 2 |
| 7 | 1 | 2019-05-08T18:54:37Z | 2 |
+----+---------+----------------------+---------------+
Demo.

Teradata sql query from grouping records using Intervals

In Teradata SQL how to assign same row numbers for the group of records created with in 8 seconds of time Interval.
Example:-
Customerid Customername Itembought dateandtime
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01
100 ALex Circketball 2017-02-10 10:10:06
100 ALex Baseball 2017-02-10 10:10:08
100 ALex volleyball 2017-02-10 10:11:01
100 ALex footbball 2017-02-10 10:11:05
100 ALex ringball 2017-02-10 10:11:08
100 Alex football 2017-02-10 10:12:10
My Expected result shoud have additional column with Row_number where it should assign the same number for all the purchases of the customer with in 8 seconds: Refer the below expected result
Customerid Customername Itembought dateandtime Row_number
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01 1
100 ALex Circketball 2017-02-10 10:10:06 1
100 ALex Baseball 2017-02-10 10:10:08 1
100 ALex volleyball 2017-02-10 10:11:01 2
100 ALex footbball 2017-02-10 10:11:05 2
100 ALex ringball 2017-02-10 10:11:08 2
100 Alex football 2017-02-10 10:12:10 3
This is one way to do it with a recursive cte. Reset the running total of difference from the previous row's timestamp when it gets > 8 to 0 and start a new group.
WITH ROWNUMS AS
(SELECT T.*
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY TM) AS RNUM
/*Replace DATEDIFF with Teradata specific function*/
,DATEDIFF(SECOND,COALESCE(MIN(TM) OVER(PARTITION BY ID
ORDER BY TM ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), TM),TM) AS DIFF
FROM T --replace this with your tablename and add columns as required
)
,RECURSIVE CTE(ID,TM,DIFF,SUM_DIFF,RNUM,GRP) AS
(SELECT ID,
TM,
DIFF,
DIFF,
RNUM,
CAST(1 AS int)
FROM ROWNUMS
WHERE RNUM=1
UNION ALL
SELECT T.ID,
T.TM,
T.DIFF,
CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN 0 ELSE C.SUM_DIFF+T.DIFF END,
T.RNUM,
CAST(CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN T.RNUM ELSE C.GRP END AS int)
FROM CTE C
JOIN ROWNUMS T ON T.RNUM=C.RNUM+1 AND T.ID=C.ID
)
SELECT ID,
TM,
DENSE_RANK() OVER(PARTITION BY ID ORDER BY GRP) AS row_num
FROM CTE
Demo in SQL Server
I am going to interpret the problem differently from vkp. Any row within 8 seconds of another row should be in the same group. Such values can chain together, so the overall span can be more than 8 seconds.
The advantage of this method is that recursive CTEs are not needed, so it should be faster. (Of course, this is not an advantage if the OP does not agree with the definition.)
The basic idea is to look at the previous date/time value; if it is more than 8 seconds away, then add a flag. The cumulative sum of the flag is the row number you are looking for.
select t.*,
sum(case when prev_dt >= dateandtime - interval '8' second
then 0 else 1
end) over (partition by customerid order by dateandtime
) as row_number
from (select t.*,
max(dateandtime) over (partition by customerid order by dateandtime row between 1 preceding and 1 preceding) as prev_dt
from t
) t;
Using Teradata's PERIOD data type and the awesome td_normalize_overlap_meet:
Consider table test32:
SELECT * FROM test32
+----+----+------------------------+
| f1 | f2 | f3 |
+----+----+------------------------+
| 1 | 2 | 2017-05-11 03:59:00 PM |
| 1 | 3 | 2017-05-11 03:59:01 PM |
| 1 | 4 | 2017-05-11 03:58:58 PM |
| 1 | 5 | 2017-05-11 03:59:26 PM |
| 1 | 2 | 2017-05-11 03:59:28 PM |
| 1 | 2 | 2017-05-11 03:59:46 PM |
+----+----+------------------------+
The following will group your records:
WITH
normalizedCTE AS
(
SELECT *
FROM TABLE
(
td_normalize_overlap_meet(NEW VARIANT_TYPE(periodCTE.f1), periodCTE.fper)
RETURNS (f1 integer, fper PERIOD(TIMESTAMP(0)), recordCount integer)
HASH BY f1
LOCAL ORDER BY f1, fper
) as output(f1, fper, recordcount)
),
periodCTE AS
(
SELECT f1, f2, f3, PERIOD(f3, f3 + INTERVAL '9' SECOND) as fper FROM test32
)
SELECT t2.f1, t2.f2, t2.f3, t1.fper, DENSE_RANK() OVER (PARTITION BY t2.f1 ORDER BY t1.fper) as fgroup
FROM normalizedCTE t1
INNER JOIN periodCTE t2 ON
t1.fper P_INTERSECT t2.fper IS NOT NULL
Results:
+----+----+------------------------+-------------+
| f1 | f2 | f3 | fgroup |
+----+----+------------------------+-------------+
| 1 | 2 | 2017-05-11 03:59:00 PM | 1 |
| 1 | 3 | 2017-05-11 03:59:01 PM | 1 |
| 1 | 4 | 2017-05-11 03:58:58 PM | 1 |
| 1 | 5 | 2017-05-11 03:59:26 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:28 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:46 PM | 3 |
+----+----+------------------------+-------------+
A Period in Teradata is a special data type that holds a date or datetime range. The first parameter is the start of the range and the second is the ending time (up to, but not including which is why it's "+ 9 seconds"). The result is that we get a 8 second time "Period" where each record might "intersect" with another record.
We then use td_normalize_overlap_meet to merge records that intersect, sharing the f1 field's value as the key. In your case that would be customerid. The result is three records for this one customer since we have three groups that "overlap" or "meet" each other's time periods.
We then join the td_normalize_overlap_meet output with the output from when we determined the periods. We use the P_INTERSECT function to see which periods from the normalized CTE INTERSECT with the periods from the initial Period CTE. From the result of that P_INTERSECT join we grab the values we need from each CTE.
Lastly, Dense_Rank() gives us a rank based on the normalized period for each group.

DB2 query to find average sale for each item 1 year previous

Having some trouble figuring out how to make these query.
In general I have a table with
sales_ID
Employee_ID
sale_date
sale_price
what I want to do is have a view that shows for each sales item how much the employee on average sells for 1 year previous of the sale_date.
example: Suppose I have this in the sales table
sales_ID employee_id sale_date sale_price
1 Bob 2016/06/10 100
2 Bob 2016/01/01 75
3 Bob 2014/01/01 475
4 Bob 2015/12/01 100
5 Bob 2016/05/01 200
6 Fred 2016/01/01 30
7 Fred 2015/05/01 50
for sales_id 1 record I want to pull all sales from Bob by 1 year up to the month of the sale (so 2015-05-01 to 2016-05-31 which has 3 sales for 75, 100, 200) so the final output would be
sales_ID employee_id sale_date sale_price avg_sale
1 Bob 2016/06/10 100 125
2 Bob 2016/01/01 75 275
3 Bob 2014/01/01 475 null
4 Bob 2015/12/01 100 475
5 Bob 2016/05/01 200 87.5
6 Fred 2016/01/01 30 50
7 Fred 2015/05/01 50 null
What I tried doing is something like this
select a.sales_ID, a.sale_price, a.employee_ID, a.sale_date, b.avg_price
from sales a
left join (
select employee_id, avg(sale_price) as avg_price
from sales
where sale_date between Date(VARCHAR(YEAR(a.sale_date)-1) ||'-'|| VARCHAR(MONTH(a.sale_date)-1) || '-01')
and Date(VARCHAR(YEAR(a.sale_date)) ||'-'|| VARCHAR(MONTH(a.sale_date)) || '-01') -1 day
group by employee_id
) b on a.employee_id = b.employee_id
which DB2 doesn't like using the parent table a in the sub query, but I can't think of how to properly write this query. any thoughts?
Ok. I think I figured it out. Please note 3 things.
I couldn't test it in DB2, so I used Oracle. But syntax would be more or less same.
I didn't use your 1 year logic exactly. I am counting current_date minus 365 days, but you can change the between part in where clause in inner query, as you mentioned in the question.
The expected output you mentioned is incorrect. So for every sale_id, I took the date, found the employee_id, took all the sales of that employee for last 1 year, excluding the current date, and then took average. If you want to change it, you can change the where clause in subquery.
select t1.*,t2.avg_sale
from
sales t1
left join
(
select a.sales_id
,avg(b.sale_price) as avg_sale
from sales a
inner join
sales b
on a.employee_id=b.employee_id
where b.sale_date between a.sale_date - 365 and a.sale_date -1
group by a.sales_id
) t2
on t1.sales_id=t2.sales_id
order by t1.sales_id
Output
+----------+-------------+-------------+------------+----------+
| SALES_ID | EMPLOYEE_ID | SALE_DATE | SALE_PRICE | AVG_SALE |
+----------+-------------+-------------+------------+----------+
| 1 | Bob | 10-JUN-2016 | 100 | 125 |
| 2 | Bob | 01-JAN-2016 | 75 | 100 |
| 3 | Bob | 01-JAN-2014 | 475 | |
| 4 | Bob | 01-DEC-2015 | 100 | |
| 5 | Bob | 01-MAY-2016 | 200 | 87.5 |
| 6 | Fred | 01-JAN-2016 | 30 | 50 |
| 7 | Fred | 01-MAY-2015 | 50 | |
+----------+-------------+-------------+------------+----------+
You can almost fix your original query by doing a LATERAL join. Lateral allows you to reference previously declared tables as in:
select a.sales_ID, a.sale_price, a.employee_ID, a.sale_date, b.avg_price
from sales a
left join LATERAL (
select employee_id, avg(sale_price) as avg_price
from sales
where sale_date between Date(VARCHAR(YEAR(a.sale_date)-1) ||'-'|| VARCHAR(MONTH(a.sale_date)-1) || '-01')
and Date(VARCHAR(YEAR(a.sale_date)) ||'-'|| VARCHAR(MONTH(a.sale_date)) || '-01') -1 day
group by employee_id
) b on a.employee_id = b.employee_id
However, I get an syntax error from your date arithmetic, so using #Utsav solution for this yields:
select a.sales_ID, a.sale_price, a.employee_ID, a.sale_date, b.avg_price
from sales a
left join lateral (
select employee_id, avg(sale_price) as avg_price
from sales b
where a.employee_id = b.employee_id
and b.sale_date between a.sale_date - 365 and a.sale_date -1
group by employee_id
) b on a.employee_id = b.employee_id
Since we already pushed the predicate inside the LATERAL join, it is strictly speaking not necessary to use the on clause:
select a.sales_ID, a.sale_price, a.employee_ID, a.sale_date, b.avg_price
from sales a
left join lateral (
select employee_id, avg(sale_price) as avg_price
from sales b
where a.employee_id = b.employee_id
and b.sale_date between a.sale_date - 365 and a.sale_date -1
group by employee_id
) b on 1=1
By using a LATERAL join we removed one access against the sales table. A comparison of the plans show:
No LATERAL Join
Access Plan:
Total Cost: 20,4571
Query Degree: 1
Rows
RETURN
( 1)
Cost
I/O
|
7
>MSJOIN
( 2)
20,4565
3
/---+----\
7 0,388889
TBSCAN FILTER
( 3) ( 6)
6,81572 13,6402
1 2
| |
7 2,72222
SORT GRPBY
( 4) ( 7)
6,81552 13,6397
1 2
| |
7 2,72222
TBSCAN TBSCAN
( 5) ( 8)
6,81488 13,6395
1 2
| |
7 2,72222
TABLE: LELLE SORT
SALES ( 9)
Q6 13,6391
2
|
2,72222
HSJOIN
( 10)
13,6385
2
/-----+------\
7 7
TBSCAN TBSCAN
( 11) ( 12)
6,81488 6,81488
1 1
| |
7 7
TABLE: LELLE TABLE: LELLE
SALES SALES
Q2 Q1
LATERAL Join
Access Plan:
Total Cost: 13,6565
Query Degree: 1
Rows
RETURN
( 1)
Cost
I/O
|
7
>^NLJOIN
( 2)
13,6559
2
/---+----\
7 0,35
TBSCAN GRPBY
( 3) ( 4)
6,81488 6,81662
1 1
| |
7 0,35
TABLE: LELLE TBSCAN
SALES ( 5)
Q5 6,81656
1
|
7
TABLE: LELLE
SALES
Q1
Window functions with framing
DB2 does not yet support range frames over dates, but by using a clever trick by #mustaccio in:
https://dba.stackexchange.com/questions/141263/what-is-the-meaning-of-order-by-x-range-between-n-preceding-if-x-is-a-dat
we can actually use only one table access and solve the problem:
select a.sales_ID, a.sale_price, a.employee_ID, a.sale_date
, avg(sale_price) over (partition by employee_id
order by julian_day(a.sale_date)
range between 365 preceding
and 1 preceding
) as avg_price
from sales a
Access Plan:
Total Cost: 6.8197
Query Degree: 1
Rows
RETURN
( 1)
Cost
I/O
|
7
TBSCAN
( 2)
6.81753
1
|
7
SORT
( 3)
6.81703
1
|
7
TBSCAN
( 4)
6.81488
1
|
7
TABLE: LELLE
SALES
Q1