Trouble using ROW_NUMBER() OVER (PARTITION BY ...) - sql

I'm using SQL Server 2008 R2. I have table called EmployeeHistory with the following structure and sample data:
EmployeeID Date DepartmentID SupervisorID
10001 20130101 001 10009
10001 20130909 001 10019
10001 20131201 002 10018
10001 20140501 002 10017
10001 20141001 001 10015
10001 20141201 001 10014
Notice that the Employee 10001 has been changing 2 departments and several supervisors over time. What I am trying to do is to list the start and end dates of this employee's employment in each Department ordered by the Date field. So, the output will look like this:
EmployeeID DateStart DateEnd DepartmentID
10001 20130101 20131201 001
10001 20131201 20141001 002
10001 20141001 NULL 001
I intended to use partitioning the data using the following query but it failed. The Department changes from 001 to 002 and then back to 001. Obviously I cannot partition by DepartmentID... I'm sure I'm overlooking the obvious. Any help? Thank you, in advance.
SELECT * ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID
ORDER BY [Date]) RN FROM EmployeeHistory

I would do something like this:
;WITH x
AS (SELECT *,
Row_number()
OVER(
partition BY employeeid
ORDER BY datestart) rn
FROM employeehistory)
SELECT *
FROM x x1
LEFT OUTER JOIN x x2
ON x1.rn = x2.rn + 1
Or maybe it would be x2.rn - 1. You'll have to see. In any case, you get the idea. Once you have the table joined on itself, you can filter, group, sort, etc. to get what you need.

A bit involved. Easiest would be to refer to this SQL Fiddle I created for you that produces the exact result. There are ways you can improve it for performance or other considerations, but this should hopefully at least be clearer than some alternatives.
The gist is, you get a canonical ranking of your data first, then use that to segment the data into groups, then find an end date for each group, then eliminate any intermediate rows. ROW_NUMBER() and CROSS APPLY help a lot in doing it readably.
EDIT 2019:
The SQL Fiddle does in fact seem to be broken, for some reason, but it appears to be a problem on the SQL Fiddle site. Here's a complete version, tested just now on SQL Server 2016:
CREATE TABLE Source
(
EmployeeID int,
DateStarted date,
DepartmentID int
)
INSERT INTO Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001)
SELECT *,
ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS EntryRank,
newid() as GroupKey,
CAST(NULL AS date) AS EndDate
INTO #RankedData
FROM Source
;
UPDATE #RankedData
SET GroupKey = beginDate.GroupKey
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 GroupKey
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID = sup.DepartmentID AND
NOT EXISTS
(
SELECT *
FROM #RankedData bot
WHERE bot.EmployeeID = sup.EmployeeID AND
bot.EntryRank BETWEEN sub.EntryRank AND sup.EntryRank AND
bot.DepartmentID <> sup.DepartmentID
)
ORDER BY DateStarted ASC
) beginDate (GroupKey);
UPDATE #RankedData
SET EndDate = nextGroup.DateStarted
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 DateStarted
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID <> sup.DepartmentID AND
sub.EntryRank > sup.EntryRank
ORDER BY EntryRank ASC
) nextGroup (DateStarted);
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupKey ORDER BY EntryRank ASC) AS GroupRank FROM #RankedData
) FinalRanking
WHERE GroupRank = 1
ORDER BY EntryRank;
DROP TABLE #RankedData
DROP TABLE Source

It looks like a common gaps-and-islands problem. The difference between two sequences of row numbers rn1 and rn2 give the "group" number.
Run this query CTE-by-CTE and examine intermediate results to see how it works.
Sample data
I expanded sample data from the question a little.
DECLARE #Source TABLE
(
EmployeeID int,
DateStarted date,
DepartmentID int
)
INSERT INTO #Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001),
(10005,'2013-05-01',001),
(10005,'2013-11-09',001),
(10005,'2013-12-01',002),
(10005,'2014-10-01',001),
(10005,'2016-12-01',001);
Query for SQL Server 2008
There is no LEAD function in SQL Server 2008, so I had to use self-join via OUTER APPLY to get the value of the "next" row for the DateEnd.
WITH
CTE
AS
(
SELECT
EmployeeID
,DateStarted
,DepartmentID
,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
FROM #Source
)
,CTE_Groups
AS
(
SELECT
EmployeeID
,MIN(DateStarted) AS DateStart
,DepartmentID
FROM CTE
GROUP BY
EmployeeID
,DepartmentID
,rn1 - rn2
)
SELECT
CTE_Groups.EmployeeID
,CTE_Groups.DepartmentID
,CTE_Groups.DateStart
,A.DateEnd
FROM
CTE_Groups
OUTER APPLY
(
SELECT TOP(1) G2.DateStart AS DateEnd
FROM CTE_Groups AS G2
WHERE
G2.EmployeeID = CTE_Groups.EmployeeID
AND G2.DateStart > CTE_Groups.DateStart
ORDER BY G2.DateStart
) AS A
ORDER BY
EmployeeID
,DateStart
;
Query for SQL Server 2012+
Starting with SQL Server 2012 there is a LEAD function that makes this task more efficient.
WITH
CTE
AS
(
SELECT
EmployeeID
,DateStarted
,DepartmentID
,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
FROM #Source
)
,CTE_Groups
AS
(
SELECT
EmployeeID
,MIN(DateStarted) AS DateStart
,DepartmentID
FROM CTE
GROUP BY
EmployeeID
,DepartmentID
,rn1 - rn2
)
SELECT
CTE_Groups.EmployeeID
,CTE_Groups.DepartmentID
,CTE_Groups.DateStart
,LEAD(CTE_Groups.DateStart) OVER (PARTITION BY CTE_Groups.EmployeeID ORDER BY CTE_Groups.DateStart) AS DateEnd
FROM
CTE_Groups
ORDER BY
EmployeeID
,DateStart
;
Result
+------------+--------------+------------+------------+
| EmployeeID | DepartmentID | DateStart | DateEnd |
+------------+--------------+------------+------------+
| 10001 | 1 | 2013-01-01 | 2013-12-01 |
| 10001 | 2 | 2013-12-01 | 2014-10-01 |
| 10001 | 1 | 2014-10-01 | NULL |
| 10005 | 1 | 2013-05-01 | 2013-12-01 |
| 10005 | 2 | 2013-12-01 | 2014-10-01 |
| 10005 | 1 | 2014-10-01 | NULL |
+------------+--------------+------------+------------+

Related

Get the previous amount of accounts?

I got a table like this:
===============================
| ID | Salary | Date |
===============================
| A1 | $1000 | 2020-01-03|
-------------------------------
| A1 | $1300 | 2020-02-03|
-------------------------------
| A1 | $1500 | 2020-03-01|
-------------------------------
| A2 | $1300 | 2020-01-13|
-------------------------------
| A2 | $1500 | 2020-02-11|
-------------------------------
Expected output:
==================================================
| ID | Salary | Previous Salary | Date |
==================================================
| A1 | $1500 | $1300 | 2020-03-01|
--------------------------------------------------
| A2 | $1500 | $1300 | 2020-02-03|
--------------------------------------------------
How could I query to always get their previous salary and to show in another column/table ?
You can combine both the row_number and the lag windows functions to locate the last salary for every id and to return their last and previous salary.
with cte as (
select id, salary,
row_number() over (partition by id order by date desc) as position,
lag(salary) over (partition by id order by date) as previous,
date
from payroll
)
select id, salary, previous, date
from cte
where position = 1 -- It's the first one because we ordered by date descendingly
Result :
ID Salary Previous Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11
Online sample: http://sqlfiddle.com/#!18/770472/15/0
In SQL Server you can use the LAG Window Function to reference a field from the previous record in a partitioned set (within a specified data window)
with [Data] as
(
SELECT ID, Salary, Cast([Date] as Date) [Date]
FROM (VALUES
('A1', 1000, '2020-01-03'),
('A1',1300,'2020-02-03'),
('A1',1500,'2020-03-01'),
('A2',1300,'2020-01-13'),
('A2',1500,'2020-02-11')
) as t(ID,Salary,Date)
)
-- above is a simple demo dataset definition, your actual query is below
SELECT ID, Salary, LAG(Salary, 1)
OVER (
PARTITION BY [ID]
ORDER BY [Date]
) as [Previous_Salary], [Date]
FROM [Data]
ORDER BY [Data].[Date] DESC
Produces the following output:
ID Salary Previous_Salary Date
---- ----------- --------------- ----------
A1 1500 1300 2020-03-01
A2 1500 1300 2020-02-11
A1 1300 1000 2020-02-03
A2 1300 NULL 2020-01-13
A1 1000 NULL 2020-01-03
(5 rows affected)
Experiment with the ordering, note here in the window we are using ascending order, and in the display, we can show descending order.
Window functions create a virtual dataset outside of your current query, think of windows functions as a way to execute correlated queries in parallel and merge in the result.
In many simple implementations window functions like this should provide better or the same performance as writing your own logic to self-join or to sub query, for example this is an equivalent query using CROSS APPLY
SELECT ID, [Data].Salary, previous.Salary as [Previous_Salary], [Data].[Date]
FROM [Data]
CROSS APPLY (SELECT TOP 1 x.Salary
FROM [Data] x
WHERE x.[ID] = [Data].ID AND x.[Date] > [Data].[Date]
ORDER BY x.[Date] DESC) as previous
ORDER BY [Data].[Date] DESC
The LAG syntax requires less code, clearly defines your intent and allows set-based execution and optimizations.
Other JOIN style queries will still be blocking queries as they will require the reversal of the original data-set (by forcing the entire set to be loaded in reverse order) and so will not offer a truly set-based or optimal approach.
SQL Server developers realized that there is genuine need for these types of queries, and that in general when left to our own devices we create inefficient lookup queries, Window Functions were designed to offer a best-practice solution to these types of analytical queries.
This query could work for you
select *, LAG(salary) OVER (partition by id ORDER BY id) as previous from A
Try this: (Assumption: Table name 'PAYROLL' in 'PLS' schema)
SELECT R1.ID,R1.SALARY AS SALARY,R2.SALARY AS PREVIOUS_SALARY
FROM PLS.PAYROLL R1
LEFT OUTER JOIN PLS.PAYROLL R2 ON R1.ID=R2.ID AND R2.SALARY_DATE = (SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID AND SALARY_DATE<R1.SALARY_DATE)
WHERE R1.SALARY_DATE=(SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID)
ORDER BY R1.ID
You can use window functions and pivot to make this.
DECLARE #SampleData TABLE (ID VARCHAR(5), Salary MONEY ,[Date] DATE)
INSERT INTO #SampleData VALUES
('A1', $1000 , '2020-01-03'),
('A1', $1300 , '2020-02-03'),
('A1', $1500 , '2020-03-01'),
('A2', $1300 , '2020-01-13'),
('A2', $1500 , '2020-02-11')
SELECT ID, [1] Salary, [2] [Previous Salary], [Date]
FROM (SELECT ID, Salary, MAX([Date]) OVER(PARTITION BY ID) AS [Date],
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY [Date] DESC) RN
FROM #SampleData ) SRC
PIVOT(MAX(Salary) FOR RN IN ([1],[2])) PVT
ORDER BY ID
Result:
ID Salary Previous Salary Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11

SQL Find Nearest Effective Date in Related Table

I'll preface this by stating that this problem is similar to SQL Join on Nearest Less Than Date but that the solution there doesn't work for my problem. Instead of selecting a single column, I need the results of a table that is 'filtered' based on the nearest date.
I have three tables. The main table contains time ticket data in the form:
ticketId
ticketNumber
ticketDate
projectId
A secondary table tracks rate schedules for resources on each daily ticket for the project. It looks something like this:
scheduleId
projectId
effectiveDate
There is also a third table that is related to the second that actually contains the applicable rates. Something like this:
scheduleId
straightTime
overTime
Joining the first two tables on projectId (obviously) replicates data for every record in the rate schedule for the project. If I have 3 rate schedules for project 1, then ticket records result in something like:
ticketNumber | ticketDate | projectId | effectiveDate | scheduleId
------------- | ------------ | ----------- | -------------- | ----------
1234 | 2016-06-18 | 25 | 2016-06-01 | 1
1234 | 2016-06-18 | 25 | 2016-06-15 | 2
1234 | 2016-06-18 | 25 | 2016-06-31 | 3
Selecting the effectiveDate into my results is straightforward with the example:
SELECT *
, (SELECT TOP 1 t1.effectiveFrom
FROM dbo.labourRateSchedule t1
WHERE t1.effectiveFrom <= t2.[date] and t1.projectId = t2.projectId
ORDER BY t1.effectiveFrom desc) as effectiveDate
FROM dbo.timeTicket t2
ORDER BY t.date
However, I need to be able to join the ID of dbo.labourRateSchedule onto the third table to get the actual rates that apply. Adding the t1.ID to the SELECT statement does not make it accessible to JOIN into another related table.
I've been trying to JOIN the SELECT statement in the FROM statement but the results are only resulting with the last effectiveDate value instead of one that is closest to the applicable ticketDate.
I would hugely appreciate any help on this!
You can move your subquery to the FROM clause by using CROSS APPLY:
SELECT *
FROM dbo.timeTicket tt
CROSS APPLY
(
SELECT TOP(1) *
FROM dbo.labourRateSchedule lrs
WHERE lrs.projectId = tt.projectId
AND lrs.effectiveFrom <= tt.date
ORDER BY lrs.effectiveFrom desc
) best_lrs
JOIN dbo.schedule s on s.schedule_id = best_lrs.schedule_id
ORDER BY tt.date
Can you try something like this (you should change something, as you didn't post all information).
SELECT A.*, C.*
FROM timeTicket A
INNER JOIN (SELECT * , ROW_NUMBER() OVER (PARTITION BY projectId ORDER BY effectiveFrom DESC) AS RN
FROM labourRateSchedule) B ON A.projectId=B.projectId AND B.RN=1
INNER JOIN YOUR3TABLE C ON B.SCHEDULEID=C.SCHEDULEID
You can do this by CTE and Rank function -
create table timeTicket (ticketId int,
ticketNumber int ,
ticketDate smalldatetime ,
projectId int )
go
create table labourRateSchedule
(scheduleId int,
projectId int,
effectiveDate smalldatetime )
go
create table ApplicableRates
(scheduleId int,
straightTime smalldatetime ,
overTime smalldatetime)
go
insert into timeTicket
select 1 , 1234 ,'2016-06-18' ,25
go
insert into labourRateSchedule
select 1 , 25 ,'2016-06-01'
union all select 2 , 25 ,'2016-06-15'
union all select 3 , 25 ,'2016-06-30'
go
insert into ApplicableRates
select 1 , '2016-06-07' ,'2016-06-07'
union all select 2 , '2016-06-17' ,'2016-06-17'
union all select 3 , '2016-06-22' ,'2016-06-25'
go
with cte
as (
select t1.ticketNumber ,t1.ticketDate ,t1.projectId ,t2.effectiveDate ,t3.scheduleId ,t3.straightTime
,t3.overTime , rank() over ( partition by t1.ticketNumber order by abs (DATEDIFF(day,t1.ticketDate, t2.effectiveDate) ) ) DyaDiff
from timeTicket t1 join labourRateSchedule t2
on t1.projectId = t2.projectId
join ApplicableRates t3
on t2.scheduleId = t3.scheduleId)
select * from cte where DyaDiff = 1

SQL to get first date and amount per account

I want to get back the date and amount of the first transaction per account in a transaction table. The table (GiftHeader) looks like this:
EntityID Date Amount
1 1/1/2027 00:00:00:00 1.00
1 2/1/2027 00:00:00:00 2.00
2 2/1/2027 00:00:00:00 4.00
2 3/1/2027 00:00:00:00 2.00
In this case, I would expect the following:
EntityID BatchDate Amount
1 1/1/2027 00:00:00:00 1.00
2 2/1/2027 00:00:00:00 4.00
Here's the SQL I'm using which isn't working.
select DISTINCT entityid, min(BatchDate) as FirstGiftDate
from GiftHeader
group by EntityId,BatchDate
order by EntityId
Any help would be appreciated.
Regards,
Joshua Goodwin
You can use top 1 with ties as below
Select top 1 with ties * from GiftHeader
order by row_number() over (partition by entityid order by [BatchDate])
Other traditional approach is
Select * from (
Select *, RowN = row_number() over (partition by entityid order by BatchDate) from GiftHeader ) a
Where a.RowN = 1
Output:
+----------+-------------------------+--------+
| EntityId | BatchDate | Amount |
+----------+-------------------------+--------+
| 1 | 2027-01-01 00:00:00.000 | 1 |
| 2 | 2027-02-01 00:00:00.000 | 4 |
+----------+-------------------------+--------+
You can use ROW_NUMBER as follow
SELECT EntityID,
Date,
Amount
FROM (SELECT ROW_NUMBER()
OVER (
PARTITION BY EntityID
ORDER BY Date) AS RN,
*
FROM GiftHeader) a
WHERE a.RN = 1

SQL Server : select from duplicate columns where date newest

I have inherited a SQL Server table in the (abbreviated) form of (includes sample data set):
| SID | name | Invite_Date |
|-----|-------|-------------|
| 101 | foo | 2013-01-06 |
| 102 | bar | 2013-04-04 |
| 101 | fubar | 2013-03-06 |
I need to select all SID's and the Invite_date, but if there is a duplicate SID, then just get the latest entry (by date).
So the results from the above would look like:
101 | fubar | 2013-03-06
102 | bar | 2013-04-04
Any ideas please.
N.B the Invite_date column has been declared as a nvarchar, so to get it in a date format I am using CONVERT(DATE, Invite_date)
You can use a ranking function like ROW_NUMBER or DENSE_RANK in a CTE:
WITH CTE AS
(
SELECT SID, name, Invite_Date,
rn = Row_Number() OVER (PARTITION By SID
Order By Invite_Date DESC)
FROM dbo.TableName
)
SELECT SID, name, Invite_Date
FROM CTE
WHERE RN = 1
Demo
Use Row_Number if you want exactly one row per group and Dense_Rank if you want all last Invite_Date rows for each group in case of repeating max-Invite_Dates.
select t1.*
from your_table t1
inner join
(
select sid, max(CONVERT(DATE, Invite_date)) mdate
from your_table
group by sid
) t2 on t1.sid = t2.sid and CONVERT(DATE, t1.Invite_date) = t2.mdate
select
SID,name,MAX(Invite_date)
FROM
Table1
GROUP BY
SID
http://sqlfiddle.com/#!2/6b6f66/1

Max (SQL-Server)

I have a table that looks like this:
BARCODE | PRICE | STARTDATE
007023819815 | 159000 | 2008-11-17 00:00:00.000
007023819815 | 319000 | 2009-02-01 00:00:00.000
How can I select so I can get the result like this:
BARCODE | PRICE | STARTDATE
007023819815 | 319000 | 2009-02-01 00:00:00.000
select by using max date.
Thanks in advance.
SELECT TOP 1 barcode, price, startdate
FROM TableName
ORDER BY startdate DESC
Or if there can be more than one rows.
SELECT barcode, price, startdate
FROM TableName A
WHERE startdate = (SELECT max(startdate) FROM TableName B WHERE B.barcode = A.barcode)
UPDATE
changed second query to view max values per barcode.
An elegant way to do that is using the analytic function row_number:
SELECT barcode, price, startdate
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY barcode ORDER BY startdate DESC) as rn
FROM YourTable
) subquery
WHERE rn = 1
If performance is an issue, check out some more complex options in this blog post.