I got a table like this:
===============================
| ID | Salary | Date |
===============================
| A1 | $1000 | 2020-01-03|
-------------------------------
| A1 | $1300 | 2020-02-03|
-------------------------------
| A1 | $1500 | 2020-03-01|
-------------------------------
| A2 | $1300 | 2020-01-13|
-------------------------------
| A2 | $1500 | 2020-02-11|
-------------------------------
Expected output:
==================================================
| ID | Salary | Previous Salary | Date |
==================================================
| A1 | $1500 | $1300 | 2020-03-01|
--------------------------------------------------
| A2 | $1500 | $1300 | 2020-02-03|
--------------------------------------------------
How could I query to always get their previous salary and to show in another column/table ?
You can combine both the row_number and the lag windows functions to locate the last salary for every id and to return their last and previous salary.
with cte as (
select id, salary,
row_number() over (partition by id order by date desc) as position,
lag(salary) over (partition by id order by date) as previous,
date
from payroll
)
select id, salary, previous, date
from cte
where position = 1 -- It's the first one because we ordered by date descendingly
Result :
ID Salary Previous Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11
Online sample: http://sqlfiddle.com/#!18/770472/15/0
In SQL Server you can use the LAG Window Function to reference a field from the previous record in a partitioned set (within a specified data window)
with [Data] as
(
SELECT ID, Salary, Cast([Date] as Date) [Date]
FROM (VALUES
('A1', 1000, '2020-01-03'),
('A1',1300,'2020-02-03'),
('A1',1500,'2020-03-01'),
('A2',1300,'2020-01-13'),
('A2',1500,'2020-02-11')
) as t(ID,Salary,Date)
)
-- above is a simple demo dataset definition, your actual query is below
SELECT ID, Salary, LAG(Salary, 1)
OVER (
PARTITION BY [ID]
ORDER BY [Date]
) as [Previous_Salary], [Date]
FROM [Data]
ORDER BY [Data].[Date] DESC
Produces the following output:
ID Salary Previous_Salary Date
---- ----------- --------------- ----------
A1 1500 1300 2020-03-01
A2 1500 1300 2020-02-11
A1 1300 1000 2020-02-03
A2 1300 NULL 2020-01-13
A1 1000 NULL 2020-01-03
(5 rows affected)
Experiment with the ordering, note here in the window we are using ascending order, and in the display, we can show descending order.
Window functions create a virtual dataset outside of your current query, think of windows functions as a way to execute correlated queries in parallel and merge in the result.
In many simple implementations window functions like this should provide better or the same performance as writing your own logic to self-join or to sub query, for example this is an equivalent query using CROSS APPLY
SELECT ID, [Data].Salary, previous.Salary as [Previous_Salary], [Data].[Date]
FROM [Data]
CROSS APPLY (SELECT TOP 1 x.Salary
FROM [Data] x
WHERE x.[ID] = [Data].ID AND x.[Date] > [Data].[Date]
ORDER BY x.[Date] DESC) as previous
ORDER BY [Data].[Date] DESC
The LAG syntax requires less code, clearly defines your intent and allows set-based execution and optimizations.
Other JOIN style queries will still be blocking queries as they will require the reversal of the original data-set (by forcing the entire set to be loaded in reverse order) and so will not offer a truly set-based or optimal approach.
SQL Server developers realized that there is genuine need for these types of queries, and that in general when left to our own devices we create inefficient lookup queries, Window Functions were designed to offer a best-practice solution to these types of analytical queries.
This query could work for you
select *, LAG(salary) OVER (partition by id ORDER BY id) as previous from A
Try this: (Assumption: Table name 'PAYROLL' in 'PLS' schema)
SELECT R1.ID,R1.SALARY AS SALARY,R2.SALARY AS PREVIOUS_SALARY
FROM PLS.PAYROLL R1
LEFT OUTER JOIN PLS.PAYROLL R2 ON R1.ID=R2.ID AND R2.SALARY_DATE = (SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID AND SALARY_DATE<R1.SALARY_DATE)
WHERE R1.SALARY_DATE=(SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID)
ORDER BY R1.ID
You can use window functions and pivot to make this.
DECLARE #SampleData TABLE (ID VARCHAR(5), Salary MONEY ,[Date] DATE)
INSERT INTO #SampleData VALUES
('A1', $1000 , '2020-01-03'),
('A1', $1300 , '2020-02-03'),
('A1', $1500 , '2020-03-01'),
('A2', $1300 , '2020-01-13'),
('A2', $1500 , '2020-02-11')
SELECT ID, [1] Salary, [2] [Previous Salary], [Date]
FROM (SELECT ID, Salary, MAX([Date]) OVER(PARTITION BY ID) AS [Date],
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY [Date] DESC) RN
FROM #SampleData ) SRC
PIVOT(MAX(Salary) FOR RN IN ([1],[2])) PVT
ORDER BY ID
Result:
ID Salary Previous Salary Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11
Related
I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?
I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40
Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;
I'm stumped as in how to do this.
I have 3 columns the first is a parent company, second is the child and the third is it's revenue. I want to find out which child per parent has the most revenue and what that revenue is.
So like the below
Vodafone. Argentina. 5b
Vodafone. Spain. 4b
Vodafone. England. 10b
So the answer would be
Vodafone. England 10b
Apologies for the formatting, on my phone.
You can use row_number(). Here is the demo.
select
company,
child,
revenue
from
(
select
*,
row_number() over (partition by company order by cast(revenue as int) desc) as rn
from yourTable
) subq
where rn = 1
output:
| company | child | revenue |
| -------- | ------- | ------- |
| Vodafone | England | 10 |
You can use dense_rank() if more than one company have same revenue.
You can try the below one -
select * from tablename
where cast(revenue as int) = (select max(cast(revenue as int)) from tablename)
I have a table which consists of dates and names. I want to group the result by names and dates with a condition that the resultant dates selected are at least 10 days apart. (starting from first date present in the table for that name)
This is an example:
________________________
Names | Dates
-----------------------
John | 2-2-2000
________________________
John | 5-2-2000
________________________
John | 16-2-2000
________________________
John | 17-2-2000
________________________
John | 20-2-2000
________________________
John | 31-2-2000
________________________
John | 5-3-2000
________________________
John | 14-3-2000
________________________
The output of the query should be the sum of count of these values (John,2-2-2000),(John,16-2-2000),(John,31-2-2000),(John,14-3-2000) That is, 4.
How do I write a query in SQL Server for this?
This is a bit tricky, because you need to keep track of the last row that was "picked" to select the next one. This means that you need to kind of iterative process, which in turns suggests a recursive query:
with
data as (
select t.*, row_number() over(partition by names order by dates) rn
from mytable t
),
rcte as (
select d.*, dates dates_base from data d where rn = 1
union all
select
d.*,
case when d.dates >= dateadd(day, 10, r.dates_base) then d.dates else r.dates_base end
from rcte r
inner join data d on d.rn = r.rn + 1 and d.names = r.names
)
select names, count(distinct dates_base) res from rcte group by names
Demo on DB Fiddlde:
names | res
:---- | --:
John | 4
Your question is unclear. Also consistent with your desired results is that you want to count rows where the gap from the previous row is 10+ days. For that, simply use lag():
select count(*)
from (select t.*,
lag(date) over (partition by name) as prev_date
from t
) t
where prev_date is null or prev_date < dateadd(day, -10, date);
Use select * to get the list of records.
I'm using SQL Server 2008 R2. I have table called EmployeeHistory with the following structure and sample data:
EmployeeID Date DepartmentID SupervisorID
10001 20130101 001 10009
10001 20130909 001 10019
10001 20131201 002 10018
10001 20140501 002 10017
10001 20141001 001 10015
10001 20141201 001 10014
Notice that the Employee 10001 has been changing 2 departments and several supervisors over time. What I am trying to do is to list the start and end dates of this employee's employment in each Department ordered by the Date field. So, the output will look like this:
EmployeeID DateStart DateEnd DepartmentID
10001 20130101 20131201 001
10001 20131201 20141001 002
10001 20141001 NULL 001
I intended to use partitioning the data using the following query but it failed. The Department changes from 001 to 002 and then back to 001. Obviously I cannot partition by DepartmentID... I'm sure I'm overlooking the obvious. Any help? Thank you, in advance.
SELECT * ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID
ORDER BY [Date]) RN FROM EmployeeHistory
I would do something like this:
;WITH x
AS (SELECT *,
Row_number()
OVER(
partition BY employeeid
ORDER BY datestart) rn
FROM employeehistory)
SELECT *
FROM x x1
LEFT OUTER JOIN x x2
ON x1.rn = x2.rn + 1
Or maybe it would be x2.rn - 1. You'll have to see. In any case, you get the idea. Once you have the table joined on itself, you can filter, group, sort, etc. to get what you need.
A bit involved. Easiest would be to refer to this SQL Fiddle I created for you that produces the exact result. There are ways you can improve it for performance or other considerations, but this should hopefully at least be clearer than some alternatives.
The gist is, you get a canonical ranking of your data first, then use that to segment the data into groups, then find an end date for each group, then eliminate any intermediate rows. ROW_NUMBER() and CROSS APPLY help a lot in doing it readably.
EDIT 2019:
The SQL Fiddle does in fact seem to be broken, for some reason, but it appears to be a problem on the SQL Fiddle site. Here's a complete version, tested just now on SQL Server 2016:
CREATE TABLE Source
(
EmployeeID int,
DateStarted date,
DepartmentID int
)
INSERT INTO Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001)
SELECT *,
ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS EntryRank,
newid() as GroupKey,
CAST(NULL AS date) AS EndDate
INTO #RankedData
FROM Source
;
UPDATE #RankedData
SET GroupKey = beginDate.GroupKey
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 GroupKey
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID = sup.DepartmentID AND
NOT EXISTS
(
SELECT *
FROM #RankedData bot
WHERE bot.EmployeeID = sup.EmployeeID AND
bot.EntryRank BETWEEN sub.EntryRank AND sup.EntryRank AND
bot.DepartmentID <> sup.DepartmentID
)
ORDER BY DateStarted ASC
) beginDate (GroupKey);
UPDATE #RankedData
SET EndDate = nextGroup.DateStarted
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 DateStarted
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID <> sup.DepartmentID AND
sub.EntryRank > sup.EntryRank
ORDER BY EntryRank ASC
) nextGroup (DateStarted);
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupKey ORDER BY EntryRank ASC) AS GroupRank FROM #RankedData
) FinalRanking
WHERE GroupRank = 1
ORDER BY EntryRank;
DROP TABLE #RankedData
DROP TABLE Source
It looks like a common gaps-and-islands problem. The difference between two sequences of row numbers rn1 and rn2 give the "group" number.
Run this query CTE-by-CTE and examine intermediate results to see how it works.
Sample data
I expanded sample data from the question a little.
DECLARE #Source TABLE
(
EmployeeID int,
DateStarted date,
DepartmentID int
)
INSERT INTO #Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001),
(10005,'2013-05-01',001),
(10005,'2013-11-09',001),
(10005,'2013-12-01',002),
(10005,'2014-10-01',001),
(10005,'2016-12-01',001);
Query for SQL Server 2008
There is no LEAD function in SQL Server 2008, so I had to use self-join via OUTER APPLY to get the value of the "next" row for the DateEnd.
WITH
CTE
AS
(
SELECT
EmployeeID
,DateStarted
,DepartmentID
,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
FROM #Source
)
,CTE_Groups
AS
(
SELECT
EmployeeID
,MIN(DateStarted) AS DateStart
,DepartmentID
FROM CTE
GROUP BY
EmployeeID
,DepartmentID
,rn1 - rn2
)
SELECT
CTE_Groups.EmployeeID
,CTE_Groups.DepartmentID
,CTE_Groups.DateStart
,A.DateEnd
FROM
CTE_Groups
OUTER APPLY
(
SELECT TOP(1) G2.DateStart AS DateEnd
FROM CTE_Groups AS G2
WHERE
G2.EmployeeID = CTE_Groups.EmployeeID
AND G2.DateStart > CTE_Groups.DateStart
ORDER BY G2.DateStart
) AS A
ORDER BY
EmployeeID
,DateStart
;
Query for SQL Server 2012+
Starting with SQL Server 2012 there is a LEAD function that makes this task more efficient.
WITH
CTE
AS
(
SELECT
EmployeeID
,DateStarted
,DepartmentID
,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
FROM #Source
)
,CTE_Groups
AS
(
SELECT
EmployeeID
,MIN(DateStarted) AS DateStart
,DepartmentID
FROM CTE
GROUP BY
EmployeeID
,DepartmentID
,rn1 - rn2
)
SELECT
CTE_Groups.EmployeeID
,CTE_Groups.DepartmentID
,CTE_Groups.DateStart
,LEAD(CTE_Groups.DateStart) OVER (PARTITION BY CTE_Groups.EmployeeID ORDER BY CTE_Groups.DateStart) AS DateEnd
FROM
CTE_Groups
ORDER BY
EmployeeID
,DateStart
;
Result
+------------+--------------+------------+------------+
| EmployeeID | DepartmentID | DateStart | DateEnd |
+------------+--------------+------------+------------+
| 10001 | 1 | 2013-01-01 | 2013-12-01 |
| 10001 | 2 | 2013-12-01 | 2014-10-01 |
| 10001 | 1 | 2014-10-01 | NULL |
| 10005 | 1 | 2013-05-01 | 2013-12-01 |
| 10005 | 2 | 2013-12-01 | 2014-10-01 |
| 10005 | 1 | 2014-10-01 | NULL |
+------------+--------------+------------+------------+
Lets say I have a table:
--------------------------------------
| ID | DATE | GROUP | RESULT |
--------------------------------------
| 1 | 01/06 | Group1 | 12345 |
| 2 | 01/05 | Group2 | 54321 |
| 3 | 01/04 | Group1 | 11111 |
--------------------------------------
I want to order the result by the most recent date at the top but group the "group" column together, but still have distinct entries. The result that I want would be:
1 | 01/06 | Group1 | 12345
3 | 01/04 | Group1 | 11111
2 | 01/05 | Group2 | 54321
What would be a query to get that result?
thank you!
EDIT:
I'm using MSSQL. I'll look into translating the oracle query into MS SQL and report my results.
EDIT
SQL Server 2000, so OVER/PARTITION is not supported =[
Thank you!
You should specify what RDBMS you are using. This answer is for Oracle, may not work in other systems.
SELECT * FROM table
ORDER BY MAX(date) OVER (PARTITION BY group) DESC, group, date DESC
declare #table table (
ID int not null,
[DATE] smalldatetime not null,
[GROUP] varchar(10) not null,
[RESULT] varchar(10) not null
)
insert #table values (1, '2009-01-06', 'Group1', '12345')
insert #table values (2, '2009-01-05', 'Group2', '12345')
insert #table values (3, '2009-01-04', 'Group1', '12345')
select t.*
from #table t
inner join (
select
max([date]) as [order-date],
[GROUP]
from #table orderer
group by
[GROUP]
) x
on t.[GROUP] = x.[GROUP]
order by
x.[order-date] desc,
t.[GROUP],
t.[DATE] desc
use an order by clause with two params:
...order by group, date desc
this assumes that your date column does hold dates and not varchars
SELECT table2.myID,
table2.mydate,
table2.mygroup,
table2.myresult
FROM (SELECT DISTINCT mygroup FROM testtable as table1) as grouptable
JOIN testtable as table2
ON grouptable.mygroup = table2.mygroup
ORDER BY grouptable.mygroup,table2.mydate
SORRY, could NOT bring myself to use columns that were reserved names, rename the columns to make it work :)
this is MUCH simpler than the accepted answer btw.