Rank & find difference of value in the same column - sql

I have the below table -
Here, I have created the "Order" column by using the rank function partitioned by case_identifier ordered by audit_date.
Now, I want to create a new column as below -
The logic for the new column would be -
select *,
case when [order] = '1' then [days_diff]
else (val of [days_diff] in rank 2) - (val of [days_diff] in rank 1) ...
end as '[New_Col]'
from TABLE
Can you please help me with the syntax? Thanks.

Take a look at the LAG function. It will provide you with what you want.
something like:
declare #temptable TABLE (case_id varchar(2), row_order int, days_diff float)
INSERT INTO #temptable values ('A',1,5)
INSERT INTO #temptable values ('A',2,3)
INSERT INTO #temptable values ('A',3,2)
INSERT INTO #temptable values ('B',1,5)
INSERT INTO #temptable values ('B',2,1)
--select * from #temptable
SELECT case_id,row_order, LAG(days_diff,1) OVER (PARTITION BY case_id ORDER BY row_order) AS prev_row,days_diff,
CASE
WHEN row_order = 1 THEN days_diff
ELSE LAG(days_diff,1) OVER (PARTITION BY case_id ORDER BY row_order) - days_diff
END AS newcolumn
FROM #temptable
order by case_id,row_order asc
SELECT case_id,row_order,LAG(days_diff,1) OVER (PARTITION BY case_id ORDER BY row_order) AS prev_row, days_diff,
COALESCE(LAG(days_diff,1) OVER (PARTITION BY case_id ORDER BY row_order) - days_diff , days_diff)
FROM #temptable
order by case_id,row_order asc
Other answers will use a coalesce in place of the CASE statement. It's probably faster, but I feel like this is clearer.
If you run both and look at the execution plans they are the same.

I believe the following query gets you what you want.
SELECT a.*,
'NEW DAYS DIFF' =
CASE
WHEN a.[order] = 1 THEN a.days_diff
ELSE a.days_diff - b.days_diff
END
FROM dbo.tblCaseDaysDiff a
INNER JOIN dbo.tblCaseDaysDiff b
ON
(b.CASE_ID = a.CASE_ID AND b.[order] + 1 = a.[order] ) -- Get the current row and compare with the next highest order
OR (b.CASE_ID = a.CASE_ID AND b.[order] = 1 AND a.[order] = 1) --WHEN ORDER = 1 Get days_diff value
ORDER BY a.CASE_ID, a.[order]

As it happens, you're already hip-deep in window functions, and as others have pointed out, LAG will do the trick. In general, though, you can always get the difference of two rows by making one row: by joining the table to itself.
with T (CASE_IDENTIFIER, AUDIT_DATE, order, days_diff)
as (
... your query ...
)
select a.*,
a.days_diff - coalesce(b.days_diff, 0) as delta_days_diff
from T as a left join T as b
on a.CASE_IDENTIFIER = b.CASE_IDENTIFIER
and b.days_diff = a.days_diff - 1

LAG METHOD
SELECT
CASE_IDENTIFIER
,AUDIT_DATE
,[order]
,days_diff
,days_diff - ISNULL(LAG(days_diff,1) OVER (PARTITION BY CASE_IDENTIFIER ORDER BY [order]),0) AS New_Column
FROM #Table
SELF JOIN METHOD
SELECT
t1.CASE_IDENTIFIER
,AUDIT_DATE
,t1.[order]
,t1.days_diff
,t1.days_diff - ISNULL(t2.days_diff,0) AS New_Column
FROM
#Table t1
LEFT JOIN #Table t2
ON t1.CASE_IDENTIFIER = t2.CASE_IDENTIFIER
AND t1.[order] - 1 = t2.[order]
I feel like a lot of the other answers are on the right track but there are some nuances or easier ways of writing some of them. Or also some of the answer provide the write direction but had something wrong with their join or syntax. Anyways, you don't need the CASE STATEMENT whether you use the LAG of SELF JOIN Method. Next COALESCE() is great but you are only comparing 2 values so ISNULL() works fine too for sql-server but either will do.

Related

How to filter out NULLs from a window function

I'm trying to return just non-null diff_review values > 1 and so far this has been my result:
This is the query that generated this:
select review_number
, review_number - lag(review_number, 1) over (partition by organization_id order by review_number) as diff_review from applications
where organization_id = 25144
and kind = 'annual_review'
and review_number is not null
order by diff_review desc
I can't use ...and diff_review is not null since you can't use aliases in where clauses, but I found out today you also can't use windowing functions in where clauses either.
This is the first time I've ever used windowing in SQL (I hadn't even heard of it until an hour ago) so I'm still very green at this. I'd appreciate someone clueing me in thanks!!!
You can use a table expression to "alias" the column. For example:
select *
from (
select review_number
, review_number - lag(review_number, 1)
over (partition by organization_id order by review_number)
as diff_review
from applications
where organization_id = 25144
and kind = 'annual_review'
and review_number is not null
) x
where diff_review is not null -- here you can use the aliased column
order by diff_review desc
Alternative method: a self join on row_number():
with omg AS (
select review_number
, row_number() over (partition by organization_id order by review_number) as rn
from applications
where organization_id = 25144
and kind = 'annual_review'
)
SELECT o2.review_number
, o2.review_number - o1.review_number AS diff_review
FROM omg o2
JOIN omg o1 ON (o2.review_number = o1.review_number AND o2.rn = o1.rn +1)
order by 2 desc
;

Oracle SQL re-use subquery "WITH" clause assistance

I have an SQL query in which I need to take the output of a subquery and use it more than once. My existing query works, but only if I repeat the subquery each time I need it. Unfortunately the subquery is complex, and takes time to execute - meaning that multiple iterations really slow the whole thing down.
I have read that you can use the "WITH" statement to assign a subquery output to a variable, in order to re-use that variable. However the problem I'm having is that within the subquery, I need to reference values from the main query. And it appears that if I use WITH - before the main query SELECT - then those references are not recognised. I'll give you a simplified example:
WITH
DateX AS
(
SELECT
MAX(TableSub.Date)
FROM
TableA TableSub
WHERE
TableSub.ID = TableMain.ID
AND TableSub.Event = 'AnotherEvent'
AND TableSub.Date BETWEEN '01-Jan-2015' AND '31-Dec-2015'
)
SELECT
TableMain.ID
FROM
TableA TableMain
WHERE
TableMain.Event = 'MainEvent'
AND TableMain.Date >= DateX
AND (
SELECT
TableSub2.ID
FROM
TableA TableSub2
WHERE
TableSub2.ID = TableMain.ID
TableSub2.Event = 'ThirdEvent'
AND TableSub2.Date <= DateX
) IS NULL
I hope this is clear. It's a simplified version of what I have, but you can see that DateX is used in more than one place: within the main query, and within a subquery. However the problem is that when DateX is defined by WITH, I need to link the ID back to the ID of the main query. And it's not working...
I would be grateful for any advice on this. Am I doing it wrong? Is there a way, or is it just impossible? If so, then should I be using another approach entirely? Thanks.
A better way:
SELECT ID
FROM (
SELECT ID,
"Date",
Event,
LAST_VALUE( CASE Event WHEN 'AnotherEvent' THEN "Date" END IGNORE NULLS )
OVER ( PARTITION BY ID ORDER BY "Date"
ROWS BETWEEN UNBOUNDED PRECEEDING AND UNBOUNDED FOLLOWING
) AS another_date,
FIRST_VALUE( CASE Event WHEN 'ThirdEvent' THEN "Date" END IGNORE NULLS )
OVER ( PARTITION BY ID ORDER BY "Date"
ROWS BETWEEN UNBOUNDED PRECEEDING AND UNBOUNDED FOLLOWING
) AS third_date
FROM TableA
WHERE Event IN ( 'MainEvent', 'ThirdEvent' )
OR ( Event = 'AnotherEvent' AND EXTRACT( YEAR FROM "Date" ) = 2015 )
)
WHERE Event = 'MainEvent'
AND "Date" >= another_date
AND ( third_date IS NULL OR third_date > another_date );
You need to join your DateX CTE on the ID column. Something like:
WITH
DateX AS
(
SELECT
TableSub.ID,
MAX(TableSub.Date) AS MaxDate
FROM
TableA TableSub
WHERE
AND TableSub.Event = 'AnotherEvent'
AND TableSub.Date >= DATE '2015-01-01'
AND TableSub.Date < DATE '2016-01-01'
GROUP BY
TableSub.ID
)
SELECT
TableMain.ID
FROM
TableA TableMain
JOIN
DateX
ON
DateX.ID = TableMain.ID
WHERE
TableMain.Event = 'MainEvent'
AND TableMain.Date >= DateX.MaxDate
AND (
SELECT
TableSub2.ID
FROM
TableA TableSub2
JOIN
DateX
ON
DateX.ID = TableSub2.ID
WHERE
TableSub2.ID = TableMain.ID
TableSub2.Event = 'ThirdEvent'
AND TableSub2.Date <= DateX.MaxDate
) IS NULL
The CTE also needs a column alias for the aggregate; and as you need to join in the ID, you need to include that and group by it.
The last subquery looks odd; you might want NOT EXISTS rather than IS NULL if you're looking for no record. Perhaps your real query is using an aggregate, but even so that might be quicker.
This still may not be the best approach but it's hard to tell from your example. Hitting the same table three times may be unnecessarily expensive.

ROW_NUMBER() Query Plan SORT Optimization

The query below accesses the Votes table that contains over 30 million rows. The result set is then selected from using WHERE n = 1. In the query plan, the SORT operation in the ROW_NUMBER() windowed function is 95% of the query's cost and it is taking over 6 minutes to complete execution.
I already have an index on same_voter, eid, country include vid, nid, sid, vote, time_stamp, new to cover the where clause.
Is the most efficient way to correct this to add an index on vid, nid, sid, new DESC, time_stamp DESC or is there an alternative to using the ROW_NUMBER() function for this to achieve the same results in a more efficient manner?
SELECT v.vid, v.nid, v.sid, v.vote, v.time_stamp, v.new, v.eid,
ROW_NUMBER() OVER (
PARTITION BY v.vid, v.nid, v.sid ORDER BY v.new DESC, v.time_stamp DESC) AS n
FROM dbo.Votes v
WHERE v.same_voter <> 1
AND v.eid <= #EId
AND v.eid > (#EId - 5)
AND v.country = #Country
One possible alternative to using ROW_NUMBER():
SELECT
V.vid,
V.nid,
V.sid,
V.vote,
V.time_stamp,
V.new,
V.eid
FROM
dbo.Votes V
LEFT OUTER JOIN dbo.Votes V2 ON
V2.vid = V.vid AND
V2.nid = V.nid AND
V2.sid = V.sid AND
V2.same_voter <> 1 AND
V2.eid <= #EId AND
V2.eid > (#EId - 5) AND
V2.country = #Country AND
(V2.new > V.new OR (V2.new = V.new AND V2.time_stamp > V.time_stamp))
WHERE
V.same_voter <> 1 AND
V.eid <= #EId AND
V.eid > (#EId - 5) AND
V.country = #Country AND
V2.vid IS NULL
The query basically says to get all rows matching your criteria, then join to any other rows that match the same criteria, but which would be ranked higher for the partition based on the new and time_stamp columns. If none are found then this must be the row that you want (it's ranked highest) and if none are found that means that V2.vid will be NULL. I'm assuming that vid otherwise can never be NULL. If it's a NULLable column in your table then you'll need to adjust that last line of the query.

Sql Server select and update X rows

I'm using Sql Server 2012.
I need to select rows from a table for processing. The number of rows needs to be variable. I need to update the rows I'm selecting to a "being processed" status - I have a guid to populate for this purpose.
I've encountered several examples of using row_number() and a couple of examples of ways of using CTE's, but I'm not sure on how to combine them (or if that's even the correct strategy). I would appreciate any insight.
Here is what I have so far:
DECLARE #SessionGuid uniqueidentifier, #rowcount bigint
SELECT #rowcount = 1000
SELECT #sessionguid = newid()
DECLARE #myProductChanges table (
ProductChangeId bigint
, ProductTypeId smallint
, SourceSystemId tinyint
, ChangeTypeId tinyint );
WITH NextPage AS
(
SELECT
ProductChangeId, ServiceSessionGuid,
ROW_NUMBER() OVER (ORDER BY ProductChangeId) AS 'RowNum'
FROM dbo.ProductChange
WHERE 'RowNum' < #rowcount
)
UPDATE dbo.ProductChange
SET ServiceSessionGuid = #sessionguid, ProcessingStateId = 2, UpdatedDate = getdate()
OUTPUT
INSERTED.ProductChangeId,
INSERTED.ProductTypeId,
INSERTED.SourceSystemId,
INSERTED.ChangeTypeId
INTO #myProductChanges
FROM dbo.ProductChange as pc join NextPage on pc.ProductChangeId = NextPage.ProductChangeId
From here I will select from my temp table and return the data:
SELECT mpc.ProductChangeId
, pt.ProductName as ProductType
, ss.Name as SourceSystem
, ct.ChangeDescription as ChangeType
FROM #myProductChanges as mpc
join dbo.R_ProductType pt on mpc.ProductTypeId = pt.ProductTypeId
join dbo.R_SourceSystem ss on mpc.SourceSystemId = ss.SourceSystemId
join dbo.R_ChangeType ct on mpc.ChangeTypeId = ct.ChangeTypeId
ORDER BY ProductType asc
So far this doesn't work for me. I get an error when I try to run it:
Msg 8114, Level 16, State 5, Line 20
Error converting data type varchar to bigint.
I'm not clear on what I'm doing wrong - so - any help is appreciated.
Thanks!
BTW, here are some of the questions I've used as reference to try and solve this:
https://stackoverflow.com/questions/9777178
https://stackoverflow.com/questions/3319842
https://stackoverflow.com/questions/6402103
This subquery makes no sense:
SELECT
ProductChangeId, ServiceSessionGuid,
ROW_NUMBER() OVER (ORDER BY ProductChangeId) AS 'RowNum'
FROM dbo.ProductChange
WHERE 'RowNum' < #rowcount
You can't reference the alias RowNum at the same scope (and you are trying to compare a string, not an alias, anyway), because when the WHERE clause is parsed, the SELECT list hasn't been materialized yet. What you need is either another nest:
SELECT ProductChangeId, ServiceSessionGuid, RowNum
FROM (SELECT ProductChangeId, ServiceSessionGuid,
ROW_NUMBER() OVER (ORDER BY ProductChangeId) AS RowNum
FROM dbo.ProductChange
) AS x WHERE RowNum < #rowcount
Or:
SELECT TOP (#rowcount-1) ProductChangeId, ServiceSessionGuid,
ROW_NUMBER() OVER (ORDER BY ProductChangeId) AS RowNum
FROM dbo.ProductChange
ORDER BY ProductChangeId
Also please stop using 'alias' - when you need to delimit aliases (you don't in this case), use [square brackets].
I'm guessing, but I think you want <= rather than < if you want to affect #rowcount rows, not one less.
Another tip is that CTEs can be updated directly*, as shown here:
WITH NextPage AS
(
SELECT TOP(#rowcount) *
FROM dbo.ProductChange
)
UPDATE NextPage
SET ServiceSessionGuid = #sessionguid, ProcessingStateId = 2, UpdatedDate = getdate()
OUTPUT
INSERTED.ProductChangeId,
INSERTED.ProductTypeId,
INSERTED.SourceSystemId,
INSERTED.ChangeTypeId
INTO #myProductChanges
* The updates affect the base table in the CTE, i.e. dbo.ProductChange

Reuse subquery result in WHERE-Clause for INSERT

i am using Microsoft SQL Server 2008
i would like to save the result of a subquery to reuse it in a following subquery.
Is this possible?
What is best practice to do this? (I am very new to SQL)
My query looks like:
INSERT INTO [dbo].[TestTable]
(
[a]
,[b]
)
SELECT
(
SELECT TOP 1 MAT_WS_ID
FROM #TempTableX AS X_ALIAS
WHERE OUTERBASETABLE.LT_ALL_MATERIAL = X_ALIAS.MAT_RM_NAME
)
,(
SELECT TOP 1 MAT_WS_NAME
FROM #TempTableY AS Y_ALIAS
WHERE Y_ALIAS.MAT_WS_ID = MAT_WS_ID
--(
--SELECT TOP 1 MAT_WS_ID
--FROM #TempTableX AS X_ALIAS
--WHERE OUTERBASETABLE.LT_ALL_MATERIAL = X_ALIAS.MAT_RM_NAME
--)
)
FROM [dbo].[LASERTECHNO] AS OUTERBASETABLE
My question is:
Is this correct what i did.
I replaced the second SELECT Statement in the WHERE-Clause for [b] (which is commented out and exactly the same as for [a]), with the result of the first SELECT Statement of [a] (=MAT_WS_ID).
It seems to give the right results.
But i dont understand why!
I mean MAT_WS_ID is part of both temporary tables X_ALIAS and Y_ALIAS.
So in the SELECT statement for [b], in the scope of the [b]-select-query, MAT_WS_ID could only be known from the Y_ALIAS table. (Or am i wrong, i am more a C++, maybe the scope things in SQL and C++ are totally different)
I just wannt to know what is the best way in SQL Server to reuse an scalar select result.
Or should i just dont care and copy the select for every column and the sql server optimizes it by its own?
One approach would be outer apply:
SELECT mat.MAT_WS_ID
, (
SELECT TOP 1 MAT_WS_NAME
FROM #TempTableY AS Y_ALIAS
WHERE Y_ALIAS.MAT_WS_ID = mat.MAT_WS_ID
)
FROM [dbo].[LASERTECHNO] AS OUTERBASETABLE
OUTER APPLY
(
SELECT TOP 1 MAT_WS_ID
FROM #TempTableX AS X_ALIAS
WHERE OUTERBASETABLE.LT_ALL_MATERIAL = X_ALIAS.MAT_RM_NAME
) as mat
You could rank rows in #TempTableX and #TempTableY partitioning them by MAT_RM_NAME in the former and by MAT_WS_ID in the latter, then use normal joins with filtering by rownum = 1 in both tables (rownum being the column containing the ranking numbers in each of the two tables):
WITH x_ranked AS (
SELECT
*,
rownum = ROW_NUMBER() OVER (PARTITION BY MAT_RM_NAME ORDER BY (SELECT 1))
FROM #TempTableX
),
y_ranked AS (
SELECT
*,
rownum = ROW_NUMBER() OVER (PARTITION BY MAT_WS_ID ORDER BY (SELECT 1))
FROM #TempTableY
)
INSERT INTO dbo.TestTable (a, b)
SELECT
x.MAT_WS_ID,
y.MAT_WS_NAME
FROM dbo.LASERTECHNO t
LEFT JOIN x_ranked x ON t.LT_ALL_MATERIAL = x.MAT_RM_NAME AND x.rownum = 1
LEFT JOIN y_ranked y ON x.MAT_WS_ID = y.MAT_WS_ID AND y.rownum = 1
;
The ORDER BY (SELECT 1) bit is a trick to specify an indeterminate ordering, which, accordingly, would result in indeterminate rownum = 1 rows picked by the query. That is to more or less duplicate your TOP 1 without an explicit order, but I would recommend you to specify a more sensible ORDER BY clause to make the results more predictable.