Two virtually calculated columns in view - sql

Initially in a regular table, there are two columns Key and ParentKey. The Key gets its value as an identity column - automatically. ParentKey - by the expression
WITH
latest
as
(
SELECT
ProductID,
Date,
[Key],
ROW_NUMBER() OVER (
PARTITION BY ProductID
ORDER BY [Date] DESC -- order by latest Date
) rn
FROM
[MyTable]
)
UPDATE
u
SET
u.[ParentKey] = latest.[Key]
FROM
[MyTable] u
INNER JOIN
latest
ON u.ProductID = latest.ProductID
WHERE
latest.rn = 1
Does it possible to combine both of these two steps simultaneously when creating the View (Key and ParentKey becomes virtual)? I have a solution for the first part of these task - creating Key column in a view. Could it be combined with the step which then sets ParentKey?
Current Code
CREATE VIEW v_Test
AS
SELECT
ProductID
,Date,
CAST (ROW_NUMBER() OVER( ORDER BY [ProductID] ) AS int) as [Key]
-- some expression for ParentKey?
FROM [MyTable]
Desired View output (Key and ParentKey must be processed during view creation)
+-------------------------------------------------
|ProductID | Date | Key | ParentKey
+-------------------------------------------------
|111 | 2017-12-31 | 1 | 3
|111 | 2018-12-31 | 2 | 3
|111 | 2019-12-31 | 3 | 3
|222 | 2017-12-31 | 4 | 6
|222 | 2018-12-31 | 5 | 6
|222 | 2019-12-31 | 6 | 6
|333 | 2017-12-31 | 7 | 9
|333 | 2018-12-31 | 8 | 9
|333 | 2019-12-31 | 9 | 9

If I have understood what you are trying to do correctly then you can put the query with the row number in a CTE or derived table and then reference that in a windowed aggregate to get the max.
CREATE VIEW v_Test
AS
WITH T
AS (SELECT ProductID,
Date,
ROW_NUMBER() OVER(ORDER BY [ProductID] ASC, [Date] ASC ) AS [Key]
FROM [MyTable])
SELECT ProductID,
Date,
[Key],
MAX([Key]) OVER (PARTITION BY [ProductID]) AS [ParentKey]
FROM T
These "Keys" will not be at all stable over time though as they can change after inserts for unrelated products.

If I understand you correctly, I believe will be much shorter using only MAX with OVER clause. Something like
UPDATE
u
SET
u.[ParentKey] =
FROM
[MyTable] u
CROSS APPLY (SELECT MAX( [Key] OVER ( PARTITION BY ProductID ORDER BY [DATE] DESC ) LatestKey ) ca

Related

Using the last_value function on every column | Downfilling all nulls in a table

I have table an individual level table, ordered by Person_ID and Date, ascending. There are duplicate entries at the Person_ID level. What I would like to do is "downfill" null values across every column -- my impression is that the last_value( | ignore nulls) function will work perfectly for each column.
A major problem is that the table is hundreds of columns wide, and is quite dynamic (feature creation for ML experiments). There has to be a better way than to writing out a last_value statement for each variable, something like this:
SELECT last_value(var1) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var1,
last_value(var2) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var2,
...
last_value(var300) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var3
FROM TABLE
In summmary, I have the following table:
+----------+-----------+------+------+---+------------+
| PersonID | YearMonth | Var1 | Var2 | … | Var300 |
+----------+-----------+------+------+---+------------+
| 1 | 200901 | 2 | null | | null |
| 1 | 200902 | null | 1 | | Category 1 |
| 1 | 201010 | null | 1 | | null |
+----------+-----------+------+------+---+------------+
and desire the following table:
+----------+-----------+------+------+---+------------+
| PersonID | YearMonth | Var1 | Var2 | … | Var300 |
+----------+-----------+------+------+---+------------+
| 1 | 200901 | 2 | null | | null |
| 1 | 200902 | 2 | 1 | | Category 1 |
| 1 | 201010 | 2 | 1 | | Category 1 |
+----------+-----------+------+------+---+------------+
I don't see any great options for you, but here are two approaches you might look into.
OPTION 1 -- Recursive CTE
In this approach, you use a recursive query, where each child value equals itself or, if it is null, its parent's value. Like so:
WITH
ordered AS (
SELECT yt.*
row_number() over ( partition by yt.personid order by yt.yearmonth ) rn
FROM YOUR_TABLE yt),
downfilled ( personid, yearmonth, var1, var2, ..., var300, rn) as (
SELECT o.*
FROM ordered o
WHERE o.rn = 1
UNION ALL
SELECT c.personid, c.yearmonth,
nvl(c.var1, p.var1) var1,
nvl(c.var2, p.var2) var2,
...
nvl(c.var300, p.var300) var300
FROM downfilled p INNER JOIN ordered c ON c.personid = p.personid AND c.rn = p.rn + 1 )
SELECT * FROM downfilled
ORDER BY personid, yearmonth;
This replaces each expression like this:
last_value(var2) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var2
with an expression like this:
NVL(c.var2, p.var2)
One downside, though, is that this makes you repeat the list of 300 columns twice (once for the 300 NVL() expressions and once to specify the output columns of the recursive CTE (downfilled).
OPTION 2 -- UNPIVOT and PIVOT again
In this approach, you UNPIVOT your VARxx columns into rows, so that you only need to write the last_value()... expression one time.
SELECT personid,
yearmonth,
var_column,
last_value(var_value ignore nulls)
over ( partition by personid, var_column order by yearmonth ) var_value
FROM YOUR_TABLE
UNPIVOT INCLUDE NULLS ( var_value FOR var_column IN ("VAR1","VAR2","VAR3") ) )
SELECT * FROM unp
PIVOT ( max(var_value) FOR var_column IN ('VAR1' AS VAR1, 'VAR2' AS VAR, 'VAR3' AS VAR3 ) )
Here you still need to list each column twice. Also, I'm not sure what performance will be like if you have a large data set.

Redshift window function for change in column

I have a redshift table with amongst other things an id and plan_type column and would like a window function group clause where the plan_type changes so that if this is the data for example:
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | A | 2019-01-02 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
| 2 | A | 2-10-01-05 |
I would like a result like this where I get the first date that the plan_type was "new":
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
Is this possible with window functions?
EDIT
Since I have some garbage in the data where plan_type can sometimes be null and the accepted solution does not include the first row (since I can't have the OR is not null I had to make some modifications. Hopefully his will help other people if they have similar issues. The final query is as follows:
SELECT * FROM
(
SELECT
user_id,
plan_type,
created_at,
lag(plan_type) OVER (PARTITION by user_id ORDER BY created_at) as prev_plan,
row_number() OVER (PARTITION by user_id ORDER BY created_at) as rownum
FROM tablename
WHERE plan_type IS NOT NULL
) userHistory
WHERE
userHistory.plan_type <> userHistory.prev_plan
OR userHistory.rownum = 1
ORDER BY created_at;
The plan_type IS NOT NULL filters out bad data at the source table and the outer where clause gets any changes OR the first row of data that would not be included otherwise.
ALSO BE CAREFUL about the created_at timestamp if you are working of your prev_plan field since it would of course give you the time of the new value!!!
This is a gaps-and-islands problem. I think lag() is the simplest approach:
select user_id, plan_type, created
from (select t.*,
lag(plan_type) over (partition by user_id order by created) as prev_plan_type
from t
) t
where prev_plan_type is null or prev_plan_type <> plan_type;
This assumes that plan types can move back to another value and you want each one.
If not, just use aggregation:
select user_id, plan_type, min(created)
from t
group by user_id, plan_type;
use row_number() window function
select * from
(select *,row_number()over(partition by user_id,plan_type order by created) rn
) a where a.rn=1
use lag()
select * from
(
select user_id, plant_type, lag(plan_type) over (partition by user_id order by created) as changes, created
from tablename
)A where plan_type<>changes and changes is not null

How to select rows and nearby rows with specific conditions

I have a table (Trans) of values like
OrderID (unique) | CustID | OrderDate| TimeSinceLast|
------------------------------------------------------
123a | A01 | 20.06.18 | 20 |
123y | B05 | 20.06.18 | 31 |
113k | A01 | 18.05.18 | NULL | <------- need this
168x | C01 | 17.04.18 | 8 |
999y | B05 | 15.04.18 | NULL | <------- need this
188k | A01 | 15.04.18 | 123 |
678a | B05 | 16.03.18 | 45 |
What I need is to select the rows where TimeSinceLast is null, as well as a row preceding and following where TimeSinceLast is not null, grouped by custID
I'd need my final table to look like:
OrderID (unique) | CustID | OrderDate| TimeSinceLast|
------------------------------------------------------
123a | A01 | 20.06.18 | 20 |
113k | A01 | 18.05.18 | NULL |
188k | A01 | 15.04.18 | 123 |
123y | B05 | 20.06.18 | 31 |
999y | B05 | 15.04.18 | NULL |
678a | B05 | 16.03.18 | 45 |
The main problem is that TimeSinceLast is not reliable and for whatsoever reason does not calculate well the days since last order, so I cannot use it in a query for preceding or following row.
I have tried to look for codes and found something like this on this forum
with dt as
(select distinct custID, OrderID,
max (case when timeSinceLast is null then OrderID end)
over(partition by custID order by OrderDate
rows between 1 preceding and 1 following) as NullID
from Trans)
select *
from dt
where request_id between NullID -1 and NullID+1
But does not work well for my purposes. Also it looks like max function cannot work with missing values.
Many thanks
Use lead() and lag().
What I need is to select the rows where TimeSinceLast is null, as well as a row preceding and following where TimeSinceLast is not null.
First, the ordering is a little unclear. Your sample data and code do not match. The following assumes some combination of the date and orderid, but there may be other columns that better capture what you mean by "preceding" and "following".
This is a little tricky, because you don't want to always include the first and last rows -- unless necessary. So, look at two columns:
select t.*
from (select t.*,
lead(TimeSinceLast) over (partition by custid order by orderdate, orderid) as next_tsl,
lag(TimeSinceLast) over (partition by custid order by orderdate, orderid) as prev_tsl,
lead(orderid) over (partition by custid order by orderdate, orderid) as next_orderid,
lag(orderid) over (partition by custid order by orderdate, orderid) as prev_orderid
from t
) t
where TimeSinceLast is not null or
(next_tsl is null and next_orderid is not null) or
(prev_tsl is null and prev_orderid is not null);
USE APPLY
DECLARE #TransTable TABLE (OrderID char(4), CustID char(3), OrderDate date, TimeSinceLast int)
INSERT #TransTable VALUES
('123a', 'A01', '06.20.2018', 20),
('123y', 'B05', '06.20.2018' ,31),
('113k', 'A01', '05.18.2018' ,NULL), ------- need this
('168x', 'C01', '04.17.2018' ,8),
('999y', 'B05', '04.15.2018' ,NULL), ------- need this
('188k', 'A01', '04.15.2018' ,123),
('678a', 'B05', '03.16.2018' ,45)
SELECT B.OrderID, B.CustID, B.OrderDate, B.TimeSinceLast
FROM #TransTable A
CROSS APPLY (
SELECT 0 AS rn, A.OrderID, A.CustID, A.OrderDate, A.TimeSinceLast
UNION ALL
SELECT TOP 2 ROW_NUMBER() OVER (PARTITION BY CASE WHEN T.OrderDate > A.OrderDate THEN 1 ELSE 0 END ORDER BY ABS(DATEDIFF(day, T.OrderDate, A.OrderDate))) rn,
T.OrderID, T.CustID, T.OrderDate, T.TimeSinceLast
FROM #TransTable T
WHERE T.CustID = A.CustID AND T.OrderID <> A.OrderID
ORDER BY rn
) B
WHERE A.TimeSinceLast IS NULL
ORDER BY B.CustID, B.OrderDate DESC

SQL Ranking by blocks

Im sure the answer to this is going to end up being really obvious, but i just cant get this bit of sql to work.
I have a table that has 3 columns in:
User | Date | AchievedTarget
----------------------------------------
1 | 2018-01-01 | 1
1 | 2018-02-01 | 0
1 | 2018-03-01 | 1
1 | 2018-04-01 | 1
1 | 2018-05-01 | 0
I want to add a ranking as follows based on the AchievedTarget column, is this possible with the data in the table above to create the ranking in the table below:
User | Date | AchievedTarget | Rank
----------------------------------------
1 | 2018-01-01 | 1 | 1
1 | 2018-02-01 | 0 | 1
1 | 2018-03-01 | 1 | 1
1 | 2018-04-01 | 1 | 2
1 | 2018-05-01 | 0 | 1
This is a guess, based on that this is actually a gaps and island question. if so, this does result in the second dataset the OP has provided:
CREATE TABLE dbo.TestTable ([User] tinyint, --Avoid using keywords for column names
[date] date, --Avoid using datatypes for column names
AchievedTarget bit);
GO
INSERT INTO dbo.TestTable ([User],[date],AchievedTarget)
VALUES (1,'20180101',1),
(1,'20180201',0),
(1,'20180301',1),
(1,'20180401',1),
(1,'20180501',0);
GO
WITH Grps AS(
SELECT [User],[date],AchievedTarget,
ROW_NUMBER() OVER (ORDER BY [date]) -
ROW_NUMBER() OVER (PARTITION BY AchievedTarget ORDER BY [date]) AS Grp
FROM dbo.TestTable)
SELECT [User],[date],AchievedTarget,
ROW_NUMBER() OVER (PARTITION BY AchievedTarget, Grp ORDER BY [date]) AS [Rank] --Avoid using keywords for column names
FROM Grps
ORDER BY [date]
GO
DROP TABLE dbo.TestTable;
Other method:
with tmp as (
select row_number() over(order by date) ID, *
from dbo.TestTable
)
select f1.*, NbBefore + 1
from tmp f1
outer apply
(
select top 1 f2.ID IDLimit from tmp f2 where f2.ID<f1.ID and f2.AchievedTarget<>f1.AchievedTarget
order by f2.ID desc
) f3
outer apply
(
select count(*) NbBefore from tmp f4 where f4.ID<f1.ID and f4.ID> f3.IDLimit
) f5

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.