SQL query to populate missing values using lead and lag - apache-spark-sql

I'm trying to create a new column that fills in the nulls below. I tried using leads and lags but isn't turning out right. Basically trying to figure out who is in "possession" of the record, given the TransferFrom and TransferTo columns and sequence of events. For instance, 35 was in possession of the record until they transferred it to 57. My current query will only populate 35 in record 2 since it only leads by one record. I need the query to populate all records prior to this as well if 35 is the first value found in the TransferFrom column. Spark SQL...Any ideas?
Create table script:
CREATE TABLE results (
OrderID int
,TransferFrom string
,TransferTo string
,ActionTime timestamp)
INSERT INTO results
VALUES
(1,null,null,'2020-01-01 00:00:00'),
(1,null,null,'2020-01-02 00:00:00'),
(1,null,null,'2020-01-03 00:00:00'),
(1,'35','57','2020-01-04 00:00:00'),
(1,null,null,'2020-01-05 00:00:00'),
(1,null,null,'2020-01-06 00:00:00'),
(1,'57','45','2020-01-07 00:00:00'),
(1,null,null,'2020-01-08 00:00:00'),
(1,null,null,'2020-01-09 00:00:00'),
(1,null,null,'2020-01-10 00:00:00')
Current query that doesn't work:
SELECT *
,coalesce(
lead(TransferFrom) over (partition by OrderID order by ActionTime)
,TransferFrom
,lag(TransferTo) over (partition by OrderID order by ActionTime)) as NewColumn
FROM results
Current query result that is incorrect:
Desired query result:

Kind of a funky situation but this works for your sample data. Ideally it would be better to fix the process that is not inserting values consistently so you don't have to jump through these hoops.
select r.*
, NewColumn = coalesce(x.TransferFrom, y.TransferTo)
from results r
outer apply
(
select top 1 TransferFrom
from results r2
where r2.ActionTime >= r.ActionTime
and r2.TransferFrom is not null
order by r2.ActionTime
) x --this will get the values for all the rows that have a preceeding NULL
outer apply
(
select top 1 TransferTo
from results r3
where r3.ActionTime <= r.ActionTime
and r3.TransferTo is not null
order by r3.ActionTime desc
) y --this will get the values for the last rows that don't have a value.
order by r.ActionTime

This is likely sub-optimal for spark SQL, but it got me the right answer.
with cte1 as (
select *
from results
)
, cte2 as (
select *
from cte1
where TransferFrom is not null
or TransferTo is not null
order by ActionTime
)
, cte3 as (
select distinct OrderID, TransferFrom as Team from cte2
union
select distinct OrderID, TransferTo as Team from cte2
)
, cte4 as (
select a.*
,ifnull(timestampadd(MICROSECOND, 1, c.ActionTime),'2000-01-01T00:00:00.000+0000') as StartTime
,ifnull(b.ActionTime,'9999-12-31T00:00:00.000+0000') as EndTime
from cte3 as a
left join cte2 as b
on a.OrderID = b.OrderID
and a.Team = b.TransferFrom
left join cte2 as c
on a.OrderID = c.OrderID
and a.Team = c.TransferTo
order by OrderID, StartTime
)
, cte5 as (
select a.*, b.Team
from cte1 as a
join cte4 as b
on a.OrderID = b.OrderID
and a.ActionTime between b.StartTime and b.EndTime
)
select *
from cte5
order by 1, 4

Related

How can I transform my N little queries into one query?

I have a query that gives me the first available value for a given date and pair.
SELECT
TOP 1 value
FROM
my_table
WHERE
date >= 'myinputdate'
AND key = 'myinpukey'
ORDER BY date
I have N pairs of key and dates, and I try to find out how not to query each pair one by one. The table is rather big, and N as well, so it's currently heavy and slow.
How can I query all the pairs in one query ?
A solution is to use APPLY like a "function" created on the fly with one or many columns from another set:
DECLARE #inputs TABLE (
myinputdate DATE,
myinputkey INT)
INSERT INTO #inputs(
myinputdate,
myinputkey)
VALUES
('2019-06-05', 1),
('2019-06-01', 2)
SELECT
I.myinputdate,
I.myinputkey,
R.value
FROM
#inputs AS I
CROSS APPLY (
SELECT TOP 1
T.value
FROM
my_table AS T
WHERE
T.date >= I.myinputdate AND
T.key = I.myinputkey
ORDER BY
T.date ) AS R
You can use OUTER APPLY if you want NULL result values to be shown also. This supports fetching multiple columns and using ORDER BY with TOP to control amount of rows.
This solution is without variables. You control your N by setting the right value to the row_num predicate.
There are plenty of ways how to do you what you want and it all depends on your specific needs. As it answered already, that you can use temp/variable table to store these conditions and then join it on the same conditions you use predicates. You can also create user defined data type and use it as param to the function/procedure. You might use CROSS APPLY + VALUES clause to get that list and then join it.
DROP TABLE IF EXISTS #temp;
CREATE TABLE #temp ( d DATE, k VARCHAR(100) );
GO
INSERT INTO #temp
VALUES ( '20180101', 'a' ),
( '20180102', 'b' ),
( '20180103', 'c' ),
( '20180104', 'd' ),
( '20190101', 'a' ),
( '20190102', 'b' ),
( '20180402', 'c' ),
( '20190103', 'c' ),
( '20190104', 'd' );
SELECT a.d ,
a.k
FROM ( SELECT d ,
k ,
ROW_NUMBER() OVER ( PARTITION BY k ORDER BY d DESC ) row_num
FROM #temp
WHERE (d >= '20180401'
AND k = 'a')
OR (d > '20180401'
AND k = 'b')
OR (d > '20180401'
AND k = 'c')
) a
WHERE a.row_num <= 1;
-- VALUES way
SELECT a.d ,
a.k
FROM ( SELECT t.d ,
t.k ,
ROW_NUMBER() OVER ( PARTITION BY t.k ORDER BY t.d DESC ) row_num
FROM #temp t
CROSS APPLY (VALUES('20180401','a'), ('20180401', 'b'), ('20180401', 'c')) f(d,k)
WHERE t.d >= f.d AND f.k = t.k
) a
WHERE a.row_num <= 1;
If all the keys are using the same date, then use window functions:
SELECT key, value
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY date) as seqnum
FROM my_table t
WHERE date >= #input_date AND
key IN ( . . . )
) t
WHERE seqnum = 1;
SELECT key, date,value
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY key,date ORDER BY date) as rownum,key,date,value
FROM my_table
WHERE
date >= 'myinputdate'
) as d
WHERE d.rownum = 1;

2 rows differences

I would like to get 2 consecutive rows from an SQL table.
One of the columns storing UNIX datestamp and between 2 rows the difference only this value.
For example:
id_int dt_int
1. row 8211721 509794233
2. row 8211722 509794233
I need only those rows where dt_int the same (edited)
Do you want both lines to be shown?
A solution could be this:
with foo as
(
select
*
from (values (8211721),(8211722),(8211728),(8211740),(8211741)) a(id_int)
)
select
id_int
from
(
select
id_int
,id_int-isnull(lag(id_int,1) over (order by id_int) ,id_int-6) prev
,isnull(lead(id_int,1) over (order by id_int) ,id_int+6)-id_int nxt
from foo
) a
where prev<=5 or nxt<=5
We use lead and lag, to find the differences between rows, and keep the rows where there is less than or equal to 5 for the row before or after.
If you use 2008r2, then lag and lead are not available. You could use rownumber in stead:
with foo as
(
select
*
from (values (8211721),(8211722),(8211728),(8211740),(8211741)) a(id_int)
)
, rownums as
(
select
id_int
,row_number() over (order by id_int) rn
from foo
)
select
id_int
from
(
select
cur.id_int
,cur.id_int-prev.id_int prev
,nxt.id_int-cur.id_int nxt
from rownums cur
left join rownums prev
on cur.rn-1=prev.rn
left join rownums nxt
on cur.rn+1=nxt.rn
) a
where isnull(prev,6)<=5 or isnull(nxt,6)<=5
Assuming:
lead() analytical function available.
ID_INT is what we need to sort by to determine table order...
you may need to partition by some value lead(ID_int) over(partition by SomeKeysuchasOrderNumber order by ID_int asc) so that orders and dates don't get mixed together.
.
WITH CTE AS (
SELECT A.*
, lead(ID_int) over ([missing partition info] ORDER BY id_Int asc) - id_int as ID_INT_DIFF
FROM Table A)
SELECT *
FROM CTE
WHERE ID_INT_DIFF < 5;
You can try it. This version works on SQL Server 2000 and above. Today I don not a more recent SQL Server to write on.
declare #t table (id_int int, dt_int int)
INSERT #T SELECT 8211721 , 509794233
INSERT #T SELECT 8211722 , 509794233
INSERT #T SELECT 8211723 , 509794235
INSERT #T SELECT 8211724 , 509794236
INSERT #T SELECT 8211729 , 509794237
INSERT #T SELECT 8211731 , 509794238
;with cte_t as
(SELECT
ROW_NUMBER() OVER (ORDER BY id_int) id
,id_int
,dt_int
FROM #t),
cte_diff as
( SELECT
id_int
,dt_int
,(SELECT TOP 1 dt_int FROM cte_t b WHERE a.id < b.id) dt_int1
,dt_int - (SELECT TOP 1 dt_int FROM cte_t b WHERE a.id < b.id) Difference
FROM cte_t a
)
SELECT DISTINCT id_int , dt_int FROM #t a
WHERE
EXISTS(SELECT 1 FROM cte_diff b where b.Difference =0 and a.dt_int = b.dt_int)

Find most recent record by date

This is my original data (anonymised):
id usage verified date
1 4000 Y 2015-03-20
2 5000 N 2015-06-20
3 6000 N 2015-07-20
4 7000 Y 2016-09-20
Original query:
SELECT
me.usage,
mes.verified,
mes.date
FROM
Table1 me,
Table2 mes,
Table3 m,
Table4 mp
WHERE
me.theFk=mes.id
AND mes.theFk=m.id
AND m.theFk=mp.id
How would I go about selecting the most recent verified and non-verified?
So I would be left with:
id usage verified date
1 6000 N 2015-07-20
2 7000 Y 2016-09-20
I am using Microsoft SQL Server 2012.
First, do not use implicit joins. This was discontinued more than 10 years ago.
Second, embrace the power of the CTE, the in clause and row_number:
with CTE as
(
select
me.usage,
mes.verified,
mes.date,
row_number() over (partition by Verified order by Date desc) as CTEOrd
from Table1 me
inner join Table2 mes
on me.theFK = mes.id
where mes.theFK in
(
select m.id
from Table3 m
inner join Table4 mp
on mp.id = m.theFK
)
)
select CTE.*
from CTE
where CTEOrd = 1
You can select the TOP 1 ordered by date for verified=N, union'd with the TOP 1 ordered by date for verified=Y.
Or in pseudo SQL:
SELECT TOP 1 ...fields ...
FROM ...tables/joins...
WHERE Verified = 'N'
ORDER BY Date DESC
UNION
SELECT TOP 1 ...fields ...
FROM ...tables/joins...
WHERE Verified = 'Y'
ORDER BY Date DESC
drop table #stack2
CREATE TABLE #stack2
([id] int, [usage] int, [verified] varchar(1), [date] datetime)
;
INSERT INTO #stack2
([id], [usage], [verified], [date])
VALUES
(1, 4000, 'Y', '2015-03-20 00:00:00'),
(2, 5000, 'N', '2015-06-20 00:00:00'),
(3, 6000, 'N', '2015-07-20 00:00:00'),
(4, 7000, 'Y', '2016-09-20 00:00:00')
;
;with cte as (select verified,max(date) d from #stack2 group by verified)
select row_number() over( order by s2.[verified]),s2.[usage], s2.[verified], s2.[date] from #stack2 s2 join cte c on c.verified=s2.verified and c.d=s2.date
As per the data shown i had written the query.
for your scenario this will be use full
WITH cte1
AS (SELECT me.usage,
mes.verified,
mes.date
FROM Table1 me,
Table2 mes,
Table3 m,
Table4 mp
WHERE me.theFk = mes.id
AND mes.theFk = m.id
AND m.theFk = mp.id),
cte
AS (SELECT verified,
Max(date) d
FROM cte1
GROUP BY verified)
SELECT Row_number()
OVER(
ORDER BY s2.[verified]),
s2.[usage],
s2.[verified],
s2.[date]
FROM cte1 s2
JOIN cte c
ON c.verified = s2.verified
AND c.d = s2.date
You can as the below Without join.
-- Mock data
DECLARE #Tbl TABLE (id INT, usage INT, verified CHAR(1), date DATETIME)
INSERT INTO #Tbl
VALUES
(1, 4000 ,'Y', '2015-03-20'),
(2, 5000 ,'N', '2015-06-20'),
(3, 6000 ,'N', '2015-07-20'),
(4, 7000 ,'Y', '2016-09-20')
SELECT
A.id ,
A.usage ,
A.verified ,
A.MaxDate
FROM
(
SELECT
id ,
usage ,
verified ,
date,
MAX(date) OVER (PARTITION BY verified) MaxDate
FROM
#Tbl
) A
WHERE
A.date = A.MaxDate
Result:
id usage verified MaxDate
----------- ----------- -------- ----------
3 6000 N 2015-07-20
4 7000 Y 2016-09-20
CREATE TABLE #Table ( ID INT ,usage INT, verified VARCHAR(10), _date DATE)
INSERT INTO #Table ( ID , usage , verified , _date)
SELECT 1,4000 , 'Y','2015-03-20' UNION ALL
SELECT 2, 5000 , 'N' ,'2015-06-20' UNION ALL
SELECT 3, 6000 , 'N' ,'2015-07-20' UNION ALL
SELECT 4, 7000 , 'Y' ,'2016-09-20'
SELECT ROW_NUMBER() OVER(ORDER BY usage) ID,usage , A.verified , A._date
FROM #Table
JOIN
(
SELECT verified , MAX(_date) _date
FROM #Table
GROUP BY verified
) A ON #Table._date = A._date

Find consecutive free numbers in table

I have a table, containing numbers (phone numbers) and a code (free or not available).
Now, I need to find series, of 30 consecutive numbers, like 079xxx100 - 079xxx130, and all of them to have free status.
Here is an example how my table looks like:
CREATE TABLE numere
(
value int,
code varchar(10)
);
INSERT INTO numere (value,code)
Values
(123100, 'free'),
(123101, 'free'),
...
(123107, 'booked'),
(123108, 'free'),
(...
(123130, 'free'),
(123131, 'free'),
...
(123200, 'free'),
(123201, 'free'),
...
(123230, 'free'),
(123231, 'free'),
...
I need a SQL query, to get me in this example, the 123200-123230 range (and all next available ranges).
Now, I found an example, doing more or less what I need:
select value, code
from numere
where value >= (select a.value
from numere a
left join numere b on a.value < b.value
and b.value < a.value + 30
and b.code = 'free'
where a.code = 'free'
group by a.value
having count(b.value) + 1 = 30)
limit 30
but this is returning only the first 30 available numbers, and not within my range (0-30). (and takes 13 minutes to execute, hehe..)
If anyone has an idea, please let me know (I am using SQL Server)
This seems like it works in my dataset. Modify the select and see if it works with your table name.
DECLARE #numere TABLE
(
value int,
code varchar(10)
);
INSERT INTO #numere (value,code) SELECT 123100, 'free'
WHILE (SELECT COUNT(*) FROM #numere)<=30
BEGIN
INSERT INTO #numere (value,code) SELECT MAX(value)+1, 'free' FROM #numere
END
UPDATE #numere
SET code='booked'
WHERE value=123105
select *
from #numere n1
inner join #numere n2 ON n1.value=n2.value-30
AND n1.code='free'
AND n2.code='free'
LEFT JOIN #numere n3 ON n3.value>=n1.value
AND n3.value<=n2.value
AND n3.code<>'free'
WHERE n3.value IS NULL
This is usual Island and Gap problem.
; with cte as
(
select *, grp = row_number() over (order by value)
- row_number() over (partition by code order by value)
from numere
),
grp as
(
select grp
from cte
group by grp
having count(*) >= 30
)
select c.grp, c.value, c.code
from grp g
inner join cte c on g.grp = c.grp
You can query table data for gaps between booked numbers using following SQL query where SQL LEAD() analytical function is used
;with cte as (
select
value, lead(value) over (order by value) nextValue
from numere
where code = 'booked'
), cte2 as (
select
value gapstart, nextValue gapend,
(nextValue - value - 1) [number count in gap] from cte
where value < nextValue - 1
)
select *
from cte2
where [number count in gap] >= 30
You can check the SQL tutorial Find Missing Numbers and Gaps in a Sequence using SQL
I hope it helps,
Can't Test it at the moment, but this might work:
SELECT a.Value
FROM (SELECT Value
FROM numere
WHERE Code='free'
) a INNER Join
(SELECT Value
FROM numere
WHERE code='free'
) b ON b.Value BETWEEN a.Value+1 AND a.Value+29
GROUP BY a.Value
HAVING COUNT(b.Value) >= 29
ORDER BY a.Value ASC
The output should be all numbers that have 29 free numbers following (so it's 30 consecutive numbers)

Using SQL to get the previous rows data

I have a requirement where I need to get data from the previous row to use in a calculation to give a status to the current row. It's a history table. The previous row will let me know if a data has changed in a date field.
I've looked up using cursors and it seems a little complicated. Is this the best way to go?
I've also tried to assgin a value to a new field...
newField =(Select field1 from Table1 where "previous row") previous row is where I seem to get stuck. I can't figure out how to select the row beneath the current row.
I'm using SQL Server 2005
Thanks in advance.
-- Test data
declare #T table (ProjectNumber int, DateChanged datetime, Value int)
insert into #T
select 1, '2001-01-01', 1 union all
select 1, '2001-01-02', 1 union all
select 1, '2001-01-03', 3 union all
select 1, '2001-01-04', 3 union all
select 1, '2001-01-05', 4 union all
select 2, '2001-01-01', 1 union all
select 2, '2001-01-02', 2
-- Get CurrentValue and PreviousValue with a Changed column
;with cte as
(
select *,
row_number() over(partition by ProjectNumber order by DateChanged) as rn
from #T
)
select
C.ProjectNumber,
C.Value as CurrentValue,
P.Value as PreviousValue,
case C.Value when P.Value then 0 else 1 end as Changed
from cte as C
inner join cte as P
on C.ProjectNumber = P.ProjectNumber and
C.rn = P.rn + 1
-- Count the number of changes per project
;with cte as
(
select *,
row_number() over(partition by ProjectNumber order by DateChanged) as rn
from #T
)
select
C.ProjectNumber,
sum(case C.Value when P.Value then 0 else 1 end) as ChangeCount
from cte as C
inner join cte as P
on C.ProjectNumber = P.ProjectNumber and
C.rn = P.rn + 1
group by C.ProjectNumber
This really depends on what tells you a row is a "Previous Row". however, a self join should do what you want:
select *
from Table1 this
join Table2 prev on this.incrementalID = prev.incrementalID+1
If you have the following table
CREATE TABLE MyTable (
Id INT NOT NULL,
ChangeDate DATETIME NOT NULL,
.
.
.
)
The following query will return the previous record for any record from MyTable.
SELECT tbl.Id,
tbl.ChangeDate,
hist.Id,
hist.ChangeDate
FROM MyTable tbl
INNER JOIN MyTable hist
ON hist.Id = tbl.Id
AND hiost.ChangeDate = (SELECT MAX(ChangeDate)
FROM MyTable sub
WHERE sub.Id = tbl.Id AND sub.ChangeDate < tbl.ChangeDate)