find start and end dates over a non-contiguous range - sql

I need to find the start and end dates of range defined as: start date is the first date and the end date is the first date where the subsequent date is two months or more after the end date. There can be multiple possible ranges
I have a table structure like:
ID int identity(1,1),
fk_ID char(9),
dateField datetime
The data looks like:
1 a 2012-01-01
2 a 2012-01-05
3 a 2012-01-12
4 b 2012-02-01
5 a 2012-04-01
6 b 2012-05-01
7 a 2012-05-30
The expected output would look like:
fk_id startdate enddate
a 2012-01-01 2012-01-12
a 2012-04-01 2012-05-30
b 2012-02-01 2012-02-01
b 2012-05-01 null
EDIT:
By doing the following:
CREATE TABLE #temp
(
autonum int identity(1,1),
id char(9),
sd datetime
)
insert into #temp (id, sd) values ('a', '2012-01-01')
insert into #temp (id, sd) values ('a', '2012-01-05')
insert into #temp (id, sd) values ('a', '2012-01-12')
insert into #temp (id, sd) values ('a', '2012-03-01')
insert into #temp (id, sd) values ('a', '2012-04-03')
insert into #temp (id, sd) values ('a', '2012-06-06')
insert into #temp (id, sd) values ('b', '2012-02-12')
insert into #temp (id, sd) values ('b', '2012-02-15')
insert into #temp (id, sd) values ('b', '2012-03-01')
insert into #temp (id, sd) values ('b', '2012-04-03')
insert into #temp (id, sd) values ('b', '2012-06-01')
select t1.id, null as previousend, min(t1.sd) as nextstart
from #temp t1
group by t1.id
union
select t1.id, t1.sd as enddate, (select min(t2.sd) from #temp t2 where t1.id=t2.id and t2.sd>t1.sd) as nextstart
from #temp t1
where (select min(t2.sd) from #temp t2 where t1.id=t2.id and t2.sd>t1.sd) >= dateadd(month, 2, t1.sd)
union
select t1.id, max(t1.sd), null
from #temp t1
group by t1.id
drop table #temp
I can get output like this:
id previousend nextstart
--------- ----------------------- -----------------------
a NULL 2012-01-01 00:00:00.000
a 2012-04-03 00:00:00.000 2012-06-06 00:00:00.000
a 2012-06-06 00:00:00.000 NULL
b NULL 2012-02-12 00:00:00.000
b 2012-06-01 00:00:00.000 NULL
Which is very close, but ideally the start and end date of the range would be on the row.

Here is my best guess given all the changes to the question. I still find the problem very confusing, splintered and that the desired results for the two cases don't seem to match. With this query:
;WITH x AS
(
SELECT a.id, sd = a.sd, ed = b.sd, rn1 = ROW_NUMBER() OVER
(PARTITION BY a.id, a.sd ORDER BY a.sd)
FROM #temp AS a
LEFT OUTER JOIN #temp AS b
ON a.id = b.id
AND b.sd >= a.sd
AND b.sd <= DATEADD(MONTH, 2, a.sd)
),
y AS
(SELECT id, sd,
ed = (SELECT MAX(ed) FROM x AS x2
WHERE x.id = x2.id AND x2.sd <= DATEADD(MONTH, 2, x.sd)
)
FROM x
WHERE rn1 = 1
),
z AS
(
SELECT id, sd = MIN(sd), ed
FROM y GROUP BY id, ed
)
SELECT id, sd, ed /* = CASE
WHEN ed > sd OR (sd = ed AND NOT EXISTS
(SELECT 1 FROM z AS z2 WHERE z2.id = z.id AND z.sd > z2.sd)) THEN ed END
*/
FROM z
ORDER BY id, sd;
The results for your first set of data:
INSERT #temp (id, sd) VALUES
('a','2012-01-01'),
('a','2012-01-05'),
('a','2012-01-12'),
('b','2012-02-01'),
('a','2012-04-01'),
('b','2012-05-01'),
('a','2012-05-30');
Is as follows:
id sd ed
a 2012-01-01 2012-01-12
a 2012-04-01 2012-05-30
b 2012-02-01 2012-02-01
b 2012-05-01 2012-05-01
And for the second set:
insert into #temp (id, sd) values ('a', '2012-01-01')
insert into #temp (id, sd) values ('a', '2012-01-05')
insert into #temp (id, sd) values ('a', '2012-01-12')
insert into #temp (id, sd) values ('a', '2012-03-01')
insert into #temp (id, sd) values ('a', '2012-04-03')
insert into #temp (id, sd) values ('a', '2012-06-06')
insert into #temp (id, sd) values ('b', '2012-02-12')
insert into #temp (id, sd) values ('b', '2012-02-15')
insert into #temp (id, sd) values ('b', '2012-03-01')
insert into #temp (id, sd) values ('b', '2012-04-03')
insert into #temp (id, sd) values ('b', '2012-06-01')
Is as follows:
id sd ed
a 2012-01-01 2012-04-03
a 2012-06-06 2012-06-06
b 2012-02-12 2012-06-01
If you uncomment the CASE block you'll get NULLs for the end date where the start date and end date are the same. As I suggested multiple times, your question is splintered and your desired results don't seem to match, so I'm not sure what the right answer is.

attempt number two which is on Fiddle and is far from elegant but seems to work apart from the final record not being NULL for the end date:
CREATE TABLE temp
(
id char(9),
d datetime
);
insert into temp (id, d) values ('a', '2012-01-01');
insert into temp (id, d) values ('a', '2012-01-05');
insert into temp (id, d) values ('a', '2012-01-12');
insert into temp (id, d) values ('a', '2012-04-01');
insert into temp (id, d) values ('a', '2012-05-30');
insert into temp (id, d) values ('b', '2012-02-01');
insert into temp (id, d) values ('b', '2012-05-01');
SELECT
x.id ,
min(x.sd) sd ,
x.ed
FROM
(SELECT
a.id ,
a.sd ,
max(a.ed) ed
FROM
(
SELECT
j.id ,
j.d sd ,
q.D ed
FROM temp j
JOIN temp q
ON
j.id = q.id
AND j.d <= q.d
GROUP BY j.id ,
j.d ,
q.d
) a
WHERE datediff(m,a.sd,a.ed)<=2
GROUP BY a.id ,
a.sd
)x
GROUP BY x.id ,
x.ed
ORDER BY x.id ,
min(x.sd) ,
x.ed

Related

SQL query for date intervals comparing non-adjacent rows?

I want to flag the first date in every window of at least 31 days for each ID unit in my data.
ROW ID INDEX_DATE
1 ABC 1/1/2019
2 ABC 1/7/2019
3 ABC 1/21/2019
4 ABC 2/2/2019
5 ABC 2/9/2019
6 ABC 3/6/2019
7 DEF 1/5/2019
8 DEF 2/1/2019
9 DEF 2/8/2019
The desired rows are 1, 4, 6, 7 and 9; these are either the first INDEX_DATE for the given ID, or they occur at least 31 days after the previously flagged INDEX_DATE. Every suggestion I have found uses LAG() or LEAD with window functions, but I could only get these to compare adjacent rows. Row 4, for example, needs to be compared to Row 1 in order to be identified as the first after a 31-day window has completed.
I tried the following:
Data
DROP TABLE tTest IF EXISTS;
CREATE TEMP TABLE tTest
(
ROWN INT,
ID VARCHAR(3),
INDEX_DATE DATE
) ;
GO
INSERT INTO tTEST VALUES (1, 'ABC', '1/1/2019');
INSERT INTO tTEST VALUES (2, 'ABC', '1/7/2019');
INSERT INTO tTEST VALUES (3, 'ABC', '1/21/2019');
INSERT INTO tTEST VALUES (4, 'ABC', '2/2/2019');
INSERT INTO tTEST VALUES (5, 'ABC', '2/9/2019');
INSERT INTO tTEST VALUES (6, 'ABC', '3/6/2019');
INSERT INTO tTEST VALUES (7, 'DEF', '1/5/2019');
INSERT INTO tTEST VALUES (8, 'DEF', '2/1/2019');
INSERT INTO tTEST VALUES (9, 'DEF', '2/8/2019');
GO
Query:
DROP TABLE TTEST2 IF EXISTS;
CREATE TEMP TABLE TTEST2 AS (
WITH
RN_CTE(ROWN, ID, INDEX_DATE, RN) AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY INDEX_DATE)
FROM tTEST),
MIN_CTE(ROWN, ID, INDEX_DATE, RN) AS (SELECT * FROM RN_CTE WHERE RN=1),
DIFF_CTE(ROWN,ID, INDEX_DATE, RN, DAY_DIFF) AS (
SELECT RN.*, DATE(RN.INDEX_DATE + INTERVAL '30 DAYS')
FROM RN_CTE AS RN
JOIN MIN_CTE AS MC ON RN.ID=MC.ID
WHERE RN.RN=1
OR RN.INDEX_DATE > MC.INDEX_DATE + INTERVAL '30 DAYS' ),
MIN_DIFF_CTE AS (
SELECT ID, DAY_DIFF, MIN(ROWN) AS MIN_ROW
FROM DIFF_CTE
GROUP BY ID, DAY_DIFF)
SELECT T.*
FROM MIN_DIFF_CTE AS MDC
JOIN tTEST AS T ON MDC.MIN_ROW = T.ROWN
ORDER BY ID, INDEX_DATE
);
Result:
SELECT * FROM TTEST2 ORDER BY ID, INDEX_DATE;
ROWN ID INDEX_DATE
1 ABC 2019-01-01
4 ABC 2019-02-02
5 ABC 2019-02-09
6 ABC 2019-03-06
7 DEF 2019-01-05
9 DEF 2019-02-08
Row 5 with INDEX_DATE = 2019-02-09 should not be in the output because it is less than 31 days after Row 4's INDEX_DATE.
Something like this. The CTE's locate the unique window of the minimum ROW value for each ID transition and 31 days rolling too.
Data
drop table if exists #tTEST;
go
select * INTO #tTEST from (values
(1, 'abc', '1/1/2019'),
(2, 'abc', '1/7/2019'),
(3, 'abc', '1/21/2019'),
(4, 'abc', '2/2/2019'),
(5, 'abc', '2/9/2019'),
(6, 'abc', '3/6/2019'),
(7, 'def', '1/5/2019'),
(8, 'def', '2/1/2019'),
(9, 'def', '2/8/2019')) V([ROW], ID, INDEX_DATE);
Query
;with
rn_cte([ROW], ID, INDEX_DATE, rn) as (
select *, row_number() over (partition by ID order by INDEX_DATE)
from #tTEST),
min_cte([ROW], ID, INDEX_DATE, rn) as (select * from rn_cte where rn=1),
diff_cte([ROW], ID, INDEX_DATE, rn, day_diff) as (
select rn.*, datediff(d, mc.INDEX_DATE, rn.INDEX_DATE)/31
from rn_cte rn
join min_cte mc on rn.ID=mc.ID
where rn.rn=1
or datediff(d, mc.INDEX_DATE, rn.INDEX_DATE)/31>0),
min_diff_cte as (
select ID, day_diff, min([ROW]) min_row
from diff_cte
group by ID, day_diff)
select t.*
from min_diff_cte mdc
join #tTEST t on mdc.min_row=t.ROW
order by 1;
Output
ROW ID INDEX_DATE
1 abc 1/1/2019
4 abc 2/2/2019
6 abc 3/6/2019
7 def 1/5/2019
9 def 2/8/2019

how to do partitioning on VARCHAR column

DECLARE #Table1 TABLE
(ID int, STATUS varchar(1))
;
INSERT INTO #Table1
(ID, STATUS)
VALUES
(1, 'A'),
(1, 'A'),
(1, 'A'),
(1, 'B'),
(1, 'A'),
(2, 'C'),
(2, 'C')
;
Script :
Select *,ROW_NUMBER()OVER(PARTITION BY STATUS ORDER BY (SELECT NULL))RN from #Table1
Getting Result Set
ID STATUS RN
1 A 1
1 A 2
1 A 3
1 A 4
1 B 1
2 C 1
2 C 2
Need Output
ID STATUS RN
1 A 1
1 A 2
1 A 3
1 B 1
1 A 1
2 C 1
2 C 2
Try this
DECLARE #Table1 TABLE
(ID int, STATUS varchar(1));
INSERT INTO #Table1
(ID, STATUS)
VALUES
(1, 'A'),
(1, 'A'),
(1, 'A'),
(1, 'B'),
(1, 'A'),
(2, 'C'),
(2, 'C');
;WITH Tmp
AS
(
SELECT *, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RowNumber FROM #Table1
)
SELECT
A.ID ,
A.STATUS ,
ROW_NUMBER() OVER (PARTITION BY A.STATUS, (A.RowNumber - A.RN) ORDER BY (SELECT NULL)) AS RN
FROM
(
Select *, ROW_NUMBER() OVER(PARTITION BY STATUS ORDER BY RowNumber) AS RN from tmp
) A
ORDER BY
A.RowNumber
Output:
ID STATUS RN
----------- ------ ------
1 A 1
1 A 2
1 A 3
1 B 1
1 A 1
2 C 1
2 C 2
Firstly, In the insert statement that you posted. How is 4 different from 1,2 and 3, if it is based on a different column then include that column as well in "row_number" in partition by sub clause. Because otherwise it will think that 'A' in 4 and 'A' in 1,2,3 are same and therefore group them together.
INSERT INTO #Table1
(ID, STATUS)
VALUES
(1, 'A'), <-- 1
(1, 'A'), <-- 2
(1, 'A'), <-- 3
(1, 'B'),
(1, 'A'), <-- 4
(2, 'C'),
(2, 'C')
;

SQL Query for Select Sequence Numbers

SQL Query for Select Sequence Numbers
In SQL server, I want to select rows based on sequence numbers. For example I am having data as below:
ID RowNos
A 1
B 2
X NULL
C 4
D 5
Y NULL
E 7
F 8
G 9
H 11
I 13
Query Should return
ID NextID
A B -- Since RowNos 1,2 is in sequence
C D -- Since RowNos 4,5 is in sequence
E G -- Since RowNos 7,8,9 is in sequence
I don't have idea to start this query. Otherwise I'll post my trial too.
DECLARE #t TABLE (ID CHAR(1), RowNos INT)
INSERT INTO #t
VALUES
('A', 1), ('B', 2), ('X', NULL),
('C', 4), ('D', 5), ('Y', NULL),
('E', 7), ('F', 8), ('G', 9),
('H', 11), ('I', 13)
SELECT MIN(ID), MAX(ID)
FROM (
SELECT *, rn = ROW_NUMBER() OVER (ORDER BY RowNos)
FROM #t
) t
WHERE RowNos IS NOT NULL
GROUP BY RowNos - rn
HAVING MIN(ID) != MAX(ID)
Output:
---- ----
A B
C D
E G
to select them ordered should be something like:
SELECT * FROM table_name WHERE RowNos IS NOT NULL ORDER BY RowNos ASC;

Delete duplicate in sql based on other column value

I want to remove duplicates based on below condition.
My table contains data like cross relation. Column 1 value exist in column 2 and vice versa.
sample table
id id1
-------------
1 2
2 1
3 4
4 3
5 6
6 5
7 8
8 7
I want to delete 1 row from first two rows, same from third and forth, same for fifth and sixth and so on..
Can anyone please help?
Like this way you are going to delete just the second row from each group of 2 rows:
CREATE TABLE [LIST_ID](
[ID] [NUMERIC](4, 0) NOT NULL,
[ID_1] [NUMERIC](4, 0) NOT NULL
);
INSERT INTO LIST_ID (ID, ID_1)
VALUES
(1, 2),
(2, 1),
(3, 4),
(4, 3),
(5, 6),
(6, 5);
WITH First_Row AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ID ASC) AS Row_Number, *
FROM LIST_ID
)
DELETE FROM First_Row WHERE Row_Number % 2 ='0';
SELECT * FROM LIST_ID;
How about this:
DELETE
FROM myTable
WHERE id IN (
SELECT CASE WHEN id < id1 THEN id ELSE id1 END
FROM myTable
)
Where myTable is the sample table with data.
declare #t table (id1 int, id2 int)
insert into #t (id1, id2)
values
(1, 2),
(2, 1),
(2, 1),
(2, 1),
(3, 4),
(3, 4),
(5, 6),
(7, 8),
(7, 6),
(6, 7),
(5, 0)
delete t2
from #t t1
inner join #t t2 on t2.id1 = t1.id2 and t2.id2 = t1.id1
where t2.id1 > t1.id1
select * from #t order by 1, 2
declare #t table (id1 int, id2 int)
insert into #t (id1, id2)
values
(1, 2),
(2, 1),
(3, 4),
(4, 3),
(5, 6),
(6, 5),
(7, 8),
(8, 7)
;
;with a as (
select
row_number() over (order by id1) rn
,t.id1
,t.id2
from
#t t
)
delete t from
#t t
join (
select
a.id1
,a.id2
from
a a
where
exists(
select
*
from
a b
where
a.id2 = b.id1 and a.id1 = b.id2 and a.rn > b.rn
)
) c on t.id1 = c.id1 and t.id2 = c.id2
;
select * from #t;
/* OUTPUT
id1 id2
1 2
3 4
5 6
7 8
*/
It'll vary a little based on which row you want to keep, but if you really have simple duplicates as in your example, and every pair exists in both orders, this should do it:
DELETE FROM MyTable
WHERE ID > ID1
So what i could understand you want to delete the rows from table where id = id1.
delete from TableA as a
where exists(select 1 from TableA as b where a.id = b.id1)

How do I select TOP 5 PERCENT from each group?

I have a sample table like this:
CREATE TABLE #TEMP(Category VARCHAR(100), Name VARCHAR(100))
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Bucky')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
SELECT Category, Name, COUNT(Name) Total
FROM #TEMP
GROUP BY Category, Name
ORDER BY Category, Total DESC
DROP TABLE #TEMP
Gives me the following:
A John 6
A Adam 4
A Lisa 2
A Bucky 1
B Lily 5
B Tom 4
B Ross 3
Now, how do I select the TOP 5 PERCENT records from each category assuming each category has more than 100 records (did not show in sample table here)? For instance, in my actual table, it should remove the John record from A and Lily record from B as appropriate (again, I did not show the full table here) to get:
A Adam 4
A Lisa 2
A Bucky 1
B Tom 4
B Ross 3
I have been trying to use CTEs and PARTITION BY clauses but cannot seem to achieve what I want. It removes the TOP 5 PERCENT from the overall result but not from each category. Any suggestions?
You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).
;WITH SlicedData AS
(
SELECT Category, Name, COUNT(Name) Total,
NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS 'NTile'
FROM #TEMP
GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1
This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.
See:
MSDN docs on NTILE
SQL Server 2005 ranking functions
SQL SERVER – 2005 – Sample Example of RANKING Functions – ROW_NUMBER, RANK, DENSE_RANK, NTILE
for more info
select Category,name,CountTotal,RankSeq,(50*CountTotal)/100 from (
select Category,name,COUNT(*)
over (partition by Category,name ) as CountTotal,
ROW_NUMBER()
over (partition by Category,name order by Category) RankSeq from #TEMP
--group by Category,Name
) temp
where RankSeq <= ((50*CountTotal)/100)
order by Category,Name,RankSeq
Output:
Category name CountTotal RankSeq 50*CountTotal)/100
A Adam 4 1 2
A Adam 4 2 2
A John 6 1 3
A John 6 2 3
A John 6 3 3
A Lisa 2 1 1
B Lily 5 1 2
B Lily 5 2 2
B Ross 3 1 1
B Tom 4 1 2
B Tom 4 2 2
I hope this helps :)
;WITH SlicedData AS
(
SELECT Category, Name, COUNT(Name) Total,
**PERCENT_RANK() OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) * 100** AS 'Percent'
FROM #TEMP
GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE Percent < 5
NTile will not work if number of records is less than your tile number.