How can I add a sequence / order a data in SQL - sql

I am trying to add a sequence order to my raw data and was wondering if there is an efficient way to do this without a while exists loop as I have more than million records to order.
Example dataset:
CustomerID StartDate EndDate EnrollID
-------------------------------------------
1 1/1/1990 1/1/1991 14994
2 1/1/1990 1/1/1992 14995
2 1/1/1993 1/1/1995 14997
1 1/1/1992 1/1/1993 14996
1 1/1/1993 1/1/1994 14997
2 1/1/1995 1/1/1996 14998
3 1/1/1990 1/1/1991 15000
3 1/1/1992 1/1/1993 15001
3 1/1/1995 1/1/1996 15007
Re-ordered data should add a sequence/ order for each customer id based on min(startdate), min(enddate) , min(enrollid)
Final output Dataset should look like below where each customerID records are ordered by min(StartDate), min(EndDate), min(EnrollID)
CustomerID StartDate EndDate EnrollID Sequence_Order
----------------------------------------------------------
1 1/1/1990 1/1/1991 14994 1
1 1/1/1992 1/1/1993 14996 2
1 1/1/1993 1/1/1994 14997 3
2 1/1/1990 1/1/1992 14995 1
2 1/1/1993 1/1/1995 14997 2
2 1/1/1995 1/1/1996 14998 3
3 1/1/1990 1/1/1991 15000 1
3 1/1/1992 1/1/1993 15001 2
3 1/1/1995 1/1/1996 15007 3
Need the fastest way to do this in T-SQL

Use ROW_NUMBER()
SELECT CustomerID, StartDate, EndDate, EnrollID,
ROW_NUMBER() OVER(
PARTITION BY CustomerId
ORDER BY StartDate
,EndDate
,EnrollID
) AS Sequence_Order
FROM Table1
OUTPUT:
CustomerID StartDate EndDate EnrollID Sequence_Order
1 1990-01-01 1991-01-01 14994 1
1 1992-01-01 1993-01-01 14996 2
1 1993-01-01 1994-01-01 14997 3
2 1990-01-01 1992-01-01 14995 1
2 1993-01-01 1995-01-01 14997 2
2 1995-01-01 1996-01-01 14998 3
3 1990-01-01 1991-01-01 15000 1
3 1992-01-01 1993-01-01 15001 2
3 1995-01-01 1996-01-01 15007 3
Follow the link to the demo:
http://sqlfiddle.com/#!18/dbe66/2

Use Row_number
select
*, row_number() over (partition by CustomerID order by StartDate, EndDate, EnrollID) as Sequence_Order
from myTable

The Order= ROW_NUMBER() OVER (ORDER BY column) is uses for. try the below answer
SELECT (Order= ROW_NUMBER() OVER ( ORDER BY CustomerID)) as Sequence ,
CustomerID,
StartDate,
EndDate,
EnrollID
FROM dbo.SomeTable
Reference

Try this:
SELECT customerId, startDate, endDate, enrollID,
ROW_NUMBER() OVER(
PARTITION BY customerId
ORDER BY startDate
,endDate
,enrollID
) AS seq
FROM Table1
ORDER BY customerId
,startDate
,endDate
,enrollID
Order by is required in the end to sort the final output

Related

SQL server group / partition to condense history table

Got a table of dates someone was in a particular category like this:
drop table if exists #category
create table #category (personid int, categoryid int, startdate datetime, enddate datetime)
insert into #category
select * from
(
select 1 Personid, 1 CategoryID, '01/04/2010' StartDate, '31/07/2016' EndDate union
select 1 Personid, 5 CategoryID, '07/08/2016' StartDate, '31/03/2019' EndDate union
select 1 Personid, 5 CategoryID, '01/04/2019' StartDate, '01/04/2019' EndDate union
select 1 Personid, 5 CategoryID, '02/04/2019' StartDate, '11/08/2019' EndDate union
select 1 Personid, 4 CategoryID, '12/08/2019' StartDate, '03/11/2019' EndDate union
select 1 Personid, 5 CategoryID, '04/11/2019' StartDate, '22/03/2020' EndDate union
select 1 Personid, 5 CategoryID, '23/03/2020' StartDate, NULL EndDate union
select 2 Personid, 1 CategoryID, '01/04/2010' StartDate, '09/04/2015' EndDate union
select 2 Personid, 4 CategoryID, '10/04/2015' StartDate, '31/03/2018' EndDate union
select 2 Personid, 4 CategoryID, '01/04/2018' StartDate, '31/03/2019' EndDate union
select 2 Personid, 4 CategoryID, '01/04/2019' StartDate, '23/06/2019' EndDate union
select 2 Personid, 4 CategoryID, '24/06/2019' StartDate, NULL EndDate
) x
order by personid, startdate
I'm trying to condense it so I get this:
PersonID
categoryid
startdate
EndDate
1
1
01/04/2010
31/07/2016
1
5
07/08/2016
11/08/2019
1
4
12/08/2019
03/11/2019
1
5
04/11/2019
NULL
2
1
01/04/2010
09/04/2015
2
4
01/04/2015
NULL
I'm having issues with people like personid 1 where they are in (e.g.) category 5, then go into category 4 and them back into category 5.
So doing something like:
select
personid,
categoryid,
min(startdate) startdate,
max(enddate) enddate
from #category
group by
personid, categoryid
gives me the earliest date from category 5's first period, and the latest date from the second period - and means it creates an overlapping period.
So I tried partitioning it with a rownum or rank, but it still does the same thing - i.e. treats the 'category 5's as the same group:
select
rank() over (partition by personid, categoryid order by personid, startdate) rank,
c.*
from #category c
order by personid, startdate
rank
personid
categoryid
startdate
enddate
1
1
1
2010-04-01 00:00:00.000
2016-07-31 00:00:00.000
1
1
5
2016-08-07 00:00:00.000
2019-03-31 00:00:00.000
2
1
5
2019-04-01 00:00:00.000
2019-04-01 00:00:00.000
3
1
5
2019-04-02 00:00:00.000
2019-08-11 00:00:00.000
1
1
4
2019-08-12 00:00:00.000
2019-11-03 00:00:00.000
4
1
5
2019-11-04 00:00:00.000
2020-03-22 00:00:00.000
5
1
5
2020-03-23 00:00:00.000
NULL
1
2
1
2010-04-01 00:00:00.000
2015-04-09 00:00:00.000
1
2
4
2015-04-10 00:00:00.000
2018-03-31 00:00:00.000
2
2
4
2018-04-01 00:00:00.000
2019-03-31 00:00:00.000
3
2
4
2019-04-01 00:00:00.000
2019-06-23 00:00:00.000
4
2
4
2019-06-24 00:00:00.000
NULL
You can see in the rank column that the category 5's start off 1,2,3, miss a row and carry on 4, 5 so obvs in the same partition - I thought that adding the order by clause would force it to start a new partition when the category changed from 5 to 4 and back again.
Any thoughts?
This is a type of gaps and islands problem. However, if your data tiles perfectly (no gaps) as it does in your example data, then you can do this without any aggregation at all -- which should be the most efficient method:
select personid, categoryid, startdate,
dateadd(day, -1, lead(startdate) over (partition by personid order by startdate)) as enddate
from (select c.*,
lag(categoryid) over (partition by personid order by startdate) as prev_categoryid
from #category c
) c
where prev_categoryid is null or prev_categoryid <> categoryid;
The where clause only selects the rows where the category changes. The lead() then gets the next start date -- and subtracts 1 for your desired enddate.

SQL query to find continuous local max, min of date based on category column

I have the following data set
Customer_ID Category FROM_DATE TO_DATE
1 5 1/1/2000 12/31/2001
1 6 1/1/2002 12/31/2003
1 5 1/1/2004 12/31/2005
2 7 1/1/2010 12/31/2011
2 7 1/1/2012 12/31/2013
2 5 1/1/2014 12/31/2015
3 7 1/1/2010 12/31/2011
3 7 1/5/2012 12/31/2013
3 5 1/1/2014 12/31/2015
The result I want to achieve is to find continuous local min/max date for Customers with the same category and identify any gap in dates:
Customer_ID FROM_Date TO_Date Category
1 1/1/2000 12/31/2001 5
1 1/1/2002 12/31/2003 6
1 1/1/2004 12/31/2005 5
2 1/1/2010 12/31/2013 7
2 1/1/2014 12/31/2015 5
3 1/1/2010 12/31/2011 7
3 1/5/2012 12/31/2013 7
3 1/1/2014 12/31/2015 5
My code works fine for customer 1 (return all 3 rows) and customer 2(return 2 rows with min and max date for each category) but for customer 3, it cannot identify the gap between 12/31/2011 and 1/5/2012 for category 7.
Customer_ID FROM_Date TO_Date Category
3 1/1/2010 12/31/2013 7
3 1/1/2014 12/31/2015 5
Here is my code:
SELECT Customer_ID, Category, min(From_Date), max(To_Date) FROM
(
SELECT Customer_ID, Category, From_Date,To_Date
,row_number() over (order by member_id, To_Date) - row_number() over (partition by Customer_ID order by Category) as p
FROM FFS_SAMP
) X
group by Customer_ID,Category,p
order by Customer_ID,min(From_Date),Max(To_Date)
This is a type of gaps and islands problem. Probably the safest method is to use a cumulative max() to look for overlaps with previous records. Where there is no overlap, then an "island" of records starts. So:
select customer_id, min(from_date), max(to_date), category
from (select t.*,
sum(case when prev_to_date >= from_date then 0 else 1 end) over
(partition by customer_id, category
order by from_date
) as grp
from (select t.*,
max(to_date) over (partition by customer_id, category
order by from_date
rows between unbounded preceding and 1 preceding
) as prev_to_date
from t
) t
) t
group by customer_id, category, grp;
Your attempt is quite close. You just need to fix the over() clause of the window functions:
select customer_id, category, min(from_date), max(to_date)
from (
select
fs.*,
row_number() over (partition by customer_id order from_date)
- row_number() over (partition by customer_id, category order by from_date) as grp
from ffs_samp fs
) x
group by customer_id, category, grp
order by customer_id, min(from_date)
Note that this method assumes no gaps or overlalp in the periods of a given customer, as show in your sample data.

Get max date for each from either of 2 columns

I have a table like below
AID BID CDate
-----------------------------------------------------
1 2 2018-11-01 00:00:00.000
8 1 2018-11-08 00:00:00.000
1 3 2018-11-09 00:00:00.000
7 1 2018-11-15 00:00:00.000
6 1 2018-12-24 00:00:00.000
2 5 2018-11-02 00:00:00.000
2 7 2018-12-15 00:00:00.000
And I am trying to get a result set as follows
ID MaxDate
-------------------
1 2018-12-24 00:00:00.000
2 2018-12-15 00:00:00.000
Each value in the id columns(AID,BID) should return the max of CDate .
ex: in the case of 1, its max CDate is 2018-12-24 00:00:00.000 (here 1 appears under BID)
in the case of 2 , max date is 2018-12-15 00:00:00.000 . (here 2 is under AID)
I tried the following.
1.
select
g.AID,g.BID,
max(g.CDate) as 'LastDate'
from dbo.TT g
inner join
(select AID,BID,max(CDate) as maxdate
from dbo.TT
group by AID,BID)a
on (a.AID=g.AID or a.BID=g.BID)
and a.maxdate=g.CDate
group by g.AID,g.BID
and 2.
SELECT
AID,
CDate
FROM (
SELECT
*,
max_date = MAX(CDate) OVER (PARTITION BY [AID])
FROM dbo.TT
) AS s
WHERE CDate= max_date
Please suggest a 3rd solution.
You can assemble the data in a table expression first, and the compute the max for each value is simple. For example:
select
id, max(cdate)
from (
select aid as id, cdate from t
union all
select bid, cdate from t
) x
group by id
You seem to only care about values that are in both columns. If this interpretation is correct, then:
select id, max(cdate)
from ((select aid as id, cdate, 1 as is_a, 0 as is_b
from t
) union all
(select bid as id, cdate, 1 as is_a, 0 as is_b
from t
)
) ab
group by id
having max(is_a) = 1 and max(is_b) = 1;

Using the earliest date of a partition to determine what other dates belong to that partition

Assume this is my table:
ID DATE
--------------
1 2018-11-12
2 2018-11-13
3 2018-11-14
4 2018-11-15
5 2018-11-16
6 2019-03-05
7 2019-05-07
8 2019-05-08
9 2019-05-08
I need to have partitions be determined by the first date in the partition. Where, any date that is within 2 days of the first date, belongs in the same partition.
The table would end up looking like this if each partition was ranked
PARTITION ID DATE
------------------------
1 1 2018-11-12
1 2 2018-11-13
1 3 2018-11-14
2 4 2018-11-15
2 5 2018-11-16
3 6 2019-03-05
4 7 2019-05-07
4 8 2019-05-08
4 9 2019-05-08
I've tried using datediff with lag to compare to the previous date but that would allow a partition to be inappropriately sized based on spacing, for example all of these dates would be included in the same partition:
ID DATE
--------------
1 2018-11-12
2 2018-11-14
3 2018-11-16
4 2018-11-18
3 2018-11-20
4 2018-11-22
Previous flawed attempt:
Mark when a date is more than 2 days past the previous date:
(case when datediff(day, lag(event_time, 1) over (partition by user_id, stage order by event_time), event_time) > 2 then 1 else 0 end)
You need to use a recursive CTE for this, so the operation is expensive.
with t as (
-- add an incrementing column with no gaps
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select id, date, date as mindate, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date <= dateadd(day, 2, cte.mindate)
then cte.mindate else t.date
end) as mindate,
t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*, dense_rank() over (partition by mindate) as partition_num
from cte;

window function in redshift

I have some data that looks like this:
CustID EventID TimeStamp
1 17 1/1/15 13:23
1 17 1/1/15 14:32
1 13 1/1/25 14:54
1 13 1/3/15 1:34
1 17 1/5/15 2:54
1 1 1/5/15 3:00
2 17 2/5/15 9:12
2 17 2/5/15 9:18
2 1 2/5/15 10:02
2 13 2/8/15 7:43
2 13 2/8/15 7:50
2 1 2/8/15 8:00
I'm trying to use the row_number function to get it to look like this:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 1
1 13 1/1/25 14:54 2
1 13 1/3/15 1:34 2
1 17 1/5/15 2:54 3
1 1 1/5/15 3:00 4
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 1
2 1 2/5/15 10:02 2
2 13 2/8/15 7:43 3
2 13 2/8/15 7:50 3
2 1 2/8/15 8:00 4
I tried this:
row_number () over
(partition by custID, EventID
order by custID, TimeStamp asc) SeqNum]
but got this back:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 2
1 13 1/1/25 14:54 3
1 13 1/3/15 1:34 4
1 17 1/5/15 2:54 5
1 1 1/5/15 3:00 6
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 2
2 1 2/5/15 10:02 3
2 13 2/8/15 7:43 4
2 13 2/8/15 7:50 5
2 1 2/8/15 8:00 6
how can I get it to sequence based on the change in the EventID?
This is tricky. You need a multi-step process. You need to identify the groups (a difference of row_number() works for this). Then, assign an increasing constant to each group. And then use dense_rank():
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp) -
row_number() over (partition by custid, eventid order by timestamp)
) as grp
from somedata sd
) sd
) sd;
Another method is to use lag() and a cumulative sum:
select sd.*,
sum(case when prev_eventid is null or prev_eventid <> eventid
then 1 else 0 end) over (partition by custid order by timestamp
) as seqnum
from (select sd.*,
lag(eventid) over (partition by custid order by timestamp) as prev_eventid
from somedata sd
) sd;
EDIT:
The last time I used Amazon Redshift it didn't have row_number(). You can do:
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp rows between unbounded preceding and current row) -
row_number() over (partition by custid, eventid order by timestamp rows between unbounded preceding and current row)
) as grp
from somedata sd
) sd
) sd;
Try this code block:
WITH by_day
AS (SELECT
*,
ts::date AS login_day
FROM table_name)
SELECT
*,
login_day,
FIRST_VALUE(login_day) OVER (PARTITION BY userid ORDER BY login_day , userid rows unbounded preceding) AS first_day
FROM by_day