SQL Server order by closest value to zero - sql

I have some duplicate values in a table and I'm trying to use Row_Number to filter them out. I want to order the rows using datediff and order the results based on the closest value to zero but I'm struggling to account for negative values.
Below is a sample of the data and my current Row_Number field (rn) column:
PersonID SurveyDate DischargeDate DaysToSurvey rn
93638 10/02/2015 30/03/2015 -48 1
93638 27/03/2015 30/03/2015 -3 2
250575 23/10/2014 29/10/2014 -6 1
250575 19/11/2014 24/11/2014 -5 2
203312 23/01/2015 26/01/2015 -3 1
203312 26/01/2015 26/01/2015 0 2
387737 19/02/2015 26/02/2015 -7 1
387737 26/02/2015 26/02/2015 0 2
751915 02/04/2015 04/04/2015 -2 1
751915 10/04/2015 25/03/2015 16 2
712364 24/01/2015 30/01/2015 -6 1
712364 26/01/2015 30/01/2015 -4 2
My select statement for the above is:
select
PersonID, SurveyDate, DischargeDate,
datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID
order by datediff(dd, dischargedate, surveydate) asc) as rn
from
Table 1
order by
PersonID, rn
What I want to do is change the sort order so it displays like this:
PersonID SurveyDate DischargeDate DaysToSurvey rn
93638 27/03/2015 30/03/2015 -3 1
93638 10/02/2015 30/03/2015 -48 2
250575 19/11/2014 24/11/2014 -5 1
250575 23/10/2014 29/10/2014 -6 2
So the DaysToSurvey value that is closest to the DischargeDate is ranked as rn 1.
Is this possible?

You're close. Just add ABS() to calculate absolute values of the differences:
ROW_NUMBER () OVER (
PARTITION BY PersonID
ORDER BY abs(datediff(dd, dischargedate, surveydate)) asc
) AS rn

You could use abs to get the distance from zero:
select PersonID, SurveyDate, DischargeDate, datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID order by abs(datediff(dd,dischargedate,surveydate)) asc) as rn
from Table 1
order by PersonID, rn

Add ABS():
select PersonID, SurveyDate, DischargeDate, datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID order by ABS(datediff(dd,dischargedate,surveydate)) asc) as rn
from Table 1
order by PersonID, rn

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

Efficient way to get the average of past x events within d days per each row in SQL (big data)

I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row.
I already have a query that works on 60M rows, but on 100% (~500M rows) of the data its collapses (maybe not efficient or maybe lack of resources).
can you help? If you think my solution is not the best way please explain.
Thank you
I have this table:
user_id event_id start end score
---------------------------------------------------
1 7 30/01/2021 30/01/2021 45
1 6 24/01/2021 29/01/2021 25
1 5 22/01/2021 23/01/2021 13
1 4 18/01/2021 21/01/2021 15
1 3 17/01/2021 17/01/2021 52
1 2 08/01/2021 10/01/2021 8
1 1 01/01/2021 02/01/2021 36
I want per line (user id+event id): to get the average score of the past 2 events in the last 7 days.
Example: for this row:
user_id event_id start end score
---------------------------------------------------
1 6 24/01/2021 29/01/2021 25
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 null null
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
1 3 17/01/2021 17/01/2021 52 yes 3
1 2 08/01/2021 10/01/2021 8 no 4
1 1 01/01/2021 02/01/2021 36 no 5
so I would select only this rows for the group by and then avg(score):
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
Result:
user_id event_id start end score avg_score_of_past_2_events_within_7_days
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 14
My query:
SELECT user_id, event_id, AVG(score) as avg_score_of_past_2_events_within_7_days
FROM (
SELECT
B.user_id, B.event_id, A.score,
ROW_NUMBER() OVER (PARTITION BY B.user_id, B.event_id ORDER BY A.end desc) AS event_num,
FROM
"df" A
INNER JOIN
(SELECT user_id, event_id, start FROM "df") B
ON B.user_id = FTP.user_id
AND (A.end BETWEEN DATE_SUB(B.start, INTERVAL 7 DAY) AND B.start))
WHERE event_num >= 2
GROUP BY user_id, event_id
Any suggestion for a better way?
I don't believe in your case, there is a more efficient query.
I can suggest you do the following:
Make sure your base table is partition by start and cluster by user_id
Split the query to 3 parts that creating partitioned and clustered tables:
first table: only the inner join O(n^2)
second table: add ROW_NUMBER O(n)
third table: group by
If it is still a problem I would suggest doing batch preprocessing and run the queries by dates.
I've tried to create a use case with using LEAD functions, but I am not able to test if works on that large dataset.
I create the two before rows as prev and ante using LEAD.
Then I have an IF for the 7 days window, and if that matches I create scorePP and scoreAA otherwise they are null.
with t as (
select 1 as user_id,7 as event_id,parse_date('%d/%m/%Y','30/01/2021') as start,parse_date('%d/%m/%Y','30/01/2021') as stop, 45 as score union all
select 1 as user_id,6 as event_id,parse_date('%d/%m/%Y','24/01/2021') as start,parse_date('%d/%m/%Y','29/01/2021') as stop, 25 as score union all
select 1 as user_id,5 as event_id,parse_date('%d/%m/%Y','22/01/2021') as start,parse_date('%d/%m/%Y','23/01/2021') as stop, 13 as score union all
select 1 as user_id,4 as event_id,parse_date('%d/%m/%Y','18/01/2021') as start,parse_date('%d/%m/%Y','21/01/2021') as stop, 15 as score union all
select 1 as user_id,3 as event_id,parse_date('%d/%m/%Y','17/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 1 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 1 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score union all
select 2 as user_id,3 as event_id,parse_date('%d/%m/%Y','12/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 2 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 2 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score
)
select *, (select avg(x) from unnest([scorePP,scoreAA]) as x) as avg_score_7_day from (
SELECT
t.*,
lead(start,1) over(partition by user_id order by event_id desc, t.stop desc) prev_start,
lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc) prev_stop,
lead(score,1) over(partition by user_id order by event_id desc, t.stop desc) prev_score,
if(((lead(start,1) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,1) over(partition by user_id order by event_id desc, t.stop desc),null) as scorePP,
/**/
lead(start,2) over(partition by user_id order by event_id desc, t.stop desc) ante_start,
lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc) ante_stop,
lead(score,2) over(partition by user_id order by event_id desc, t.stop desc) ante_score,
if(((lead(start,2) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,2) over(partition by user_id order by event_id desc, t.stop desc),null) as scoreAA,
from
t
)
where coalesce(scorePP,scoreAA) is not null
order by user_id,event_id desc
Consider below approach
select * except(candidates1, candidates2),
( select avg(score)
from (
select * from unnest(candidates1) union distinct
select * from unnest(candidates2)
order by event_id desc
limit 2
)
) as avg_score_of_past_2_events_within_7_days
from (
select *,
array_agg(struct(event_id, score)) over(order by unix_date(t.start) range between 7 preceding and 1 preceding) as candidates1,
array_agg(struct(event_id, score)) over(order by unix_date(t.end) range between 7 preceding and 1 preceding) as candidates2
from your_table t
)
if applied to sample data in your question - output is

How to use SQL to get column count for a previous date?

I have the following table,
id status price date
2 complete 10 2020-01-01 10:10:10
2 complete 20 2020-02-02 10:10:10
2 complete 10 2020-03-03 10:10:10
3 complete 10 2020-04-04 10:10:10
4 complete 10 2020-05-05 10:10:10
Required output,
id status_count price ratio
2 0 0 0
2 1 10 0
2 2 30 0.33
I am looking to add the price for previous row. Row 1 is 0 because it has no previous row value.
Find ratio ie 10/30=0.33
You can use analytical function ROW_NUMBER and SUM as follows:
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id ORDER BY date), 0) - price as price
FROM yourTable;
DB<>Fiddle demo
I think you want something like this:
SELECT
id,
COUNT(*) OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id
ORDER BY date ROWS BETWEEN
UNBOUNDED PRECEDING AND 1 PRECEDING), 0) price
FROM yourTable;
Demo
Please also check another method:
with cte
as(*,ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
SUM(price) OVER (PARTITION BY id ORDER BY date) ss from yourTable)
select id,status_count,isnull(ss,0)-price price
from cte

SQL query to find continuous local max, min of date based on category column

I have the following data set
Customer_ID Category FROM_DATE TO_DATE
1 5 1/1/2000 12/31/2001
1 6 1/1/2002 12/31/2003
1 5 1/1/2004 12/31/2005
2 7 1/1/2010 12/31/2011
2 7 1/1/2012 12/31/2013
2 5 1/1/2014 12/31/2015
3 7 1/1/2010 12/31/2011
3 7 1/5/2012 12/31/2013
3 5 1/1/2014 12/31/2015
The result I want to achieve is to find continuous local min/max date for Customers with the same category and identify any gap in dates:
Customer_ID FROM_Date TO_Date Category
1 1/1/2000 12/31/2001 5
1 1/1/2002 12/31/2003 6
1 1/1/2004 12/31/2005 5
2 1/1/2010 12/31/2013 7
2 1/1/2014 12/31/2015 5
3 1/1/2010 12/31/2011 7
3 1/5/2012 12/31/2013 7
3 1/1/2014 12/31/2015 5
My code works fine for customer 1 (return all 3 rows) and customer 2(return 2 rows with min and max date for each category) but for customer 3, it cannot identify the gap between 12/31/2011 and 1/5/2012 for category 7.
Customer_ID FROM_Date TO_Date Category
3 1/1/2010 12/31/2013 7
3 1/1/2014 12/31/2015 5
Here is my code:
SELECT Customer_ID, Category, min(From_Date), max(To_Date) FROM
(
SELECT Customer_ID, Category, From_Date,To_Date
,row_number() over (order by member_id, To_Date) - row_number() over (partition by Customer_ID order by Category) as p
FROM FFS_SAMP
) X
group by Customer_ID,Category,p
order by Customer_ID,min(From_Date),Max(To_Date)
This is a type of gaps and islands problem. Probably the safest method is to use a cumulative max() to look for overlaps with previous records. Where there is no overlap, then an "island" of records starts. So:
select customer_id, min(from_date), max(to_date), category
from (select t.*,
sum(case when prev_to_date >= from_date then 0 else 1 end) over
(partition by customer_id, category
order by from_date
) as grp
from (select t.*,
max(to_date) over (partition by customer_id, category
order by from_date
rows between unbounded preceding and 1 preceding
) as prev_to_date
from t
) t
) t
group by customer_id, category, grp;
Your attempt is quite close. You just need to fix the over() clause of the window functions:
select customer_id, category, min(from_date), max(to_date)
from (
select
fs.*,
row_number() over (partition by customer_id order from_date)
- row_number() over (partition by customer_id, category order by from_date) as grp
from ffs_samp fs
) x
group by customer_id, category, grp
order by customer_id, min(from_date)
Note that this method assumes no gaps or overlalp in the periods of a given customer, as show in your sample data.

Teradara SQL - Operation with max-min dates

suppose I have the following data frame in Reradata SQL.
How can I get the variation between the highest and lowest date, at user level? Regards
Initial table
user date price
1 1-1 10
1 2-1 20
1 3-1 30
2 1-1 12
2 2-1 22
2 3-1 32
3 1-1 13
3 2-1 23
3 3-1 33
Final table
user var_price
1 30/10-1
2 32/12-1
3 33/13-1
Try this-
SELECT B.[user],
CAST(SUM(B.max_price) AS VARCHAR)+'/'+CAST(SUM(B.min_price) AS VARCHAR)+ '-1' var_price,
SUM(B.max_price)/SUM(B.min_price) -1 calculated_var_price
FROM
(
SELECT * FROM
(
SELECT [user],0 max_price,price min_price,ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE) RN
FROM your_table
)A WHERE RN = 1
UNION ALL
SELECT * FROM
(
SELECT [user],price max_price,0 min_price, ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE DESC) RN
FROM your_table
)A WHERE RN = 1
)B
GROUP BY B.[user]
Output is-
user var_price calculated_var_price
1 30/10-1 2
2 32/12-1 1
3 33/13-1 1
Is this what you want?
select user, max(price) / min(price) - 1
from t
group by user;
Your values are monotonically increasing, so max() and min() seems like the simplest solution.
EDIT:
You can use window functions:
select user, max(last_price) / max(first_price) - 1
from (select t.*,
first_value(price) over (partition by user order by date rows between unbounded preceding and current_row) as first_price,
first_value(price) over (partition by user order by date desc rows between unbounded preceding and current_row) as last_price
from t
) t
group by user;
select user
,price as first_price
,last_value(price)
over (paritition by user
order by date
rows between unbounded preceding and unbounded following) as last_price
from mytab
qualify
row_number() -- lowest date only
over (paritition by user
order by date) = 1
This returns the row with the lowest date and adds the price of the latest date