SQL partition by on date range

SQL partition by on date range - sql

Assume this is my table:
ID NUMBER DATE
------------------------
1 45 2018-01-01
2 45 2018-01-02
2 45 2018-01-27
I need to separate using partition by and row_number where the difference between one date and another is greater than 5 days. Something like this would be the result of the above example:
ROWNUMBER ID NUMBER DATE
-----------------------------
1 1 45 2018-01-01
2 2 45 2018-01-02
1 3 45 2018-01-27
My actual query is something like this:
SELECT ROW_NUMBER() OVER(PARTITION BY NUMBER ODER BY ID DESC) AS ROWNUMBER, ...
But as you can notice, it doesn't work for the dates. How can I achieve that?

You can use lag function :
select *, row_number() over (partition by number, grp order by id) as [ROWNUMBER]
from (select *, (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1
then 1 else 2
end) as grp
from table
) t;

by using lag and datediff funtion
select * from
(
select t.*,
datediff(day,
lag(DATE) over (partition by NUMBER order by id),
DATE
) as diff
from t
) as TT where diff>5
http://sqlfiddle.com/#!18/130ae/11

I think you want to identify the groups, using lag() and datediff() and a cumulative sum. Then use row_number():
select t.*,
row_number() over (partition by number, grp order by date) as rownumber
from (select t.*,
sum(grp_start) over (partition by number order by date) as grp
from (select t.*,
(case when lag(date) over (partition by number order by date) < dateadd(day, 5, date)
then 1 else 0
end) as grp_start
from t
) t
) t;

Related

How to find the time and step between status change

I'm trying to query a dataset about user status changes. and I want to find out the time it takes for the status to change, and the steps in between(number of rows).
Example data:
user_id
Status
date
1
a
2001-01-01
1
a
2001-01-08
1
b
2001-01-15
1
b
2001-01-28
1
a
2001-01-31
1
b
2001-02-01
2
a
2001-01-08
2
a
2001-01-18
2
a
2001-01-28
3
b
2001-03-08
3
b
2001-03-18
3
b
2001-03-19
3
a
2001-03-20
Desired output:
user_id
From
to
days in between
Steps in between
1
a
b
14
2
1
b
a
16
2
1
a
b
1
1
3
b
a
12
3

You might consider below another approach.
WITH partitions AS (
SELECT *, COUNTIF(flag) OVER w AS part FROM (
SELECT *, ROW_NUMBER() OVER w AS rn, status <> LAG(status) OVER w AS flag,
FROM sample_data
WINDOW w AS (PARTITION BY user_id ORDER BY date)
) WINDOW w AS (PARTITION BY user_id ORDER BY date)
)
SELECT user_id,
LAG(ANY_VALUE(status)) OVER w AS `from`,
ANY_VALUE(status) AS `to`,
EXTRACT(DAY FROM MIN(date) - LAG(MIN(date)) OVER w) AS days_in_between,
MIN(rn) - LAG(MIN(rn)) OVER w AS steps_in_between
FROM partitions
GROUP BY user_id, part
QUALIFY `from` IS NOT NULL
WINDOW w AS (PARTITION BY user_id ORDER BY MIN(date));
Query results

with main as (
select
*,
dense_rank() over(partition by user_id order by date) as rank_,
row_number() over(partition by user_id, status order by date) as rank_2,
row_number() over(partition by user_id, status order by date) - dense_rank() over(partition by id order by date) as diff,
row_number() over(partition by user_id order by date) as row_num,
lag(status) over(partition by user_id order by date) as prev_status,
concat(lag(status) over(partition by user_id order by date) , ' to ' , status) as status_change
from table
),
new_rank as (
select
*,
rown_num - diff as row_num_diff,
min(date) over(partition by user_id, status, rown_num - diff) as min_date
from main
),
prev_date as (
select
*,
lag(min_date) over(partition by user_id order by date) as prev_min_date
from new_rank
)
select
status as from,
prev_status as to,
date_diff(prev_min_date, min_date, DAY) as days_in_between
from prev_date
where status !=prev_status and prev_status is not null
Does this seem to work? I tried to solve this but it's very hard to solve it without a fiddle plus:
you may remove the extra steps/ranks that I have added, I left them there so you can visually see what they are doing
I don't get your steps logic so it is missing from the code

Need to get maximum date range which is overlapping in SQL

I have a table with 3 columns id, start_date, end_date
Some of the values are as follows:
1 2018-01-01 2030-01-01
1 2017-10-01 2018-10-01
1 2019-01-01 2020-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2010-10-01 2010-12-01
2 2008-01-01 2009-01-01
I have the above kind of data set where I have to filter out overlap date range by keeping maximum datarange and keep the other date range which is not overlapping for a particular id.
Hence desired output should be:
1 2018-01-01 2030-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2008-01-01 2009-01-01
I am unable to find the right way to code in impala. Can someone please help me.
I have tried like,
with cte as(
select a*, row_number() over(partition by id order by datediff(end_date , start_date) desc) as flag from mytable a) select * from cte where flag=1
but this will remove other date range which is not overlapping. Please help.

use row number with countItem for each id
with cte as(
select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable
)
select id,start_date,end_date
from cte
where seq = 1 or seq = countItem
or without cte
select id,start_date,end_date
from
(select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable) t
where seq = 1 or seq = countItem
demo in db<>fiddle

You can use a cumulative max to see if there is any overlap with preceding rows. If there is not, then you have the first row of a new group (row in the result set).
A cumulative sum of the starts assigns each row in the source to a group. Then aggregate:
select id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date >= start_date then 0 else 1 end) over
(partition by id
order by start_date
rows between unbounded preceding and current row
) as grp
from (select t.*,
max(end_date) over (partition by id
order by start_date
rows between unbounded preceding and 1 preceding
) as prev_end_date
from t
) t
) t
group by id, grp;

How to use over partition

I have this table:
ID BS time
1 1 14:10:00
1 1 14:10:05
1 1 15:04:03
1 2 16:18:05
1 2 17:00:09
1 3 18:33:50
1 1 19:03:14
1 1 19:10:23
and except:
ID BS start_time end_time
1 1 14:10:00 16:18:05
1 2 16:18:05 18:33:50
1 3 18:33:50 19:03:14
1 1 19:03:14 19:10:23
I try use lead, but i don't know, how to resolve problem, when BS is repeat after is end
SELECT id,bs,time,--min(time) time_start,
lead(time,1) over (partition by id order by time) next_time,
FROM `sage-facet-114619.Temp_data.temp_table`
order by id,time
After that I think about group by after this, but I have problem with same BS's

Below is for BigQuery Standard SQL (and actually returns expected result - which is not a case with other two answers)
#standardSQL
SELECT id, bs,
MIN(time) AS start_time,
MAX(IFNULL(end_time, time)) AS end_time
FROM (
SELECT id, bs, time, end_time,
COUNTIF(flag) OVER(PARTITION BY id ORDER BY time) AS grp
FROM (
SELECT *,
LEAD(time) OVER win AS end_time,
bs != LAG(bs) OVER win AS flag
FROM `sage-facet-114619.Temp_data.temp_table`
WINDOW win AS (PARTITION BY id ORDER BY time)
)
)
GROUP BY id, bs, grp
If applied to sample data from your question - output is
Row id bs start_time end_time
1 1 1 14:10:00 16:18:05
2 1 2 16:18:05 18:33:50
3 1 3 18:33:50 19:03:14
4 1 1 19:03:14 19:10:23

This is a gaps-and-islands problem. The idfference of row numbers is one solution:
select id, min(time) as start_time,
lead(min(time)) over (partition by id order by min(time)) as end_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partiton by id order by time) as seqnum_2
from `sage-facet-114619.Temp_data.temp_table` t
t
group by id, (seqnum - seqnum_2);
Another solution in this case is lag():
select id, time as start_time,
lead(time) over (partition by id order by time) as end_time
from (select t.*,
lag(id) over (order by time) as prev_id
from `sage-facet-114619.Temp_data.temp_table` t
t
where prev_id is null or prev_id <> id

I have fixed a few oversights in Gordon's query. As I have not used BigQuery myself I can't speak to it's "standard" syntax or features but I trust that it is reliable outside of the relatively minor changes I made.
select
id, bs, min(time) as start_time,
coalesce(
lead(min(time)) over (partition by id order by min(time)),
max(time) -- corrected: max() rather than min()
) as end_time
from (
select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by id, bs order by time) as seqnum_2
from t
) t
group by id, bs, seqnum - seqnum_2;
Please compare results (running against SQL Server): https://rextester.com/WCSL25882

Minimum and maximum dates within continuous date range grouped by name

I have a data ranges with start and end date for a persons, I want to get the continuous date ranges only per persons:
Input:
NAME | STARTDATE | END DATE
--------------------------------------
MIKE | **2019-05-15** | 2019-05-16
MIKE | 2019-05-17 | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
Expected output like:
MIKE | **2019-05-15** | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
So basically output is MIN and MAX for each continuous period for the person.
Appreciate any help.
I have tried the below query:
With N AS ( SELECT Name, StartDate, EndDate
, LastStop = MAX(EndDate)
OVER (PARTITION BY Name ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) FROM Table ), B AS ( SELECT Name, StartDate, EndDate
, Block = SUM(CASE WHEN LastStop Is Null Then 1
WHEN LastStop < StartDate Then 1
ELSE 0
END)
OVER (PARTITION BY Name ORDER BY StartDate, LastStop) FROM N ) SELECT Name
, MIN(StartDate) DateFrom
, MAX(EndDate) DateTo FROM B GROUP BY Name, Block ORDER BY Name, Block
But its not considering the continuous period. It's showing the same input.

This is a type of gap-and-islands problem. There is no need to expand the data out by day! That seems very inefficient.
Instead, determine the "islands". This is where there is no overlap -- in your case lag() is sufficient. Then a cumulative sum and aggregation:
select name, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate) then 0 else 1 end) over
(partition by name order by startdate) as grp
from (select t.*,
lag(enddate) over (partition by name order by startdate) as prev_enddate
from t
) t
) t
group by name, grp;
Here is a db<>fiddle.

Here is an example using an ad-hoc tally table
Example or dbFiddle
;with cte as (
Select A.[Name]
,B.D
,Grp = datediff(day,'1900-01-01',D) - dense_rank() over (partition by [Name] Order by D)
From YourTable A
Cross Apply (
Select Top (DateDiff(DAY,StartDate,EndDate)+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),StartDate)
From master..spt_values n1,master..spt_values n2
) B
)
Select [Name]
,StartDate= min(D)
,EndDate = max(D)
From cte
Group By [Name],Grp
Returns
Name StartDate EndDate
MIKE 2019-05-15 2019-05-18
MIKE 2020-05-18 2020-05-19
Just to help with the Visualization, the CTE generates the following

This will give you the same result
SELECT subquery.name,min(subquery.startdate),max(subquery.enddate1)
FROM (SELECT NAME,startdate,
CASE WHEN EXISTS(SELECT yt1.startdate
FROM t yt1
WHERE yt1.startdate = DATEADD(day, 1, yt2.enddate)
) THEN null else yt2.enddate END as enddate1
FROM t yt2) as subquery
GROUP by NAME, CAST(MONTH(subquery.startdate) AS VARCHAR(2)) + '-' + CAST(YEAR(subquery.startdate) AS VARCHAR(4))
For the CASE WHEN EXISTS I refered to SQL CASE
For the group by month and year you can see this GROUP BY MONTH AND YEAR
DB_FIDDLE

SQL - unique users who are visiting for the first time

Given following table visitorLog, write a SQL to find the following by date.
Total_Visitors
VisitorGain - compare to previous day
VisitorLoss - compare to previous day
Total_New_Visitors - unique users who are visiting for the first time
visitorLog :
*----------------------*
| Date Visitor |
*----------------------*
| 01-Jan-2011 V1 |
| 01-Jan-2011 V2 |
| 01-Jan-2011 V3 |
| 02-Jan-2011 V2 |
| 03-Jan-2011 V2 |
| 03-Jan-2011 V4 |
| 03-Jan-2011 V5 |
*----------------------*
Expected output:
*---------------------------------------------------------------------*
| Date Total_Visitors VisitorGain VisitorLoss Total_New_Visitors |
*---------------------------------------------------------------------*
| 01-Jan-2011 3 3 0 3 |
| 02-Jan-2011 1 0 2 0 |
| 03-Jan-2011 3 2 0 2 |
*---------------------------------------------------------------------*
Here is my SQL and SLQ fiddle.
with cte as
(
select
date,
total_visitors,
lag(total_visitors) over (order by date) as prev_visitors,
row_number() over (order by date ) as rnk
from
(
select
*,
count(visitor) over (partition by date) as total_visitors
from visitorLog
) val
group by
date,
total_visitors
),
cte2 as
(
select
date,
sum(case when rnk = 1 then 1 else 0 end) as total_new_visitors
from
(
select
date,
visitor,
row_number() over (partition BY visitor order by date) as rnk
from visitorLog
) t
group by
date
)
select
c.date,
sum(total_visitors) as total_visitors,
sum(
case
when rnk = 1 then total_visitors
when (rnk > 1 and prev_visitors < total_visitors) then (total_visitors - prev_visitors)
else
0
end
)visitorGain,
sum(
case
when rnk = 1 then 0
when prev_visitors > total_visitors then (prev_visitors - total_visitors)
else
0
end
) as visitorLoss,
sum(total_new_visitors) as total_new_visitors
from cte c
join cte2 c2
on c.date = c2.date
group by
c.date
order by
c.date
My solution is working as expected but I am wondering if I am missing any any edge cases here which may break my logic. any help would be great.

This logic does what you want:
select date, count(*) as num_visitor,
greatest(count(*) - lag(count(*)::int, 1, 0) over (order by date), 0) as visitor_gain,
greatest(lag(count(*)::int, 1, 0) over (order by date) - count(*), 0) as visitor_loss,
count(*) filter (where seqnum = 1) as num_new_visitors
from (select vl.*,
row_number() over (partition by visitor order by date) as seqnum
from visitorLog vl
) vl
group by date
order by date
Here is a db<>fiddle.

I would use window functions and aggregation:
select
date,
count(*) no_visitor,
count(*) - lag(count(*), 1, 0) over(partition by date) no_visitor_diff,
count(*) filter(where rn = 1) no_new_visitors
from (
select t.*, row_number() over(partition by visitor order by date) rn
from visitorLog
) t
group by date
order by date
The subquery ranks the visits of each customer using row_number() (the first visit of each customer gets row number 1). Then, the outer query aggregates by date, and uses lag() to get the visitor count of the "previous" day.
I don't really see the point to have two distinct columns for the difference of visitors compared to the last day, so this gives you a single column, with a value that's either positive or negative depending whether customers were gained or lost.
If you really want two columns, then:
greatest(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_gain,
- least(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_loss

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL partition by on date range - sql

You can use lag function : select , row_number() over (partition by number, grp order by id) as [ROWNUMBER] from (select , (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1 then 1 else 2 end) as grp from table ) t;

by using lag and datediff funtion select * from ( select t.*, datediff(day, lag(DATE) over (partition by NUMBER order by id), DATE ) as diff from t ) as TT where diff>5 http://sqlfiddle.com/#!18/130ae/11

Related

How to find the time and step between status change

Need to get maximum date range which is overlapping in SQL

How to use over partition

Minimum and maximum dates within continuous date range grouped by name

SQL - unique users who are visiting for the first time

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL partition by on date range - sql

You can use lag function : select *, row_number() over (partition by number, grp order by id) as [ROWNUMBER] from (select *, (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1 then 1 else 2 end) as grp from table ) t;

by using lag and datediff funtion select * from ( select t.*, datediff(day, lag(DATE) over (partition by NUMBER order by id), DATE ) as diff from t ) as TT where diff>5 http://sqlfiddle.com/#!18/130ae/11

Related

How to find the time and step between status change

Need to get maximum date range which is overlapping in SQL

How to use over partition

Minimum and maximum dates within continuous date range grouped by name

SQL - unique users who are visiting for the first time

Categories

Resources

You can use lag function : select , row_number() over (partition by number, grp order by id) as [ROWNUMBER] from (select , (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1 then 1 else 2 end) as grp from table ) t;