Partitioning on non-unique values

Partitioning on non-unique values - sql

I have a table that lists events, operations in the events and the time of each operation. Event ID is not unique, as it is the same event, just happening on different times. Operations might differ for the same type of the event. The same event never runs twice in a row).
I want to populate three new columns as per given example. This will allow me to run analysis on the separate events as I'll be able to generate a unique "Event" ID.
Edit:
I already tried PARTITION function based on event and it haven't worked as SQL server assumes two events (A and B) and therefore gives the same start date to all "A" events, even if in reality I need to show them as separate events with different start dates.
Thank you!

This is just window functions:
select t.*,
min(operationtime) over (partition by event) as event_start_time,
max(operationtime) over (partition by event) as event_end_time,
concat(event, '-', min(operationtime) over (partition by event)) as event_id
from t;
Actually, for the event id, you probably want something like:
concat(event, '-', convert(varchar(255), min(operationtime) over (partition by event), 101)) as event_id
or whatever format for the date you really want. I recommend YYYY-MM-DD as a date format.

I understand this as a gaps-and-island problem, where you want to build groups of consecutive daily events.
One option uses the difference between row numbers to identify the groups:
select
t.*,
min(operation_time) over(partition by event, rn1 - rn2) event_start_time,
max(operation_time) over(partition by event, rn1 - rn2) event_end_time,
concat(event, '-', min(operation_time) over(partition by event, rn1 - rn2)) event_id
from (
select
t.*,
row_number() over(order by operation_time) rn1,
row_number() over(partition by event order by operation_time) rn2
from mytable t
) t
order by operation_time
If there is always one and only one event per day, as showned in your sample data, then one row_number() is sufficient, along with date arithmetics:
select
t.*,
min(operation_time) over(partition by event, grp) event_start_time,
max(operation_time) over(partition by event, grp) event_end_time,
concat(event, '-', min(operation_time) over(partition by event, grp)) event_id
from (
select
t.*,
dateadd(
day,
- row_number() over(partition by event order by operation_time),
operation_time
) grp
from mytable t
) t

This approach creates the event group explicitly, then it uses a windowing query very similar to the other answers. I created a simple sample table to show results.
Data
drop table if exists #tTEST;
go
select * INTO #tTEST from (values
('A', 'X', '2020-01-08'),
('A', 'Z', '2020-02-08'),
('B', 'X', '2020-03-08'),
('B', 'Z', '2020-04-08'),
('A', 'X', '2020-05-08'),
('A', 'Z', '2020-06-08')) V([Event], [Operation], operation_time);
Query
;with
grp_cte as (
select t.*, case when lag([Event], 1, 0) over (order by operation_time) != [Event] then 1 else 0 end grp_ind
from #tTEST t),
event_grp_cte as (
select gc.*, sum(grp_ind) over (order by operation_time) EventGroup
from grp_cte gc)
select
t.*,
min(operation_time) over(partition by EventGroup) event_start_time,
max(operation_time) over(partition by EventGroup) event_end_time,
concat(event, '-', min(operation_time) over(partition by EventGroup)) event_id
from event_grp_cte t
order by operation_time;
Results
Event Operation operation_time grp_ind EventGroup rn1 rn2 event_start_time event_end_time event_id
A X 2020-01-08 1 1 1 1 2020-01-08 2020-02-08 A-2020-01-08
A Z 2020-02-08 0 1 2 2 2020-01-08 2020-02-08 A-2020-01-08
B X 2020-03-08 1 2 3 1 2020-03-08 2020-04-08 B-2020-03-08
B Z 2020-04-08 0 2 4 2 2020-03-08 2020-04-08 B-2020-03-08
A X 2020-05-08 1 3 5 3 2020-05-08 2020-06-08 A-2020-05-08
A Z 2020-06-08 0 3 6 4 2020-05-08 2020-06-08 A-2020-05-08

Related

Get last date of modification in database by value

How it is possible to get - when was the last change (by date) - in
this table:
id
date
value
1
01.01.2021
0.0
1
02.01.2021
10.0
1
03.01.2021
15.0
1
04.01.2021
25.0
1
05.01.2021
25.0
1
06.01.2021
25.0
Of course I could use clause where and it will works, but i have a lot of rows and for some i don't now exactly day when this happend.
The resault should be:
id
date
value
1
04.01.2021
25.0

Try this one:
with mytable as (
select 1 as id, date '2021-01-01' as date, 0 as value union all
select 1, date '2021-01-02', 10 union all
select 1, date '2021-01-03', 15 union all
select 1, date '2021-01-04', 25 union all
select 1, date '2021-01-05', 25 union all
select 1, date '2021-01-06', 25
)
select id, array_agg(struct(date, value) order by last_change_date desc limit 1)[offset(0)].*
from (
select *, if(value != lag(value) over (partition by id order by date), date, null) as last_change_date
from mytable
)
group by id

in this scenario I would be using two field in my database "created_at and updated_at" with the type as "timestamp". You may simply fetch your records using OrderBY "updated_at" field.

see what this gives you:
SELECT MAX(date) OVER (PARTITION BY(value)) AS lastChange
FROM Table
WHERE id = 1

The following query and reproducible example on db-fiddle works. I've also included some additional test records.
CREATE TABLE my_data (
`id` INTEGER,
`date` date,
`value` INTEGER
);
INSERT INTO my_data
(`id`, `date`, `value`)
VALUES
('1', '01.01.2021', '0.0'),
('1', '02.01.2021', '10.0'),
('1', '03.01.2021', '15.0'),
('1', '04.01.2021', '25.0'),
('1', '05.01.2021', '25.0'),
('1', '06.01.2021', '25.0'),
('2', '05.01.2021', '25.0'),
('2', '06.01.2021', '23.0'),
('3', '03.01.2021', '15.0'),
('3', '04.01.2021', '25.0'),
('3', '05.01.2021', '17.0'),
('3', '06.01.2021', '17.0');
Query #1
SELECT
id,
date,
value
FROM (
SELECT
*,
row_number() over (partition by id order by date desc) as id_rank
FROM (
SELECT
id,
m1.date,
m1.value,
rank() over (partition by id,m1.value order by date asc) as id_value_rank,
CASE
WHEN (m1.date = (max(m1.date) over (partition by id,m1.value ))) THEN 1
ELSE 0
END AS is_max_date_for_group,
CASE
WHEN (m1.date = (max(m1.date) over (partition by id ))) THEN 1
ELSE 0
END AS is_max_date_for_id
from
my_data m1
) m2
WHERE (m2.is_max_date_for_group = m2.is_max_date_for_id and is_max_date_for_group <> 0 and id_value_rank=1) or (id_value_rank=1 and is_max_date_for_id=0)
) t
where t.id_rank=1
order by id, date, value;
id
date
value
1
04.01.2021
25
2
06.01.2021
23
3
05.01.2021
17
View on DB Fiddle

I actually find that the simplest method is to enumerate the rows by id/date and by id/date/value in descending order. These are the same for the last group . . . and the rest is aggregation:
select id, min(date), value
from (select t.*,
row_number() over (partition by id order by date desc) as seqnum,
row_number() over (partition by id, value order by date desc) as seqnum_2
from t
) t
where seqnum = seqnum_2
group by id;
If you use lag(), I would recommend using qualify for performance:
select t.*
from (select t.*
from t
qualify lag(value) over (partition by id order by date) <> value or
lag(value) over (partition by id order by date) is null
) t
qualify row_number() over (partition by id order by date desc) = 1;
Note: Both of these work if the value is the same for all rows. Other methods may not work in that situation.

SQL - Combine two rows if difference is below threshhold

I have a table like this in SQL Server:
id start_time end_time
1 10:00:00 10:34:00
2 10:38:00 10:52:00
3 10:53:00 11:23:00
4 11:24:00 11:56:00
5 14:20:00 14:40:00
6 14:41:00 14:59:00
7 15:30:00 15:40:00
What I would like to have is a query that outputs consolidated records based on the time difference between two consecutive records (end_time of row n and start_time row n+1) . All records where the time difference is less than 2 minutes should be combined into one time entry and the ID of the first record should be kept. This should also combine more than two records if multiple consecutive records have a time difference less than 2 minutes.
This would be the expected output:
id start_time end_time
1 10:00:00 10:34:00
2 10:38:00 11:56:00
5 14:20:00 14:59:00
7 15:30:00 15:40:00
Thanks in advance for any tips how to build the query.
Edit:
I started with following code to calculate the lead_time and the time difference but do not know how to group and consolidate.
WITH rows AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Id) AS rn
FROM #temp
)
SELECT mc.id, mc.start_time, mc.end_time, mp.start_time lead_time, DATEDIFF(MINUTE, mc.[end_time], mp.[start_time]) as DiffToNewSession
FROM rows mc
LEFT JOIN rows mp
ON mc.rn = mp.rn - 1

The window function in t-sql can realize a lot of data statistics, such as
create table #temp(id int identity(1,1), start_time time, end_time time)
insert into #temp(start_time, end_time)
values ('10:00:00', '10:34:00')
, ('10:38:00', '10:52:00')
, ('10:53:00', '11:23:00')
, ('11:24:00', '11:56:00')
, ('14:20:00', '14:40:00')
, ('14:41:00', '14:59:00')
, ('15:30:00', '15:40:00')
;with c0 as(
select *, LAG(end_time,1,'00:00:00') over (order by id) as lag_time
from #temp
), c1 as(
select *, case when DATEDIFF(MI, lag_time, start_time) <= 2 then 1 else -0 end as gflag
from c0
), c2 as(
select *, SUM(case when gflag=0 then 1 else 0 end) over(order by id) as gid
from c1
)
select MIN(id) as id, MIN(start_time) as start_time, MAX(end_time) as end_time
from c2
group by gid
In order to better describe the process of data construction, I simply use c0, c1, c2... to represent levels, you can merge some levels and optimize.
If you can’t use id as a sorting condition, then you need to change the sorting part in the above statement.

You can use a recursive cte to get the result that you want. This method just simple compare current end_time with next start_time. If it is less than the 2 mintues threshold use the same start_time as grp_start. And the end, simple do a GROUP BY on the grp_start
with rcte as
(
-- anchor member
select *, grp_start = start_time
from tbl
where id = 1
union all
-- recursive member
select t.id, t.start_time, t.end_time,
grp_start = case when datediff(second, r.end_time, t.start_time) <= 120
then r.grp_start
else t.start_time
end
from tbl t
inner join rcte r on t.id = r.id + 1
)
select id = min(id), grp_start as start_time, max(end_time) as end_time
from rcte
group by grp_start
demo

I guess this should do the trick without recursion. Again I used several ctes in order to make the solution a bit easier to read. guess it can be reduced a little...
INSERT INTO T1 VALUES
(1,'10:00:00','10:34:00')
,(2,'10:38:00','10:52:00')
,(3,'10:53:00','11:23:00')
,(4,'11:24:00','11:56:00')
,(5,'14:20:00','14:40:00')
,(6,'14:41:00','14:59:00')
,(7,'15:30:00','15:40:00')
GO
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (ORDER BY id) AS rn
,DATEDIFF(MINUTE, ISNULL(LAG(endtime) OVER (ORDER BY id), starttime), starttime) AS diffMin
,COUNT(*) OVER (PARTITION BY (SELECT 1)) as maxRn
FROM T1
),
cteFirst AS(
SELECT *
FROM cte
WHERE rn = 1 OR diffMin > 2
),
cteGrp AS(
SELECT *
,ISNULL(LEAD(rn) OVER (ORDER BY id), maxRn+1) AS nextRn
FROM cteFirst
)
SELECT f.id, f.starttime, MAX(ISNULL(n.endtime, f.endtime)) AS endtime
FROM cteGrp f
LEFT JOIN cte n ON n.rn >= f.rn AND n.rn < f.nextRn
GROUP BY f.id, f.starttime

Lag functions and SUM

I need to get the list of users that have been offline for at least 20 min every day. Here's my data
I have this starting query but am stuck on how to sum the difference in offline_mins i.e. need to add "and sum(offline_mins)>=20" to the where clause
SELECT
userid,
connected,
LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt) AS offline_period,
DATEDIFF(minute, LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt),recordeddt) offline_mins
FROM device_data where connected=0;
My expected results :
Thanks in advance.

This reads like a gaps-and-island problem, where you want to group together adjacent rows having the same userid and status.
As a starter, here is a query that computes the islands:
select userid, connected, min(recordeddt) startdt, max(lead_recordeddt) enddt,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*,
row_number() over(partition by userid order by recordeddt) rn1,
row_number() over(partition by userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by userid order by recordeddt) lead_recordeddt
from device_data dd
) dd
group by userid, connected, rn1 - rn2
Now, say you want users that were offline for at least 20 minutes every day. You can breakdown the islands per day, and use a having clause for filtering:
select userid
from (
select recordedday, userid, connected,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*, v.*,
row_number() over(partition by v.recordedday, userid order by recordeddt) rn1,
row_number() over(partition by v.recordedday, userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by v.recordedday, userid order by recordeddt) lead_recordeddt
from device_data dd
cross apply (values (convert(date, recordeddt))) v(recordedday)
) dd
group by convert(date, recordeddt), userid, connected, rn1 - rn2
) dd
group by userid
having count(distinct case when connected = 0 and duration >= 20 then recordedday end) = count(distinct recordedday)

As noted this is a gaps and island problem. This is my take on it using a simple lag function to create groups, filter out the connected rows and then work on the date ranges.
CREATE TABLE #tmp(ID int, UserID int, dt datetime, connected int)
INSERT INTO #tmp VALUES
(1,1,'11/2/20 10:00:00',1),
(2,1,'11/2/20 10:05:00',0),
(3,1,'11/2/20 10:10:00',0),
(4,1,'11/2/20 10:15:00',0),
(5,1,'11/2/20 10:20:00',0),
(6,2,'11/2/20 10:00:00',1),
(7,2,'11/2/20 10:05:00',1),
(8,2,'11/2/20 10:10:00',0),
(9,2,'11/2/20 10:15:00',0),
(10,2,'11/2/20 10:20:00',0),
(11,2,'11/2/20 10:25:00',0),
(12,2,'11/2/20 10:30:00',0)
SELECT UserID, connected,DATEDIFF(minute,MIN(DT), MAX(DT)) OFFLINE_MINUTES
FROM
(
SELECT *, SUM(CASE WHEN connected <> LG THEN 1 ELSE 0 END) OVER (ORDER BY UserID,dt) grp
FROM
(
select *, LAG(connected,1,connected) OVER(PARTITION BY UserID ORDER BY UserID,dt) LG
from #tmp
) x
) y
WHERE connected <> 1
GROUP BY UserID,grp,connected
HAVING DATEDIFF(minute,MIN(DT), MAX(DT)) >= 20

Find the true start end dates for customers that have multiple accounts in SQL Server 2014

I have a checking account table that contains columns Cust_id (customer id), Open_Date (start date), and Closed_Date (end date). There is one row for each account. A customer can open multiple accounts at any given point. I would like to know how long the person has been a customer.
eg 1:
CREATE TABLE [Cust]
(
[Cust_id] [varchar](10) NULL,
[Open_Date] [date] NULL,
[Closed_Date] [date] NULL
)
insert into [Cust] values ('a123', '10/01/2019', '10/15/2019')
insert into [Cust] values ('a123', '10/12/2019', '11/01/2019')
Ideally I would like to insert this into a table with just one row, that says this person has been a customer from 10/01/2019 to 11/01/2019. (as he opened his second account before he closed his previous one.
Similarly eg 2:
insert into [Cust] values ('b245', '07/01/2019', '09/15/2019')
insert into [Cust] values ('b245', '10/12/2019', '12/01/2019')
I would like to see 2 rows in this case- one that shows he was a customer from 07/01 to 09/15 and then again from 10/12 to 12/01.
Can you point me to the best way to get this?

I would approach this as a gaps and islands problem. You want to group together groups of adjacents rows whose periods overlap.
Here is one way to solve it using lag() and a cumulative sum(). Everytime the open date is greater than the closed date of the previous record, a new group starts.
select
cust_id,
min(open_date) open_date,
max(closed_date) closed_date
from (
select
t.*,
sum(case when not open_date <= lag_closed_date then 1 else 0 end)
over(partition by cust_id order by open_date) grp
from (
select
t.*,
lag(closed_date) over (partition by cust_id order by open_date) lag_closed_date
from cust t
) t
) t
group by cust_id, grp
In this db fiddle with your sample data, the query produces:
cust_id | open_date | closed_date
:------ | :--------- | :----------
a123 | 2019-10-01 | 2019-11-01
b245 | 2019-07-01 | 2019-09-15
b245 | 2019-10-12 | 2019-12-01

I would solve this with recursion. While this is certainly very heavy, it should accommodate even the most complex account timings (assuming your data has such). However, if the sample data provided is as complex as you need to solve for, I highly recommend sticking with the solution provided above. It is much more concise and clear.
WITH x (cust_id, open_date, closed_date, lvl, grp) AS (
SELECT cust_id, open_date, closed_date, 1, 1
FROM (
SELECT cust_id
, open_date
, closed_date
, row_number()
OVER (PARTITION BY cust_id ORDER BY closed_date DESC, open_date) AS rn
FROM cust
) AS t
WHERE rn = 1
UNION ALL
SELECT cust_id, open_date, closed_date, lvl, grp
FROM (
SELECT c.cust_id
, c.open_date
, c.closed_date
, x.lvl + 1 AS lvl
, x.grp + CASE WHEN c.closed_date < x.open_date THEN 1 ELSE 0 END AS grp
, row_number() OVER (PARTITION BY c.cust_id ORDER BY c.closed_date DESC) AS rn
FROM cust c
JOIN x
ON x.cust_id = c.cust_id
AND c.open_date < x.open_date
) AS t
WHERE t.rn = 1
)
SELECT cust_id, min(open_date) AS first_open_date, max(closed_date) AS last_closed_date
FROM x
GROUP BY cust_id, grp
ORDER BY cust_id, grp
I would also add the caveat that I don't run on SQL Server, so there could be syntax differences that I didn't account for. Hopefully they are minor, if present.

you can try something like that:
select distinct
cust_id,
(select min(Open_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
),
(select max(Closed_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
)
from Cust as a
so, for every row - you're selecting minimal and maximal dates from all overlapping ranges, later distinct filters out duplicates

How to return all the rows in the yellow census blocks?

Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);

You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows

Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Partitioning on non-unique values - sql

Related

Get last date of modification in database by value

SQL - Combine two rows if difference is below threshhold

Lag functions and SUM

Find the true start end dates for customers that have multiple accounts in SQL Server 2014

How to return all the rows in the yellow census blocks?

Categories

Resources