How to Update Subsets of a Window (Partition) - sql

My data can be partitioned into an ordered set of events in time for each item of interest.
Looks like this:
item_id state time window
---------- -------- ----- --------
a1 start t1 w1
a1 stall t2 w1
a1 restart t3 w1
a1 stall t4 w1
a1 restart t5 w1
a1 stop t6 w1
a2 start t9 w2
a2 stop t10 w2
a2 start t11 w2
a2 stop t12 w2
In this example, the times are arranged in chronological order (t(n) is later than t(n-1)). The 'window' column is illustrative only of the partition definition below, and isn't actually in the data. I only know how to partition this as follows:
window w over (partition by item_id order by time)
This window will always be bounded by the earliest 'start' and the latest state in the data, but there may many 'stop' states that occur before the last recorded state.
What I want to do now is subdivide this window into intervals that begin with the 'start' time and end with the next 'stop' time, if there is one, or the last state following a start if not.
My approach so far: pull out all the 'start' records, order them by time, and generate a sequence number for them. That sequence number (I'm actually just loading those start records into a temp table that has a serial primary key on it) is then unique for every interval I want to identify.
So, I now have data that looks like this:
item_id state time interval_id
---------- -------- ----- ------------
a1 start t1 1
a1 stall t2 NULL
a1 restart t3 NULL
a1 stall t4 NULL
a1 restart t5 NULL
a1 stop t6 NULL
a2 start t9 2
a2 stop t10 NULL
a2 start t11 3
a2 stop t12 NULL
In other words, sure, I can identify the first member of each of my 'sub-window' intervals, but I still can't find a reasonable way to tag the other members of these sets with the same interval id. I'm getting wrapped around the axle trying define the other members of each set--the original problem.
The whole reason for tagging the sets with an identifier is really so I can use that identifier for windowing the data based on the real window definition: "starts with 'start', ends with 'end', is ordered by time".
If the problem is clear, how to approach?

If the 'stop' state is always followed by the 'start' state, you could use a running sum that increased by 1 whenever a 'start' state is found:
select *,
sum(1) filter (where state='start') over (order by replace(time, 't','')::int) interval_id
from tbl
order by replace(time, 't','')::int
replace(time, 't','')::int is used to get the correct order of time, if you are using real timestamps just use order by time.
See demo

Related

SQL Server query order by sequence serie

I am writing a query and I want it to do a order by a series. The first seven records should be ordered by 1,2,3,4,5,6 and 7. And then it should start all over.
I have tried over partition, last_value but I cant figure it out.
This is the SQL code:
set language swedish;
select
tblridgruppevent.id,
datepart(dw,date) as daynumber,
tblRidgrupper.name
from
tblRidgruppEvent
join
tblRidgrupper on tblRidgrupper.id = tblRidgruppEvent.ridgruppid
where
ridgruppid in (select id from tblRidgrupper
where corporationID = 309 and Removeddate is null)
and tblridgruppevent.terminID = (select id from tblTermin
where corporationID = 309 and removedDate is null and isActive = 1)
and tblridgrupper.removeddate is null
order by
datepart(dw, date)
and this is a example the result:
5887 1 J2
5916 1 J5
6555 2 Junior nybörjare
6004 2 Morgonridning
5911 3 J2
6467 3 J5
and this is what I would expect:
5887 1 J2
6555 2 Junior nybörjare
5911 3 J2
5916 1 J5
6004 2 Morgonridning
6467 3 J5
You might get some value by zooming out a little further and consider what you're trying to do and how else you might do it. SQL tends to perform very poorly with row by row processing as well as operations where a row borrows details from the row before it. You also could run into problems if you need to change what range you repeat at (switching from 7 to 10 or 4 etc).
If you need a number there somewhat arbitrarily still, you could add ROW_NUMBER combined with a modulo to get a repeating increment, then add it to your select/where criteria. It would look something like this:
((ROW_NUMBER() OVER(ORDER BY column ASC) -1) % 7) + 1 AS Number
The outer +1 is to display the results as 1-7 instead of 0-6, and the inner -1 deals with the off by one issue (the column starting at 2 instead of 1). I feel like there's a better way to deal with that, but it's not coming to me at the moment.
edit: Looking over your post again, it looks like you're dealing with days of the week. You can order by Date even if it's not shown in the select statement, that might be all you need to get this working.
The first seven records should be ordererd by 1,2,3,4,5,6 and 7. And then it should start all over.
You can use row_number():
order by row_number() over (partition by DATEPART(dw, date) order by tblridgruppevent.id),
datepart(dw, date)
The second key keeps the order within a group.
You don't specify how the rows should be chosen for each group. It is not clear from the question.

Conditional lead/lag function PostgreSQL?

I have a table like this:
Name activity time
user1 A1 12:00
user1 E3 12:01
user1 A2 12:02
user2 A1 10:05
user2 A2 10:06
user2 A3 10:07
user2 M6 10:07
user2 B1 10:08
user3 A1 14:15
user3 B2 14:20
user3 D1 14:25
user3 D2 14:30
Now, I need a result like this:
Name activity next_activity
user1 A2 NULL
user2 A3 B1
user3 A1 B2
I would like to check for every user the last activity from group A and what type of activity took place next from group B (activity from group B always takes place after activity from group A). Other types of activity are not interesting for me. I've tried to use the lead() function, but it hasn't worked.
How I can solve my problem?
Your definition:
activity from group B always takes place after activity from group A.
.. logically implies that there is, per user, 0 or 1 B activity after 1 or more A activities. Never more than 1 B activities in sequence.
You can make it work with a single window function, DISTINCT ON and CASE, which should be the fastest way for few rows per user (also see below):
SELECT name
, CASE WHEN a2 LIKE 'B%' THEN a1 ELSE a2 END AS activity
, CASE WHEN a2 LIKE 'B%' THEN a2 END AS next_activity
FROM (
SELECT DISTINCT ON (name)
name
, lead(activity) OVER (PARTITION BY name ORDER BY time DESC) AS a1
, activity AS a2
FROM t
WHERE (activity LIKE 'A%' OR activity LIKE 'B%')
ORDER BY name, time DESC
) sub;
db<>fiddle here
An SQL CASE expression defaults to NULL if no ELSE branch is added, so I kept that short.
Assuming time is defined NOT NULL. Else, you might want to add NULLS LAST. Why?
Sort by column ASC, but NULL values first?
(activity LIKE 'A%' OR activity LIKE 'B%') is more verbose than activity ~ '^[AB]', but typically faster in older versions of Postgres. About pattern matching:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Conditional window functions?
That's actually possible. You can combine the aggregate FILTER clause with the OVER clause of window functions. However:
The FILTER clause itself can only work with values from the current row.
More importantly, FILTER is not implemented for pure genuine functions like lead() or lag() (up to Postgres 13) - only for aggregate functions.
If you try:
lead(activity) FILTER (WHERE activity LIKE 'A%') OVER () AS activity
Postgres will tell you:
FILTER is not implemented for non-aggregate window functions
About FILTER:
Aggregate columns with additional (distinct) filters
Referencing current row in FILTER clause of window function
Performance
For few users with few rows per user, pretty much any query is fast, even without index.
For many users and few rows per user, the first query above should be fastest. See:
Select first row in each GROUP BY group?
For many rows per user, there are (potentially much) faster techniques, depending on details of your setup. See:
Optimize GROUP BY query to retrieve latest row per user
select distinct on(name) name,activity,next_activity
from (select name,activity,time
,lead(activity) over (partition by name order by time) as next_activity
from t
where left(activity,1) in ('A','B')
) t
where left(activity,1) = 'A'
order by name,time desc

SQL - Delete value if incremental pattern not met

I have a table with a column of values with the following sample data that has been pulled for 1 user:
ID | Data
5 Record1
12 NULL
13 NULL
15 Record1
20 Record12
28 NULL
31 NULL
35 Record12
37 Record23
42 Record34
51 NULL
53 Record34
58 Record5
61 Record17
63 NULL
69 Record17
What I would like to do is to delete any values in the Data column where the Data value does not have a start and finish record. So in the above Record 23 and Record 5 would be deleted.
Please note that the Record(n) may appear more than once so it's not as straight forward as doing a count on the Data value. It needs to be incremental, a record should always start and finish before another one starts, if it starts and doesnt finish then I want to remove it.
Sadly SQL Server 2008 does not have LAG or LEAD which would make the operation simpler.
You could use a common table expression for finding the non consecutive (non null) values, and delete them;
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) rn FROM table1 WHERE data IS NOT NULL
)
DELETE c1 FROM cte c1
LEFT JOIN cte c2 ON (c1.rn = c2.rn+1 OR c1.rn = c2.rn-1) AND c1.data = c2.data
WHERE c2.id IS NULL
An SQLfiddle to test with.
If you just want to see which rows would be deleted, replace DELETE c1 with SELECT c1.*.
...and as always, remember to back up before running potentially destructive SQL for random people on the Internet.

Convert list of transitions (points in time) to list of states (periods of time)

Did something similar long ago, but when I think I'm doing the same thing now, it doesn't work.
A history table is a list of events happening to accounts. Some of those events are changes in status, in which case a multipurpose Detail column shows the new status. Sample:
... where Event_Type = 'Change_Status';
Acct Line Event_Type Detail
---- ---- ------------- -------
A 1 Change_Status Created
A 4 Change_Status Billed
A 7 Change_Status Paid
A 10 Change_Status Audited
B 1 Change_Status Created
B 6 Change_Status Billed
Now it is easy enough to join this to itself and get a table of time periods WHERE A.Acct = B.Acct and A.Line < B.Line but two things I'm failing on:
I also need to capture the last status, but in that case there is no end (B.*). I thought a left join would get it (B.Line is null) but it doesn't.
Need to eliminate periods that span more than one status, such as A-1 to A-7 Tried both items below, but either one eliminated everything.
AND A.LINE = (SELECT Max(Line) FROM Events TEMP
WHERE TEMP.Acct = A.Acct
AND TEMP.Line < B.Line or B.Line is null);
AND NOT EXISTS (SELECT Line FROM Events TEMP
WHERE TEMP.Acct = A.Acct
AND TEMP.Line between A.Line and B.Line);
If any of that is unclear, what I need to create is effectively
Acct Line Acct Line Status
---- ---- ---- ---- -------
from A 1 To A 4 Created
from A 4 To A 7 Billed
from A 7 To A 10 Paid
from A 10 To Audited
from B 1 To B 6 Created
I poked around with this on a postgres 9.1 database (so, ymmv). This is the query i came up with:
select
x.acct, x.line, y.line, x.status
from
statchanges x
left join statchanges y on x.acct = y.acct
and y.line > x.line
where
y.line is null or
(y.line - x.line =
(select min(y1.line - x1.line)
from statchanges x1, statchanges y1
where x1.acct = x.acct
and x1.line = x.line
and x1.acct = y1.acct
and y1.line > x1.line));
Important differences: 1- in the join clause, i'm joining on b.line > a.line, rather than a.line < b.line. This appears to be because (on postgres 9.1, at least) null is sorted after non-nulls, unless otherwise specified. 2- i'm jumping through some hoops to make sure i get the right min in the sub-query: making a very similar join (don't have to do a left join since we don't care about the nulls), and making sure the acct and starting line match with the outer query.
I'm not sure if this is completely what you're looking for, but it should hopefully give you some directions to explore.

Missing gaps in recurring series within a group

We have a table with following data
Id,ItemId,SeqNumber;DateTimeTrx
1,100,254,2011-12-01 09:00:00
2,100,1,2011-12-01 09:10:00
3,200,7,2011-12-02 11:00:00
4,200,5,2011-12-02 10:00:00
5,100,255,2011-12-01 09:05:00
6,200,3,2011-12-02 09:00:00
7,300,0,2011-12-03 10:00:00
8,300,255,2011-12-03 11:00:00
9,300,1,2011-12-03 10:30:00
Id is an identity column.
The sequence for an ItemId starts from 0 and goes till 255 and then resets to 0. All this information is stored in a table called Item. The order of sequence number is determined by the DateTimeTrx but such data can enter any time into the system. The expected output is as shown below-
ItemId,PrevorNext,SeqNumber,DateTimeTrx,MissingNumber
100,Previous,255,2011-12-01 09:05:00,0
100,Next,1,2011-12-01 09:10:00,0
200,Previous,3,2011-12-02 09:00:00,4
200,Next,5,2011-12-02 10:00:00,4
200,Previous,5,2011-12-02 10:00:00,6
200,Next,7,2011-12-02 11:00:00,6
300,Previous,1,2011-12-03 10:30:00,2
300,Next,255,2011-12-03 16:30:00,2
We need to get those rows one before and one after the missing sequence. In the above example for ItemId 300 - the record with sequence 1 has entered first (2011-12-03 10:30:00) and then 255(2011-12-03 16:30:00), hence the missing number here is 2. So 1 is previous and 255 is next and 2 is the first missing number. Coming to ItemId 100, the record with sequence 255 has entered first (2011-12-02 09:05:00) and then 1 (2011-12-02 09:10:00), hence 255 is previous and then 1, hence 0 is the first missing number.
In the above expected result, MissingNumber column is the first occuring missing number just to illustrate the example.
We will not have a case where we would have a complete series reset at one time i.e. it can be either a series rundown from 255 to 0 as in for itemid 100 or 0 to 255 as in ItemId 300. Hence we need to identify sequence missing when in ascending order (0,1,...255) or either in descending order (254,254,0,2) etc.
How can we accomplish this in a t-sql?
Could work like this:
;WITH b AS (
SELECT *
,row_number() OVER (ORDER BY ItemId, DateTimeTrx, SeqNumber) AS rn
FROM tbl
), x AS (
SELECT
b.Id
,b.ItemId AS prev_Itm
,b.SeqNumber AS prev_Seq
,c.ItemId AS next_Itm
,c.SeqNumber AS next_Seq
FROM b
JOIN b c ON c.rn = b.rn + 1 -- next row
WHERE c.ItemId = b.ItemId -- only with same ItemId
AND c.SeqNumber <> (b.SeqNumber + 1)%256 -- Seq cycles modulo 256
)
SELECT Id, prev_Itm, 'Previous' AS PrevNext, prev_Seq
FROM x
UNION ALL
SELECT Id, next_Itm ,'Next', next_Seq
FROM x
ORDER BY Id, PrevNext DESC
Produces exactly the requested result.
See a complete working demo on data.SE.
This solution takes gaps in the Id column into consideration, as there is no mention of a gapless sequence of Ids in the question.
Edit2: Answer to updated question:
I updated the CTE in the query above to match your latest verstion - or so I think.
Use those columns that define the sequence of rows. Add as many columns to your ORDER BY clause as necessary to break ties.
The explanation to your latest update is not entirely clear to me, but I think you only need to squeeze in DateTimeTrx to achieve what you want. I have SeqNumber in the ORDER BY additionally to break ties left by identical DateTimeTrx. I edited the query above.