PostgreSQL - Detecting patterns in a series - sql

Consider the following table:
id | date | status
1 | 2014-01-10 | 1
1 | 2014-02-10 | 1
1 | 2014-03-10 | 1
1 | 2014-04-10 | 1
1 | 2014-05-10 | 0
1 | 2014-06-10 | 0
------------------------
2 | 2014-01-10 | 1
2 | 2014-02-10 | 1
2 | 2014-03-10 | 0
2 | 2014-04-10 | 1
2 | 2014-05-10 | 0
2 | 2014-06-10 | 0
------------------------
3 | 2014-01-10 | 1
3 | 2014-02-10 | 0
3 | 2014-03-10 | 0
3 | 2014-04-10 | 1
3 | 2014-05-10 | 0
3 | 2014-06-10 | 0
------------------------
4 | 2014-01-10 | 0
4 | 2014-02-10 | 1
4 | 2014-03-10 | 1
4 | 2014-04-10 | 1
4 | 2014-05-10 | 0
4 | 2014-06-10 | 0
------------------------
5 | 2014-01-10 | 0
5 | 2014-02-10 | 1
5 | 2014-03-10 | 0
5 | 2014-04-10 | 1
5 | 2014-05-10 | 0
5 | 2014-06-10 | 0
------------------------
The Id field is the user id, the date field is when a certain checkpoint is due and the status indicates if the checkpoint is accomplished by its user.
I'm having a big trouble trying to detect users that skipped some checkpoint, like the users with ids 2, 3, 4 and 5. Actually I need a query that lists the id's that have a missing checkpoint in the middle or start of the series, returning only the ids.
I've tried hard to find a way of doing that just with queries, but I couldn't create one. I know that I could do it coding some script, but that project I'm working on requires that I do it just using SQL.
Anyone have a slightest idea on how to accomplish that ?
EDIT: As recommended by the mods here are more details and some things I unsuccessfully tried:
My most successful try was to count how many statuses were registered for each id with this query:
SELECT
id,
SUM(CASE WHEN status = 1 THEN 1 ELSE 0 END) AS check,
SUM(CASE WHEN status = 0 THEN 1 ELSE 0 END) AS non_check
FROM
example_table
GROUP BY
id
ORDER BY
id
Getting the following result:
id | check | non_check
1 | 4 | 2
2 | 3 | 3
3 | 2 | 4
4 | 3 | 3
5 | 2 | 4
With that result I could select each id entries limiting by it's check result doing a SUM on the status field, if the SUM result is equal with the check result then the checkpoint is contiguous, like in:
WITH tbl AS (
SELECT id, status, SUM(status) AS "sum"
FROM (
SELECT id, status FROM example_table WHERE id = 1 ORDER BY date LIMIT 4
) AS tbl2
GROUP BY
status,id
)
SELECT
id,"sum"
FROM
tbl
WHERE
status = 1
Getting the following result:
id | sum
1 | 4
As the sum result is equal to check on the first query, I can determine that the checkpoints are contiguous. But take the id 2 as an example this time, it's query is:
WITH tbl AS (
SELECT id, status, SUM(status) AS "sum"
FROM (
SELECT id, status FROM example_table WHERE id = 2 ORDER BY date LIMIT 3
) AS tbl2
GROUP BY
status,id
)
SELECT
id,"sum"
FROM
tbl
WHERE
status = 1
Notice that I changed the id on WHERE and the LIMIT values based on which id I'm working with and its check result on the first query, and I got the following result:
id | sum
2 | 2
As the sum field value for id 2 in that query differs from its check value, I can say it's not contiguous. That pattern can be repeated with every id.
As I said before, to work that problem out that way I would need to do it by code, but in that specific case I need it to be in SQL.
Also I found the following article:
postgres detect repeating patterns of zeros
In which the problem resembles mine, but he wants to detect repeating zeroes, it has enlighten me a bit, but not enough to solve my own problem.
Thanks in advance!

The pattern you're looking for is a missed checkpoint followed by an accomplished checkpoint. Join each checkpoint from a user with the next (by timestamp) checkpoint then look for status 0 joined to status 1.
Here is an example:
create table tab (id int,date date,status int);
insert into tab values(1 , '2014-01-10' , 1),(1 , '2014-02-10' , 1),(1 , '2014-03-10' , 1),(1 , '2014-04-10' , 1),(1 , '2014-05-10' , 0),(1 , '2014-06-10' , 0),(2 , '2014-01-10' , 1),(2 , '2014-02-10' , 1),(2 , '2014-03-10' , 0),(2 , '2014-04-10' , 1),(2 , '2014-05-10' , 0),(2 , '2014-06-10' , 0),(3 , '2014-01-10' , 1),(3 , '2014-02-10' , 0),(3 , '2014-03-10' , 0),(3 , '2014-04-10' , 1),(3 , '2014-05-10' , 0),(3 , '2014-06-10' , 0),(4 , '2014-01-10' , 0),(4 , '2014-02-10' , 1),(4 , '2014-03-10' , 1),(4 , '2014-04-10' , 1),(4 , '2014-05-10' , 0),(4 , '2014-06-10' , 0),(5 , '2014-01-10' , 0),(5 , '2014-02-10' , 1),(5 , '2014-03-10' , 0),(5 , '2014-04-10' , 1),(5 , '2014-05-10' , 0),(5 , '2014-06-10' , 0);
with tabwithrow as
(select *
, row_number() OVER(PARTITION by id order by date) rnum
from tab)
select *
from tabwithrow a
join tabwithrow b on b.rnum = a.rnum + 1
and a.id = b.id
and a.status = 0
and b.status = 1;

Related

How to find next row in ordered table that matches a condition, given an initial match in SQL

I'm querying a table that contains state transitions for a state engine. The table is set up so that it has the previous_state, current_state, and timestamp of the transition, grouped by unique ids.
My goal is to find a sequence of target intervals, defined as timestamp of the initial state transition (eg timestamp when we shift from from 1->2), and timestamp of the target next state transition that matches a specific condition (eg the next timestamp that current_state=3 OR current_state=4).
state_transition_table
+------------+---------------+-----------+----+
| prev_state | current_state | timestamp | id |
+------------+---------------+-----------+----+
| 1 | 2 | 4.5 | 1 |
| 2 | 3 | 5.2 | 1 |
| 3 | 1 | 5.4 | 1 |
| 1 | 2 | 10.3 | 1 |
| 2 | 5 | 10.4 | 1 |
| 5 | 4 | 10.8 | 1 |
| 4 | 1 | 11.0 | 1 |
| 1 | 2 | 12.3 | 1 |
| 2 | 3 | 13.5 | 1 |
| 3 | 1 | 13.6 | 1 |
+------------+---------------+-----------+----+
Within a given id, we want to find all intervals that start with 1->2 (easy enough query), and end with either state 3 or 4.
1->2->anything->3 or 4
An example output table given the input above would have the three states and the timestamps for when we transition between the states:
target output
+------------+---------------+------------+-----------+-----------+
| prev_state | current_state | end_state | curr_time | end_time |
+------------+---------------+------------+-----------+-----------+
| 1 | 2 | 3 | 4.5 | 5.2 |
| 1 | 2 | 4 | 10.3 | 10.8 |
| 1 | 2 | 3 | 12.3 | 13.5 |
+------------+---------------+------------+-----------+-----------+
The best query I could come up with is using window functions in a sub-table, and then creating the new columns from that table. But this solution only finds the next row following the initial transition, and doesnt allow other states to occur between then and when our target state arrives.
WITH state_transitions as (
SELECT
id
previous_state, current_state,
LEAD(current_state) OVER ( PARTITION BY id ORDER BY timestamp) AS end_state,
timestamp as curr_time,
LEAD(timestamp) OVER ( PARTITION BY id ORDER BY timestamp) AS end_time
FROM
state_transition_table
SELECT
previous_state,
current_state,
end_state,
curr_time,
end_time
FROM state_transitions
WHERE previous_state=1 and current_state=2
ORDER BY curr_time
This query would incorrectly give the second output row end_state==5, which is not what I am looking for.
How can one search a table for the next row that matches my target condition, eg end_state=3 OR end_state=4?
This requires a recursive query that checks each row against siblings. This query should account for more than three rows. I assumed ORACLE for the seed data, may need to adapt your syntax to your database engine. I tried to document the query as best as I thought it was needed.
WITH /*SEED DATA*/
state_transition_table(prev_state, current_state, time_stamp, id) as (
SELECT 1 , 2 , 4.5 , 1 --FROM DUAL
UNION ALL SELECT 2 , 3 , 5.2 , 1 --FROM DUAL
UNION ALL SELECT 3 , 1 , 5.4 , 1 --FROM DUAL
UNION ALL SELECT 1 , 2 , 10.3 , 1 --FROM DUAL
UNION ALL SELECT 2 , 5 , 10.4 , 1 --FROM DUAL
UNION ALL SELECT 5 , 4 , 10.8 , 1 --FROM DUAL
UNION ALL SELECT 4 , 1 , 11.0 , 1 --FROM DUAL
UNION ALL SELECT 1 , 2 , 12.3 , 1 --FROM DUAL
UNION ALL SELECT 2 , 3 , 13.5 , 1 --FROM DUAL
UNION ALL SELECT 3 , 1 , 13.6 , 1 --FROM DUAL
)
/*THE END STATES YOU ARE LOOKING FOR*/
, end_states (a_state) as (
select 3 --FROM DUAL
union all select 4 --FROM DUAL
)
/*ORDER THE STEPS TO USE THE order_id COLUMN TO EVALUATE THE NEXT NODE*/
, ordered_states as (
SELECT row_number() OVER (ORDER BY time_stamp) order_id
, prev_state
, current_state
, id
, time_stamp
FROM state_transition_table
)
/*RECURSIVE QUERY WITH ANSI SYNTAX*/
, recursive (
root_order_id
, order_id
, time_stamp
, prev_state
, current_state
--, id
, steps
)
as (
SELECT order_id root_order_id /*THE order_id OF EACH ROOT ROW*/
, order_id
, time_stamp
, prev_state
, current_state
, CAST(order_id as char(100)) as steps /*INITIAL VALIDATION PATH*/
FROM ordered_states
WHERE prev_state = 1 AND current_state = 2 /*INITIAL CONDITION*/
UNION ALL
SELECT prev.root_order_id
, this.order_id
, this.time_stamp
, prev.prev_state
, this.current_state
, CAST(CONCAT(CONCAT(RTRIM(LTRIM(prev.steps)), ', '), RTRIM(LTRIM(CAST(this.order_id as char(3))))) as char(100)) as steps
FROM recursive prev /*ANSI PSEUDO TABLE*/
, ordered_states this /*THE SIBLING ROW TO CHECK*/
WHERE prev.order_id = this.order_id - 1 /*ROW TO PREVIOUS ROW JOIN*/
and prev.current_state not in (select a_state from end_states) /*THE PREVIOUS ROW STATE IS NOT AN END STATE */
)
select init_state.prev_state
, init_state.current_state as mid_state /*this name is better, I think*/
, end_state.current_state
, init_state.time_stamp as initial_time /*initial_time is better, I think*/
, end_state.time_stamp as end_time /*end_time is better, I think*/
, recursive.steps as validation_path_by_order_id
from recursive
inner join ordered_states init_state
on init_state.order_id = recursive.root_order_id
inner join ordered_states end_state
on end_state.order_id = recursive.order_id
where recursive.current_state in (select a_state from end_states)
One final note. The resulting columns are only accounting for 3 rows (prev_state, mid_state and current_state). As I said above, there are cases where you can have a path from (1) to (2) to (3 or 4) with more than three rows, lets say 1 to 2 to 5 to 2 to 3, thus the mid_state is really just one state in the middle.
Final-final note: Your desired results table was wrong, but you corrected it. 👍

Vertica dynamic pivot/transform

I have a table in vertica :
id Timestamp Mask1 Mask2
-------------------------------------------
1 11:30 50 100
1 11:35 52 101
2 12:00 53 102
3 09:00 50 100
3 22:10 52 105
. . . .
. . . .
Which I want to transform into :
id rows 09:00 11:30 11:35 12:00 22:10 .......
--------------------------------------------------------------
1 Mask1 Null 50 52 Null Null .......
Mask2 Null 100 101 Null Null .......
2 Mask1 Null Null Null 53 Null .......
Mask2 Null Null Null 102 Null .......
3 Mask1 50 Null Null Null 52 .......
Mask2 100 Null Null Null 105 .......
The dots (...) indicate that I have many records.
Timestamp is for a whole day and is of format hours:minutes:seconds starting from 00:00:00 to 24:00:00 for a day (I have just used hours:minutes for the question).
I have defined just two extra columns Mask1 and Mask2. I have about 200 Mask columns to work with.
I have shown 5 records but in real I have about a million record.
What I have tried so far:
Dumping each records based on id in a csv file.
Applying transpose in python pandas.
Joining the transposed tables.
The possible generic solution may be pivoting in vertica (or UDTF), but I am fairly new to this database.
I am struggling with this logic for couple of days. Can anyone please help me. Thanks a lot.
Below is the solution as I would code it for just the time values that you have in your data examples.
If you really want to be able to display all 86400 of '00:00:00' through '23:59:59', though, you won't be able to. Vertica's maximum number of columns is 1600.
You could, however, play with the Vertica function TIME_SLICE(timestamp::TIMESTAMP,1,'MINUTE')::TIME
(TIME_SLICE takes a timestamp as input and returns a timestamp, so you have to cast (::) back and forth), to reduce the number of rows to 1440 ...
In any case, I would start with SELECT DISTINCT timestamp FROM input ORDER BY 1;, and then, in the final query, would generate one line per found timestamp (hoping they won't be more than 1598....), like the ones actually used for your data, into your query:
, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"
SQL in general has no variable number of output columns from any given query. If the number of final columns varies depending on the data, you will have to generate your final query from the data, and then run it.
Welcome to SQL and relational databases ..
Here's the complete script for your data. I pivot vertically first, along the "Mask-n" column names, and then I re-pivot horizontally, along the timestamps.
\pset null Null
-- ^ this is a vsql command to display nulls with the "Null" string
WITH
-- your input, not in final query
input(id,Timestamp,Mask1,Mask2) AS (
SELECT 1 , TIME '11:30' , 50 , 100
UNION ALL SELECT 1 , TIME '11:35' , 52 , 101
UNION ALL SELECT 2 , TIME '12:00' , 53 , 102
UNION ALL SELECT 3 , TIME '09:00' , 50 , 100
UNION ALL SELECT 3 , TIME '22:10' , 52 , 105
)
,
-- real WITH clause starts here
-- need an index for your 200 masks
i(i) AS (
SELECT MICROSECOND(ts) FROM (
SELECT TIMESTAMPADD(MICROSECOND, 1,TIMESTAMP '2000-01-01') AS tm
UNION ALL SELECT TIMESTAMPADD(MICROSECOND,200,TIMESTAMP '2000-01-01') AS tm
)x
TIMESERIES ts AS '1 MICROSECOND' OVER(ORDER BY tm)
)
,
-- verticalised masks
vertical AS (
SELECT
id
, i
, CASE i
WHEN 1 THEN 'Mask001'
WHEN 2 THEN 'Mask002'
WHEN 200 THEN 'Mask200'
END AS rows
, timestamp
, CASE i
WHEN 1 THEN Mask1
WHEN 2 THEN Mask2
WHEN 200 THEN 0 -- no mask200 present
END AS val
FROM input CROSS JOIN i
WHERE i <=2 -- only 2 masks present currently
)
-- test the vertical CTE ...
-- SELECT * FROM vertical order by id,rows,timestamp;
-- out id | i | rows | timestamp | val
-- out ----+---+---------+-----------+-----
-- out 1 | 1 | Mask001 | 11:30:00 | 50
-- out 1 | 1 | Mask001 | 11:35:00 | 52
-- out 1 | 2 | Mask002 | 11:30:00 | 100
-- out 1 | 2 | Mask002 | 11:35:00 | 101
-- out 2 | 1 | Mask001 | 12:00:00 | 53
-- out 2 | 2 | Mask002 | 12:00:00 | 102
-- out 3 | 1 | Mask001 | 09:00:00 | 50
-- out 3 | 1 | Mask001 | 22:10:00 | 52
-- out 3 | 2 | Mask002 | 09:00:00 | 100
-- out 3 | 2 | Mask002 | 22:10:00 | 105
SELECT
id
, rows
, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"
FROM vertical
GROUP BY
id
, rows
ORDER BY
id
, rows
;
-- out Null display is "Null".
-- out id | rows | 09:00 | 11:30 | 11:35 | 12:00 | 22:10
-- out ----+---------+-------+-------+-------+-------+-------
-- out 1 | Mask001 | Null | 50 | 52 | Null | Null
-- out 1 | Mask002 | Null | 100 | 101 | Null | Null
-- out 2 | Mask001 | Null | Null | Null | 53 | Null
-- out 2 | Mask002 | Null | Null | Null | 102 | Null
-- out 3 | Mask001 | 50 | Null | Null | Null | 52
-- out 3 | Mask002 | 100 | Null | Null | Null | 105
-- out (6 rows)
-- out
-- out Time: First fetch (6 rows): 28.143 ms. All rows formatted: 28.205 ms
You can use union all to unpivot the data and then conditional aggregation:
select id, which,
max(case when timestamp >= '09:00' and timestamp < '09:30' then mask end) as "09:00",
max(case when timestamp >= '09:30' and timestamp < '10:00' then mask end) as "09:30",
max(case when timestamp >= '10:00' and timestamp < '10:30' then mask end) as "10:00",
. . .
from ((select id, timestamp,
'Mask1' as which, Mask1 as mask
from t
) union all
(select id, timestamp, 'Mask2' as which, Mask2 as mask
from t
)
) t
group by t.id, t.which;
Note: This includes the id on each row. I strongly recommend doing that, but you could use:
select (case when which = 'Mask1' then id end) as id
If you really wanted to.

Find all members in a tree structure

I have inherited a tree type table in this format
StatementAreaId | ParentStatementAreaId | SubjectId | Description
-----------------------------------------------------------------
1 | 0 | 100 | Reading
2 | 0 | 110 | Maths
3 | 2 | 0 | Number
4 | 2 | 0 | Shape
5 | 3 | 0 | Addition
6 | 3 | 0 | Subtraction
I want to find all the StatementAreaIds where the ultimate parent subject is, say maths (i.e. SubjectId=110). For instance if the SubjectId was Maths I'd get a list of StatementAreaIds in the tree:
StatementAreaId
---------------
2
3
4
5
6
The tree has a maximum of a depth of 3 if that helps.
Thanks
Recursive CTE to the rescue:
Create and populate sample table: (Please save us this step in your future questions)
DECLARE #T AS TABLE
(
StatementAreaId int,
ParentStatementAreaId int,
SubjectId int,
Description varchar(20)
)
INSERT INTO #T VALUES
(1 , 0 , 100 , 'Reading'),
(2 , 0 , 110 , 'Maths'),
(3 , 2 , 0 , 'Number'),
(4 , 2 , 0 , 'Shape'),
(5 , 3 , 0 , 'Addition'),
(6 , 3 , 0 , 'Subtraction')
Query:
;WITH CTE AS
(
SELECT StatementAreaId, ParentStatementAreaId
FROM #T
WHERE SubjectId = 110
UNION ALL
SELECT t1.StatementAreaId, t1.ParentStatementAreaId
FROM #T t1
INNER JOIN CTE ON t1.ParentStatementAreaId = CTE.StatementAreaId
)
SELECT StatementAreaId
FROM CTE
Results:
StatementAreaId
2
3
4
5
6

Count who paid group by 1, 2 or 3+

I have a payment table like the example below and I need a query that gives me how many IDs paid (AMOUNT > 0) 1 time, 2 times, 3 or more times. Example:
+----+--------+
| ID | AMOUNT |
+----+--------+
| 1 | 50 |
| 1 | 0 |
| 2 | 10 |
| 2 | 20 |
| 2 | 15 |
| 2 | 10 |
| 3 | 80 |
+----+--------+
I expect the result:
+-----------+------------+-------------+
| 1 payment | 2 payments | 3+ payments |
+-----------+------------+-------------+
| 2 | 0 | 1 |
+-----------+------------+-------------+
ID 1: Paid 1 time (50). The other payment is 0, so I did not count. So, 1 person paid 1 time.
ID 2: Paid 3 times (10,20,15). So, 1 person paid 3 or more time.
ID 3: Paid 1 time (80). So, 2 persons paid 1 time.
I'm doing manually on excel right now but I'm pretty sure there is a more practical solution. Any ideas?
A little sub-query will do the trick
Declare #YOurTable table (ID int, AMOUNT int)
Insert into #YourTable values
( 1 , 50 ),
( 1 , 0) ,
( 2 , 10) ,
( 2 , 20) ,
( 2 , 15) ,
( 2 , 10) ,
( 3 , 80)
Select [1_Payment] = sum(case when Cnt=1 then 1 else 0 end)
,[2_Payment] = sum(case when Cnt=2 then 1 else 0 end)
,[3_Payment] = sum(case when Cnt>2 then 1 else 0 end)
From (
Select id
,Cnt=count(*)
From #YourTable
Where Amount<>0
Group By ID
) A
Returns
1_Payment 2_Payment 3_Payment
2 0 1
To get the output you want try using a table to form the data and then SELECT from that:
with c as (
select count(*) count from mytable where amount > 0 group by id)
select
sum(case count when 1 then 1 else 0 end) "1 Payment"
, sum(case count when 2 then 1 else 0 end) "2 Payments"
, sum(case when count > 2 then 1 else 0 end) "3 Payments"
from c
Here is an example you can play with to see how the query is working.

How to create sql selection based on condition?

I have the following database which shows characteristics of attributes as follows:
attributeId | attributeCode | groupCode
------------+---------------+-----------
1 | 10 | 50
1 | 10 | 50
1 | 12 | 50
My desired result from a select would be:
attributeId | groupcount | code10 | code12
------------+------------+--------+--------
1 | 1 | 2 | 1
Which means: attributeId = 1 has only one groupCode (50), where attributeCode=10 occurs 2 times and attributeCode=12 occurs 1 time.
Of course the following is not valid, but you get the idea of what I'm trying to achieve:
select attributeId,
count(distinct(groupCode)) as groupcount,
attributeCode = 10 as code10,
attributeCode = 12 as code12
from table
group by attributeId;
Try this:
SELECT attributeId, COUNT(DISTINCT groupCode) AS groupcount,
COUNT(CASE WHEN attributeCode = 10 THEN 1 END) AS code10,
COUNT(CASE WHEN attributeCode = 12 THEN 1 END) AS code12
FROM mytable
GROUP BY attributeId
Demo here