Aggregate function to detect trend in PostgreSQL - sql

I'm using a psql DB to store a data structure like so:
datapoint(userId, rank, timestamp)
where timestamp is the Unix Epoch milliseconds timestamp.
In this structure I store the rank of each user each day, so it's like:
UserId Rank Timestamp
1 1 1435366459
1 2 1435366458
1 3 1435366457
2 8 1435366456
2 6 1435366455
2 7 1435366454
So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.
What I need to do is to detect all users that have a positive trend based on the last N measurements.

One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope:
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id
This query gives you the basic functionality. Now, you can dress it up a bit with case expressions if you like:
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id) t
Edit:
Unfortunately, regr_slope doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number:
-- Decoration outer query
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (-- Inner query to calculate the slope
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM (-- Inner query to get top N
SELECT user_id, rank1,
ROW_NUMER() OVER (PARTITION BY user_id
ORDER BY timestamp1 DESC) AS rn
FROM my_table) t
WHERE rn <= N -- Replace N with the number of rows you need
GROUP BY user_id) t2

You can use analytic functions for this. Overall approach:
compute the previous rank using lag()
use case to decide whether the trend is positive or not (0 or 1)
use min() to get the minimum trend over the preceding N rows; if the trend was positive for N rows, this returns 1, otherwise 0. To limit it to N rows, use the between N preceding and 0 following clause of the windowing function
Code:
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank
from t1
order by userid, timestamp1
) v1
) v2
SQL Fiddle
UPDATE
To only get the userid with the overall trend and the delta for the rank, you'll have to add another call to lag(.., N+1) to get the nth previous rank and row_number() to get a numbering within the same userid:
select v3.userid, v3.trend_overall, delta_rank
from (
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall,
latest_rank - prev_N_rank as delta_rank
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend,
max(case when v1.rn = 1 then rank1 else NULL end) over (partition by userid) as latest_rank
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank,
lag(rank1, 4) over (partition by userid order by timestamp1) as prev_N_rank,
row_number() over (partition by userid order by timestamp1 desc) as rn
from t1
order by userid, timestamp1
) v1
) v2
) v3
where rn = 1
group by userid, trend_overall, delta_rank
order by userid, trend_overall, delta_rank
Updated SQL Fiddle

Related

SUM a specific column in next rows until a condition is true

Here is a table of articles and I want to store sum of Mass Column from next rows in sumNext Column based on a condition.
If next row has same floor (in floorNo column) as current row, then add the mass of next rows until the floor is changed
E.g : Rows three has sumNext = 2. That is computed by adding the mass from row four and row five because both rows has same floor number as row three.
id
mass
symbol
floorNo
sumNext
2891176
1
D
1
0
2891177
1
L
8
0
2891178
1
L
1
2
2891179
1
L
1
1
2891180
1
1
0
2891181
1
5
2
2891182
1
5
1
2891183
1
5
0
Here is the query, that is generating this table, I just want to add sumNext column with the right value inside.
WITH items AS (SELECT
SP.id,
SP.mass,
SP.symbol,
SP.floorNo
FROM articles SP
ORDER BY
DECODE(SP.symbol,
'P',1,
'D',2,
'L',3,
4 ) asc)
SELECT CLS.*
FROM items CLS;
You could use below solution which uses
common table expression (cte) technique to put all consecutive rows with same FLOORNO value in the same group (new grp column).
Then uses the analytic version of SUM function to sum all next MASS per grp column as required.
Items_RowsNumbered (id, mass, symbol, floorNo, rnb) as (
select ID, MASS, SYMBOL, FLOORNO
, row_number()over(
order by DECODE(symbol, 'P',1, 'D',2, 'L',3, 4 ) asc, ID )
/*
You need to add ID column (or any others columns that can identify each row uniquely)
in the "order by" clause to make the result deterministic
*/
from (Your source query)Items
)
, cte(id, mass, symbol, floorNo, rnb, grp) as (
select id, mass, symbol, floorNo, rnb, 1 grp
from Items_RowsNumbered
where rnb = 1
union all
select t.id, t.mass, t.symbol, t.floorNo, t.rnb
, case when t.floorNo = c.floorNo then c.grp else c.grp + 1 end grp
from Items_RowsNumbered t
join cte c on (c.rnb + 1 = t.rnb)
)
select
ID, MASS, SYMBOL, FLOORNO
/*, RNB, GRP*/
, nvl(
sum(MASS)over(
partition by grp
order by rnb
ROWS BETWEEN 1 FOLLOWING and UNBOUNDED FOLLOWING)
, 0
) sumNext
from cte
;
demo on db<>fiddle
This is a typical gaps-and-islands problem. You can use LAG() in order to determine the exact partitions, and then SUM() analytic function such as
WITH ii AS
(
SELECT i.*,
ROW_NUMBER() OVER (ORDER BY id DESC) AS rn2,
ROW_NUMBER() OVER (PARTITION BY floorNo ORDER BY id DESC) AS rn1
FROM items i
)
SELECT id,mass,symbol, floorNo,
SUM(mass) OVER (PARTITION BY rn2-rn1 ORDER BY id DESC)-1 AS sumNext
FROM ii
ORDER BY id
Demo

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

RANK in SQL but start at 1 again when number is greater than

I need an sql code for the below. I want it to RANK however if DSLR >= 60 then I want the rank to start again like below.
Thanks
Assuming that you have a column that defines the ordering of the rows, say id, you can address this as a gaps-and-islands problem. Islands are group of adjacent record that start with a dslr above 60. We can identify them with a window sum, then rank within each island:
select dslr, rank() over(partition by grp order by id) as rn
from (
select t.*,
sum(case when dslr >= 60 then 1 else 0 end) over(order by id) as grp
from mytable t
) t

Using window function in redshift to aggregate conditionally

I have a table with following data:
Link to test data: http://sqlfiddle.com/#!15/dce01/1/0
I want to aggregate the items column (using listagg) for each group in gid in sequence as specified by seq column based on the condition that aggregation ends when pid becomes 0 again for a group.
i.e.
for group g1, there would be 2 aggregations; 1 for seq 1-3 and another for sequence 4-6; since for group g1, the pid becomes 0 for seq 4.
I expect the result for the given example to be as follows (Please note that seq in result is the min value of seq for the group where the pid becomes 0):
I understand your question as a gaps and island problem, where you want to group together adjacent rows having the same gid untiil a pid having value 0 is met.
Here is one way to solve it using a window sum to define the groups: basically, a new island starts everytime a pid of 0 is met. The rest is just aggregation:
select
gid,
min(seq) seq,
listagg(items, ',') within group(order by seq) items
from (
select
t.*,
sum(case when pid = 0 then 1 else 0 end) over(partition by gid order by seq) grp
from mytable t
) t
group by gid, grp
order by gid, grp
it's gaps and islands problem:
with
subgroup_ids as (
select *, sum(case when pid=0 then 1 else 0 end) over (partition by gid order by seq) as subgroup_id
from tablename
)
select gid, subgroup_id, listagg(items,',')
from subgroup_ids
group by 1,2

Find minimum value in groups of rows

In the SQL space (specifically T-SQL, SQL Server 2008), given this list of values:
Status Date
------ -----------------------
ACT 2012-01-07 11:51:06.060
ACT 2012-01-07 11:51:07.920
ACT 2012-01-08 04:13:29.140
NOS 2012-01-09 04:29:16.873
ACT 2012-01-21 12:39:37.607 <-- THIS
ACT 2012-01-21 12:40:03.840
ACT 2012-05-02 16:27:17.370
GRAD 2012-05-19 13:30:02.503
GRAD 2013-09-03 22:58:48.750
Generated from this query:
SELECT Status, Date
FROM Account_History
WHERE AccountNumber = '1234'
ORDER BY Date
The status for this particular object started at ACT, then changed to NOS, then back to ACT, then to GRAD.
What is the best way to get the minimum date from the latest "group" of records where Status = 'ACT'?
Here is a query that does this, by identifying the groups where the student statuses are the same and then using simple aggregation:
select top 1 StudentStatus, min(WhenLastChanged) as WhenLastChanged
from (SELECT StudentStatus, WhenLastChanged,
(row_number() over (order by "date") -
row_number() over (partition by studentstatus order by "date)
) as grp
FROM Account_History
WHERE AccountNumber = '1234'
) t
where StudentStatus = 'ACT'
group by StudentStatus, grp
order by WhenLastChanged desc;
The row_number() function assigns sequential numbers within groups of rows based on the date. For your data, the two row_numbers() and their difference is:
Status Date
------ -----------------------
ACT 2012-01-07 11:51:06.060 1 1 0
ACT 2012-01-07 11:51:07.920 2 2 0
ACT 2012-01-08 04:13:29.140 3 3 0
NOS 2012-01-09 04:29:16.873 4 1 3
ACT 2012-01-21 12:39:37.607 5 4 1
ACT 2012-01-21 12:40:03.840 6 5 1
ACT 2012-05-02 16:27:17.370 7 6 1
GRAD 2012-05-19 13:30:02.503 8 1 7
GRAD 2013-09-03 22:58:48.750 9 2 7
Notice the last row is constant for rows that have the same status.
The aggregation brings these together and chooses the latest (top 1 . . . order by date desc) of the first dates (min(date)).
EDIT:
The query is easy to tweak for multiple account numbers. I probably should have written that way to begin with, except the final selection is trickier. The results from this has the date for each status and account:
select StudentStatus, min(WhenLastChanged) as WhenLastChanged
from (SELECT StudentStatus, WhenLastChanged, AccountNumber
(row_number() over (partition by AccountNumber order by WhenLastChanged) -
row_number() over (partition by AccountNumber, studentstatus order by WhenLastChanged)
) as grp
FROM Account_History
) t
where StudentStatus = 'ACT'
group by AccountNumber, StudentStatus, grp
order by WhenLastChanged desc;
But you can't get the last one per account quite so easily. Another level of subqueries:
select AccountNumber, StudentStatus, WhenLastChanged
from (select AccountNumber, StudentStatus, min(WhenLastChanged) as WhenLastChanged,
row_number() over (partition by AccountNumber, StudentStatus order by min(WhenLastChanged) desc
) as seqnum
from (SELECT AccountNumber, StudentStatus, WhenLastChanged,
(row_number() over (partition by AccountNumber order by WhenLastChanged) -
row_number() over (partition by AccountNumber, studentstatus order by WhenLastChanged)
) as grp
FROM Account_History
) t
where StudentStatus = 'ACT'
group by AccountNumber, StudentStatus, grp
) t
where seqnum = 1;
This uses aggregation along with the window function row_number(). This is assigning sequential numbers to the groups (after aggregation), with the last date for each account getting a value of 1 (order by min(WhenLastChanged) desc). The outermost select then just chooses that row for each account.
SELECT [Status], MIN([Date])
FROM Table_Name
WHERE [Status] = (SELECT [Status]
FROM Table_Name
WHERE [Date] = (SELECT MAX([Date])
FROM Table_Name)
)
GROUP BY [Status]
Try here Sql Fiddle
Hogan: basically, yes. I just want to know the date/time when the
account was last changed to ACT. The records after the point above
marked THIS are just extra.
Instead of just looking for act we can look for first time status changes and select act (and max) from that.
so... every time a status changes:
with rownumb as
(
select *, row_number() OVER (order by date asc) as rn
)
select status, date
from rownumb A
join rownumb B on A.rn = B.rn-1
where a.status != b.status
now finding the max of the act items.
with rownumb as
(
select *, row_number() OVER (order by date asc) as rn
), statuschange as
(
select status, date
from rownumb A
join rownumb B on A.rn = B.rn-1
where a.status != b.status
)
select max(date)
from satuschange
where status='Act'