Incrementing row numbers by condition in postgres

Incrementing row numbers by condition in postgres - sql

I have a postgres table with timestamps and the rounded difference in hours between current and previous (lagged) timestamp in difftime
timestamp type difftime
2013-09-14 14:19:46 JPR03 2
2013-09-14 15:11:48 JPR03 1
2013-09-14 16:11:49 JPR03 1
2013-09-14 17:13:45 JPR03 1
2013-09-22 00:08:38 JPR03 175
2013-09-22 00:10:11 JPR03 0
2013-09-22 01:11:36 JPR03 1
2013-09-22 02:16:11 JPR03 1
2013-09-22 03:13:16 JPR03 1
2013-09-22 04:05:38 JPR03 1
2013-09-22 06:10:11 JPR03 2
2013-09-22 07:26:43 JPR03 1
2013-09-22 08:17:35 JPR03 1
2013-09-22 09:16:08 JPR03 1
2013-09-22 10:16:08 JPR03 1
2013-10-01 06:15:07 JPR03 212
2013-10-01 06:15:12 JPR03 0
2013-10-02 07:15:15 JPR03 25
2013-10-02 08:05:09 JPR03 1
My objective is to create an incremental row number sequence that increases by 1 when and only when the value in difftime is above a certain threshold x (ordered by time). If x = 5, then the output would look like this:
timestamp type difftime rownum
2013-09-14 14:19:46 JPR03 2 0
2013-09-14 15:11:48 JPR03 1 0
2013-09-14 16:11:49 JPR03 1 0
2013-09-14 17:13:45 JPR03 1 0
2013-09-22 00:08:38 JPR03 175 1
2013-09-22 00:10:11 JPR03 0 1
2013-09-22 01:11:36 JPR03 1 1
2013-09-22 02:16:11 JPR03 1 1
2013-09-22 03:13:16 JPR03 1 1
2013-09-22 04:05:38 JPR03 1 1
2013-09-22 06:10:11 JPR03 2 1
2013-09-22 07:26:43 JPR03 1 1
2013-09-22 08:17:35 JPR03 1 1
2013-09-22 09:16:08 JPR03 1 1
2013-09-22 10:16:08 JPR03 1 1
2013-10-01 06:15:07 JPR03 212 2
2013-10-01 06:15:12 JPR03 0 2
2013-10-02 07:15:15 JPR03 25 3
2013-10-02 08:05:09 JPR03 1 3
I am familiar with the RANK(), DENSE_RANK(), ROW_NUMBER(), and COALESCE() functions, but none of these would achieve the objective of incrementing a row number by condition (beginning with 0). Any suggestions on how to implement this kind of variable assignment or what functions might be applied here to partition based on a condition?

demo:db<>fiddle
You can use the cumulative SUM() function with a conditional value: Add 1 if the condition is met, 0 otherwise:
SELECT
*,
SUM(
CASE
WHEN diff >= 5 THEN 1
ELSE 0
END
) OVER (ORDER BY ts)
FROM --<your query>

In Postgres, I would recommend using filter:
select q.*,
count(*) filter (where diff > ?) over (order by ts) as rownum
from <your query> q;
The ? is a placeholder for whatever value you have in mind.

Related

Grouping based on start date matching the previous row's end date SQL

Hoping someone can help me out with this problem.
I have the following sample dataset:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
1
111
01-01-2020
02-01-2020
1
112
03-01-2020
04-01-2020
1
113
04-01-2020
05-01-2020
1
114
06-01-2020
07-01-2020
2
211
01-01-2020
02-01-2020
2
212
05-01-2020
08-01-2020
3
311
02-01-2020
03-01-2020
3
312
03-01-2020
05-01-2020
3
313
05-01-2020
06-01-2020
3
314
07-01-2020
08-01-2020
I am trying to create groupings based on MEM_ID. If a ADM_DT is equal to the previous DCHG_DT then the records should be grouped together
Below is the expected output:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
GROUP_ID
1
111
01-01-2020
02-01-2020
1
1
112
03-01-2020
04-01-2020
2
1
113
04-01-2020
05-01-2020
2
1
114
06-01-2020
07-01-2020
3
2
211
01-01-2020
02-01-2020
1
2
212
05-01-2020
08-01-2020
2
3
311
02-01-2020
03-01-2020
1
3
312
03-01-2020
05-01-2020
1
3
313
05-01-2020
06-01-2020
1
3
314
07-01-2020
08-01-2020
2
I have attempted the following:
select DISTINCT MEM_ID
,CLM_ID
,ADM_DT
,DCHG_DT
,CASE WHEN ADM_DT = LAG(DCHG_DT) OVER(PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT) THEN 0 ELSE 1 END AS ISSTART
FROM
table
Which produces something like this:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
ISSTART
1
111
01-01-2020
02-01-2020
1
1
112
03-01-2020
04-01-2020
1
1
113
04-01-2020
05-01-2020
0
1
114
06-01-2020
07-01-2020
1
2
211
01-01-2020
02-01-2020
1
2
212
05-01-2020
08-01-2020
1
3
311
02-01-2020
03-01-2020
1
3
312
03-01-2020
05-01-2020
0
3
313
05-01-2020
06-01-2020
0
3
314
07-01-2020
08-01-2020
1
I have also looked into other external sources such as https://www.kodyaz.com/t-sql/sql-query-for-overlapping-time-periods-on-sql-server.aspx
This got me pretty close but I realized that the author was using a recursive CTE and Netezza does not support that function.
Ultimately I would like to create these groupings so that i can then merge to the original table that I am using and sum values based on the assigned group for each MEM_ID.
Thank you in advance for any help provided.

Try this:
select MEM_ID, CLM_ID, ADM_DT, DCHG_DT,
sum(ISSTART) over(partition by MEM_ID order by ADM_DT, DCHG_DT rows unbounded preceding) as GROUP_ID from
(select MEM_ID
,CLM_ID
,ADM_DT
,DCHG_DT
,CASE WHEN ADM_DT = LAG(DCHG_DT) OVER(PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT) THEN 0 ELSE 1 END AS ISSTART
FROM
table_name) t
Fiddle
Basically using your ISSTART in a sum to get the desired output.

Finding most recent startdate, and endDate from consecutive dates

I have a table like below:
user_id
store_id
stock
date
116
2
0
2021-10-18
116
2
0
2021-10-19
116
2
0
2021-10-20
116
2
0
2021-08-16
116
2
0
2021-08-15
116
2
0
2021-07-04
116
2
0
,2021-07-03
389
2
0
2021-07-02
389
2
0
2021-07-01
389
2
0
2021-10-27
52
6
0
2021-10-28
52
6
0
2021-10-29
52
6
0
2021-10-30
116
38
0
2021-05-02
116
38
0
2021-05-03
116
38
0
2021-05-04
116
38
0
2021-04-06
The table can have multiple consecutive days where a product ran out of stock, so I'd like to create a query with the last startDate and endDate where the product ran out of stock. For the table above, the results have to be:
user_Id
store_id
startDate
endDate
116
2
2021-10-18
2021-10-20
116
38
2021-05-02
2021-05-04
389
2
2021-07-01
2021-07-02
52
6
2021-10-28
2021-10-30
I have tried the solution with row_number(), but it didn't work. Does someone have a tip or idea to solve this problem with SQL (PostgreSQL)?

here is how you can do it :
select user_id, store_id,min(date) startdate,max(date) enddate
from (
select *, rank() over (partition by user_id, store_id order by grp desc) rn from (
select *, date - row_number() over (partition by user_id,store_id order by date) * interval '1 day' grp
from tablename
) t) t where rn = 1
group by user_id, store_id,grp
db<>fiddle here

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!

You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B

Pandas: Days since last event per id

I want to build a column for my dataframe df['days_since_last'] that shows the days since the last match for each player_id for each event_id and nan if the row is the first match for the player in the dataset.
Example of my data:
event_id player_id match_date
0 1470993 227485 2015-11-29
1 1492031 227485 2016-07-23
2 1489240 227485 2016-06-19
3 1495581 227485 2016-09-02
4 1490222 227485 2016-07-03
5 1469624 227485 2015-11-14
6 1493822 227485 2016-08-13
7 1428946 313444 2014-08-10
8 1483245 313444 2016-05-21
9 1472260 313444 2015-12-13
I tried the code in Find days since last event pandas dataframe but got nonsensical results.

It seems you need sort first:
df['days_since_last_event'] = (df.sort_values(['player_id','match_date'])
.groupby('player_id')['match_date'].diff()
.dt.days)
print (df)
event_id player_id match_date days_since_last_event
0 1470993 227485 2015-11-29 15.0
1 1492031 227485 2016-07-23 20.0
2 1489240 227485 2016-06-19 203.0
3 1495581 227485 2016-09-02 20.0
4 1490222 227485 2016-07-03 14.0
5 1469624 227485 2015-11-14 NaN
6 1493822 227485 2016-08-13 21.0
7 1428946 313444 2014-08-10 NaN
8 1483245 313444 2016-05-21 160.0
9 1472260 313444 2015-12-13 490.0

Demo:
In [174]: df['days_since_last'] = (df.groupby('player_id')['match_date']
.transform(lambda x: (x.max()-x).dt.days))
In [175]: df
Out[175]:
event_id player_id match_date days_since_last
0 1470993 227485 2015-11-29 278
1 1492031 227485 2016-07-23 41
2 1489240 227485 2016-06-19 75
3 1495581 227485 2016-09-02 0
4 1490222 227485 2016-07-03 61
5 1469624 227485 2015-11-14 293
6 1493822 227485 2016-08-13 20
7 1428946 313444 2014-08-10 650
8 1483245 313444 2016-05-21 0
9 1472260 313444 2015-12-13 160

pandas group By select columns

I work with Cloudera VM 5.2.0 pandas 0.18.0.
I have the following data
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
parse_dates=['timestamp'],
skipinitialspace=True).assign(adCount=1)
adclicksDF.head(n=5)
Out[65]:
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 15:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 15:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
I want to do a group by for the field timestamp
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
agrupadoDF.head(n=5)
Out[68]:
count sum
timestamp
2016-05-26 15:00:00 14 14
2016-05-26 16:00:00 24 24
2016-05-26 17:00:00 13 13
2016-05-26 18:00:00 16 16
2016-05-26 19:00:00 16 16
I want to add to agrupado more columns adCategory, idUser .
How can I do this?

There is multiple values in userId and adCategory for each group, so aggreagate by join:
In this sample last two datetime are changed for better output
print (adclicksDF)
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 16:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 16:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
#cast int to string
adclicksDF['userId'] = adclicksDF['userId'].astype(str)
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))
.agg({'adCount': ['count','sum'],
'userId': ', '.join,
'adCategory': ', '.join})
agrupadoDF.columns = ['adCategory','count','sum','userId']
print (agrupadoDF)
adCategory count sum \
timestamp
2016-05-26 15:00:00 electronics, movies, computers 3 3
2016-05-26 16:00:00 fashion, clothing 2 2
userId
timestamp
2016-05-26 15:00:00 611, 1874, 2139
2016-05-26 16:00:00 212, 1027

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Incrementing row numbers by condition in postgres - sql

demo:db<>fiddle You can use the cumulative SUM() function with a conditional value: Add 1 if the condition is met, 0 otherwise: SELECT *, SUM( CASE WHEN diff >= 5 THEN 1 ELSE 0 END ) OVER (ORDER BY ts) FROM --<your query>

In Postgres, I would recommend using filter: select q., count() filter (where diff > ?) over (order by ts) as rownum from <your query> q; The ? is a placeholder for whatever value you have in mind.

Related

Grouping based on start date matching the previous row's end date SQL

Finding most recent startdate, and endDate from consecutive dates

Grouping into series based on days since

Pandas: Days since last event per id

pandas group By select columns

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Incrementing row numbers by condition in postgres - sql

demo:db<>fiddle You can use the cumulative SUM() function with a conditional value: Add 1 if the condition is met, 0 otherwise: SELECT *, SUM( CASE WHEN diff >= 5 THEN 1 ELSE 0 END ) OVER (ORDER BY ts) FROM --<your query>

In Postgres, I would recommend using filter: select q.*, count(*) filter (where diff > ?) over (order by ts) as rownum from <your query> q; The ? is a placeholder for whatever value you have in mind.

Related

Grouping based on start date matching the previous row's end date SQL

Finding most recent startdate, and endDate from consecutive dates

Grouping into series based on days since

Pandas: Days since last event per id

pandas group By select columns

Categories

Resources

In Postgres, I would recommend using filter: select q., count() filter (where diff > ?) over (order by ts) as rownum from <your query> q; The ? is a placeholder for whatever value you have in mind.