Incrementing row numbers by condition in postgres - sql

I have a postgres table with timestamps and the rounded difference in hours between current and previous (lagged) timestamp in difftime
timestamp type difftime
2013-09-14 14:19:46 JPR03 2
2013-09-14 15:11:48 JPR03 1
2013-09-14 16:11:49 JPR03 1
2013-09-14 17:13:45 JPR03 1
2013-09-22 00:08:38 JPR03 175
2013-09-22 00:10:11 JPR03 0
2013-09-22 01:11:36 JPR03 1
2013-09-22 02:16:11 JPR03 1
2013-09-22 03:13:16 JPR03 1
2013-09-22 04:05:38 JPR03 1
2013-09-22 06:10:11 JPR03 2
2013-09-22 07:26:43 JPR03 1
2013-09-22 08:17:35 JPR03 1
2013-09-22 09:16:08 JPR03 1
2013-09-22 10:16:08 JPR03 1
2013-10-01 06:15:07 JPR03 212
2013-10-01 06:15:12 JPR03 0
2013-10-02 07:15:15 JPR03 25
2013-10-02 08:05:09 JPR03 1
My objective is to create an incremental row number sequence that increases by 1 when and only when the value in difftime is above a certain threshold x (ordered by time). If x = 5, then the output would look like this:
timestamp type difftime rownum
2013-09-14 14:19:46 JPR03 2 0
2013-09-14 15:11:48 JPR03 1 0
2013-09-14 16:11:49 JPR03 1 0
2013-09-14 17:13:45 JPR03 1 0
2013-09-22 00:08:38 JPR03 175 1
2013-09-22 00:10:11 JPR03 0 1
2013-09-22 01:11:36 JPR03 1 1
2013-09-22 02:16:11 JPR03 1 1
2013-09-22 03:13:16 JPR03 1 1
2013-09-22 04:05:38 JPR03 1 1
2013-09-22 06:10:11 JPR03 2 1
2013-09-22 07:26:43 JPR03 1 1
2013-09-22 08:17:35 JPR03 1 1
2013-09-22 09:16:08 JPR03 1 1
2013-09-22 10:16:08 JPR03 1 1
2013-10-01 06:15:07 JPR03 212 2
2013-10-01 06:15:12 JPR03 0 2
2013-10-02 07:15:15 JPR03 25 3
2013-10-02 08:05:09 JPR03 1 3
I am familiar with the RANK(), DENSE_RANK(), ROW_NUMBER(), and COALESCE() functions, but none of these would achieve the objective of incrementing a row number by condition (beginning with 0). Any suggestions on how to implement this kind of variable assignment or what functions might be applied here to partition based on a condition?

demo:db<>fiddle
You can use the cumulative SUM() function with a conditional value: Add 1 if the condition is met, 0 otherwise:
SELECT
*,
SUM(
CASE
WHEN diff >= 5 THEN 1
ELSE 0
END
) OVER (ORDER BY ts)
FROM --<your query>

In Postgres, I would recommend using filter:
select q.*,
count(*) filter (where diff > ?) over (order by ts) as rownum
from <your query> q;
The ? is a placeholder for whatever value you have in mind.

Related

Grouping based on start date matching the previous row's end date SQL

Hoping someone can help me out with this problem.
I have the following sample dataset:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
1
111
01-01-2020
02-01-2020
1
112
03-01-2020
04-01-2020
1
113
04-01-2020
05-01-2020
1
114
06-01-2020
07-01-2020
2
211
01-01-2020
02-01-2020
2
212
05-01-2020
08-01-2020
3
311
02-01-2020
03-01-2020
3
312
03-01-2020
05-01-2020
3
313
05-01-2020
06-01-2020
3
314
07-01-2020
08-01-2020
I am trying to create groupings based on MEM_ID. If a ADM_DT is equal to the previous DCHG_DT then the records should be grouped together
Below is the expected output:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
GROUP_ID
1
111
01-01-2020
02-01-2020
1
1
112
03-01-2020
04-01-2020
2
1
113
04-01-2020
05-01-2020
2
1
114
06-01-2020
07-01-2020
3
2
211
01-01-2020
02-01-2020
1
2
212
05-01-2020
08-01-2020
2
3
311
02-01-2020
03-01-2020
1
3
312
03-01-2020
05-01-2020
1
3
313
05-01-2020
06-01-2020
1
3
314
07-01-2020
08-01-2020
2
I have attempted the following:
select DISTINCT MEM_ID
,CLM_ID
,ADM_DT
,DCHG_DT
,CASE WHEN ADM_DT = LAG(DCHG_DT) OVER(PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT) THEN 0 ELSE 1 END AS ISSTART
FROM
table
Which produces something like this:
MEM_ID
CLM_ID
ADM_DT
DCHG_DT
ISSTART
1
111
01-01-2020
02-01-2020
1
1
112
03-01-2020
04-01-2020
1
1
113
04-01-2020
05-01-2020
0
1
114
06-01-2020
07-01-2020
1
2
211
01-01-2020
02-01-2020
1
2
212
05-01-2020
08-01-2020
1
3
311
02-01-2020
03-01-2020
1
3
312
03-01-2020
05-01-2020
0
3
313
05-01-2020
06-01-2020
0
3
314
07-01-2020
08-01-2020
1
I have also looked into other external sources such as https://www.kodyaz.com/t-sql/sql-query-for-overlapping-time-periods-on-sql-server.aspx
This got me pretty close but I realized that the author was using a recursive CTE and Netezza does not support that function.
Ultimately I would like to create these groupings so that i can then merge to the original table that I am using and sum values based on the assigned group for each MEM_ID.
Thank you in advance for any help provided.
Try this:
select MEM_ID, CLM_ID, ADM_DT, DCHG_DT,
sum(ISSTART) over(partition by MEM_ID order by ADM_DT, DCHG_DT rows unbounded preceding) as GROUP_ID from
(select MEM_ID
,CLM_ID
,ADM_DT
,DCHG_DT
,CASE WHEN ADM_DT = LAG(DCHG_DT) OVER(PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT) THEN 0 ELSE 1 END AS ISSTART
FROM
table_name) t
Fiddle
Basically using your ISSTART in a sum to get the desired output.

Finding most recent startdate, and endDate from consecutive dates

I have a table like below:
user_id
store_id
stock
date
116
2
0
2021-10-18
116
2
0
2021-10-19
116
2
0
2021-10-20
116
2
0
2021-08-16
116
2
0
2021-08-15
116
2
0
2021-07-04
116
2
0
,2021-07-03
389
2
0
2021-07-02
389
2
0
2021-07-01
389
2
0
2021-10-27
52
6
0
2021-10-28
52
6
0
2021-10-29
52
6
0
2021-10-30
116
38
0
2021-05-02
116
38
0
2021-05-03
116
38
0
2021-05-04
116
38
0
2021-04-06
The table can have multiple consecutive days where a product ran out of stock, so I'd like to create a query with the last startDate and endDate where the product ran out of stock. For the table above, the results have to be:
user_Id
store_id
startDate
endDate
116
2
2021-10-18
2021-10-20
116
38
2021-05-02
2021-05-04
389
2
2021-07-01
2021-07-02
52
6
2021-10-28
2021-10-30
I have tried the solution with row_number(), but it didn't work. Does someone have a tip or idea to solve this problem with SQL (PostgreSQL)?
here is how you can do it :
select user_id, store_id,min(date) startdate,max(date) enddate
from (
select *, rank() over (partition by user_id, store_id order by grp desc) rn from (
select *, date - row_number() over (partition by user_id,store_id order by date) * interval '1 day' grp
from tablename
) t) t where rn = 1
group by user_id, store_id,grp
db<>fiddle here

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!
You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B

Pandas: Days since last event per id

I want to build a column for my dataframe df['days_since_last'] that shows the days since the last match for each player_id for each event_id and nan if the row is the first match for the player in the dataset.
Example of my data:
event_id player_id match_date
0 1470993 227485 2015-11-29
1 1492031 227485 2016-07-23
2 1489240 227485 2016-06-19
3 1495581 227485 2016-09-02
4 1490222 227485 2016-07-03
5 1469624 227485 2015-11-14
6 1493822 227485 2016-08-13
7 1428946 313444 2014-08-10
8 1483245 313444 2016-05-21
9 1472260 313444 2015-12-13
I tried the code in Find days since last event pandas dataframe but got nonsensical results.
It seems you need sort first:
df['days_since_last_event'] = (df.sort_values(['player_id','match_date'])
.groupby('player_id')['match_date'].diff()
.dt.days)
print (df)
event_id player_id match_date days_since_last_event
0 1470993 227485 2015-11-29 15.0
1 1492031 227485 2016-07-23 20.0
2 1489240 227485 2016-06-19 203.0
3 1495581 227485 2016-09-02 20.0
4 1490222 227485 2016-07-03 14.0
5 1469624 227485 2015-11-14 NaN
6 1493822 227485 2016-08-13 21.0
7 1428946 313444 2014-08-10 NaN
8 1483245 313444 2016-05-21 160.0
9 1472260 313444 2015-12-13 490.0
Demo:
In [174]: df['days_since_last'] = (df.groupby('player_id')['match_date']
.transform(lambda x: (x.max()-x).dt.days))
In [175]: df
Out[175]:
event_id player_id match_date days_since_last
0 1470993 227485 2015-11-29 278
1 1492031 227485 2016-07-23 41
2 1489240 227485 2016-06-19 75
3 1495581 227485 2016-09-02 0
4 1490222 227485 2016-07-03 61
5 1469624 227485 2015-11-14 293
6 1493822 227485 2016-08-13 20
7 1428946 313444 2014-08-10 650
8 1483245 313444 2016-05-21 0
9 1472260 313444 2015-12-13 160

pandas group By select columns

I work with Cloudera VM 5.2.0 pandas 0.18.0.
I have the following data
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
parse_dates=['timestamp'],
skipinitialspace=True).assign(adCount=1)
adclicksDF.head(n=5)
Out[65]:
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 15:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 15:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
I want to do a group by for the field timestamp
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
agrupadoDF.head(n=5)
Out[68]:
count sum
timestamp
2016-05-26 15:00:00 14 14
2016-05-26 16:00:00 24 24
2016-05-26 17:00:00 13 13
2016-05-26 18:00:00 16 16
2016-05-26 19:00:00 16 16
I want to add to agrupado more columns adCategory, idUser .
How can I do this?
There is multiple values in userId and adCategory for each group, so aggreagate by join:
In this sample last two datetime are changed for better output
print (adclicksDF)
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 16:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 16:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
#cast int to string
adclicksDF['userId'] = adclicksDF['userId'].astype(str)
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))
.agg({'adCount': ['count','sum'],
'userId': ', '.join,
'adCategory': ', '.join})
agrupadoDF.columns = ['adCategory','count','sum','userId']
print (agrupadoDF)
adCategory count sum \
timestamp
2016-05-26 15:00:00 electronics, movies, computers 3 3
2016-05-26 16:00:00 fashion, clothing 2 2
userId
timestamp
2016-05-26 15:00:00 611, 1874, 2139
2016-05-26 16:00:00 212, 1027