Pandas: filling missing values iterating through a groupby object - pandas

I have the folowing dataset:
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2',
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd',
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan, 'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07',
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09',
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22',
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35',
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16',
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df.head()
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 NaN 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
So, these are my three columns:
'player' - with three players (1,2,3) - dtype = object
'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Let's say I groupby player & id, and compute the time range for each session:
my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg
min max
player session
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:07 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-01-01 00:21:16 2018-01-01 00:35:22
e 2018-01-01 00:38:16 2018-01-01 01:31:16
3 f 2018-01-03 00:55:22 2018-01-03 00:58:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.
Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.
Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.

Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.
df.sort_values(['player','date']).fillna(method='ffill')
Yields:
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 b 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
5 1 c 2018-01-01 00:25:11
6 1 c 2018-01-01 00:27:28
7 1 c 2018-01-01 00:29:29
8 1 c 2018-01-01 00:30:35
9 2 d 2018-01-01 00:21:16
10 2 d 2018-01-01 00:35:22
11 2 e 2018-01-01 00:38:16
12 2 e 2018-01-01 00:38:20
13 2 e 2018-01-01 00:40:35
14 2 e 2018-01-01 01:31:16
15 3 f 2018-01-03 00:55:22
16 3 f 2018-01-03 00:58:16
17 3 g 2018-01-03 00:58:21
18 3 g 2018-03-01 01:00:35
19 3 g 2018-03-01 01:31:16

Related

Pandas groupby issue after melt bug?

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?
my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

How to groupby in Pandas by datetime range from different DF

I've stuck and can't solve this...
I have 2 dataframes.
One has datetimes intervals, another has datetimes and values.
I need to get MIN() values based on datetime ranges.
import pandas as pd
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
Required output:
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
And ideas?
Use IntervalIndex created by timeseries columns, then get positions by Index.get_indexer, aggregate min and last add to column to timeseries:
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'],
closed='both')
values['new'] = timeseries.index[s.get_indexer(values['DT'])]
print (values)
DT Value new
0 2018-01-01 00:00:00 1 0
1 2018-01-01 01:00:00 2 0
2 2018-01-01 02:00:00 0 0
3 2018-01-02 00:00:00 -1 1
4 2018-01-02 01:00:00 3 1
5 2018-01-02 02:00:00 10 1
6 2018-01-03 00:00:00 7 2
7 2018-01-03 01:00:00 11 2
8 2018-01-03 02:00:00 2 2
df = timeseries.join(values.groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
EDIT: If no match is added missing values instead -1, so was selected last index value, here 2:
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[ ['2017-12-31T00:00:00.000000000', -10],
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
values['DT'] = pd.to_datetime(values['DT'])
print (values)
DT Value
0 2017-12-31 00:00:00 -10
1 2018-01-01 00:00:00 1
2 2018-01-01 01:00:00 2
3 2018-01-01 02:00:00 0
4 2018-01-02 00:00:00 -1
5 2018-01-02 01:00:00 3
6 2018-01-02 02:00:00 10
7 2018-01-03 00:00:00 7
8 2018-01-03 01:00:00 11
9 2018-01-03 02:00:00 2
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'], closed='both')
pos = s.get_indexer(values['DT'])
values['new'] = timeseries.index[pos].where(pos != -1)
print (values)
DT Value new
0 2017-12-31 00:00:00 -10 NaN
1 2018-01-01 00:00:00 1 0.0
2 2018-01-01 01:00:00 2 0.0
3 2018-01-01 02:00:00 0 0.0
4 2018-01-02 00:00:00 -1 1.0
5 2018-01-02 01:00:00 3 1.0
6 2018-01-02 02:00:00 10 1.0
7 2018-01-03 00:00:00 7 2.0
8 2018-01-03 01:00:00 11 2.0
9 2018-01-03 02:00:00 2 2.0
df = timeseries.join(values.dropna(subset=['new']).groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
One possible solution is to create a variable (key) on which to join the two datasets
# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)
# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()
# join
timeseries.merge(mins, on='key').drop(columns=['key'])
Start DT End DT Value
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
values['DT']=values['DT'].astype(str) #convert to string
s=values['DT'].str.split(' ')#split on space
values['day']=s.str[0] #take the day part
df4=values.groupby(by='day').min()#groupby and take min value
df4.reset_index(inplace=True) #reset index
df4['day']=pd.to_datetime(df4['day'])#convert back to datetime for merging
final=pd.merge(timeseries,df4,left_on='Start DT',right_on='day',how='inner') #merge

Rank data with history of recurrence

Lets say I'm tracking a user's location, and I capture the following information:
Date
UserId
CurrentLocation
I can do a fairly easy transformation on this data to form a new table to get their last known position, if one exists, which I'll include below to show us moving from one point to another.
I want to create a grouping for each user's current position, and increment it whenever their position has changed from a previous value. If a user leaves a position, and later comes back to it, I want that to be treated as a new value, and not lump it in with the group when they were first at this location.
The problem with using RANK or DENSE_RANK to do this is that I'm ordering by the currentPos, which obviously won't work.
I thought I could use LAG() to look at the previous data, but this doesn't allow you to aggregate the previous record's LAG() with the current row.
Here's an example using RANK()
WITH dummyData(id, occuredOn, userId, currentPos, lastPos) AS (
SELECT 01, '2021-01-01 00:00:00', 23, 'A', null
UNION ALL
SELECT 22, '2021-01-01 01:30:00', 23, 'A', 'A'
UNION ALL
SELECT 43, '2021-01-01 04:00:00', 23, 'B', 'A'
UNION ALL
SELECT 55, '2021-01-02 00:00:00', 23, 'C', 'B'
UNION ALL
SELECT 59, '2021-01-02 04:40:00', 23, 'B', 'C'
UNION ALL
SELECT 68, '2021-01-02 08:00:00', 23, 'C', 'B'
UNION ALL
SELECT 69, '2021-01-02 09:00:00', 23, 'D', 'C'
UNION ALL
SELECT 11, '2021-01-01 01:00:00', 43, 'X', 'X'
UNION ALL
SELECT 18, '2021-01-01 02:00:00', 43, 'Y', 'X'
UNION ALL
SELECT 32, '2021-01-02 00:00:00', 43, 'Z', 'Y'
)
SELECT *
, DENSE_RANK() OVER (PARTITION BY userId ORDER BY currentPos) locationChangeGroup
FROM dummyData
ORDER BY userId ASC, occuredOn ASC
Here's what it outputs
id
occurredOn
userId
currentPos
lastPos
locationChangeGroup
01
2021-01-01 00:00:00
23
A
NULL
1
22
2021-01-01 01:30:00
23
A
A
1
43
2021-01-01 04:00:00
23
B
A
2
55
2021-01-02 00:00:00
23
C
B
3
59
2021-01-02 04:40:00
23
B
C
2
68
2021-01-02 08:00:00
23
C
B
3
69
2021-01-02 09:00:00
23
D
C
4
11
2021-01-01 01:00:00
43
X
X
1
18
2021-01-01 02:00:00
43
Y
X
2
32
2021-01-02 00:00:00
43
Z
Y
3
Here's what I want
id
occurredOn
userId
currentPos
lastPos
locationChangeGroup
01
2021-01-01 00:00:00
23
A
NULL
1
22
2021-01-01 01:30:00
23
A
A
1
43
2021-01-01 04:00:00
23
B
A
2
55
2021-01-02 00:00:00
23
C
B
3
59
2021-01-02 04:40:00
23
B
C
4
68
2021-01-02 08:00:00
23
C
B
5
69
2021-01-02 09:00:00
23
D
C
6
11
2021-01-01 01:00:00
43
X
X
1
18
2021-01-01 02:00:00
43
Y
X
2
32
2021-01-02 00:00:00
43
Z
Y
3
I know I could do this with a CURSOR, but I'd rather not resort to that.
T-SQL is fine, but I'm trying to stay away from any stored procs or functions, as it will require a larger effort of generating database migration scripts and the rigamarole of our processes that entails.
Any suggestions?
I think this is a gap-and-islands problem. For this purpose, you can use lag() and a cumulative sum:
select dd.*,
sum(case when prev_currentpos = currentpos then 0 else 1 end) over
(partition by userid
order by occurredon
) as locationChangeGroup
from (select dd.*,
lag(currentpos) over (partition by userid order by occurredon) as prev_currentpos
from dummydata dd
) dd

auto increment inside group

I have a dataframe:
df = pd.DataFrame.from_dict({
'product': ('a', 'a', 'a', 'a', 'c', 'b', 'b', 'b'),
'sales': ('-', '-', 'hot_price', 'hot_price', '-', 'min_price', 'min_price', 'min_price'),
'price': (100, 100, 50, 50, 90, 70, 70, 70),
'dt': ('2020-01-01 00:00:00', '2020-01-01 00:05:00', '2020-01-01 00:07:00', '2020-01-01 00:10:00', '2020-01-01 00:13:00', '2020-01-01 00:15:00', '2020-01-01 00:19:00', '2020-01-01 00:21:00')
})
product sales price dt
0 a - 100 2020-01-01 00:00:00
1 a - 100 2020-01-01 00:05:00
2 a hot_price 50 2020-01-01 00:07:00
3 a hot_price 50 2020-01-01 00:10:00
4 c - 90 2020-01-01 00:13:00
5 b min_price 70 2020-01-01 00:15:00
6 b min_price 70 2020-01-01 00:19:00
7 b min_price 70 2020-01-01 00:21:00
I need the next output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
How I do it:
unique_group = 0
df['unique_group'] = unique_group
for i in range(1, len(df)):
current, prev = df.loc[i], df.loc[i - 1]
if not all([
current['product'] == prev['product'],
current['sales'] == prev['sales'],
current['price'] == prev['price'],
]):
unique_group += 1
df.loc[i, 'unique_group'] = unique_group
Is it possible to do it without iteration? I tried using cumsum(), shift(), ngroup(), drop_duplicates() but unsuccessfully.
IIUC, GroupBy.ngroup:
df['unique_group'] = df.groupby(['product', 'sales', 'price'],sort=False).ngroup()
print(df)
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
this works either way, even if the data frame is not ordered
Another approach
this works with the ordered data frame
cols = ['product','sales','price']
df['unique_group'] = df[cols].ne(df[cols].shift()).any(axis=1).cumsum().sub(1)
Another option which might be a bit faster than groupby:
df['unique_group'] = (~df.duplicated(['product','sales','price'])).cumsum() - 1
Output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3

Pandas: filling missing values by the time occurrence of an event

I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change.
So, I have the following dataset (it's already sorted by date and player):
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'],
'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09',
'2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16',
'2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22',
'2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'],
'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f',
'g', np.nan, 'f', 'g']}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df
player date id
0 1 2018-01-01 00:17:01 NaN
1 1 2018-01-01 00:17:05 NaN
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 NaN
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 NaN
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 NaN
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
So, these are my three columns:
'player' - dtype = object
'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Ok, so here I have my missing values:
df.loc[df.id.isnull(),'date']
0 2018-01-01 00:17:01
1 2018-01-01 00:17:05
5 2018-01-01 00:22:07
15 2018-02-01 00:55:15
19 2018-03-01 01:00:35
Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players.
I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).
my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
min max
player id
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:09 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-02-01 00:31:16 2018-02-01 00:35:22
e 2018-02-01 00:38:16 2018-02-01 00:38:20
3 f 2018-01-03 00:55:22 2018-03-01 01:20:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.
In the dataset I try to illustrate three possible scenarios I am interested in:
the action occurred between the first and last date of a certain session. In this case I would like to fill the missing value with the id of that session, as it clearly belongs to that session. Row 5 of the dataset should therefore be labeled as 'b', as it occurs within the range of b.
I would mark as '0' the session where the action occurred outside the range of any session - for example the first two Nans and row 15.
Finally, mark it as '-99' if it is not possible to associate the action to a single session, because it occurred during the time range of different session. This is the case of row 19, the last Nan.
Desired output:
to sum it up, the outcome should look like this df:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select:
df['shift'] = df['id'].shift(1)
df['shift-1'] = df['id'].shift(-1)
df['merge'] = df[['shift','shift-1']].values.tolist()
df.drop(columns=['shift','shift-1'], inplace=True)
alpha = {np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8}
diff = []
for i in range(len(df)):
diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]])
df['diff'] = diff
conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(~df['diff'] < 0)),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(df['diff'] < 0)),
]
choices = [df['id'].ffill(),
0,
-99
]
df['id'] = np.select(conditions, choices, default = df['id'])
df.drop(columns=['merge','diff'], inplace=True)
df
out:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
In my solution I just had to work a bit to apply correctly the function wrote by #ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.
#define a function to sort the missing values (ysearka function from stackoverflow)
def my_custom_function(time):
#compare every date event with the range of the sessions.
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
#store length, that is the number of matches.
count = len(current_sessions)
#How many matches are there for any missing id value?
# if 0 it means that no matches are found: the event lies outside all the possible ranges
if count == 0:
return 0
#if more than one, it is impossible to say to which session the event belongs
if count > 1:
return -99
#equivalent to if count == 1 return: in this case the event belongs clearly to just one session
return current_sessions.index[0][1]
#create a list storing all the player ids
plist = list(df.player.unique())
#ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None
# create an empty new dataframe, where to store the results
final = pd.DataFrame()
#with this for loop iterate over the part of the dataset corresponding to one player at a time
for i in plist:
#slice the dataset by player
players = df.loc[df['player'] == i]
#for every player, take the dates where we are missing the id
mv_per_player = players.loc[players.id.isnull(),'date']
#for each player, groupby player id, and compute the first and last event
my_agg = players.groupby(['player', 'id']).date.agg([min, max])
#apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan
ema = mv_per_player.apply(my_custom_function)
#now we can sobstitute the missing id with the new imputed values...
players.loc[players.id.isnull(),'id'] = ema.values
#append new values stored in players to the new dataframe
final = final.append(players)
#...and check the new dataset
final
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:17 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).