I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change.
So, I have the following dataset (it's already sorted by date and player):
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'],
'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09',
'2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16',
'2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22',
'2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'],
'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f',
'g', np.nan, 'f', 'g']}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df
player date id
0 1 2018-01-01 00:17:01 NaN
1 1 2018-01-01 00:17:05 NaN
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 NaN
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 NaN
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 NaN
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
So, these are my three columns:
'player' - dtype = object
'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Ok, so here I have my missing values:
df.loc[df.id.isnull(),'date']
0 2018-01-01 00:17:01
1 2018-01-01 00:17:05
5 2018-01-01 00:22:07
15 2018-02-01 00:55:15
19 2018-03-01 01:00:35
Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players.
I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).
my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
min max
player id
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:09 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-02-01 00:31:16 2018-02-01 00:35:22
e 2018-02-01 00:38:16 2018-02-01 00:38:20
3 f 2018-01-03 00:55:22 2018-03-01 01:20:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.
In the dataset I try to illustrate three possible scenarios I am interested in:
the action occurred between the first and last date of a certain session. In this case I would like to fill the missing value with the id of that session, as it clearly belongs to that session. Row 5 of the dataset should therefore be labeled as 'b', as it occurs within the range of b.
I would mark as '0' the session where the action occurred outside the range of any session - for example the first two Nans and row 15.
Finally, mark it as '-99' if it is not possible to associate the action to a single session, because it occurred during the time range of different session. This is the case of row 19, the last Nan.
Desired output:
to sum it up, the outcome should look like this df:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select:
df['shift'] = df['id'].shift(1)
df['shift-1'] = df['id'].shift(-1)
df['merge'] = df[['shift','shift-1']].values.tolist()
df.drop(columns=['shift','shift-1'], inplace=True)
alpha = {np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8}
diff = []
for i in range(len(df)):
diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]])
df['diff'] = diff
conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(~df['diff'] < 0)),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(df['diff'] < 0)),
]
choices = [df['id'].ffill(),
0,
-99
]
df['id'] = np.select(conditions, choices, default = df['id'])
df.drop(columns=['merge','diff'], inplace=True)
df
out:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
In my solution I just had to work a bit to apply correctly the function wrote by #ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.
#define a function to sort the missing values (ysearka function from stackoverflow)
def my_custom_function(time):
#compare every date event with the range of the sessions.
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
#store length, that is the number of matches.
count = len(current_sessions)
#How many matches are there for any missing id value?
# if 0 it means that no matches are found: the event lies outside all the possible ranges
if count == 0:
return 0
#if more than one, it is impossible to say to which session the event belongs
if count > 1:
return -99
#equivalent to if count == 1 return: in this case the event belongs clearly to just one session
return current_sessions.index[0][1]
#create a list storing all the player ids
plist = list(df.player.unique())
#ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None
# create an empty new dataframe, where to store the results
final = pd.DataFrame()
#with this for loop iterate over the part of the dataset corresponding to one player at a time
for i in plist:
#slice the dataset by player
players = df.loc[df['player'] == i]
#for every player, take the dates where we are missing the id
mv_per_player = players.loc[players.id.isnull(),'date']
#for each player, groupby player id, and compute the first and last event
my_agg = players.groupby(['player', 'id']).date.agg([min, max])
#apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan
ema = mv_per_player.apply(my_custom_function)
#now we can sobstitute the missing id with the new imputed values...
players.loc[players.id.isnull(),'id'] = ema.values
#append new values stored in players to the new dataframe
final = final.append(players)
#...and check the new dataset
final
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:17 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).
Related
df = pd.DataFrame([['2018-02-03',42],
['2018-02-03',22],
['2018-02-03',10],
['2018-02-03',32],
['2018-02-03',10],
['2018-02-04',8],
['2018-02-04',2],
['2018-02-04',12],
['2018-02-03',20],
['2018-02-05',30],
['2018-02-05',5],
['2018-02-05',15]])
df.columns = ['product','date','quantity']
I want to create groups by date and calculate the minimum value of a 'quantity' column for all the groups respectively and subtract the value from all the values of a 'quantity' column of that group. The desired output is:
day value
2018-02-03 32 #(because, 42-10 = 32), 10 is minimum for date 2018-02-03.
2018-02-03 12
2018-02-03 0
2018-02-03 22
2018-02-03 0
2018-02-04 6
2018-02-04 0
2018-02-04 10
2018-02-03 10
2018-02-05 25
2018-02-05 0
2018-02-05 10
Now, this is what I tried:
df = df.groupby('Date', as_index = True)
datamin = df.groupby('Date')['quantity'].min()
Bu this creates a dataframe with the first quantity by Date ana I also do not know, how to proceed after this!!
try via groupby() and transform():
df['value']=df.groupby('date')['quantity'].transform(lambda x:x-x.min())
output of df:
date quantity value
0 2018-02-03 42 32
1 2018-02-03 22 12
2 2018-02-03 10 0
3 2018-02-03 32 22
4 2018-02-03 10 0
5 2018-02-04 8 6
6 2018-02-04 2 0
7 2018-02-04 12 10
8 2018-02-03 20 10
9 2018-02-05 30 25
10 2018-02-05 5 0
11 2018-02-05 15 10
For improve performance use GroupBy.transform without lambda function, better is subtract all values of column like:
df['value'] = df['quantity'].sub(df.groupby('date')['quantity'].transform('min'))
I've stuck and can't solve this...
I have 2 dataframes.
One has datetimes intervals, another has datetimes and values.
I need to get MIN() values based on datetime ranges.
import pandas as pd
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
Required output:
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
And ideas?
Use IntervalIndex created by timeseries columns, then get positions by Index.get_indexer, aggregate min and last add to column to timeseries:
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'],
closed='both')
values['new'] = timeseries.index[s.get_indexer(values['DT'])]
print (values)
DT Value new
0 2018-01-01 00:00:00 1 0
1 2018-01-01 01:00:00 2 0
2 2018-01-01 02:00:00 0 0
3 2018-01-02 00:00:00 -1 1
4 2018-01-02 01:00:00 3 1
5 2018-01-02 02:00:00 10 1
6 2018-01-03 00:00:00 7 2
7 2018-01-03 01:00:00 11 2
8 2018-01-03 02:00:00 2 2
df = timeseries.join(values.groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
EDIT: If no match is added missing values instead -1, so was selected last index value, here 2:
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[ ['2017-12-31T00:00:00.000000000', -10],
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
values['DT'] = pd.to_datetime(values['DT'])
print (values)
DT Value
0 2017-12-31 00:00:00 -10
1 2018-01-01 00:00:00 1
2 2018-01-01 01:00:00 2
3 2018-01-01 02:00:00 0
4 2018-01-02 00:00:00 -1
5 2018-01-02 01:00:00 3
6 2018-01-02 02:00:00 10
7 2018-01-03 00:00:00 7
8 2018-01-03 01:00:00 11
9 2018-01-03 02:00:00 2
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'], closed='both')
pos = s.get_indexer(values['DT'])
values['new'] = timeseries.index[pos].where(pos != -1)
print (values)
DT Value new
0 2017-12-31 00:00:00 -10 NaN
1 2018-01-01 00:00:00 1 0.0
2 2018-01-01 01:00:00 2 0.0
3 2018-01-01 02:00:00 0 0.0
4 2018-01-02 00:00:00 -1 1.0
5 2018-01-02 01:00:00 3 1.0
6 2018-01-02 02:00:00 10 1.0
7 2018-01-03 00:00:00 7 2.0
8 2018-01-03 01:00:00 11 2.0
9 2018-01-03 02:00:00 2 2.0
df = timeseries.join(values.dropna(subset=['new']).groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
One possible solution is to create a variable (key) on which to join the two datasets
# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)
# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()
# join
timeseries.merge(mins, on='key').drop(columns=['key'])
Start DT End DT Value
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
values['DT']=values['DT'].astype(str) #convert to string
s=values['DT'].str.split(' ')#split on space
values['day']=s.str[0] #take the day part
df4=values.groupby(by='day').min()#groupby and take min value
df4.reset_index(inplace=True) #reset index
df4['day']=pd.to_datetime(df4['day'])#convert back to datetime for merging
final=pd.merge(timeseries,df4,left_on='Start DT',right_on='day',how='inner') #merge
I am trying replicate the excel index match in pandas so as to produce a new column which copy's the date on the first occurrence of value in colB being exceeded or matched by value in colC
date colA colB colC colD desired_output
0 2020-04-01 00:00:00 2 1 e 2020-04-02 00:00:00
1 2020-04-02 00:00:00 8 4 4 d 2020-04-02 00:00:00
2 2020-04-03 00:00:00 1 2 a 2020-04-03 00:00:00
3 2020-04-04 00:00:00 4 2 3 b 2020-04-04 00:00:00
4 2020-04-05 00:00:00 5 3 1 c 2020-04-07 00:00:00
5 2020-04-06 00:00:00 9 4 1 m
6 2020-04-07 00:00:00 5 3 3 c 2020-04-07 00:00:00
Here is the code that i have tried so far, unsuccessfully:
col_6 = []
for ind in df3.index:
if df3['colC'][ind] >= df3['colB']:
col_6.append(df3['date'][ind]
else:
col_6.append('')
df3['desired_output'] = col_6
and have also tried:
col_6 = []
for ind in df3.index:
if df3['colB'][ind] <= df3['colC']:
col_6.append(df3['date'][ind]
else:
col_6.append('')
df3['desired_output'] = col_6
...this second attempt has come the closest, but only produces results when the 'if' conditions occur within the same index row in the dataframe. For instance, the value of 'colB' in index row 4 is exceeded by value of 'colC' in index row 6 but my attempted code is unsuccessful at capturing this sort of occurrence
I have the folowing dataset:
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2',
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd',
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan, 'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07',
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09',
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22',
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35',
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16',
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df.head()
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 NaN 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
So, these are my three columns:
'player' - with three players (1,2,3) - dtype = object
'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Let's say I groupby player & id, and compute the time range for each session:
my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg
min max
player session
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:07 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-01-01 00:21:16 2018-01-01 00:35:22
e 2018-01-01 00:38:16 2018-01-01 01:31:16
3 f 2018-01-03 00:55:22 2018-01-03 00:58:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.
Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.
Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.
Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.
df.sort_values(['player','date']).fillna(method='ffill')
Yields:
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 b 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
5 1 c 2018-01-01 00:25:11
6 1 c 2018-01-01 00:27:28
7 1 c 2018-01-01 00:29:29
8 1 c 2018-01-01 00:30:35
9 2 d 2018-01-01 00:21:16
10 2 d 2018-01-01 00:35:22
11 2 e 2018-01-01 00:38:16
12 2 e 2018-01-01 00:38:20
13 2 e 2018-01-01 00:40:35
14 2 e 2018-01-01 01:31:16
15 3 f 2018-01-03 00:55:22
16 3 f 2018-01-03 00:58:16
17 3 g 2018-01-03 00:58:21
18 3 g 2018-03-01 01:00:35
19 3 g 2018-03-01 01:31:16
I have a table that looks like this (the ratio column was merged from another table based on the codename and date):
date codename ratio
2018-01-01 A .5
2018-02-01 A
2018-03-01 A
2018-01-01 B
2018-02-01 B
2018-01-01 C .6
2018-02-01 C
2018-03-01 C .7
2018-04-01 C
I need to fill in the empty ratio values with the most recent value given the codename
Output should be:
date codename ratio
2018-01-01 A .5
2018-02-01 A .5
2018-03-01 A .5
2018-01-01 B
2018-02-01 B
2018-01-01 C .6
2018-02-01 C .6
2018-03-01 C .7
2018-04-01 C .7
A got .5 because that's its only value. B remains empty because it has no ratio. C got .6 filled for February since that was the January value, but it's April value is .7 because that was the March value.
You can use .fillna() to fill in NaN values, and its method argument allows you to fill forwards or backwards. In this case, we want to group by codename to ensure we don't fill across different names.
Assuming your dataframe is called df:
df['ratio'] = df.groupby('codename')['ratio'].fillna(method='ffill')
Should do the trick. Printing df after this gets us:
date codename ratio
0 2018-01-01 A 0.5
1 2018-02-01 A 0.5
2 2018-03-01 A 0.5
3 2018-01-01 B NaN
4 2018-02-01 B NaN
5 2018-01-01 C 0.6
6 2018-02-01 C 0.6
7 2018-03-01 C 0.7
8 2018-04-01 C 0.7