rank for nan values based on group

rank for nan values based on group - pandas

I have dataframe with column d1 and now i am trying calculate 'out' column after ranking that column when there in 'nan' value with in a column.
data_input = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan']}
df_input = pd.DataFrame(data_input)
data_out = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan'],
'out':[1,np.NaN,np.NaN,np.NaN,2,2,np.NaN,np.NaN,1,1,np.NaN,np.NaN,np.NaN,2]}
df_out = pd.DataFrame(data_out)
If in that particular group if nan appers before and after some values then rank should be in asscending.
ex: rank for index-0 will be 1 and index-4&5 will be 2(because there is no after values in that group)
df_out["out"] = df_out.groupby(["Name","type"])['d1'].rank(method="first")

Use GroupBy.cumsum by consecutive mising values per groups:
df_out['d1'] = pd.to_numeric(df_out['d1'], errors='coerce')
m = df_out['d1'].isna()
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))
.groupby(["Name","type"])['a']
.cumsum()
.where(m))
Alternative solution with boolean indexing:
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))[m]
.groupby(["Name","type"])['a']
.cumsum())

Related

Creating pandas columns with for loop

I have the following dataframe created through the following chunk of code:
df = pd.DataFrame(
[
(13412339, '07/03/2022', '08/03/2022', '10/03/2022', 1),
(13412343, '07/03/2022', '07/03/2022', '09/03/2022', 0),
(13412489, '07/02/2022', '08/02/2022', '07/03/2022', 0),
],
columns=['task_id', 'start_date', 'end_date', 'end_period', 'status']
)
df = df.astype(dtype={'status' : bool})
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df.end_period = pd.to_datetime(df.end_period)
What I need to do here is to calculate the difference in days between the start_date and end_date columns if the status column is False, else it should do the same but between start_date and end_period columns.
The code that I have implemented to calculate the days differences between the start_date and end_date columns is as follows:
new_frame = pd.DataFrame()
for row in range(df.shape[0]):
#extract the row
extracted_row = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff = extracted_row['end_date'] - extracted_row['start_date']
diff_days = diff.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days+1):
new_row = extracted_row.copy()
new_row['date'] = new_row['start_date'] + dt.timedelta(days=i)
new_row = new_row[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame = new_frame.append(new_row)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
Then in order to calculate the differences if the status column is False, I did the following one:
new_frame1 = pd.DataFrame()
new_frame2 = pd.DataFrame()
for row in range(df.shape[0]):
#In this iteration, status column should be equals True
if df['status'] == False:
#extract the row
extracted_row_end = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff1 = extracted_row_end['end_date'] - extracted_row_end['start_date']
diff_days_end = diff1.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days_end+1):
new_row_end = extracted_row_end.copy()
new_row_end['date'] = new_row_end['start_date'] + dt.timedelta(days=i)
new_row_end = new_row_end[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame1 = new_frame1.append(new_row_end)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
#In this iteration, status column should be equals False
else:
#extract the row
extracted_row_period = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff2 = extracted_row_period['end_period'] - extracted_row_period['start_date']
diff_days_period = diff2.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days_period+1):
new_row_period = extracted_row_end.copy()
new_row_period['date'] = new_row_period['start_date'] + dt.timedelta(days=i)
new_row_period = new_row_period[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame2 = new_frame2.append(new_row_period)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
#Merges both dataframes
frames = [new_frame1,new_frame2]
df = pd.concat(frames)
Then it throws an error when starts the first for loop, here is where I should be asking help on how to calculate the difference in days between the start_date and end_date columns if the status column is False, else calculate it between start_date and end_period columns.
The complete error is as follows:

Some part of your code did not work on my machine (so I just took the initial df from your first cell) - but when reading what you need, this is what I would do
import numpy as np
df['dayDiff']=np.where(df['status'],(df['end_period']-df['start_date']).dt.days,(df['end_date']-df['start_date']).dt.days)
df
As you already have booleand on df['status'], I would use that to the np.where condition , then either calculate the day difference df['end_period']-df['start_date']).dt.days when True either day difference (df['end_date']-df['start_date']).dt.days when False

Python3 to speed up the computing of dataframe

I have a dataframe (df) as following
id date t_slot dayofweek label
1 2021-01-01 2 0 1
1 2021-01-02 3 1 0
2 2021-01-01 4 6 1
.......
The data frame is very large(6 million rows). the t_slot is from 1 to 6 value. dayofweek is from 0-6.
I want to get the rate:
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
I have used loop to compute the rate, but it is very slow, do you have fast way to compute it. My code is copied as following:
def get_time_slot_rate(df):
import numpy as np
if len(df)==0:
return np.nan, np.nan, np.nan, np.nan
else:
work = df.loc[df['dayofweek']<5]
weekend = df.loc[df['dayofweek']>=5]
if len(work)==0:
work_14, work_56 = np.nan, np.nan
else:
work_14 = len(work.loc[(work['time_slot']<5)*(work['label']==1)])/len(work)
work_56 = len(work.loc[(work['time_slot']>5)*(work['label']==1)])/len(work)
if len(weekend)==0:
weekend_14, weekend_56 = np.nan, np.nan
else:
weekend_14 = len(weekend.loc[(weekend['time_slot']<5)*(weekend['label']==1)])/len(weekend)
weekend_56 = len(weekend.loc[(weekend['time_slot']>5)*(weekend['label']==1)])/len(weekend)
return work_14, work_56, weekend_14, weekend_56
import datetime as d_t
lst_id = list(df['id'])
lst_date = list(df['date'])
lst_t14_work = []
lst_t56_work = []
lst_t14_weekend = []
lst_t56_weekend = []
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = lst_date[i]
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& ((df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date))].reset_index(drop=True)
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df_s)
lst_t14_work.append(work_14_rate)
lst_t56_work.append(work_56_rate)
lst_t14_weekend.append(weekend_14_rate)
lst_t56_weekend.append(weekend_56_rate)

I could only fix your function and it's completely untested, but here we go:
Import only once by putting the imports at the top of your .py.
try/except blocks are more efficient than if/else statements.
True and False equals to 1 and 0 respectively in Python.
Don't multiply boolean selectors and use the reverse operator ~
Create the least amount of copies.
import numpy as np
def get_time_slot_rate(df):
# much faster than counting
if df.empty:
return np.nan, np.nan, np.nan, np.nan
# assuming df['label'] is either 0 or 1
df = df.loc[df['label']]
# create boolean selectors to be inverted with '~'
weekdays = df['dayofweek']<=5
slot_selector = df['time_slot']<=5
weekday_count = np.sum(weekdays)
try:
work_14 = len(df.loc[weekdays & slot_selector])/weekday_count
work_56 = len(df.loc[weekdays & ~slot_selector])/weekday_count
except ZeroDivisionError:
work_14 = work_56 = np.nan
weekend_count = np.sum(~weekdays)
try:
weekend_14 = len(df.loc[~weekdays & slot_selector])/weekend_count
weekend_56 = len(df.loc[~weekdays & ~slot_selector])/weekend_count
except ZeroDivisionError:
weekend_14 = weekend_56 = np.nan
return work_14, work_56, weekend_14, weekend_56
The rest of your script doesn't really make sense, see my comments:
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = date[i]
# what is d_t ?
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& (df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date)].reset_index(drop=True)
# is it df or df_s ?
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df)
If your date column is a datetime object than you can compare dates directly (no need for strings).

Data frame: get row and update it

I want to select a row based on a condition and then update it in dataframe.
One solution I found is to update df based on condition, but I must repeat the condition, what is the better solution so that I get the desired row once and change it?
df.loc[condition, "top"] = 1
df.loc[condition, "pred_text1"] = 2
df.loc[condtion, "pred1_score"] = 3
something like:
row = df.loc[condition]
row["top"] = 1
row["pred_text1"] = 2
row["pred1_score"] = 3

Extract the boolean mask and set it as a variable.
m = condition
df.loc[m, 'top'] = 1
df.loc[m, 'pred_text1'] = 2
df.loc[m, 'pred1_score'] = 3
but the shortest way is:
df.loc[condition, ['top', 'pred_text1', 'pred_score']] = [1, 2, 3]
Update
Wasn't it possible to retrieve the index of row and then update it by that index?
idx = df[condition].idx
df.loc[idx, 'top'] = 1
df.loc[idx, 'pred_text1'] = 2
df.loc[idx, 'pred1_score'] = 3

Fill zeroes with increment of the max value

I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above

Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4

Group by based on an if statement

I have a df that contains ids and timestamps.
I was looking to group by the id and then a condition on the timestamp in the two rows.
Something like if timestamp_col1 > timestamp_col1 for the second row then 1 else 2
Basically grouping the ids and an if statement to give a value of 1 if the first row timestamp is < than the second and 2 if the second row timestamp is < then the first
Updated output below where last two values should be 2

Use to_timedelta for converting times, then aggregate difference between first and last value and compare by gt (>), last map with numpy.where for assign new column:
df = pd.DataFrame({
'ID Code': ['a','a','b','b'],
'Time Created': ['21:25:27','21:12:09','21:12:00','21:12:40']
})
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = df.groupby('ID Code')['Time Created'].agg(lambda x: x.iat[0] < x.iat[-1])
print (mask)
ID Code
a True
b False
Name: Time Created, dtype: bool
df['new'] = np.where(df['ID Code'].map(mask), 1, 2)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1
Another solution with transform for return aggregate value to new column, here boolean mask:
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = (df.groupby('ID Code')['Time Created'].transform(lambda x: x.iat[0] > x.iat[-1]))
print (mask)
0 True
1 True
2 False
3 False
Name: Time Created, dtype: bool
df['new'] = np.where(mask, 2, 1)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

rank for nan values based on group - pandas

Related

Creating pandas columns with for loop

Python3 to speed up the computing of dataframe

Data frame: get row and update it

Fill zeroes with increment of the max value

Group by based on an if statement

Categories

Resources