I have a dataframe in long format (panel data), Each person has a start month along with variables. it looks something like:
Data description
person_id
month_start
Var1
Var2
1
1
0.4
1.4
1
2
0.3
0.131
1
3
0.34
0.434
2
2
0.49
0.949
2
3
0.53
1.53
2
5
0.38
0.738
3
1
1.12
1.34
3
4
1.89
1.02
3
5
0.83
0.27
and I need it to look like:
person_id
month_start
month_end
Var1
Var2
1
1
2
0.4
1.4
1
2
3
0.3
0.131
1
3
4
0.34
0.434
2
2
3
0.49
0.949
2
3
5
0.53
1.53
2
5
6
0.38
0.738
3
1
4
1.12
1.34
3
4
5
1.89
1.02
3
5
6
0.83
0.27
Where month end is the beginning of the next entry for that person.
I was able to make this:
a = pd.DataFrame({'person_id':[1,1,1,2,2,2,3,3,3], 'var1': [0.4, 0.3, 0.34, 0.49, 0.53, 0.38, 1.12, 1.89, 0.83], 'var2': [1.4, 0.131, 0.434, 0.949, 1.53, 0.738, 1.34, 1.02, 0.27], 'month_start': [1,2,3,2,3,5,1,4,5]})
def add_end_date(df_in,object_id, start_col, end_col):
df = df_in.copy()
prev_person_id = -1
prev_index = -1
df[end_col] = [-1]*len(df)
for idx, row in df.iterrows():
p_id = row[object_id]
p_idx = idx
if prev_person_id == p_id:
df.loc[prev_index, end_col] = int(row[start_col])# put in start date as last entries end date
if row[end_col] == -1:
df.loc[idx, end_col] = int(row[start_col]+1)
prev_person_id = p_id
prev_index = p_idx
return df
add_end_date(a, 'person_id', 'month_start', 'month_end')
Is there a better/optimized way to accomplish this?
Try groupby.shift:
df['month_end'] = df.groupby('person_id').month_start.shift(-1)\
.fillna(df.month_start + 1).astype(int)
df
person_id month_start Var1 Var2 month_end
0 1 1 0.40 1.400 2
1 1 2 0.30 0.131 3
2 1 3 0.34 0.434 4
3 2 2 0.49 0.949 3
4 2 3 0.53 1.530 5
5 2 5 0.38 0.738 6
6 3 1 1.12 1.340 4
7 3 4 1.89 1.020 5
8 3 5 0.83 0.270 6
Related
I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0
I have a dataframe as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
p = [0.2, 0.4] and Duration for each slot = 30 (minutes)
p = threshold probability
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1 Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2 multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3 multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off_0.2 Cut_off_0.4
1 0.2 S1 1 1 1 0.5
2 0.3 S1 2 1 1.5 0.75
3 0.8 S1 3 1 4 2
4 0.3 S1 3 2 1.2 0.6
5 0.6 S1 4 1 3 1.5
6 0.8 S1 5 1 4 2
7 0.9 S1 5 2 3.6 1.8
8 0.4 S1 5 3 1.44 0.72
9 0.6 S1 5 4 0.864 0.432
12 0.9 S2 1 1 4.5 2.25
13 0.5 S2 1 2 2.25 1.125
14 0.3 S2 2 1 1.5 0.75
15 0.7 S2 3 1 3.5 1.75
20 0.7 S2 4 1 3.5 1.75
16 0.6 S2 5 1 3 1.5
17 0.8 S2 5 2 2.4 1.2
19 0.3 S2 5 3 0.72 0.36
I tried below code
p = [0.2, 0.4]
for i in p:
df['Cut_off_'+'i'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Your solution is possible here with f-strings with {i} for new columns names:
p = [0.2, 0.4]
for i in p:
df[f'Cut_off_{i}'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Solution with numpy is also possible - output is converted to numpy array and divided by p, then converted to DataFrame and joined to original.
p = [0.2, 0.4]
arr = df.groupby(['Session','slot_num'])['No_Show'].cumprod().values[:, None] / np.array(p)
df = df.join(pd.DataFrame(arr, columns=p, index=df.index).add_prefix('Cut_off_'))
I have a data frame as shown below
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
threshold probability = 0.2
Duration for each slot = 30 (minutes)
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1
Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2
multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3
multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off
1 0.2 S1 1 1 1
2 0.3 S1 2 1 1.5
3 0.8 S1 3 1 4
4 0.3 S1 3 2 1.2
5 0.6 S1 4 1 3
6 0.8 S1 5 1 4
7 0.9 S1 5 2 3.6
8 0.4 S1 5 3 1.44
9 0.6 S1 5 4 0.864
12 0.9 S2 1 1 4.5
13 0.5 S2 1 2 2.25
14 0.3 S2 2 1 1.5
15 0.7 S2 3 1 3.5
20 0.7 S2 4 1 3.5
16 0.6 S2 5 1 3
17 0.8 S2 5 2 2.4
19 0.3 S2 5 3 0.72
Use GroupBy.cumprod and divide by probability by Series.div:
probability = 0.2
df['new'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(probability)
print (df)
B_ID No_Show Session slot_num Patient_count new
0 1 0.2 S1 1 1 1.000
1 2 0.3 S1 2 1 1.500
2 3 0.8 S1 3 1 4.000
3 4 0.3 S1 3 2 1.200
4 5 0.6 S1 4 1 3.000
5 6 0.8 S1 5 1 4.000
6 7 0.9 S1 5 2 3.600
7 8 0.4 S1 5 3 1.440
8 9 0.6 S1 5 4 0.864
9 12 0.9 S2 1 1 4.500
10 13 0.5 S2 1 2 2.250
11 14 0.3 S2 2 1 1.500
12 15 0.7 S2 3 1 3.500
13 20 0.7 S2 4 1 3.500
14 16 0.6 S2 5 1 3.000
15 17 0.8 S2 5 2 2.400
16 19 0.3 S2 5 3 0.720
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6