I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0
Related
I have 2 dataframes
df_1:
Week Day Coeff_1 ... Coeff_n
1 1 12 23
1 2 11 19
1 3 23 68
1 4 57 81
1 5 35 16
1 6 0 0
1 7 0 0
...
50 1 12 23
50 2 11 19
50 3 23 68
50 4 57 81
50 5 35 16
50 6 0 0
50 7 0 0
df_2:
Week Day Coeff_1 ... Coeff_n
1 1 0 0
1 2 0 0
1 3 0 0
1 4 0 0
1 5 0 0
1 6 56 24
1 7 20 10
...
50 1 0 0
50 2 0 0
50 3 0 0
50 4 0 0
50 5 0 0
50 6 10 84
50 7 29 10
In the first dataframe df_1 I have coefficients for monday to friday. In the second dataframes df_2 I have coefficients for the week end. My goal is to merge both dataframes such that I have no longer 0 values which are obsolete.
What is the best approach to do that?
I found that using df.replace seems to be a good approach
Assuming that your dataframes follow the same structure, you can capitalise on pandas functionality to align automatically on indexes. Thus you can replace 0's with np.nan in df1, and then use fillna:
df1.replace({0:np.nan},inplace=True)
df1.fillna(df2)
Week Day Coeff_1 Coeff_n
0 1.0 1.0 12.0 23.0
1 1.0 2.0 11.0 19.0
2 1.0 3.0 23.0 68.0
3 1.0 4.0 57.0 81.0
4 1.0 5.0 35.0 16.0
5 1.0 6.0 56.0 24.0
6 1.0 7.0 20.0 10.0
7 50.0 1.0 12.0 23.0
8 50.0 2.0 11.0 19.0
9 50.0 3.0 23.0 68.0
10 50.0 4.0 57.0 81.0
11 50.0 5.0 35.0 16.0
12 50.0 6.0 10.0 84.0
13 50.0 7.0 29.0 10.0
Can't you just append the rows df_1 where day is 1-5 to the rows of df_2 where day is 6-7?
df_3 = df_1[df_1.Day.isin(range(1,6))].append(df_2[df_2.Day.isin(range(6,8))])
To get a normal sorting, you can sort your values by week and day:
df_3.sort_values(['Week','Day'])
I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
I have a dataframe as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
p = [0.2, 0.4] and Duration for each slot = 30 (minutes)
p = threshold probability
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1 Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2 multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3 multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off_0.2 Cut_off_0.4
1 0.2 S1 1 1 1 0.5
2 0.3 S1 2 1 1.5 0.75
3 0.8 S1 3 1 4 2
4 0.3 S1 3 2 1.2 0.6
5 0.6 S1 4 1 3 1.5
6 0.8 S1 5 1 4 2
7 0.9 S1 5 2 3.6 1.8
8 0.4 S1 5 3 1.44 0.72
9 0.6 S1 5 4 0.864 0.432
12 0.9 S2 1 1 4.5 2.25
13 0.5 S2 1 2 2.25 1.125
14 0.3 S2 2 1 1.5 0.75
15 0.7 S2 3 1 3.5 1.75
20 0.7 S2 4 1 3.5 1.75
16 0.6 S2 5 1 3 1.5
17 0.8 S2 5 2 2.4 1.2
19 0.3 S2 5 3 0.72 0.36
I tried below code
p = [0.2, 0.4]
for i in p:
df['Cut_off_'+'i'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Your solution is possible here with f-strings with {i} for new columns names:
p = [0.2, 0.4]
for i in p:
df[f'Cut_off_{i}'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Solution with numpy is also possible - output is converted to numpy array and divided by p, then converted to DataFrame and joined to original.
p = [0.2, 0.4]
arr = df.groupby(['Session','slot_num'])['No_Show'].cumprod().values[:, None] / np.array(p)
df = df.join(pd.DataFrame(arr, columns=p, index=df.index).add_prefix('Cut_off_'))
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6
Is there a way using pandas functions to add values/rows by a particular increment?
For example:
This is what I have:
df = pd.DataFrame([1.1,2,2.8])
df
value other1 other2
zebra 0.3 250
bunny 0.7 10
rat 1.0 35
cat 1.1 100
dog 2.0 150
mouse 2.8 125
EDIT 1:
This is what I want, where ideally the inserted rows' index are whatever is easiest but the previous row names are preserved.
df_goal = pd.DataFrame([1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8])
df_goal
value other1 other2
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
1 1.2
2 1.3
3 1.4
4 1.5
5 1.6
6 1.7
7 1.8
8 1.9
dog 2.0 150
10 2.1
11 2.2
12 2.3
13 2.4
14 2.5
15 2.6
16 2.7
mouse 2.8 125
EDIT 2:
Also I would like to keep the values of other columns that were there previously and any new rows are simply empty or zero.
I think you can use reindex by numpy.arange:
#create index by value column
df = df.reset_index().set_index('value')
#reindex floatindex
s = 0.1
a = np.arange(df.index.min(),df.index.max() + s, step=s)
df = df.reindex(a, tolerance=s/2., method='nearest')
#replace NaN in another columns as index
cols = df.columns.difference(['index'])
df[cols] = df[cols].fillna('')
#replace NaN by range
s = pd.Series(np.arange(len(df.index)), index=df.index)
df['index'] = df['index'].combine_first(s)
#swap column with index
df = df.reset_index().set_index('index')
print (df)
value other1 other2
index
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
9 1.2
10 1.3
11 1.4
12 1.5
13 1.6
14 1.7
15 1.8
16 1.9
dog 2.0 150
18 2.1
19 2.2
20 2.3
21 2.4
22 2.5
23 2.6
24 2.7
mouse 2.8 125