How to calculate a rolling correlation coefficient between 2 columns in a pandas dataframe with groupby? - pandas

I have a dataframe:
df=pd.dataframe({'group':['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B'],'val1':[100,200,300,400,50,150,250,350,50,150,250,350,100,200,300,475],'val2':[3,5,10,-3,2,-5,89,12,35,5,10,-3,2,-5,89,12]})
I want to calculate the correlation coefficient between columns 'val1' & 'val2' with a rolling window of 3 and within each groups. I would like to add this as a column to the dataframe. I'm able to do this without using a groupby:
df['val1'].rolling(5).corr(df['val2'])
But I'm not able to incorporate the same with a groupby.
Output I'm looking for is a column added to the original df like this:
group
Val1
Val2
Correlation
A
100
3
Nan
A
200
5
Nan
A
300
10
Nan
A
400
-3
Nan
A
50
2
0.1
A
150
-5
-0.25
A
250
89
0.8
A
350
12
0.65
B
50
35
Nan
B
150
5
Nan
B
250
10
Nan
B
350
-3
Nan
B
100
2
-0.43
B
200
-5
0.23
B
475
89
0.87
B
100
12
0.65

You can use .groupby() to group by column group. The result will be 2 groups each with all rows (even for rows not belonging to the group). Then, further combine the results of different groups by aggregating with .GroupBy.max() on the original row index, as follows:
df['Correlation'] = df.groupby('group')['val1'].rolling(5).corr(df['val2']).groupby(level=1).max()
Result:
print(df)
group val1 val2 Correlation
0 A 100 3 NaN
1 A 200 5 NaN
2 A 300 10 NaN
3 A 400 -3 NaN
4 A 50 2 -0.136808
5 A 150 -5 0.051931
6 A 250 89 0.093510
7 A 350 12 0.079207
8 B 50 35 NaN
9 B 150 5 NaN
10 B 250 10 NaN
11 B 350 -3 NaN
12 B 100 2 -0.652637
13 B 200 -5 -0.210248
14 B 300 89 0.328695
15 B 475 12 0.152914

Related

python rolling product on non-adjacent row

I would like to calculate rolling product of non-adjacent row, such as product of values in every fifth row as shown in the photo (result in blue cell is the product of data in blue cell etc.)
The best way I can do now is the following;
temp = pd.DataFrame([range(20)]).transpose()
df = temp.copy()
df['shift1'] = temp.shift(5)
df['shift2'] = temp.shift(10)
df['shift3'] = temp.shift(15)
result = df.product(axis=1)
however, it looks to be cumbersome as I want to change the row step dynamically.
can anyone tell me if there is a better way to navigate this?
Thank you
You can use groupby.cumprod/groupby.prod with the modulo 5 as grouper:
import numpy as np
m = np.arange(len(df)) % 5
# option 1
df['result'] = df.groupby(m)['data'].cumprod()
# option 2
df.loc[~m.duplicated(keep='last'), 'result2'] = df.groupby(m)['data'].cumprod()
# or
# df.loc[~m.duplicated(keep='last'),
# 'result2'] = df.groupby(m)['data'].prod().to_numpy()
output:
data result result2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
4 4 4 NaN
5 5 0 NaN
6 6 6 NaN
7 7 14 NaN
8 8 24 NaN
9 9 36 NaN
10 10 0 NaN
11 11 66 NaN
12 12 168 NaN
13 13 312 NaN
14 14 504 NaN
15 15 0 0.0
16 16 1056 1056.0
17 17 2856 2856.0
18 18 5616 5616.0
19 19 9576 9576.0

Any function/attribute in dataframe similar to attribute 'remove' or 'pop'

Is there any attribute/function for dataframe similar to like 'remove' attribute in series, to remove the 1st occirance of similar indexes in a dataframe.
Dataframe:
a b c d
100 1 2 3 NaN
200 4 5 6 NaN
100 7 9 10 NaN
Desired output:(after the desired command)
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
Try with loc and duplicated with keep='last':
>>> df[~df.index.duplicated(keep='last')]
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
>>>
Edit:
df.iloc[np.where(df.index.duplicated(keep='last'))]

Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequence

I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0

How to map missing values of a df's column according to another column's values (of the same df) using a dictionary? Python

I managed to solve using if and for loops but I'm looking for a less computationally expensive way to do this. i.e. using apply or map or any other technique
d = {1:10, 2:20, 3:30}
df
a b
1 35
1 nan
1 nan
2 nan
2 47
2 nan
3 56
3 nan
I want to fill missing values of column b according to dict d, i.e. output should be
a b
1 35
1 10
1 10
2 20
2 47
2 20
3 56
3 30
You can use fillna or combine_first by maped a column:
print (df['a'].map(d))
0 10
1 10
2 10
3 20
4 20
5 20
6 30
7 30
Name: a, dtype: int64
df['b'] = df['b'].fillna(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
df['b'] = df['b'].combine_first(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
And if all values are ints add astype:
df['b'] = df['b'].fillna(df['a'].map(d)).astype(int)
print (df)
a b
0 1 35
1 1 10
2 1 10
3 2 20
4 2 47
5 2 20
6 3 56
7 3 30
If all data in column a are in keys of dict, then is possible use replace:
df['b'] = df['b'].fillna(df['a'].replace(d))

based on a value in column A, shift the values in columns C and D to the right in a pandas dataframe

How can i achieve the desired result based on the following dataset ?
A B C D E
1 apple 5 2 20 NaN
2 orange 2 6 30 NaN
3 apple 6 1 40 NaN
4 apple 10 3 50 NaN
5 banana 8 9 60 NaN
Desired Result :
A B C D E
1 apple 5 NaN 2 20
2 orange 2 6 30 NaN
3 apple 6 NaN 1 40
4 apple 10 NaN 3 50
5 banana 8 9 60 NaN
IIUC you can use np.roll on the rows of interest, here we need to select only the rows where 'A' is 'apple' and then roll these by a single column row-wise and assign back:
In [14]:
df.loc[df['A']=='apple', 'C':] = np.roll(df.loc[df['A']=='apple', 'C':], 1,axis=1)
df
Out[14]:
A B C D E
1 apple 5 NaN 2 20.0
2 orange 2 6.0 30 NaN
3 apple 6 NaN 1 40.0
4 apple 10 NaN 3 50.0
5 banana 8 9.0 60 NaN
Note that because you introduce NaN values the dtype changes to float to allow this