I have a dataframe:
df=pd.dataframe({'group':['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B'],'val1':[100,200,300,400,50,150,250,350,50,150,250,350,100,200,300,475],'val2':[3,5,10,-3,2,-5,89,12,35,5,10,-3,2,-5,89,12]})
I want to calculate the correlation coefficient between columns 'val1' & 'val2' with a rolling window of 3 and within each groups. I would like to add this as a column to the dataframe. I'm able to do this without using a groupby:
df['val1'].rolling(5).corr(df['val2'])
But I'm not able to incorporate the same with a groupby.
Output I'm looking for is a column added to the original df like this:
group
Val1
Val2
Correlation
A
100
3
Nan
A
200
5
Nan
A
300
10
Nan
A
400
-3
Nan
A
50
2
0.1
A
150
-5
-0.25
A
250
89
0.8
A
350
12
0.65
B
50
35
Nan
B
150
5
Nan
B
250
10
Nan
B
350
-3
Nan
B
100
2
-0.43
B
200
-5
0.23
B
475
89
0.87
B
100
12
0.65
You can use .groupby() to group by column group. The result will be 2 groups each with all rows (even for rows not belonging to the group). Then, further combine the results of different groups by aggregating with .GroupBy.max() on the original row index, as follows:
df['Correlation'] = df.groupby('group')['val1'].rolling(5).corr(df['val2']).groupby(level=1).max()
Result:
print(df)
group val1 val2 Correlation
0 A 100 3 NaN
1 A 200 5 NaN
2 A 300 10 NaN
3 A 400 -3 NaN
4 A 50 2 -0.136808
5 A 150 -5 0.051931
6 A 250 89 0.093510
7 A 350 12 0.079207
8 B 50 35 NaN
9 B 150 5 NaN
10 B 250 10 NaN
11 B 350 -3 NaN
12 B 100 2 -0.652637
13 B 200 -5 -0.210248
14 B 300 89 0.328695
15 B 475 12 0.152914
I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0
The question
Given a Series s and DataFrame df, how do I operate on each column of df with s?
df = pd.DataFrame(
[[1, 2, 3], [4, 5, 6]],
index=[0, 1],
columns=['a', 'b', 'c']
)
s = pd.Series([3, 14], index=[0, 1])
When I attempt to add them, I get all np.nan
df + s
a b c 0 1
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
What I thought I should get is
a b c
0 4 5 6
1 18 19 20
Objective and motivation
I've seen this kind of question several times over and have seen many other questions that involve some element of this. Most recently, I had to spend a bit of time explaining this concept in comments while looking for an appropriate canonical Q&A. I did not find one and so I thought I'd write one.
These questions usually arises with respect to a specific operation, but equally applies to most arithmetic operations.
How do I subtract a Series from every column in a DataFrame?
How do I add a Series from every column in a DataFrame?
How do I multiply a Series from every column in a DataFrame?
How do I divide a Series from every column in a DataFrame?
It is helpful to create a mental model of what Series and DataFrame objects are.
Anatomy of a Series
A Series should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make, but I am targeting a dictionary in order to demonstrate the purpose of this post.
index
These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.
values
These are the corresponding values that are keyed by the index.
Anatomy of a DataFrame
A DataFrame should be thought of as a dictionary of Series or a Series of Series. In this case the keys are the column names and the values are the columns themselves as Series objects. Each Series agrees to share the same index which is the index of the DataFrame.
columns
These are the keys that we can reference to get at the corresponding Series.
index
This the the index that all of the Series values agree to share.
Note: RE: columns and index objects
They are the same kind of things. A DataFrames index can be used as another DataFrames columns. In fact, this happens when you do df.T to get a transpose.
values
This is a two-dimensional array that contains the data in a DataFrame. The reality is that values is not what is stored inside the DataFrame object. (Well, sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a two-dimensional array of the data.
Define Sample Data
These are sample pandas.Index objects that can be used as the index of a Series or DataFrame or can be used as the columns of a DataFrame:
idx_lower = pd.Index([*'abcde'], name='lower')
idx_range = pd.RangeIndex(5, name='range')
These are sample pandas.Series objects that use the pandas.Index objects above:
s0 = pd.Series(range(10, 15), idx_lower)
s1 = pd.Series(range(30, 40, 2), idx_lower)
s2 = pd.Series(range(50, 10, -8), idx_range)
These are sample pandas.DataFrame objects that use the pandas.Index objects above:
df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower)
df1 = pd.DataFrame(
np.arange(np.product(df0.shape)).reshape(df0.shape),
index=idx_range, columns=idx_lower
)
Series on Series
When operating on two Series, the alignment is obvious. You align the index of one Series with the index of the other.
s1 + s0
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
Which is the same as when I randomly shuffle one before I operate. The indices will still align.
s1 + s0.sample(frac=1)
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
And is not the case when instead I operate with the values of the shuffled Series. In this case, Pandas doesn't have the index to align with and therefore operates from a positions.
s1 + s0.sample(frac=1).values
lower
a 42
b 42
c 47
d 50
e 49
dtype: int64
Add a scalar
s1 + 1
lower
a 31
b 33
c 35
d 37
e 39
dtype: int64
DataFrame on DataFrame
The similar is true when operating between two DataFrames. The alignment is obvious and does what we think it should do:
df0 + df1
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
It shuffles the second DataFrame on both axes. The index and columns will still align and give us the same thing.
df0 + df1.sample(frac=1).sample(frac=1, axis=1)
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
It is the same shuffling, but it adds the array and not the DataFrame. It is no longer aligned and will get different results.
df0 + df1.sample(frac=1).sample(frac=1, axis=1).values
lower a b c d e
range
0 123 124 121 122 120
1 118 119 116 117 115
2 108 109 106 107 105
3 103 104 101 102 100
4 113 114 111 112 110
Add a one-dimensional array. It will align with columns and broadcast across rows.
df0 + [*range(2, df0.shape[1] + 2)]
lower a b c d e
range
0 102 103 104 105 106
1 102 103 104 105 106
2 102 103 104 105 106
3 102 103 104 105 106
4 102 103 104 105 106
Add a scalar. There isn't anything to align with, so broadcasts to everything:
df0 + 1
lower a b c d e
range
0 101 101 101 101 101
1 101 101 101 101 101
2 101 101 101 101 101
3 101 101 101 101 101
4 101 101 101 101 101
DataFrame on Series
If DataFrames are to be thought of as dictionaries of Series and Series are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame and Series that they should be aligned by their "keys".
s0:
lower a b c d e
10 11 12 13 14
df0:
lower a b c d e
range
0 100 100 100 100 100
1 100 100 100 100 100
2 100 100 100 100 100
3 100 100 100 100 100
4 100 100 100 100 100
And when we operate, the 10 in s0['a'] gets added to the entire column of df0['a']:
df0 + s0
lower a b c d e
range
0 110 111 112 113 114
1 110 111 112 113 114
2 110 111 112 113 114
3 110 111 112 113 114
4 110 111 112 113 114
The heart of the issue and point of the post
What about if I want s2 and df0?
s2: df0:
| lower a b c d e
range | range
0 50 | 0 100 100 100 100 100
1 42 | 1 100 100 100 100 100
2 34 | 2 100 100 100 100 100
3 26 | 3 100 100 100 100 100
4 18 | 4 100 100 100 100 100
When I operate, I get the all np.nan as cited in the question:
df0 + s2
a b c d e 0 1 2 3 4
range
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
This does not produce what we wanted, because Pandas is aligning the index of s2 with the columns of df0. The columns of the result includes a union of the index of s2 and the columns of df0.
We could fake it out with a tricky transposition:
(df0.T + s2).T
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis argument to specify the axis to align with.
- sub
+ add
* mul
/ div
** pow
And so the answer is simply:
df0.add(s2, axis='index')
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
It turns out axis='index' is synonymous with axis=0.
As is axis='columns' synonymous with axis=1:
df0.add(s2, axis=0)
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
The rest of the operations
df0.sub(s2, axis=0)
lower a b c d e
range
0 50 50 50 50 50
1 58 58 58 58 58
2 66 66 66 66 66
3 74 74 74 74 74
4 82 82 82 82 82
df0.mul(s2, axis=0)
lower a b c d e
range
0 5000 5000 5000 5000 5000
1 4200 4200 4200 4200 4200
2 3400 3400 3400 3400 3400
3 2600 2600 2600 2600 2600
4 1800 1800 1800 1800 1800
df0.div(s2, axis=0)
lower a b c d e
range
0 2.000000 2.000000 2.000000 2.000000 2.000000
1 2.380952 2.380952 2.380952 2.380952 2.380952
2 2.941176 2.941176 2.941176 2.941176 2.941176
3 3.846154 3.846154 3.846154 3.846154 3.846154
4 5.555556 5.555556 5.555556 5.555556 5.555556
df0.pow(1 / s2, axis=0)
lower a b c d e
range
0 1.096478 1.096478 1.096478 1.096478 1.096478
1 1.115884 1.115884 1.115884 1.115884 1.115884
2 1.145048 1.145048 1.145048 1.145048 1.145048
3 1.193777 1.193777 1.193777 1.193777 1.193777
4 1.291550 1.291550 1.291550 1.291550 1.291550
It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.
I prefer the method mentioned by piSquared (i.e., df.add(s, axis=0)), but another method uses apply together with lambda to perform an action on each column in the dataframe:
>>>> df.apply(lambda col: col + s)
a b c
0 4 5 6
1 18 19 20
To apply the lambda function to the rows, use axis=1:
>>> df.T.apply(lambda row: row + s, axis=1)
0 1
a 4 18
b 5 19
c 6 20
This method could be useful when the transformation is more complex, e.g.:
df.apply(lambda col: 0.5 * col ** 2 + 2 * s - 3)
Just to add an extra layer from my own experience. It extends what others have done here. This shows how to operate on a Series with a DataFrame that has extra columns that you want to keep the values for. Below is a short demonstration of the process.
import pandas as pd
d = [1.056323, 0.126681,
0.142588, 0.254143,
0.15561, 0.139571,
0.102893, 0.052411]
df = pd.Series(d, index = ['const', '426', '428', '424', '425', '423', '427', '636'])
print(df)
const 1.056323
426 0.126681
428 0.142588
424 0.254143
425 0.155610
423 0.139571
427 0.102893
636 0.052411
d2 = {
'loc': ['D', 'D', 'E', 'E', 'F', 'F', 'G', 'G', 'E', 'D'],
'426': [9, 2, 3, 2, 4, 0, 2, 7, 2, 8],
'428': [2, 4, 1, 0, 2, 1, 3, 0, 7, 8],
'424': [1, 10, 5, 8, 2, 7, 10, 0, 3, 5],
'425': [9, 2, 6, 8, 9, 1, 7, 3, 8, 6],
'423': [4, 2, 8, 7, 9, 6, 10, 5, 9, 9],
'423': [2, 7, 3, 10, 8, 1, 2, 9, 3, 9],
'427': [4, 10, 4, 0, 8, 3, 1, 5, 7, 7],
'636': [10, 5, 6, 4, 0, 5, 1, 1, 4, 8],
'seq': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
df2 = pd.DataFrame(d2)
print(df2)
loc 426 428 424 425 423 427 636 seq
0 D 9 2 1 9 2 4 10 1
1 D 2 4 10 2 7 10 5 1
2 E 3 1 5 6 3 4 6 1
3 E 2 0 8 8 10 0 4 1
4 F 4 2 2 9 8 8 0 1
5 F 0 1 7 1 1 3 5 1
6 G 2 3 10 7 2 1 1 1
7 G 7 0 0 3 9 5 1 1
8 E 2 7 3 8 3 7 4 1
9 D 8 8 5 6 9 7 8 1
To multiply a DataFrame by a Series and keep dissimilar columns
Create a list of the elements in the DataFrame and Series you want to operate on:
col = ['426', '428', '424', '425', '423', '427', '636']
Perform your operation using the list and indicate the axis to use:
df2[col] = df2[col].mul(df[col], axis=1)
print(df2)
loc 426 428 424 425 423 427 636 seq
0 D 1.140129 0.285176 0.254143 1.40049 0.279142 0.411572 0.524110 1
1 D 0.253362 0.570352 2.541430 0.31122 0.976997 1.028930 0.262055 1
2 E 0.380043 0.142588 1.270715 0.93366 0.418713 0.411572 0.314466 1
3 E 0.253362 0.000000 2.033144 1.24488 1.395710 0.000000 0.209644 1
4 F 0.506724 0.285176 0.508286 1.40049 1.116568 0.823144 0.000000 1
5 F 0.000000 0.142588 1.779001 0.15561 0.139571 0.308679 0.262055 1
6 G 0.253362 0.427764 2.541430 1.08927 0.279142 0.102893 0.052411 1
7 G 0.886767 0.000000 0.000000 0.46683 1.256139 0.514465 0.052411 1
8 E 0.253362 0.998116 0.762429 1.24488 0.418713 0.720251 0.209644 1
9 D 1.013448 1.140704 1.270715 0.93366 1.256139 0.720251 0.419288 1
I have been struggling with this issue for a bit and even though there are some workarounds i would assume, I would love to know if there is an elegant way to achieve this result:
import pandas as pd
import numpy as np
data = np.array([
[1,10],
[2,12],
[4,13],
[5,14],
[8,15]])
df1 = pd.DataFrame(data=data, index=range(0,5), columns=['x','a'])
data = np.array([
[2,100,101],
[3,120,122],
[4,130,132],
[7,140,142],
[9,150,151],
[12,160,152]])
df2 = pd.DataFrame(data=data, index=range(0,6), columns=['x','b','c'])
Now I would like to have a data frame that concatenate those 2 and fill the missing values with the previous value
or the first value otherwise. Both data frames can have differnet sizes, what we are interested in here is the unique column x.
That would be my desired output frame df_result.
x is the aggregated unique "x" between the 2 frames
x a b c
0 1 10 100 101
1 2 12 100 101
2 3 12 120 122
3 4 13 130 132
4 5 14 130 132
5 7 14 140 142
6 8 15 140 142
7 9 15 150 151
8 12 15 160 152
Any help or hint would be much appreciated, thank you very much
You can simply use merge operation on 2 dataframes, after that you can apply a sorting, forward fill and backward filling for null values fillling.
df1.merge(df2,on='x',how='outer').sort_values('x').ffill().bfill()
Out:
x a b c
0 1 10.0 100.0 101.0
1 2 12.0 100.0 101.0
5 3 12.0 120.0 122.0
2 4 13.0 130.0 132.0
3 5 14.0 130.0 132.0
6 7 14.0 140.0 142.0
4 8 15.0 140.0 142.0
7 9 15.0 150.0 151.0
8 12 15.0 160.0 152.0