applying vlookup to every element of pandas dataframe - pandas

I have two dataframes, one of which is source(src), and the other one is the destination(dest)
time ...
2020-06-26 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-29 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-30 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-07-01 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
2020-07-02 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
1.00 1.25 1.50 1.75 ... 10.00 10.25
2020-06-29 0.153556 0.159041 0.162370 0.164580 ... 0.643962 0.658646
2020-06-30 0.156180 0.159280 0.161534 0.163746 ... 0.660171 0.675189
2020-07-01 0.156947 0.163433 0.168326 0.171734 ... 0.687046 0.701364
2020-07-02 0.152465 0.153910 0.154862 0.155750 ... 0.676183 0.690475
2020-07-03 0.154169 0.153923 0.154868 0.155751 ... 0.676537 0.690816
For each value in dest, i want to replace it with a value in the src table, which has same index, and same column name as itself.
e.g. Value for AJ on '2020-06-26' in the dest table right now is 3.5. I want to replace it with value in src table corresponding to index '2020-06-26' and column = 3.5
I was thinking of using applymap, but it doesnt seem to have a concept of index.
dest.applymap(lambda x: src.loc[x.index][x]).tail()
AttributeError: ("'numpy.float64' object has no attribute 'index'", u'occurred at index AJ')
I then tried using apply and it worked like this:
dest1 = dest.replace(0,np.nan).fillna(1) # 0 and nan are not in src.columns
df= dest1.apply(lambda x: [src[col].loc[row] for row, col in zip(x.index,x)], axis=0).tail()
2 questions on this:
Is there a better solution to this instead of doing a list comprehension within apply?
Is there a better way of handling values in dest that are not in src.columns (like 0 and nan) so the output is nan when that's the case?


How to get the index of the last condition and assign it to other columns

condition is column 'A' > 0.5
I want to calculate the index of the last condition established and assign it to column 'cond_index'
A cond_index
0 0.001566 NaN
1 0.174676 NaN
2 0.553506 2
3 0.583377 3
4 0.418854 3
5 0.836482 5
6 0.927756 6
7 0.800908 7
8 0.277646 7
9 0.388323 7
Use Index.to_series with replace missing values if not match condition in Series.where with comapre for greater like 0.5 and last forward filling missing values:
df['new'] = df.index.to_series().where(df['A'].gt(0.5)).ffill()
print (df)
A cond_index new
0 0.001566 NaN NaN
1 0.174676 NaN NaN
2 0.553506 2.0 2.0
3 0.583377 3.0 3.0
4 0.418854 3.0 3.0
5 0.836482 5.0 5.0
6 0.927756 6.0 6.0
7 0.800908 7.0 7.0
8 0.277646 7.0 7.0
9 0.388323 7.0 7.0

Creating variables and calculating the difference between these variables and selected variable - Pandas

I've got this data frame:
ID Date X 123_P 456_P 789_P choice
A 07/16/2019 . 1.5 1.8 1.6 123
A 07/17/2019 . 2.0 2.1 4.5 789
A 07/18/2019 . 3.0 3.2 NaN 0
A 07/19/2019 . 2.1 2.2 4.5 456
B 07/16/2019 . 1.5 1.8 1.6 789
B 07/17/2019 . 2.0 2.1 4.5 0
B 07/18/2019 . 3.0 3.2 NaN 123
I want to create new variables: 123_PD, 456_PD, 789_PD (I have much more variables than this example, so it shouldn't be done manually).
The new variables will indicate the differences between 123_P, 456_P, 789_P variables and the same variables from the previous row, considering the previous choice.
I mean, if the choice from the previous row was "123", so the differences between these variables will refer to value in "123_P" from the previous row.
Value of 0 means there is no choice, so the differences will refer to the last choice for this ID.
It should be done for each ID separately.
Expected result:
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
A 07/16/2019 . 1.5 1.8 1.6 123 0 0 0
A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0
B 07/16/2019 . 1.5 1.8 1.6 789 0 0 0
B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
First create helper DataFrame with new column 0_P for filled missing values and change choice values for match columns names:
df1 = (df.join(pd.DataFrame({'0_P':np.nan}, index=df.index))
.assign(choice = df['choice'].astype(str) + '_P'))
print (df1)
ID Date X 123_P 456_P 789_P choice 0_P
0 A 07/16/2019 . 1.5 1.8 1.6 123_P NaN
1 A 07/17/2019 . 2.0 2.1 4.5 789_P NaN
2 A 07/18/2019 . 3.0 3.2 NaN 0_P NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456_P NaN
4 B 07/16/2019 . 1.5 1.8 1.6 789_P NaN
5 B 07/17/2019 . 2.0 2.1 4.5 0_P NaN
6 B 07/18/2019 . 3.0 3.2 NaN 123_P NaN
Then use DataFrame.lookup for values to array, convert to Series, Series.shift and forward filling missing values per groups in lambda function:
s = (pd.Series(df1.lookup(df1.index, df1['choice']), index=df.index)
.apply(lambda x: x.shift().ffill()))
print (s)
0 NaN
1 1.5
2 4.5
3 4.5
4 NaN
5 1.6
6 1.6
dtype: float64
Then select necessary columns, subtract by DataFrame.sub, DataFrame.add_suffix and last set rows to 0 by duplicated ID column:
df2 = df.iloc[:, -4:-1].sub(s, axis=0).add_suffix('D')
df2.loc[~df1['ID'].duplicated(), :] = 0
print (df2)
123_PD 456_PD 789_PD
0 0.0 0.0 0.0
1 0.5 0.6 3.0
2 -1.5 -1.3 NaN
3 -2.4 -2.3 0.0
4 0.0 0.0 0.0
5 0.4 0.5 2.9
6 1.4 1.6 NaN
df = df.join(df2)
print (df)
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
0 A 07/16/2019 . 1.5 1.8 1.6 123 0.0 0.0 0.0
1 A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
2 A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0.0
4 B 07/16/2019 . 1.5 1.8 1.6 789 0.0 0.0 0.0
5 B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
6 B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
This should do the needful:
df[['123_PD', '456_PD', '789_PD']] = df[['123_P', '456_P', '789_P']] - df[['123_P', '456_P', '789_P']].shift(1)
df['123_PD'].iloc[0] = 0
df['456_PD'].iloc[0] = 0
df['789_PD'].iloc[0] = 0

Reverse Rolling mean for DataFrame

I am trying to create a fixture difficulty grid using a DataFrame. I want the mean for the next 5 fixtures for each team.
I’m currently using df.rolling(5, min_periods=1).mean().shift(-4). This is working for the start but is pulling NANs at the end. I understand why NANs are returned – there is no DF to shift up. Ideally I’d like the NANs to become mean across the remaining values, value against 38 just being its current value?
Fixture difficulties
3 4 3 2
2 2 2 2
5 2 2 4
4 2 5 3
3 2 2 2
Mean of next 5 fixtures
3.4 2.4 2.8 2.6
3.2 2.4 2.8 2.6
3.6 2.4 3.2 2.6
3 2.4 3.6 2.6
2.6 2.4 3 2.4
NAN on last records as nothing to shift up.
3.2 3.6 2.8 3.6
nan nan nan nan
nan nan nan nan
nan nan nan nan
nan nan nan nan
Can I adapt this approach or need a different one altogether to populate the NANs?
IIUC you need inverse values by indexing, use rolling and inverse back:
df1 = df.iloc[::-1].rolling(5, min_periods=1).mean().iloc[::-1]
print (df1)
0 3.4 2.4 2.80 2.60
1 3.5 2.0 2.75 2.75
2 4.0 2.0 3.00 3.00
3 3.5 2.0 3.50 2.50
4 3.0 2.0 2.00 2.00

Pandas multiply DataFrames with element-wise match of index and column

I have two pandas DataFrames, with one of them having index and columns that are subsets of the other. For example:
DF1 =
date a b c
20170101 1.0 2.2 3
20170102 2.1 5.2 -3.0
20170103 4.2 1.8 10.0
20170331 9.8 5.1 4.5
DF2 =
date a c
20170101 NaN 2.1
20170103 4 NaN
What I want is element-wise multiplication by matching both the index and column. i.e. only DF1[20170103]['c'] will be multiplied with DF2[20170103]['c'], etc.
The resulting DF should have the same dimension as the bigger one (DF1), with missing values in DF2 set as the original DF1 value:
result DF =
date a b c
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
20170331 9.8 5.1 4.5
What's the best/fastest way to do this? The real-life matrices are huge, and DF2 is relatively sparse.
I think you need vectorized function mul:
df = DF1.mul(DF2, fill_value=1)
print (df)
a b c
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
20170331 9.8 5.1 4.5

Understanding pandas interpolation function

I have seen numerous comparisons like this but I still don't understand as I don't have issues related to date time and floating point precision.
I am having problem understanding implementation of interpolation function in pandas.
I am trying the following:
import pandas as pd
df = pd.DataFrame({'MC':[1,2,4,6,7,8,9,10,12,13,15], 'MW':[1,2,4,6,7,8,9,10,12,13,15]})
df.set_index('MW', inplace=True)
df2 = df.reindex([1,2,4,6,7,8,9,10,12,13,15,3,5,11,13,14,16])
1 1.0
2 2.0
4 4.0
6 6.0
7 7.0
8 8.0
9 9.0
10 10.0
12 12.0
13 13.0
15 15.0
3 14.5
5 14.0
11 13.5
13 13.0
14 13.0
16 13.0
Which looks very unusual result to me for both interpolation and extrapolation. There is no floating point issue playing out here.
When I use scipy interpolation, I get the desired result but will need to implement separate extrapolation:
from scipy import interpolate
f = interpolate.interp1d(df.index, df.values.flatten())
array([ 3., 5., 11., 13., 14.])
Edit: I have tried numerous options as following:
1 1.0
2 2.0
4 4.0
6 6.0
7 7.0
8 8.0
9 9.0
10 10.0
12 12.0
13 13.0
15 15.0
3 3.0
5 5.0
11 11.0
13 13.0
14 13.0
16 13.0