applying vlookup to every element of pandas dataframe - pandas

I have two dataframes, one of which is source(src), and the other one is the destination(dest)
dest.tail()
Out[166]:
Item AJ AM AO AR BA BO BR BU BY CA ... TJ TK TR
time ...
2020-06-26 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-29 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-30 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-07-01 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
2020-07-02 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
src.tail()
Out[167]:
1.00 1.25 1.50 1.75 ... 10.00 10.25
time
2020-06-29 0.153556 0.159041 0.162370 0.164580 ... 0.643962 0.658646
2020-06-30 0.156180 0.159280 0.161534 0.163746 ... 0.660171 0.675189
2020-07-01 0.156947 0.163433 0.168326 0.171734 ... 0.687046 0.701364
2020-07-02 0.152465 0.153910 0.154862 0.155750 ... 0.676183 0.690475
2020-07-03 0.154169 0.153923 0.154868 0.155751 ... 0.676537 0.690816
For each value in dest, i want to replace it with a value in the src table, which has same index, and same column name as itself.
e.g. Value for AJ on '2020-06-26' in the dest table right now is 3.5. I want to replace it with value in src table corresponding to index '2020-06-26' and column = 3.5
I was thinking of using applymap, but it doesnt seem to have a concept of index.
dest.applymap(lambda x: src.loc[x.index][x]).tail()
AttributeError: ("'numpy.float64' object has no attribute 'index'", u'occurred at index AJ')
I then tried using apply and it worked like this:
dest1 = dest.replace(0,np.nan).fillna(1) # 0 and nan are not in src.columns
df= dest1.apply(lambda x: [src[col].loc[row] for row, col in zip(x.index,x)], axis=0).tail()
2 questions on this:
Is there a better solution to this instead of doing a list comprehension within apply?
Is there a better way of handling values in dest that are not in src.columns (like 0 and nan) so the output is nan when that's the case?

Related

How to get the index of the last condition and assign it to other columns

condition is column 'A' > 0.5
I want to calculate the index of the last condition established and assign it to column 'cond_index'
A cond_index
0 0.001566 NaN
1 0.174676 NaN
2 0.553506 2
3 0.583377 3
4 0.418854 3
5 0.836482 5
6 0.927756 6
7 0.800908 7
8 0.277646 7
9 0.388323 7
Use Index.to_series with replace missing values if not match condition in Series.where with comapre for greater like 0.5 and last forward filling missing values:
df['new'] = df.index.to_series().where(df['A'].gt(0.5)).ffill()
print (df)
A cond_index new
0 0.001566 NaN NaN
1 0.174676 NaN NaN
2 0.553506 2.0 2.0
3 0.583377 3.0 3.0
4 0.418854 3.0 3.0
5 0.836482 5.0 5.0
6 0.927756 6.0 6.0
7 0.800908 7.0 7.0
8 0.277646 7.0 7.0
9 0.388323 7.0 7.0

Creating variables and calculating the difference between these variables and selected variable - Pandas

I've got this data frame:
ID Date X 123_P 456_P 789_P choice
A 07/16/2019 . 1.5 1.8 1.6 123
A 07/17/2019 . 2.0 2.1 4.5 789
A 07/18/2019 . 3.0 3.2 NaN 0
A 07/19/2019 . 2.1 2.2 4.5 456
B 07/16/2019 . 1.5 1.8 1.6 789
B 07/17/2019 . 2.0 2.1 4.5 0
B 07/18/2019 . 3.0 3.2 NaN 123
I want to create new variables: 123_PD, 456_PD, 789_PD (I have much more variables than this example, so it shouldn't be done manually).
The new variables will indicate the differences between 123_P, 456_P, 789_P variables and the same variables from the previous row, considering the previous choice.
I mean, if the choice from the previous row was "123", so the differences between these variables will refer to value in "123_P" from the previous row.
Notes:
Value of 0 means there is no choice, so the differences will refer to the last choice for this ID.
It should be done for each ID separately.
Expected result:
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
A 07/16/2019 . 1.5 1.8 1.6 123 0 0 0
A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0
B 07/16/2019 . 1.5 1.8 1.6 789 0 0 0
B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
First create helper DataFrame with new column 0_P for filled missing values and change choice values for match columns names:
df1 = (df.join(pd.DataFrame({'0_P':np.nan}, index=df.index))
.assign(choice = df['choice'].astype(str) + '_P'))
print (df1)
ID Date X 123_P 456_P 789_P choice 0_P
0 A 07/16/2019 . 1.5 1.8 1.6 123_P NaN
1 A 07/17/2019 . 2.0 2.1 4.5 789_P NaN
2 A 07/18/2019 . 3.0 3.2 NaN 0_P NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456_P NaN
4 B 07/16/2019 . 1.5 1.8 1.6 789_P NaN
5 B 07/17/2019 . 2.0 2.1 4.5 0_P NaN
6 B 07/18/2019 . 3.0 3.2 NaN 123_P NaN
Then use DataFrame.lookup for values to array, convert to Series, Series.shift and forward filling missing values per groups in lambda function:
s = (pd.Series(df1.lookup(df1.index, df1['choice']), index=df.index)
.groupby(df['ID'])
.apply(lambda x: x.shift().ffill()))
print (s)
0 NaN
1 1.5
2 4.5
3 4.5
4 NaN
5 1.6
6 1.6
dtype: float64
Then select necessary columns, subtract by DataFrame.sub, DataFrame.add_suffix and last set rows to 0 by duplicated ID column:
df2 = df.iloc[:, -4:-1].sub(s, axis=0).add_suffix('D')
df2.loc[~df1['ID'].duplicated(), :] = 0
print (df2)
123_PD 456_PD 789_PD
0 0.0 0.0 0.0
1 0.5 0.6 3.0
2 -1.5 -1.3 NaN
3 -2.4 -2.3 0.0
4 0.0 0.0 0.0
5 0.4 0.5 2.9
6 1.4 1.6 NaN
df = df.join(df2)
print (df)
ID Date X 123_P 456_P 789_P choice 123_PD 456_PD 789_PD
0 A 07/16/2019 . 1.5 1.8 1.6 123 0.0 0.0 0.0
1 A 07/17/2019 . 2.0 2.1 4.5 789 0.5 0.6 3.0
2 A 07/18/2019 . 3.0 3.2 NaN 0 -1.5 -1.3 NaN
3 A 07/19/2019 . 2.1 2.2 4.5 456 -2.4 -2.3 0.0
4 B 07/16/2019 . 1.5 1.8 1.6 789 0.0 0.0 0.0
5 B 07/17/2019 . 2.0 2.1 4.5 0 0.4 0.5 2.9
6 B 07/18/2019 . 3.0 3.2 NaN 123 1.4 1.6 NaN
This should do the needful:
df[['123_PD', '456_PD', '789_PD']] = df[['123_P', '456_P', '789_P']] - df[['123_P', '456_P', '789_P']].shift(1)
df['123_PD'].iloc[0] = 0
df['456_PD'].iloc[0] = 0
df['789_PD'].iloc[0] = 0

Reverse Rolling mean for DataFrame

I am trying to create a fixture difficulty grid using a DataFrame. I want the mean for the next 5 fixtures for each team.
I’m currently using df.rolling(5, min_periods=1).mean().shift(-4). This is working for the start but is pulling NANs at the end. I understand why NANs are returned – there is no DF to shift up. Ideally I’d like the NANs to become mean across the remaining values, value against 38 just being its current value?
Fixture difficulties
ARS AVL BHA BOU
3 4 3 2
2 2 2 2
5 2 2 4
4 2 5 3
3 2 2 2
Mean of next 5 fixtures
ARS AVL BHA BOU
3.4 2.4 2.8 2.6
3.2 2.4 2.8 2.6
3.6 2.4 3.2 2.6
3 2.4 3.6 2.6
2.6 2.4 3 2.4
NAN on last records as nothing to shift up.
3.2 3.6 2.8 3.6
nan nan nan nan
nan nan nan nan
nan nan nan nan
nan nan nan nan
Can I adapt this approach or need a different one altogether to populate the NANs?
IIUC you need inverse values by indexing, use rolling and inverse back:
df1 = df.iloc[::-1].rolling(5, min_periods=1).mean().iloc[::-1]
print (df1)
ARS AVL BHA BOU
0 3.4 2.4 2.80 2.60
1 3.5 2.0 2.75 2.75
2 4.0 2.0 3.00 3.00
3 3.5 2.0 3.50 2.50
4 3.0 2.0 2.00 2.00

Pandas multiply DataFrames with element-wise match of index and column

I have two pandas DataFrames, with one of them having index and columns that are subsets of the other. For example:
DF1 =
date a b c
20170101 1.0 2.2 3
20170102 2.1 5.2 -3.0
20170103 4.2 1.8 10.0
...
20170331 9.8 5.1 4.5
DF2 =
date a c
20170101 NaN 2.1
20170103 4 NaN
What I want is element-wise multiplication by matching both the index and column. i.e. only DF1[20170103]['c'] will be multiplied with DF2[20170103]['c'], etc.
The resulting DF should have the same dimension as the bigger one (DF1), with missing values in DF2 set as the original DF1 value:
result DF =
date a b c
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
...
20170331 9.8 5.1 4.5
What's the best/fastest way to do this? The real-life matrices are huge, and DF2 is relatively sparse.
I think you need vectorized function mul:
df = DF1.mul(DF2, fill_value=1)
print (df)
a b c
date
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
20170331 9.8 5.1 4.5

Understanding pandas interpolation function

I have seen numerous comparisons like this but I still don't understand as I don't have issues related to date time and floating point precision.
I am having problem understanding implementation of interpolation function in pandas.
I am trying the following:
import pandas as pd
df = pd.DataFrame({'MC':[1,2,4,6,7,8,9,10,12,13,15], 'MW':[1,2,4,6,7,8,9,10,12,13,15]})
df.set_index('MW', inplace=True)
df2 = df.reindex([1,2,4,6,7,8,9,10,12,13,15,3,5,11,13,14,16])
df2.apply(pd.Series.interpolate)
MW
MC
1 1.0
2 2.0
4 4.0
6 6.0
7 7.0
8 8.0
9 9.0
10 10.0
12 12.0
13 13.0
15 15.0
3 14.5
5 14.0
11 13.5
13 13.0
14 13.0
16 13.0
Which looks very unusual result to me for both interpolation and extrapolation. There is no floating point issue playing out here.
When I use scipy interpolation, I get the desired result but will need to implement separate extrapolation:
from scipy import interpolate
f = interpolate.interp1d(df.index, df.values.flatten())
f([3,5,11,13,14])
array([ 3., 5., 11., 13., 14.])
Edit: I have tried numerous options as following:
df2.interpolate(method='index')
MW
MC
1 1.0
2 2.0
4 4.0
6 6.0
7 7.0
8 8.0
9 9.0
10 10.0
12 12.0
13 13.0
15 15.0
3 3.0
5 5.0
11 11.0
13 13.0
14 13.0
16 13.0