Reverse Rolling mean for DataFrame - pandas

I am trying to create a fixture difficulty grid using a DataFrame. I want the mean for the next 5 fixtures for each team.
I’m currently using df.rolling(5, min_periods=1).mean().shift(-4). This is working for the start but is pulling NANs at the end. I understand why NANs are returned – there is no DF to shift up. Ideally I’d like the NANs to become mean across the remaining values, value against 38 just being its current value?
Fixture difficulties
ARS AVL BHA BOU
3 4 3 2
2 2 2 2
5 2 2 4
4 2 5 3
3 2 2 2
Mean of next 5 fixtures
ARS AVL BHA BOU
3.4 2.4 2.8 2.6
3.2 2.4 2.8 2.6
3.6 2.4 3.2 2.6
3 2.4 3.6 2.6
2.6 2.4 3 2.4
NAN on last records as nothing to shift up.
3.2 3.6 2.8 3.6
nan nan nan nan
nan nan nan nan
nan nan nan nan
nan nan nan nan
Can I adapt this approach or need a different one altogether to populate the NANs?

IIUC you need inverse values by indexing, use rolling and inverse back:
df1 = df.iloc[::-1].rolling(5, min_periods=1).mean().iloc[::-1]
print (df1)
ARS AVL BHA BOU
0 3.4 2.4 2.80 2.60
1 3.5 2.0 2.75 2.75
2 4.0 2.0 3.00 3.00
3 3.5 2.0 3.50 2.50
4 3.0 2.0 2.00 2.00

Related

Finding the mean of nuisance columns in DataFrame error

id gender status dept var1 var2 salary
0 P001 M FT DS 2.0 8.0 NaN
1 P002 F PT FS 3.0 NaN 54.0
2 P003 M NaN AWS 5.0 5.0 59.0
3 P004 F FT AWS NaN 8.0 120.0
4 P005 M PT DS 7.0 11.0 58.0
5 P006 F PT NaN 1.0 NaN 75.0
6 P007 M FT FS NaN NaN NaN
7 P008 F NaN FS 10.0 2.0 136.0
8 P009 M PT NaN 14.0 3.0 60.0
9 P010 F FT DS NaN 7.0 125.0
10 P011 M NaN AWS 6.0 9.0 NaN
print(df.mean())
FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
print(df.mean())
when corrected my code as :
print(df.mean(numeric_only=True))
I solve it , without an error.
Is there any other way to solve it ?
No, there's no other way to solve it. Using numeric_only=True is the right way.
You're getting that warning because there are some columns that contain strings, but df.mean() only works with columns that contain numbers (floats, ints, nan, etc.).
Using numeric_only=True causes df.mean() to ignore columns that contain non-numbers, and only calculate the mean for columns that only contain numbers.
Try this:
import pandas as pd
pd.options.mode.chained_assignment = None

Conditional aggregation after rolling in pandas

I am trying to calculate a rolling mean of a specific column based on a condition in another column.
The condition is to create three different rolling means for column A, as follows -
The rolling mean of A when column B is less than 2
The rolling mean of A when column B is equal to 2
The rolling mean of A when column B is greater than 2
Consider the following df with a window size of 2
A B
0 1 2
1 2 4
2 3 4
3 4 6
4 5 1
5 6 2
The output will be the following-
rolling less rolling equal rolling greater
0 NaN NaN NaN
1 NaN 1 2
2 NaN NaN 2.5
3 NaN NaN 3.5
4 5 NaN 4
5 5 6 NaN
The main difficulty I encountered was that the rolling function is column-wise, and on the other hand, the apply function works rows-wise, but then, calculating the rolling mean is too hard-coded.
Any ideas?
Thanks a lot.
You can create your 3 columns before rolling then compute it:
out = df.join(df.assign(rolling_less=df.mask(df['B'] >= 2)['A'],
rolling_equal=df.mask(df['B'] != 2)['A'],
rolling_greater=df.mask(df['B'] <= 2)['A'])
.filter(like='rolling').rolling(2, min_periods=1).mean())
print(out)
# Output
A B rolling_less rolling_equal rolling_greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN
def function1(ss:pd.Series):
df11=df1.loc[:ss.name].tail(2)
return pd.Series([
df11.loc[lambda dd:dd.B<2,'A'].mean()
,df11.loc[lambda dd:dd.B==2,'A'].mean()
,df11.loc[lambda dd:dd.B>2,'A'].mean()
],index=['rolling less','rolling equal','rolling greater'],name=ss.name)
pd.concat([df1.A.shift(i) for i in range(2)],axis=1)\
.apply(function1,axis=1)
A B rolling less rolling equal rolling greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN

applying vlookup to every element of pandas dataframe

I have two dataframes, one of which is source(src), and the other one is the destination(dest)
dest.tail()
Out[166]:
Item AJ AM AO AR BA BO BR BU BY CA ... TJ TK TR
time ...
2020-06-26 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-29 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-30 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-07-01 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
2020-07-02 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
src.tail()
Out[167]:
1.00 1.25 1.50 1.75 ... 10.00 10.25
time
2020-06-29 0.153556 0.159041 0.162370 0.164580 ... 0.643962 0.658646
2020-06-30 0.156180 0.159280 0.161534 0.163746 ... 0.660171 0.675189
2020-07-01 0.156947 0.163433 0.168326 0.171734 ... 0.687046 0.701364
2020-07-02 0.152465 0.153910 0.154862 0.155750 ... 0.676183 0.690475
2020-07-03 0.154169 0.153923 0.154868 0.155751 ... 0.676537 0.690816
For each value in dest, i want to replace it with a value in the src table, which has same index, and same column name as itself.
e.g. Value for AJ on '2020-06-26' in the dest table right now is 3.5. I want to replace it with value in src table corresponding to index '2020-06-26' and column = 3.5
I was thinking of using applymap, but it doesnt seem to have a concept of index.
dest.applymap(lambda x: src.loc[x.index][x]).tail()
AttributeError: ("'numpy.float64' object has no attribute 'index'", u'occurred at index AJ')
I then tried using apply and it worked like this:
dest1 = dest.replace(0,np.nan).fillna(1) # 0 and nan are not in src.columns
df= dest1.apply(lambda x: [src[col].loc[row] for row, col in zip(x.index,x)], axis=0).tail()
2 questions on this:
Is there a better solution to this instead of doing a list comprehension within apply?
Is there a better way of handling values in dest that are not in src.columns (like 0 and nan) so the output is nan when that's the case?

How to select NaN values in pandas in specific range

I have a dataframe like this:
df = pd.DataFrame({'col1': [5,6,np.nan, np.nan,np.nan, 4, np.nan, np.nan,np.nan, np.nan,7,8,8, np.nan, 5 , np.nan]})
df:
col1
0 5.0
1 6.0
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
10 7.0
11 8.0
12 8.0
13 NaN
14 5.0
15 NaN
These NaN values should be replaced in the following way. The first selection should look like this.
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
And then these Nan values should be replace with the only value in that selection, 4.
The second selection is:
13 NaN
14 5.0
15 NaN
and these NaN values should be replaced with 5.
With isnull() you can select the NaN values in a dataframe but how are able to filter/select these specific ranges in pandas?
Solution if missing values are around one non missing val - solution create unique groups and replace in groups by forward and back filling:
#test missing values
s = df['col1'].isna()
#create unique groups
v = s.ne(s.shift()).cumsum()
#count groups and get only 1 value around, filter only misising values groups
mask = v.map(v.value_counts()).eq(1) | s
#groups for replacement per groups
g = mask.ne(mask.shift()).cumsum()
df['col2'] = df.groupby(g)['col1'].apply(lambda x: x.ffill().bfill())
print (df)
col1 col2
0 5.0 5.0
1 6.0 6.0
2 NaN 4.0
3 NaN 4.0
4 NaN 4.0
5 4.0 4.0
6 NaN 4.0
7 NaN 4.0
8 NaN 4.0
9 NaN 4.0
10 7.0 7.0
11 8.0 8.0
12 8.0 8.0
13 NaN 5.0
14 5.0 5.0
15 NaN 5.0

Pandas Pivot Table Sort Index Level 1 Not "Sticking"

I know this is a lot, but I really cannot pinpoint what is causing the problem.
Most of this code is just to demonstrate what I'm doing, but the short end of it is:
After reordering columns in a multi-indexed data frame (via
transposing and other methods), calling columns.levels returns the
original sorted levels instead of the new ones.
Given the following:
#Original data frame
import pandas as pd
df = pd.DataFrame(
{'Year':[2012,2012,2012,2012,2012,2012,2013,2013,2013,2013,2013,2013,2014,2014,2014,2014,2014,2014],
'Type':['A','A','B','B','C','C','A','A','B','B','C','C','A','A','B','B','C','C'],
'Org':['a','c','a','b','a','c','a','b','a','c','a','c','a','b','a','c','a','b'],
'Enr':[3,5,3,6,6,4,7,89,5,3,7,34,4,64,3,6,7,44]
})
df.head()
Enr Org Type Year
0 3 a A 2012
1 5 c A 2012
2 3 a B 2012
3 6 b B 2012
4 6 a C 2012
#Pivoted
dfp=df.pivot_table(df,index=['Year'],columns=['Type','Org'],aggfunc=np.sum)\
.sortlevel(ascending=True).sort_index(axis=1)
dfp
Enr
Type A B C
Org a b c a b c a b c
Year
2012 3.0 NaN 5.0 3.0 6.0 NaN 6.0 NaN 4.0
2013 7.0 89.0 NaN 5.0 NaN 3.0 7.0 NaN 34.0
2014 4.0 64.0 NaN 3.0 NaN 6.0 7.0 44.0 NaN
#Transposed
f=dfp.T
Year 2012 2013 2014
Type Org
Enr A a 3.0 7.0 4.0
b NaN 89.0 64.0
c 5.0 NaN NaN
B a 3.0 5.0 3.0
b 6.0 NaN NaN
c NaN 3.0 6.0
C a 6.0 7.0 7.0
b NaN NaN 44.0
c 4.0 34.0 NaN
#Sort level 2 by last column and transpose back
ab2=f.groupby(level=1)[f.columns[-1]].transform(sum)
ab3=pd.concat([f,ab2],axis=1)
ab4=ab3.sort_values([ab3.columns[-1]],ascending=[0])
ab4=ab4.drop(ab4.columns[-1],axis=1,inplace=False)
g=ab4.T
g
Enr
Type A C B
Org a b c a b c a b c
Year
2012 3.0 NaN 5.0 6.0 NaN 4.0 3.0 6.0 NaN
2013 7.0 89.0 NaN 7.0 NaN 34.0 5.0 NaN 3.0
2014 4.0 64.0 NaN 7.0 44.0 NaN 3.0 NaN 6.0
I know this was a lot, but I really cannot pinpoint what is causing the problem.
If you do:
g.Enr.columns.levels
The result is:
FrozenList([['A', 'B', 'C'], ['a', 'b', 'c']])
My question is: Why is it not:
FrozenList([['A', 'C', 'B'], ['a', 'b', 'c']]) ?
I really need it to be the second one.
Thanks in advance!
A MultiIndex stores itself as a set of levels, which are the distinct possible values, and labels, which are integer codes for the actual labels used. Changing the column order is just a reshuffling of the codes, not changing the actual levels.
If you want the levels by the order in which they first appear you could do something like this.
In [61]: c = g.Enr.columns
In [62]: [c.levels[i].take(pd.unique(c.labels[i]))
...: for i in range(len(c.levels))]
Out[62]:
[Index([u'A', u'C', u'B'], dtype='object', name=u'Type'),
Index([u'a', u'b', u'c'], dtype='object', name=u'Org')]