Using scalar values in series as variables in user defined function - pandas

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.

Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64

Related

How to do pd.fillna() with condition

Am trying to do a fillna with if condition
Fimport pandas as pd
df = pd.DataFrame(data={'a':[1,None,3,None],'b':[4,None,None,None]})
print df
df[b].fillna(value=0, inplace=True) only if df[a] is None
print df
a b
0 1 4
1 NaN NaN
2 3 NaN
3 NaN NaN
##What i want to acheive
a b
0 1 4
1 NaN 0
2 3 NaN
3 NaN 0
Please help
You can chain both conditions for test mising values with & for bitwise AND and then replace values to 0:
df.loc[df.a.isna() & df.b.isna(), 'b'] = 0
#alternative
df.loc[df[['a', 'b']].isna().all(axis=1), 'b'] = 0
print (df)
a b
0 1.0 4.0
1 NaN 0.0
2 3.0 NaN
3 NaN 0.0
Or you can use fillna with one condition:
df.loc[df.a.isna(), 'b'] = df.b.fillna(0)

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

How to build a window through number positive and negative ranges in dataframe column?

I would like to have average value and max value in every positive and negative range.
From sample data below:
import pandas as pd
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
Which give me dataframe like this:
value
0 -1
1 -2
2 -3
3 -2
4 -1
5 1
6 2
7 3
8 2
9 1
10 -1
11 -4
12 -5
13 2
14 4
15 7
I would like to have something like that:
AVG1 = [-1, -2, -3, -2, -1] / 5 = - 1.8
Max1 = -3
AVG2 = [1, 2, 3, 2, 1] / 5 = 1.8
Max2 = 3
AVG3 = [2 ,4 ,7] / 3 = 4.3
Max3 = 7
If solution need new column or new dataframe, that is ok for me.
I know that I can use .mean like here
pandas get column average/mean with round value
But this solution give me average from all positive and all negative value.
How to build some kind of window that I can calculate average from first negative group next from second positive group and etc..
Regards
You can create Series by np.sign for distinguish positive and negative groups with compare shifted values with cumulative sum for groups and then aggregate mean and max:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df = df_test.groupby(g)['value'].agg(['mean','max'])
print (df)
mean max
value
1 -1.800000 -1
2 1.800000 3
3 -3.333333 -1
4 4.333333 7
EDIT:
For find locale extremes is used solution from this answer:
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
from scipy.signal import argrelextrema
#https://stackoverflow.com/a/50836425
n=2 # number of points to be checked before and after
# Find local peaks
df_test['min'] = df_test.iloc[argrelextrema(df_test.value.values, np.less_equal, order=n)[0]]['value']
df_test['max'] = df_test.iloc[argrelextrema(df_test.value.values, np.greater_equal, order=n)[0]]['value']
Then are replaced values after extremes to missing values, separately for negative and positive groups:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df_test[['min1','max1']] = df_test[['min','max']].notna().astype(int).iloc[::-1].groupby(g[::-1]).cumsum()
df_test['min1'] = df_test['min1'].where(s.eq(-1) & df_test['min1'].ne(0))
df_test['max1'] = df_test['max1'].where(s.eq(1) & df_test['max1'].ne(0))
df_test['g'] = g
print (df_test)
value min max min1 max1 g
0 -1 NaN -1.0 1.0 NaN 1
1 -2 NaN NaN 1.0 NaN 1
2 -3 -3.0 NaN 1.0 NaN 1
3 -2 NaN NaN NaN NaN 1
4 -1 NaN NaN NaN NaN 1
5 1 NaN NaN NaN 1.0 2
6 2 NaN NaN NaN 1.0 2
7 3 NaN 3.0 NaN 1.0 2
8 2 NaN NaN NaN NaN 2
9 1 NaN NaN NaN NaN 2
10 -1 NaN NaN 1.0 NaN 3
11 -4 NaN NaN 1.0 NaN 3
12 -5 -5.0 NaN 1.0 NaN 3
13 2 NaN NaN NaN 1.0 4
14 4 NaN NaN NaN 1.0 4
15 7 NaN 7.0 NaN 1.0 4
So is possible separately aggregate last 3 values per groups with lambda function and mean, rows with missing values in min1 or max1 are removed by default in groupby:
df1 = df_test.groupby(['g','min1'])['value'].agg(lambda x: x.tail(3).mean())
print (df1)
g min1
1 1.0 -2.000000
3 1.0 -3.333333
Name: value, dtype: float64
df2 = df_test.groupby(['g','max1'])['value'].agg(lambda x: x.tail(3).mean())
print (df2)
g max1
2 1.0 2.000000
4 1.0 4.333333
Name: value, dtype: float64

pandas how to get row index satisfying certain condition in a vectorized way?

I have a timeseries dataframe containing market price and order information. For every entry, there is a stoploss accordingly. I want to find out the stoploss triggered bar index in the dataframe for each entry order. If the market price >= stoploss , then stop is triggered and I want to record that the stop belongs to which entry order. Each entry is recorded according to its entry bar index. For example, order with entry price 99 at bar 1 is recorded as entry order 1. Entry price 98 at bar 2 is entry order 2 and entry price 103 at bar 5 is entry order 5 etc.
The original dataframe is like:
entry price index entryprice stoploss
0 0 100 0 NaN NaN
1 1 99 1 99.0 102.0
2 1 98 2 98.0 101.0
3 0 100 3 NaN NaN
4 0 101 4 NaN NaN
5 1 103 5 103.0 106.0
6 0 105 6 NaN NaN
7 0 104 7 NaN NaN
8 0 106 8 NaN NaN
9 1 103 9 103.0 106.0
10 0 100 10 NaN NaN
11 0 104 11 NaN NaN
12 0 108 12 NaN NaN
13 0 110 13 NaN NaN
code is :
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
In order to find out where stoploss is triggered for each order, I do it in an apply way. I defined an outside parameter stoplist which is recording all the stoploss orders and their corresponding entry order index which are not triggered yet. Then I pass every row of the df to the function and compare the market price with the stoploss in the stoplist, whenever condition is met, assign the entry order index to this row and remove it from the stoplist variable.
The code is like:
def Stop(row, stoplist):
output = None
for i in range(len(stoplist)-1, -1, -1):
(ix, stop) = stoplist[i]
if row['price'] >= stop:
output = ix
stoplist.pop(i)
if row['stoploss'] != None:
stoplist.append( (row['index'], row['stoploss']) )
return output
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
stoplist = []
df['stopix'] = df.apply(lambda row: Stop(row, stoplist), axis=1)
print(df)
The final output is:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 1.0
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN
The last column stopix is what I wanted. But the only problem of this solution is that apply is not very efficient and I am wondering if there is a vectorized way to do this? Or if there is any better solution boosting the performance would be helpful. Because efficiency is critical to me.
Thanks
Here's my take:
# mark the block starting by entry
blocks = df.stoploss.notna().cumsum()
# mark where the prices are higher than or equal to entry price
higher = df['stoploss'].ffill().le(df.price)
# group higher by entries
g = higher.groupby(blocks)
# where the entry occurs in each group
idx = g.transform('idxmin')
# transform the idx to where the first higher occurs
df['stopix'] = np.where(g.cumsum().eq(1), idx, np.nan)
Output:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 NaN
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN

pandas diff between within successive groups

d = pd.DataFrame({'a':[7,6,3,4,8], 'b':['c','c','d','d','c']})
d.groupby('b')['a'].diff()
Gives me
0 NaN
1 -1.0
2 NaN
3 1.0
4 2.0
What I'd need
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Which is difference between only successive values within group, so when a group appears after another group , it's previous values are ignored.
In my example last c value is a new c group.
You would need to groupby on consecutive segments
In [1055]: d.groupby((d.b != d.b.shift()).cumsum())['a'].diff()
Out[1055]:
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Name: a, dtype: float64
Details
In [1056]: (d.b != d.b.shift()).cumsum()
Out[1056]:
0 1
1 1
2 2
3 2
4 3
Name: b, dtype: int32