pandas how to get row index satisfying certain condition in a vectorized way? - pandas

I have a timeseries dataframe containing market price and order information. For every entry, there is a stoploss accordingly. I want to find out the stoploss triggered bar index in the dataframe for each entry order. If the market price >= stoploss , then stop is triggered and I want to record that the stop belongs to which entry order. Each entry is recorded according to its entry bar index. For example, order with entry price 99 at bar 1 is recorded as entry order 1. Entry price 98 at bar 2 is entry order 2 and entry price 103 at bar 5 is entry order 5 etc.
The original dataframe is like:
entry price index entryprice stoploss
0 0 100 0 NaN NaN
1 1 99 1 99.0 102.0
2 1 98 2 98.0 101.0
3 0 100 3 NaN NaN
4 0 101 4 NaN NaN
5 1 103 5 103.0 106.0
6 0 105 6 NaN NaN
7 0 104 7 NaN NaN
8 0 106 8 NaN NaN
9 1 103 9 103.0 106.0
10 0 100 10 NaN NaN
11 0 104 11 NaN NaN
12 0 108 12 NaN NaN
13 0 110 13 NaN NaN
code is :
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
In order to find out where stoploss is triggered for each order, I do it in an apply way. I defined an outside parameter stoplist which is recording all the stoploss orders and their corresponding entry order index which are not triggered yet. Then I pass every row of the df to the function and compare the market price with the stoploss in the stoplist, whenever condition is met, assign the entry order index to this row and remove it from the stoplist variable.
The code is like:
def Stop(row, stoplist):
output = None
for i in range(len(stoplist)-1, -1, -1):
(ix, stop) = stoplist[i]
if row['price'] >= stop:
output = ix
stoplist.pop(i)
if row['stoploss'] != None:
stoplist.append( (row['index'], row['stoploss']) )
return output
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
stoplist = []
df['stopix'] = df.apply(lambda row: Stop(row, stoplist), axis=1)
print(df)
The final output is:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 1.0
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN
The last column stopix is what I wanted. But the only problem of this solution is that apply is not very efficient and I am wondering if there is a vectorized way to do this? Or if there is any better solution boosting the performance would be helpful. Because efficiency is critical to me.
Thanks

Here's my take:
# mark the block starting by entry
blocks = df.stoploss.notna().cumsum()
# mark where the prices are higher than or equal to entry price
higher = df['stoploss'].ffill().le(df.price)
# group higher by entries
g = higher.groupby(blocks)
# where the entry occurs in each group
idx = g.transform('idxmin')
# transform the idx to where the first higher occurs
df['stopix'] = np.where(g.cumsum().eq(1), idx, np.nan)
Output:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 NaN
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN

Related

Create a new ID column based on conditions in other column using pandas

I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0

Applying multiple functions to a pivot table (grouped) dataframe

I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks
I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN

Concatenating dataframe that have different number of rows

I have a dataframe df = df[['A', 'B', 'C']] with 3 columns and 2000 rows
Then I have another set of data with only 200 rows
How can I add this into df['D'] such that this 200 rows will only appear as the tail of the 2000 rows?
So that from row 0-1800 for df['D'] it will be NaN and then 1801 to 2000 will be the values
Been trying various ways without success... thank you
data with 200 rows in this format
[[ 0.43628979]
[ 0.43454027]
[ 0.43552566]
[ 0.43542767]
[ 0.43331838]
...
I believe you need join with changing index by last index values of df1:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=list('ABC'))
print (df1)
A B C
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
6 6 2 4
7 1 5 3
8 4 4 3
9 7 1 1
10 7 7 0
11 2 9 9
12 3 2 5
13 8 1 0
14 7 6 2
15 0 8 2
16 5 1 8
17 1 5 4
18 2 8 3
19 5 0 9
df2 = pd.DataFrame(np.random.randint(10, size=(2,5)), columns=list('werty'))
print (df2)
w e r t y
0 3 6 3 4 7
1 6 3 9 0 4
df2.index = df1.index[-len(df2.index):]
df = df1.join(df2)
print (df)
A B C w e r t y
0 8 8 3 NaN NaN NaN NaN NaN
1 7 7 0 NaN NaN NaN NaN NaN
2 4 2 5 NaN NaN NaN NaN NaN
3 2 2 2 NaN NaN NaN NaN NaN
4 1 0 8 NaN NaN NaN NaN NaN
5 4 0 9 NaN NaN NaN NaN NaN
6 6 2 4 NaN NaN NaN NaN NaN
7 1 5 3 NaN NaN NaN NaN NaN
8 4 4 3 NaN NaN NaN NaN NaN
9 7 1 1 NaN NaN NaN NaN NaN
10 7 7 0 NaN NaN NaN NaN NaN
11 2 9 9 NaN NaN NaN NaN NaN
12 3 2 5 NaN NaN NaN NaN NaN
13 8 1 0 NaN NaN NaN NaN NaN
14 7 6 2 NaN NaN NaN NaN NaN
15 0 8 2 NaN NaN NaN NaN NaN
16 5 1 8 NaN NaN NaN NaN NaN
17 1 5 4 NaN NaN NaN NaN NaN
18 2 8 3 3.0 6.0 3.0 4.0 7.0
19 5 0 9 6.0 3.0 9.0 0.0 4.0

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64

Boxplot with pandas and groupby

I have the following dataset sample:
0 1
0 0 0.040158
1 2 0.500642
2 0 0.005694
3 1 0.065052
4 0 0.034789
5 2 0.128495
6 1 0.088816
7 1 0.056725
8 0 -0.000193
9 2 -0.070252
10 2 0.138282
11 2 0.054638
12 2 0.039994
13 2 0.060659
14 0 0.038562
And need a box and whisker plot, grouped by column 0. I have the following:
plt.figure()
grouped = df.groupby(0)
grouped.boxplot(column=1)
plt.savefig('plot.png')
But I end up with three subplots. How can place all three on one plot?
Thanks.
In 0.16.0 version of pandas, you could simply do this:
df.boxplot(by='0')
Result:
I don't believe you need to use groupby.
df2 = df.pivot(columns=df.columns[0], index=df.index)
df2.columns = df2.columns.droplevel()
>>> df2
0 0 1 2
0 0.040158 NaN NaN
1 NaN NaN 0.500642
2 0.005694 NaN NaN
3 NaN 0.065052 NaN
4 0.034789 NaN NaN
5 NaN NaN 0.128495
6 NaN 0.088816 NaN
7 NaN 0.056725 NaN
8 -0.000193 NaN NaN
9 NaN NaN -0.070252
10 NaN NaN 0.138282
11 NaN NaN 0.054638
12 NaN NaN 0.039994
13 NaN NaN 0.060659
14 0.038562 NaN NaN
df2.boxplot()