how to calculate how many times is changed in the column - pandas

how I can calculate on the most easy way, how much values changes I have in the specific DataFrame columns. For example I have follow DF:
a b
0 1
1 1
2 1
3 2
4 1
5 2
6 2
7 3
8 3
9 3
In this Data Frame the values in the column b have been changed 4 times (in the rows 4,5,6 and 8).
My very simple solution is:
a = 0
for i in range(df.shape[0] - 1):
if df['b'].iloc[i] != df['b'].iloc[i+1]:
a+=1

I think need boolean indexing with index:
idx = df.index[df['b'].diff().shift().fillna(0).ne(0)]
print (idx)
Int64Index([4, 5, 6, 8], dtype='int64')
For more general solution is possible indexing by arange:
a = np.arange(len(df))[df['b'].diff().shift().bfill().ne(0)].tolist()
print (a)
[4, 5, 6, 8]
Explanation:
First get difference by Series.diff:
print (df['b'].diff())
0 NaN
1 0.0
2 0.0
3 1.0
4 -1.0
5 1.0
6 0.0
7 1.0
8 0.0
9 0.0
Name: b, dtype: float64
Then shift by one value:
print (df['b'].diff().shift())
0 NaN
1 NaN
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
Replace first NaNs by fillna:
print (df['b'].diff().shift().fillna(0))
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
And compare for not equal to 0
print (df['b'].diff().shift().fillna(0).ne(0))
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: b, dtype: bool

If the a is a column and not the index:
idx = df['a'].loc[df['b'].diff().shift().fillna(0) != 0]

Related

How to fill nans with multiple if-else conditions?

I have a dataset:
value score
0 0.0 8
1 0.0 7
2 NaN 4
3 1.0 11
4 2.0 22
5 NaN 12
6 0.0 4
7 NaN 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 NaN 28
There are some NaNs in it. I want to fill those NaNs with these conditions:
If 'score' is less than 10, then fill nan with 0.0
If 'score' is between 10 and 20, then fill nan with 1.0
If 'score' is greater than 20, then fill nan with 2.0
How do I do this in pandas?
Here is an example dataframe:
value = [0,0,np.nan,1,2,np.nan,0,np.nan,0,2,1,1,0,2,np.nan]
score = [8,7,4,11,22,12,4,15,5,24,12,15,5,26,28]
pd.DataFrame({'value': value, 'score':score})
Do with cut then fillna
df.value.fillna(pd.cut(df.score,[-np.Inf,10,20,np.Inf],labels = [0,1,2]).astype(int),inplace=True)
df
Out[6]:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
You could use numpy.select with conditions on <10, 10≤score<20, etc. but a more efficient version could be to use a floor division to have values below 10 become 0, below 20 -> 1, etc.
df['value'] = df['value'].fillna(df['score'].floordiv(10))
with numpy.select:
df['value'] = df['value'].fillna(np.select([df['score'].lt(10),
df['score'].between(10, 20),
df['score'].ge(20)],
[0, 1, 2])
)
output:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
use np.select or pd.cut to map the intervals to values, then fillna:
mapping = np.select((df['score'] < 10, df['score'] > 20),
(0, 2), 1)
df['value'] = df['value'].fillna(mapping)

mode returns Exception: Must produce aggregated value

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

how to get correct row data with certain restrict in pandas?

I want to extract the correct row based on certain condition.
The dataframe contains a column entry with the entry signals.
A valid entry only when there is no order in market. Therefore, only the first signal is valid in two consecutive signals
A valid exit is 5 bars later after entry.
Here is my code and dataframe
import pandas as pd
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,1,0,0,0,0,0,0]})
df['exit'] = df['entry'].shift(5)
df['state'] = np.select([df['entry'] == 1, df['exit'] == 1], [1, 0], default=np.nan)
df['state'].ffill(inplace=True)
df['state'].fillna(value=0, inplace=True)
df['change'] = df['state'].diff()
print(df)
entrysig = df[df['change'].eq(1)]
exitsig = df[df['change'].eq(-1)]
tradelist = pd.DataFrame({'entry': entrysig.index, 'exit': exitsig.index})
tradelist['wantedexit'] = [6, 12]
print(tradelist)
The output is like :
entry exit state change
0 0 NaN 0.0 NaN
1 1 NaN 1.0 1.0
2 0 NaN 1.0 0.0
3 1 NaN 1.0 0.0
4 0 NaN 1.0 0.0
5 0 0.0 1.0 0.0
6 0 1.0 0.0 -1.0
7 1 0.0 1.0 1.0
8 0 1.0 0.0 -1.0
9 0 0.0 0.0 0.0
10 0 0.0 0.0 0.0
11 0 0.0 0.0 0.0
12 0 1.0 0.0 0.0
13 0 0.0 0.0 0.0
entry exit wantedexit
0 1 6 6
1 7 8 12
In this example, the first trade entered at bar 1 exit at 6 is correct, it enters at bar 1 and exit after 5 bars which is 6.
The entry on bar 3 is ignored because there is currently an order in market which entered at bar 1.
The second trade entered at bar 7 and exit bar 8 is not correct, because the trade only last for 1 bar while my condition is to exit after 5 bars.
The exit at bar 8 is there because there is an invalid signal at bar 3.
The 'wantedexit' column should be the correct exit bar index.

pandas diff between within successive groups

d = pd.DataFrame({'a':[7,6,3,4,8], 'b':['c','c','d','d','c']})
d.groupby('b')['a'].diff()
Gives me
0 NaN
1 -1.0
2 NaN
3 1.0
4 2.0
What I'd need
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Which is difference between only successive values within group, so when a group appears after another group , it's previous values are ignored.
In my example last c value is a new c group.
You would need to groupby on consecutive segments
In [1055]: d.groupby((d.b != d.b.shift()).cumsum())['a'].diff()
Out[1055]:
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Name: a, dtype: float64
Details
In [1056]: (d.b != d.b.shift()).cumsum()
Out[1056]:
0 1
1 1
2 2
3 2
4 3
Name: b, dtype: int32

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64