pandas diff between within successive groups - pandas

d = pd.DataFrame({'a':[7,6,3,4,8], 'b':['c','c','d','d','c']})
d.groupby('b')['a'].diff()
Gives me
0 NaN
1 -1.0
2 NaN
3 1.0
4 2.0
What I'd need
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Which is difference between only successive values within group, so when a group appears after another group , it's previous values are ignored.
In my example last c value is a new c group.

You would need to groupby on consecutive segments
In [1055]: d.groupby((d.b != d.b.shift()).cumsum())['a'].diff()
Out[1055]:
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Name: a, dtype: float64
Details
In [1056]: (d.b != d.b.shift()).cumsum()
Out[1056]:
0 1
1 1
2 2
3 2
4 3
Name: b, dtype: int32

Related

Conditional aggregation after rolling in pandas

I am trying to calculate a rolling mean of a specific column based on a condition in another column.
The condition is to create three different rolling means for column A, as follows -
The rolling mean of A when column B is less than 2
The rolling mean of A when column B is equal to 2
The rolling mean of A when column B is greater than 2
Consider the following df with a window size of 2
A B
0 1 2
1 2 4
2 3 4
3 4 6
4 5 1
5 6 2
The output will be the following-
rolling less rolling equal rolling greater
0 NaN NaN NaN
1 NaN 1 2
2 NaN NaN 2.5
3 NaN NaN 3.5
4 5 NaN 4
5 5 6 NaN
The main difficulty I encountered was that the rolling function is column-wise, and on the other hand, the apply function works rows-wise, but then, calculating the rolling mean is too hard-coded.
Any ideas?
Thanks a lot.
You can create your 3 columns before rolling then compute it:
out = df.join(df.assign(rolling_less=df.mask(df['B'] >= 2)['A'],
rolling_equal=df.mask(df['B'] != 2)['A'],
rolling_greater=df.mask(df['B'] <= 2)['A'])
.filter(like='rolling').rolling(2, min_periods=1).mean())
print(out)
# Output
A B rolling_less rolling_equal rolling_greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN
def function1(ss:pd.Series):
df11=df1.loc[:ss.name].tail(2)
return pd.Series([
df11.loc[lambda dd:dd.B<2,'A'].mean()
,df11.loc[lambda dd:dd.B==2,'A'].mean()
,df11.loc[lambda dd:dd.B>2,'A'].mean()
],index=['rolling less','rolling equal','rolling greater'],name=ss.name)
pd.concat([df1.A.shift(i) for i in range(2)],axis=1)\
.apply(function1,axis=1)
A B rolling less rolling equal rolling greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN

how to replace nan value using the value of which from other rows with common column value

Using column B as the reference how can I replace NaN value
>>> a
A B
1 1
Nan 3
1 1
Nan 1
Nan 2
5 3
1 1
2 2
I want result like this.
>> result
A B
1 1
5 3
1 1
1 1
2 2
5 3
1 1
2 2
I tried merging on the column b but couldn't figure out
b=a.groupby('B').reset_index()
dfM = pd.merge(a,b,on='B', how ='left')
We need a map from values in column B to the values in A.
mapping = a.dropna().drop_duplicates().set_index("B")["A"]
It looks like this
B
1 1.0
3 5.0
2 2.0
Name: A, dtype: float64
Filling null values becomes irrelevant at this point. We can just map B to get column A
a["B"].map(mapping)
This gives you
0 1.0
1 5.0
2 1.0
3 1.0
4 2.0
5 5.0
6 1.0
7 2.0
Name: B, dtype: float64
Cast to int and use it to overwrite column A in your original dataframe if you need to.

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

Pandas assign value in one column based on top 10 values in another column

I have a table:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
I would like to make a new column called 'flag' for the top 2 values in column D.
I've tried:
for i in df.D.nlargest(2):
df.['flag']= 1
But that gets me:
A B C D flag
0 NaN 2.0 NaN 0 1
1 3.0 4.0 NaN 1 1
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
What I want is:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC:
df['flag'] = 0
df.loc[df.D.nlargest(2).index, 'flag'] = 1
Or:
df['flag'] = df.index.isin(df.D.nlargest(2).index).astype(int)
Output:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC
df['flag']=df.D.sort_values().tail(2).eq(df.D).astype(int)
df
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64