How to trim util the last row meet condition within groups in Pandas - pandas

Problem Description:
Suppose I have the following dataframe
df = pd.DataFrame({"date": [1,2,3,3,4,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
"variable": ["A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
"no": [1, 2.2, 3.5, 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9],
"value": [0.469112, -0.282863, -1.509059, -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, -0.234,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, 0.332,0.87]})
where the df would be
print(df)
date variable no value
0 1 A 1.0 0.469112
1 1 A 1.0 0.469112
2 3 A 1.5 -1.135632
3 4 A 1.5 1.212112
4 3 A 1.5 -1.135632
5 4 A 1.5 1.212112
6 2 A 2.2 -0.282863
7 2 A 2.2 -0.282863
8 3 A 3.5 -1.509059
9 3 A 3.5 -1.509059
10 4 B 1.0 0.469112
11 1 B 1.1 -1.044236
12 1 B 1.2 -0.173215
13 1 B 1.3 0.119209
14 4 B 2.0 -0.861849
15 4 B 3.0 -0.234000
16 1 C 1.5 -1.135632
17 2 C 1.5 1.212112
18 1 C 2.2 -0.282863
19 1 C 3.5 -1.509059
20 3 D 1.1 -1.044236
21 2 D 1.2 -0.173215
22 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000
And then I wanna:
Sort based on columns variable and no,
Trim each group until the last row meets a condition, say, I would like to trim the group (by single column, say variable) until the last row where the value in column value is greater than 0, in other words, to drop rest of rows after the last row that meets the condition.
I have tried groupby-apply
df.groupby('variable', as_index=False).apply(
lambda x: x.iloc[: x.where(x['value'] > 0).last_valid_index() + 1, ]))
but the result is incorrect:
date variable no value
0 0 1 A 1.0 0.469112
1 1 A 1.0 0.469112
2 3 A 1.5 -1.135632
3 4 A 1.5 1.212112
4 3 A 1.5 -1.135632
5 4 A 1.5 1.212112
1 10 4 B 1.0 0.469112
11 1 B 1.1 -1.044236
12 1 B 1.2 -0.173215
13 1 B 1.3 0.119209
14 4 B 2.0 -0.861849
15 4 B 3.0 -0.234000
2 16 1 C 1.5 -1.135632
17 2 C 1.5 1.212112
18 1 C 2.2 -0.282863
19 1 C 3.5 -1.509059
3 20 3 D 1.1 -1.044236
21 2 D 1.2 -0.173215
22 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000
as you may see the end of group B and C are not greater than 0.
Anyone who could provide a solution and explain why my solution does not work would be highly appreciated.
Plus. Since the size of dataframe is way larger than the example here, I assume we had better not reverse the dataframe.

You can do it this way:
df = df.sort_values(['variable', 'no'])
(df.groupby('variable')
.apply(
lambda x: x.iloc[:np.where(x.value.gt(0), range(len(x)), 0).max() + 1]
))
Output
date variable no value
variable
A 0 1 A 1.0 0.469112
5 1 A 1.0 0.469112
3 3 A 1.5 -1.135632
4 4 A 1.5 1.212112
8 3 A 1.5 -1.135632
9 4 A 1.5 1.212112
B 15 4 B 1.0 0.469112
12 1 B 1.1 -1.044236
10 1 B 1.2 -0.173215
11 1 B 1.3 0.119209
C 18 1 C 1.5 -1.135632
19 2 C 1.5 1.212112
D 22 3 D 1.1 -1.044236
20 2 D 1.2 -0.173215
21 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000

Related

How to fill nans with multiple if-else conditions?

I have a dataset:
value score
0 0.0 8
1 0.0 7
2 NaN 4
3 1.0 11
4 2.0 22
5 NaN 12
6 0.0 4
7 NaN 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 NaN 28
There are some NaNs in it. I want to fill those NaNs with these conditions:
If 'score' is less than 10, then fill nan with 0.0
If 'score' is between 10 and 20, then fill nan with 1.0
If 'score' is greater than 20, then fill nan with 2.0
How do I do this in pandas?
Here is an example dataframe:
value = [0,0,np.nan,1,2,np.nan,0,np.nan,0,2,1,1,0,2,np.nan]
score = [8,7,4,11,22,12,4,15,5,24,12,15,5,26,28]
pd.DataFrame({'value': value, 'score':score})
Do with cut then fillna
df.value.fillna(pd.cut(df.score,[-np.Inf,10,20,np.Inf],labels = [0,1,2]).astype(int),inplace=True)
df
Out[6]:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
You could use numpy.select with conditions on <10, 10≤score<20, etc. but a more efficient version could be to use a floor division to have values below 10 become 0, below 20 -> 1, etc.
df['value'] = df['value'].fillna(df['score'].floordiv(10))
with numpy.select:
df['value'] = df['value'].fillna(np.select([df['score'].lt(10),
df['score'].between(10, 20),
df['score'].ge(20)],
[0, 1, 2])
)
output:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
use np.select or pd.cut to map the intervals to values, then fillna:
mapping = np.select((df['score'] < 10, df['score'] > 20),
(0, 2), 1)
df['value'] = df['value'].fillna(mapping)

How to replace values in a dataframes with values in another dataframe

I have 2 dataframes
df_1:
Week Day Coeff_1 ... Coeff_n
1 1 12 23
1 2 11 19
1 3 23 68
1 4 57 81
1 5 35 16
1 6 0 0
1 7 0 0
...
50 1 12 23
50 2 11 19
50 3 23 68
50 4 57 81
50 5 35 16
50 6 0 0
50 7 0 0
df_2:
Week Day Coeff_1 ... Coeff_n
1 1 0 0
1 2 0 0
1 3 0 0
1 4 0 0
1 5 0 0
1 6 56 24
1 7 20 10
...
50 1 0 0
50 2 0 0
50 3 0 0
50 4 0 0
50 5 0 0
50 6 10 84
50 7 29 10
In the first dataframe df_1 I have coefficients for monday to friday. In the second dataframes df_2 I have coefficients for the week end. My goal is to merge both dataframes such that I have no longer 0 values which are obsolete.
What is the best approach to do that?
I found that using df.replace seems to be a good approach
Assuming that your dataframes follow the same structure, you can capitalise on pandas functionality to align automatically on indexes. Thus you can replace 0's with np.nan in df1, and then use fillna:
df1.replace({0:np.nan},inplace=True)
df1.fillna(df2)
Week Day Coeff_1 Coeff_n
0 1.0 1.0 12.0 23.0
1 1.0 2.0 11.0 19.0
2 1.0 3.0 23.0 68.0
3 1.0 4.0 57.0 81.0
4 1.0 5.0 35.0 16.0
5 1.0 6.0 56.0 24.0
6 1.0 7.0 20.0 10.0
7 50.0 1.0 12.0 23.0
8 50.0 2.0 11.0 19.0
9 50.0 3.0 23.0 68.0
10 50.0 4.0 57.0 81.0
11 50.0 5.0 35.0 16.0
12 50.0 6.0 10.0 84.0
13 50.0 7.0 29.0 10.0
Can't you just append the rows df_1 where day is 1-5 to the rows of df_2 where day is 6-7?
df_3 = df_1[df_1.Day.isin(range(1,6))].append(df_2[df_2.Day.isin(range(6,8))])
To get a normal sorting, you can sort your values by week and day:
df_3.sort_values(['Week','Day'])

resample data within each group in pandas

I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0

Is it possible to do pandas groupby transform rolling mean?

Is it possible for pandas to do something like:
df.groupby("A").transform(pd.rolling_mean,10)
You can do this without the transform or apply:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]})
df.groupby('grp')['data'].rolling(2, min_periods=1).mean()
Output:
grp
A 0 1.0
1 1.5
2 2.5
3 3.5
4 4.5
B 5 2.0
6 3.0
7 5.0
8 7.0
9 9.0
Name: data, dtype: float64
Update per comment:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]},
index=[*'ABCDEFGHIJ'])
df['avg_2'] = df.groupby('grp')['data'].rolling(2, min_periods=1).mean()\
.reset_index(level=0, drop=True)
Output:
grp data avg_2
A A 1 1.0
B A 2 1.5
C A 3 2.5
D A 4 3.5
E A 5 4.5
F B 2 2.0
G B 4 3.0
H B 6 5.0
I B 8 7.0
J B 10 9.0

Rolling grouped cumulative sum

I'm looking to create a rolling grouped cumulative sum. I can get the result via iteration, but wanted to see if there was a more intelligent way.
Here's what the source data looks like:
Per C V
1 c 3
1 a 4
1 c 1
2 a 6
2 b 5
3 j 7
4 x 6
4 x 5
4 a 9
5 a 2
6 c 3
6 k 6
Here is the desired result:
Per C V
1 c 4
1 a 4
2 c 4
2 a 10
2 b 5
3 c 4
3 a 10
3 b 5
3 j 7
4 c 4
4 a 19
4 b 5
4 j 7
4 x 11
5 c 4
5 a 21
5 b 5
5 j 7
5 x 11
6 c 7
6 a 21
6 b 5
6 j 7
6 x 11
6 k 6
This is a very interesting problem. Try below to see if it works for you.
(
pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
.groupby(by=['Per','C'])
.sum()
.reset_index()
)
Out[197]:
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.
After this, I use groupby, cumsum, and remove zeros.
s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])
s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
You could use pivot_table/cumsum:
(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
yields
Per C 0
0 1 a 4.0
1 1 c 4.0
2 2 a 10.0
3 2 b 5.0
4 2 c 4.0
5 3 a 10.0
6 3 b 5.0
7 3 c 4.0
8 3 j 7.0
9 4 a 19.0
10 4 b 5.0
11 4 c 4.0
12 4 j 7.0
13 4 x 11.0
14 5 a 21.0
15 5 b 5.0
16 5 c 4.0
17 5 j 7.0
18 5 x 11.0
19 6 a 21.0
20 6 b 5.0
21 6 c 7.0
22 6 j 7.0
23 6 k 6.0
24 6 x 11.0
On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:
In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]:
C a b c j k x
Per
1 4.0 NaN 4.0 NaN NaN NaN
2 6.0 5.0 NaN NaN NaN NaN
3 NaN NaN NaN 7.0 NaN NaN
4 9.0 NaN NaN NaN NaN 11.0
5 2.0 NaN NaN NaN NaN NaN
6 NaN NaN 3.0 NaN 6.0 NaN
On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.
The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:
In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop
In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop
In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop
In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop
Here is the setup I used for the benchmark:
import numpy as np
import pandas as pd
def using_pivot(df):
return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
def using_reindex(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
"""
s = df.set_index(['P', 'C']).V.sum(level=[0, 1])
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_reindex2(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
with first line changed
"""
s = df.groupby(['P', 'C'])['V'].sum()
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_concat(df):
"""
https://stackoverflow.com/a/49095139/190597 (Allen)
"""
return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i)
for i in df.P.unique()])
.groupby(by=['P','C'])
.sum()
.reset_index())
def make(nrows):
df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
return df
df = make(1000)