Rolling grouped cumulative sum - pandas

I'm looking to create a rolling grouped cumulative sum. I can get the result via iteration, but wanted to see if there was a more intelligent way.
Here's what the source data looks like:
Per C V
1 c 3
1 a 4
1 c 1
2 a 6
2 b 5
3 j 7
4 x 6
4 x 5
4 a 9
5 a 2
6 c 3
6 k 6
Here is the desired result:
Per C V
1 c 4
1 a 4
2 c 4
2 a 10
2 b 5
3 c 4
3 a 10
3 b 5
3 j 7
4 c 4
4 a 19
4 b 5
4 j 7
4 x 11
5 c 4
5 a 21
5 b 5
5 j 7
5 x 11
6 c 7
6 a 21
6 b 5
6 j 7
6 x 11
6 k 6

This is a very interesting problem. Try below to see if it works for you.
(
pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
.groupby(by=['Per','C'])
.sum()
.reset_index()
)
Out[197]:
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11

If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.
After this, I use groupby, cumsum, and remove zeros.
s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])
s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11

You could use pivot_table/cumsum:
(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
yields
Per C 0
0 1 a 4.0
1 1 c 4.0
2 2 a 10.0
3 2 b 5.0
4 2 c 4.0
5 3 a 10.0
6 3 b 5.0
7 3 c 4.0
8 3 j 7.0
9 4 a 19.0
10 4 b 5.0
11 4 c 4.0
12 4 j 7.0
13 4 x 11.0
14 5 a 21.0
15 5 b 5.0
16 5 c 4.0
17 5 j 7.0
18 5 x 11.0
19 6 a 21.0
20 6 b 5.0
21 6 c 7.0
22 6 j 7.0
23 6 k 6.0
24 6 x 11.0
On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:
In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]:
C a b c j k x
Per
1 4.0 NaN 4.0 NaN NaN NaN
2 6.0 5.0 NaN NaN NaN NaN
3 NaN NaN NaN 7.0 NaN NaN
4 9.0 NaN NaN NaN NaN 11.0
5 2.0 NaN NaN NaN NaN NaN
6 NaN NaN 3.0 NaN 6.0 NaN
On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.
The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:
In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop
In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop
In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop
In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop
Here is the setup I used for the benchmark:
import numpy as np
import pandas as pd
def using_pivot(df):
return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
def using_reindex(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
"""
s = df.set_index(['P', 'C']).V.sum(level=[0, 1])
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_reindex2(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
with first line changed
"""
s = df.groupby(['P', 'C'])['V'].sum()
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_concat(df):
"""
https://stackoverflow.com/a/49095139/190597 (Allen)
"""
return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i)
for i in df.P.unique()])
.groupby(by=['P','C'])
.sum()
.reset_index())
def make(nrows):
df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
return df
df = make(1000)

Related

how to use pandas concatenate string within rolling window for each group?

I have a data set like below:
cluster order label
0 1 1 a
1 1 2 b
2 1 3 c
3 1 4 c
4 1 5 b
5 2 1 b
6 2 2 b
7 2 3 c
8 2 4 a
9 2 5 a
10 2 6 b
11 2 7 c
12 2 8 c
I want to add a column to concatenate a rolling window of 3 for the previous values of the column label. It seems pandas rolling can only do calculations for numerical. Is there a way to concatenate string?
cluster order label roll3
0 1 1 a NaN
1 1 2 b NaN
2 1 3 c NaN
3 1 4 c abc
4 1 5 b bcc
5 2 1 b NaN
6 2 2 b NaN
7 2 3 c NaN
8 2 4 a bbc
9 2 5 a bca
10 2 6 b caa
11 2 7 c aab
12 2 8 c abc
Use groupby.apply to shift and concat the labels:
df['roll3'] = (df.groupby('cluster')['label']
.apply(lambda x: x.shift(3) + x.shift(2) + x.shift(1)))
# cluster order label roll3
# 0 1 1 a NaN
# 1 1 2 b NaN
# 2 1 3 c NaN
# 3 1 4 c abc
# 4 1 5 b bcc
# 5 2 1 b NaN
# 6 2 2 b NaN
# 7 2 3 c NaN
# 8 2 4 a bbc
# 9 2 5 a bca
# 10 2 6 b caa
# 11 2 7 c aab
# 12 2 8 c abc

How to update the value in a column based on another value in the same column where both rows have the same value in another column?

The dataframe df is given:
ID I J K
0 10 1 a 1
1 10 2 b nan
2 10 3 c nan
3 11 1 f 0
4 11 2 b nan
5 11 3 d nan
6 12 1 b 1
7 12 2 d nan
8 12 3 c nan
For each unique value in the ID, when I==3, if J=='c' then the K=1 where I==1, else K=0. The other values in K do not matter. In other words, the value of K in the row 0, 3, and 6 are determined based on the value of I in the row 2, 5, and 8 respectively.
Try:
IDs = df.loc[df.I.eq(3) & df.J.eq("c"), "ID"]
df["K"] = np.where(df["ID"].isin(IDs) & df.I.eq(1), 1, 0)
df["K"] = np.where(df.I.eq(1), df.K, np.nan) # <-- if you want other values NaNs
print(df)
Prints:
ID I J K
0 10 1 a 1.0
1 10 2 b NaN
2 10 3 c NaN
3 11 1 f 0.0
4 11 2 b NaN
5 11 3 d NaN
6 12 1 b 1.0
7 12 2 d NaN
8 12 3 c NaN

Difference between previous row and next row in pandas gives NaN for first value

I am new to using Pandas and I have a dataframe df as given below
A B
0 4 5
1 5 8
2 6 11
3 7 13
4 8 15
5 9 30
6 10 477
7 11 3643
8 12 33469
9 13 141409
10 14 335338
11 15 365115
I want to get the difference between previous row and next row for B column
I used df.set_index('B').diff() but it gives NaN for first row. How to get 5 there?
A B
4 NaN
5 3.0
6 3.0
7 2.0
8 2.0
9 15.0
10 447.0
11 3166.0
12 29826.0
13 107940.0
14 193929.0
15 29777.0
Let us do
df.B.diff().fillna(df.B)
0 5.0
1 3.0
2 3.0
3 2.0
4 2.0
5 15.0
6 447.0
7 3166.0
8 29826.0
9 107940.0
10 193929.0
11 29777.0
Name: B, dtype: float64

generate lines with all unique values of given column for each group

df = pd.DataFrame({'timePoint': [1,1,1,1,2,2,2,2,3,3,3,3],
'item': [1,2,3,4,3,4,5,6,1,3,7,2],
'value': [2,4,7,6,5,9,3,2,4,3,1,5]})
>>> df
item timePoint value
0 1 1 2
1 2 1 4
2 3 1 7
3 4 1 6
4 3 2 5
5 4 2 9
6 5 2 3
7 6 2 2
8 1 3 4
9 3 3 3
10 7 3 1
11 2 3 5
In this df, not every item appears at every timePoint. I want to have all unique items at every timePoint, and these newly inserted items should either have:
(i) a NaN value if they have not appeared at a previous timePoint, or
(ii) if they have, they get their most recent value.
The desired output should look like the following (lines with hashtag are those inserted).
>>> dfx
item timePoint value
0 1 1 2.0
3 1 2 2.0 #
8 1 3 4.0
1 2 1 4.0
4 2 2 4.0 #
11 2 3 5.0
2 3 1 7.0
4 3 2 5.0
9 3 3 3.0
3 4 1 6.0
5 4 2 9.0
6 4 3 9.0 #
0 5 1 NaN #
6 5 2 3.0
7 5 3 3.0 #
1 6 1 NaN #
7 6 2 2.0
8 6 3 2.0 #
2 7 1 NaN #
5 7 2 NaN #
10 7 3 1.0
For example, item 1 gets a 4.0 at timePoint 2 because that's what it had a timePoint 1 whereas item 6 gets a NaN at timePoint 1 because there is no preceding value.
Now, I know that if I manage to insert all lines of every unique item missing in each timePoint group, i.e. reach this point:
>>> dfx
item timePoint value
0 1 1 2.0
1 2 1 4.0
2 3 1 7.0
3 4 1 6.0
4 3 2 5.0
5 4 2 9.0
6 5 2 3.0
7 6 2 2.0
8 1 3 4.0
9 3 3 3.0
10 7 3 1.0
11 2 3 5.0
0 5 1 NaN
1 6 1 NaN
2 7 1 NaN
3 1 2 NaN
4 2 2 NaN
5 7 2 NaN
6 4 3 NaN
7 5 3 NaN
8 6 3 NaN
Then I can do:
dfx.sort_values(by = ['item', 'timePoint'],
inplace = True,
ascending = [True, True])
dfx['value'] = dfx.groupby('item')['value'].fillna(method='ffill')
which will return the desired output.
But how do I add as lines all df.item.unique() items that are missing to each timePoint group?
Also, if you have a more efficient solution from scratch to suggest, then by all means please be my guest.
Using pd.MultiIndex.from_product, levels, reindex
d = df.set_index(['item', 'timePoint'])
d.reindex(
pd.MultiIndex.from_product(d.index.levels, names=d.index.names)
).groupby(level='item').ffill().reset_index()
item timePoint value
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I think stack with unstack will achieve the format , then we using groupby ffillto fill the nan value forward
s=df.set_index(['item','timePoint']).value.unstack().stack(dropna=False)
s.groupby(level=0).ffill().reset_index()
Out[508]:
item timePoint 0
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0

How to map missing values of a df's column according to another column's values (of the same df) using a dictionary? Python

I managed to solve using if and for loops but I'm looking for a less computationally expensive way to do this. i.e. using apply or map or any other technique
d = {1:10, 2:20, 3:30}
df
a b
1 35
1 nan
1 nan
2 nan
2 47
2 nan
3 56
3 nan
I want to fill missing values of column b according to dict d, i.e. output should be
a b
1 35
1 10
1 10
2 20
2 47
2 20
3 56
3 30
You can use fillna or combine_first by maped a column:
print (df['a'].map(d))
0 10
1 10
2 10
3 20
4 20
5 20
6 30
7 30
Name: a, dtype: int64
df['b'] = df['b'].fillna(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
df['b'] = df['b'].combine_first(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
And if all values are ints add astype:
df['b'] = df['b'].fillna(df['a'].map(d)).astype(int)
print (df)
a b
0 1 35
1 1 10
2 1 10
3 2 20
4 2 47
5 2 20
6 3 56
7 3 30
If all data in column a are in keys of dict, then is possible use replace:
df['b'] = df['b'].fillna(df['a'].replace(d))