How to obtain the max. value of n-period for every row based on values of another columns in a dataframe? - numpy

Let's say I have a df:
idx c1
A 1
B 7
C 8
D 6
E 5
F 6
G 9
H 8
I 0
J 10
What's the fastest way to obtain the highest value for n-periods for every row based on c1, then create a new column for it? Eg. if it's 3-period, then it will be like this:
idx c1 new_col
A 1 0
B 7 0
C 8 0
D 6 8 (prev. 3-period, 1,7,8, 8 is the highest)
E 5 8 (prev. 3-period, 7,8,6 8 is the highest)
F 6 8 (prev. 3-period, 8,6,5 8 is the highest)
G 9 6 (prev. 3-period, 6,5,6 6 is the highest)
H 8 9 (prev. 3-period, 5,6,9 9 is the highest)
I 0 9 (prev. 3-period, 6,9,8 9 is the highest)
J 10 9 (prev. 3-period, 9,8,0 9 is the highest)
My current code is now:
list=[]
for row in range(len(df)):
if row < 3:
list.append(0)
else:
list.append(max(c1[row-3:row]))
df['new_col'] = list
This method is very slow because I have many rows and this has to loop through the whole thing. Is there a faster way to do it? Thanks.

This is just rolling and shift:
df['new_col'] = df['c1'].rolling(3).max().shift().fillna(0)
Output:
idx c1 new_col
0 A 1 0.0
1 B 7 0.0
2 C 8 0.0
3 D 6 8.0
4 E 5 8.0
5 F 6 8.0
6 G 9 6.0
7 H 8 9.0
8 I 0 9.0
9 J 10 9.0

Related

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

How to multiply dataframe columns with dataframe column in pandas?

I want to multiply hdataframe columns with dataframe column.
I have two dataframews as shown here:
A dataframe, B dataframe
a b c d e
3 4 4 4 2
3 3 3 3 3
3 3 3 3 4
and I want to make multiplication A and B.
Multiplication result should be like this:
a b c d
6 8 8 8
9 9 9 9
12 12 12 12
I tried just * multiplication but got a wrong result.
Thank you in advance!
Use B.values or B.to_numpy() which will return numpy array and then you can multiply with DataFrame
Ex.:
>>> A
a b c d
0 3 4 4 4
1 3 3 3 3
2 3 3 3 3
>>> B
c
0 2
1 3
2 4
>>> A * B.values
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Just another variation on #Dishin's excellent answer:
U can use pandas mul method to multiply A by B, by setting B as a series and multiplying on the index:
A.mul(B.iloc[:,0],axis='index')
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Use DataFrame.mul with Series by selecting e column:
df = A.mul(B['e'], axis=0)
print (df)
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
I think you are looking for the mul function, as seen on this thread here, here is the code.
df = pd.DataFrame([[3, 4, 4, 4],[3, 3, 3, 3],[3, 3, 3, 3]])
val = [2,3,4]
df.mul(val, axis = 0)
Here are the results:
0 1 2 3
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Ignore the indices.

Rolling grouped cumulative sum

I'm looking to create a rolling grouped cumulative sum. I can get the result via iteration, but wanted to see if there was a more intelligent way.
Here's what the source data looks like:
Per C V
1 c 3
1 a 4
1 c 1
2 a 6
2 b 5
3 j 7
4 x 6
4 x 5
4 a 9
5 a 2
6 c 3
6 k 6
Here is the desired result:
Per C V
1 c 4
1 a 4
2 c 4
2 a 10
2 b 5
3 c 4
3 a 10
3 b 5
3 j 7
4 c 4
4 a 19
4 b 5
4 j 7
4 x 11
5 c 4
5 a 21
5 b 5
5 j 7
5 x 11
6 c 7
6 a 21
6 b 5
6 j 7
6 x 11
6 k 6
This is a very interesting problem. Try below to see if it works for you.
(
pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
.groupby(by=['Per','C'])
.sum()
.reset_index()
)
Out[197]:
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.
After this, I use groupby, cumsum, and remove zeros.
s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])
s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
You could use pivot_table/cumsum:
(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
yields
Per C 0
0 1 a 4.0
1 1 c 4.0
2 2 a 10.0
3 2 b 5.0
4 2 c 4.0
5 3 a 10.0
6 3 b 5.0
7 3 c 4.0
8 3 j 7.0
9 4 a 19.0
10 4 b 5.0
11 4 c 4.0
12 4 j 7.0
13 4 x 11.0
14 5 a 21.0
15 5 b 5.0
16 5 c 4.0
17 5 j 7.0
18 5 x 11.0
19 6 a 21.0
20 6 b 5.0
21 6 c 7.0
22 6 j 7.0
23 6 k 6.0
24 6 x 11.0
On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:
In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]:
C a b c j k x
Per
1 4.0 NaN 4.0 NaN NaN NaN
2 6.0 5.0 NaN NaN NaN NaN
3 NaN NaN NaN 7.0 NaN NaN
4 9.0 NaN NaN NaN NaN 11.0
5 2.0 NaN NaN NaN NaN NaN
6 NaN NaN 3.0 NaN 6.0 NaN
On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.
The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:
In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop
In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop
In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop
In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop
Here is the setup I used for the benchmark:
import numpy as np
import pandas as pd
def using_pivot(df):
return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
def using_reindex(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
"""
s = df.set_index(['P', 'C']).V.sum(level=[0, 1])
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_reindex2(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
with first line changed
"""
s = df.groupby(['P', 'C'])['V'].sum()
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_concat(df):
"""
https://stackoverflow.com/a/49095139/190597 (Allen)
"""
return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i)
for i in df.P.unique()])
.groupby(by=['P','C'])
.sum()
.reset_index())
def make(nrows):
df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
return df
df = make(1000)

how to calculate "consecutive mean" in R without using loop, or in a more efficient way?

I have a set a data that I need to calculate their "consecutive mean" (I dunno if it is the correct name, but I can't find anything better), here is an example:
ID Var2 Var3
1 A 1
2 A 3
3 A 5
4 A 7
5 A 9
6 A 11
7 B 2
8 B 4
9 B 6
10 B 8
11 B 10
Here I need to calculated the mean of 3 Var3 variable in the same subset consecutively (i.e. there will be 4 means caulculated for A: mean(1,3,5), mean(3,5,7), mean(5,7,9), mean(7,9,11), and 3 means calculated for B: mean(2,4,6), mean(4,6,8), mean(6,8,10). And the result should be:
ID Var2 Var3 Mean
1 A 1 N/A
2 A 3 N/A
3 A 5 3
4 A 7 5
5 A 9 7
6 A 11 9
7 B 2 N/A
8 B 4 N/A
9 B 6 4
10 B 8 6
11 B 10 8
Currently I am using a "loop-inside-a-loop" approach, I subset the dataset using Var2, and then I calculated the mean in another start from the third data.
It suits what I need, but it is very slow, is there any faster way for this problem?
Thanks!
It's generally referred to as a "rolling mean" or "running mean". The plyr package allows you to calculate a function over segments of your data and the zoo package has methods for rolling calculations.
> lines <- "ID,Var2,Var3
+ 1,A,1
+ 2,A,3
+ 3,A,5
+ 4,A,7
+ 5,A,9
+ 6,A,11
+ 7,B,2
+ 8,B,4
+ 9,B,6
+ 10,B,8
+ 11,B,10"
>
> x <- read.csv(con <- textConnection(lines))
> close(con)
>
> ddply(x,"Var2",function(y) data.frame(y,
+ mean=rollmean(y$Var3,3,na.pad=TRUE,align="right")))
ID Var2 Var3 mean
1 1 A 1 NA
2 2 A 3 NA
3 3 A 5 3
4 4 A 7 5
5 5 A 9 7
6 6 A 11 9
7 7 B 2 NA
8 8 B 4 NA
9 9 B 6 4
10 10 B 8 6
11 11 B 10 8
Alternately using base R
x$mean <- unlist(tapply(x$Var3, x$Var2, zoo::rollmean, k=3, na.pad=TRUE, align="right", simplity=FALSE))