how to calculate "consecutive mean" in R without using loop, or in a more efficient way? - optimization

I have a set a data that I need to calculate their "consecutive mean" (I dunno if it is the correct name, but I can't find anything better), here is an example:
ID Var2 Var3
1 A 1
2 A 3
3 A 5
4 A 7
5 A 9
6 A 11
7 B 2
8 B 4
9 B 6
10 B 8
11 B 10
Here I need to calculated the mean of 3 Var3 variable in the same subset consecutively (i.e. there will be 4 means caulculated for A: mean(1,3,5), mean(3,5,7), mean(5,7,9), mean(7,9,11), and 3 means calculated for B: mean(2,4,6), mean(4,6,8), mean(6,8,10). And the result should be:
ID Var2 Var3 Mean
1 A 1 N/A
2 A 3 N/A
3 A 5 3
4 A 7 5
5 A 9 7
6 A 11 9
7 B 2 N/A
8 B 4 N/A
9 B 6 4
10 B 8 6
11 B 10 8
Currently I am using a "loop-inside-a-loop" approach, I subset the dataset using Var2, and then I calculated the mean in another start from the third data.
It suits what I need, but it is very slow, is there any faster way for this problem?
Thanks!

It's generally referred to as a "rolling mean" or "running mean". The plyr package allows you to calculate a function over segments of your data and the zoo package has methods for rolling calculations.
> lines <- "ID,Var2,Var3
+ 1,A,1
+ 2,A,3
+ 3,A,5
+ 4,A,7
+ 5,A,9
+ 6,A,11
+ 7,B,2
+ 8,B,4
+ 9,B,6
+ 10,B,8
+ 11,B,10"
>
> x <- read.csv(con <- textConnection(lines))
> close(con)
>
> ddply(x,"Var2",function(y) data.frame(y,
+ mean=rollmean(y$Var3,3,na.pad=TRUE,align="right")))
ID Var2 Var3 mean
1 1 A 1 NA
2 2 A 3 NA
3 3 A 5 3
4 4 A 7 5
5 5 A 9 7
6 6 A 11 9
7 7 B 2 NA
8 8 B 4 NA
9 9 B 6 4
10 10 B 8 6
11 11 B 10 8

Alternately using base R
x$mean <- unlist(tapply(x$Var3, x$Var2, zoo::rollmean, k=3, na.pad=TRUE, align="right", simplity=FALSE))

Related

Pandas Groupby Problems with Calculating Column-Wise Quantiles with "quantile"

i need to compute quantiles for a large DF across columns or column-wise along rows or "months" in my case. Apparently, the quantile function applied on just a df works using the key word "axis" but if you try and apply quantile using a groupby, it is rejected with an error:
TypeError: quantile() got an unexpected keyword argument 'axis'
Here is the situation that the quantile works with data like this:
Num Num Num Quantile 0.5
5 6 4 5
4 1 2 2
3 9 7 7
7 2 8 7
5 5 4 5
But, if I add more columns with a groupby statement to find the same quantile(0.5, axis=1), then I get the error shown above. Please help and thank you. My actual data looks like this below:
site month Num Num Num Quantile 0.5
0 A 8 5 6 4 5
1 A 9 4 1 2 2
2 A 10 3 9 7 7
3 A 11 7 2 8 7
4 A 12 5 5 4 5
5 B 8 3 7 5 5
6 B 9 6 9 0 6
7 B 10 4 1 3 3
8 B 11 8 3 0 3
9 B 12 5 6 8 6
The confusion arises from the fact that pd.DataFrame.quantile and DataFrameGroupBy.quantile are not the same functions. The first one has an axis parameter, the second one does not. Hence the error.
When you think about it, it is perfectly logical that the second function does not have this option. Suppose we do:
groups = df.groupby('site')
for group in groups:
print(group[1])
site month Num Num.1 Num.2
0 A 8 5 6 4
1 A 9 4 1 2
2 A 10 3 9 7
3 A 11 7 2 8
4 A 12 5 5 4
site month Num Num.1 Num.2
5 B 8 3 7 5
6 B 9 6 9 0
7 B 10 4 1 3
8 B 11 8 3 0
9 B 12 5 6 8
Now ask yourself the question which axis could generate a qauntile that is meaningfully related to A | B. The answer surely is column-wise. I could get a quantile of Num for A, or Num.1. E.g.:
print(groups.quantile())
month Num Num.1 Num.2
site
A 10.0 5.0 5.0 4.0
B 10.0 5.0 6.0 3.0
It wouldn't make sense to say, let's get the quantile row-wise for A at row 0 (and pretend that this has anything to do with A as a grouped value as distinct from B). Indeed, you don't need a groupby for that at all.
Sidenote: you will have noticed that your columns Num, Num, Num have turned into Num, Num.1, Num.2 in my examples. This conversion takes place automatically when you read from the clipboard (pd.read_clipboard). In general, having multiple columns with duplicate names is very bad practice and might get you into all sorts of problems with various operators. So, I strongly advice you to rename them.

How can I create a column of numbers that ascends after a certain amount of rows?

I have a column of scores going in descending order. I want to create a column of difficulty level with scale 1-10 going up every 37 rows for diffculty 1-7 and then 36 rows for 8-10. i have created a small example below where the difficulty goes down in 3 row intervals and the final difficulty '4' and '5' is 2 rows
In:
score
0 11
1 10
2 9
3 8
4 8
5 6
6 5
7 4
8 4
9 3
10 2
11 1
12 1
Out:
score difficulty
0 11 1
1 10 1
2 9 1
3 8 2
4 8 2
5 6 2
6 5 3
7 4 3
8 4 3
9 3 4
10 2 4
11 1 5
12 1 5
If I understand your problem correctly, you could do something like:
import pandas as pd
from random import randint
count = (37*7) + (36*3)
difficulty = [int(i/37) + 1 for i in range(37*7)] + [int(i/36) + 8 for i in range(36*3)]
df = pd.DataFrame({'score': [randint(0, 10) for i in range(count)]})
df['difficulty'] = difficulty

Pandas: Create a new column that alternate between values in two other columns [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
How can I transform a Dataframe with columns S (start), E (end), V (value)
S E V
1 2 3
2 5 11
5 11 5
And transform it to:
T V
1 3
2 3
2 11
5 11
5 5
11 5
?
This is so that we can plot the data with in such a way the value V (y-axis) is the same throughout the interval.
Edit:
Some are suggesting this is the same as a "how do I use melt()?" question. However the order of the result is important.
Or with set_index/stack:
df = df.set_index('V').stack().reset_index(-1, drop =True).reset_index(name = 'T')
OUTPUT:
V T
0 3 1
1 3 2
2 11 2
3 11 5
4 5 5
5 5 11
Try with melt
df.melt('V')
Out[39]:
V variable value
0 3 S 1
1 11 S 2
2 5 S 5
3 3 E 2
4 11 E 5
5 5 E 11

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

How to multiply dataframe columns with dataframe column in pandas?

I want to multiply hdataframe columns with dataframe column.
I have two dataframews as shown here:
A dataframe, B dataframe
a b c d e
3 4 4 4 2
3 3 3 3 3
3 3 3 3 4
and I want to make multiplication A and B.
Multiplication result should be like this:
a b c d
6 8 8 8
9 9 9 9
12 12 12 12
I tried just * multiplication but got a wrong result.
Thank you in advance!
Use B.values or B.to_numpy() which will return numpy array and then you can multiply with DataFrame
Ex.:
>>> A
a b c d
0 3 4 4 4
1 3 3 3 3
2 3 3 3 3
>>> B
c
0 2
1 3
2 4
>>> A * B.values
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Just another variation on #Dishin's excellent answer:
U can use pandas mul method to multiply A by B, by setting B as a series and multiplying on the index:
A.mul(B.iloc[:,0],axis='index')
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Use DataFrame.mul with Series by selecting e column:
df = A.mul(B['e'], axis=0)
print (df)
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
I think you are looking for the mul function, as seen on this thread here, here is the code.
df = pd.DataFrame([[3, 4, 4, 4],[3, 3, 3, 3],[3, 3, 3, 3]])
val = [2,3,4]
df.mul(val, axis = 0)
Here are the results:
0 1 2 3
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Ignore the indices.