I'm trying to get an expanding mean. I can get it to work when I iterate and "group" just by filtering by the specific values, but it takes way too long to do. I feel like this should be an easy application to do with a groupby, but when I do it, it just does the expanding mean to the entire dataset, as opposed to just doing it for each of the groups in grouby.
for a quick example:
I want to take this (in this particular case, grouped by 'player' and 'year'), and get an expanding mean.
player pos year wk pa ra
a qb 2001 1 10 0
a qb 2001 2 5 0
a qb 2001 3 10 0
a qb 2002 1 12 0
a qb 2002 2 13 0
b rb 2001 1 0 20
b rb 2001 2 0 17
b rb 2001 3 0 12
b rb 2002 1 0 14
b rb 2002 2 0 15
to get:
player pos year wk pa ra avg_pa avg_ra
a qb 2001 1 10 0 10 0
a qb 2001 2 5 0 7.5 0
a qb 2001 3 10 0 8.3 0
a qb 2002 1 12 0 12 0
a qb 2002 2 13 0 12.5 0
b rb 2001 1 0 20 0 20
b rb 2001 2 0 17 0 18.5
b rb 2001 3 0 12 0 16.3
b rb 2002 1 0 14 0 14
b rb 2002 2 0 15 0 14.5
Not sure where I'm going wrong:
# Group by player and season - also put weeks in correct ascending order
grouped = calc_averages.groupby(['player','pos','seas']).apply(pd.DataFrame.sort_values, 'wk')
grouped['avg_pa'] = grouped['pa'].expanding().mean()
But this will give an expanding mean for the entire set, not for each player, season.
Try:
df.sort_values('wk').groupby(['player','pos','year'])['pa','ra'].expanding().mean()\
.reset_index()
Output:
player pos year level_3 pa ra
0 a qb 2001 0 10.000000 0.000000
1 a qb 2001 1 7.500000 0.000000
2 a qb 2001 2 8.333333 0.000000
3 a qb 2002 3 12.000000 0.000000
4 a qb 2002 4 12.500000 0.000000
5 b rb 2001 5 0.000000 20.000000
6 b rb 2001 6 0.000000 18.500000
7 b rb 2001 7 0.000000 16.333333
8 b rb 2002 8 0.000000 14.000000
9 b rb 2002 9 0.000000 14.500000
Related
Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
I have a data frame, my data frame is like this:
except the last column is not there.
I mean I do not have formula column and here my purpose is to calculate that column.
but how it has been calculated?
the formula for the last column is: for each patientNumber,
number of Yes/total number of questions has been answered by the patient.
for example for the patient number one:there is 1 Yes and 2 No, so it has been 1/3
for patient two, in year 2006, month 10, we can not see Yes the three questions are no, so it has been calculated 0
PatientNumber QT Answer Answerdate year month dayofyear count formula
1 1 transferring No 2017-03-03 2017 3 62 2.0 (1/3)
2 1 preparing food No 2017-03-03 2017 3 62 2.0 (1/3)
3 1 medications Yes 2017-03-03 2017 3 62 1.0 (1/3)
4 2 transferring No 2006-10-05 2006 10 275 3.0 0
5 2 preparing food No 2006-10-05 2006 10 275 3.0 0
6 2 medications No 2006-10-05 2006 10 275 3.0 0
7 2 transferring Yes 2007-4-15 2007 4 105 2.0 2/3
8 2 preparing food Yes 2007-4-15 2007 4 105 2.0 2/3
9 2 medications No 2007-4-15 2007 4 105 1.0 2/3
10 2 transferring Yes 2007-12-15 2007 12 345 1.0 1/3
11 2 preparing food No 2007-12-15 2007 12 345 2.0 1/3
12 2 medications No 2007-12-15 2007 12 345 2.0 1/3
13 2 transferring Yes 2008-10-10 2008 10 280 1.0 (1/3)
14 2 preparing food No 2008-10-10 2008 10 280 2.0 (1/3)
15 2 medications No 2008-10-10 2008 10 280 2.0 (1/3)
16 3 medications No 2008-10-10 2008 12 280 …… ………..
Update 1
Also, what if the formula change a little bit:
if the patient visit the hospital once a year, the same formula as it is multiple by 2. for example, for year 2017 there is just one month related to that patient, so it means the patient reached just one time during the year. in this case the above formula multiple by 2 works.
(why because my window should be every 6 month, so if the patient has not come every 6 month I am assuming the same record is happening)
But if there is several records during one year for one patient, it should be multiplied 2/the number of record on that year.
for example at year 2007, the patient reached the hospital 2 times once in month 4 and another in month 12 so in this case the same formula should be multiplied by 2/2
try this,
def func(x):
x['yes']= len(x[x['Answer']=='Yes'])
x['all']= len(x)
return x
df=df.groupby(['PatientNumber','Answerdate']).apply(func)
df['formula_applied']=df['yes']/df['all']
df['formula']=(df['yes']).astype(str)+'/'+(df['all']).astype(str)
print df
Output:
PatientNumber QT Answer Answerdate year month dayofyear \
0 1 transferring No 2017-03-03 2017 3 62
1 1 preparing food No 2017-03-03 2017 3 62
2 1 medications Yes 2017-03-03 2017 3 62
3 2 transferring No 2006-10-05 2006 10 275
4 2 preparing food No 2006-10-05 2006 10 275
5 2 medications No 2006-10-05 2006 10 275
6 2 transferring Yes 2007-4-15 2007 4 105
7 2 preparing food Yes 2007-4-15 2007 4 105
8 2 medications No 2007-4-15 2007 4 105
9 2 transferring Yes 2007-12-15 2007 12 345
10 2 preparing food No 2007-12-15 2007 12 345
11 2 medications No 2007-12-15 2007 12 345
12 2 transferring Yes 2008-10-10 2008 10 280
13 2 preparing food No 2008-10-10 2008 10 280
14 2 medications No 2008-10-10 2008 10 280
count yes all formula_applied formula
0 2.0 1 3 0.333333 1/3
1 2.0 1 3 0.333333 1/3
2 1.0 1 3 0.333333 1/3
3 3.0 0 3 0.000000 0/3
4 3.0 0 3 0.000000 0/3
5 3.0 0 3 0.000000 0/3
6 2.0 2 3 0.666667 2/3
7 2.0 2 3 0.666667 2/3
8 1.0 2 3 0.666667 2/3
9 1.0 1 3 0.333333 1/3
10 2.0 1 3 0.333333 1/3
11 2.0 1 3 0.333333 1/3
12 1.0 1 3 0.333333 1/3
13 2.0 1 3 0.333333 1/3
14 2.0 1 3 0.333333 1/3
Explanation:
Try to get help from user defined method. this func will calculate you number of yes and total record. then you could solve it as your wish. column formula is your desired result. If you want it to evaluate i added formula_applied.
I got the following df:
Name Year [Columns which rows should not be moved] V2 C2 KeyC
A 2001 ... 4 7 NA
A 2002 ... 2 0.5 1
A 2003 ... 4 0.2 0
A 2005 ... 3 0.3 NA
B 2004 ... 0 0.4 NA
B 2006 ... 1 7 NA
B 2007 ... 2 0.6 1
C 2002 .... 4 4 NA
What I want to do now is that I want to move ONLY the observations from the columns V2 and C2 by one row if the next row is one year in the future from the current years row.
In this example: Move the value from row 1 into row 2. So overwrite the value from row 2. Row4 does keep the values for V2 and C2 as there is no 2004. For B: the observations in row 7 get the values from row 6 and the values from row 7 disappear as there starts a new Letter in column Name. Do this for every letter.
Name Year [Columns which rows should not be moved] V2 C2 KeyC
A 2001 ... 4 7 NA
A 2002 ... 4 7 1
A 2003 ... 2 0.5 0
A 2005 ... 3 0.3 NA
B 2004 ... 0 0.4 NA
B 2006 ... 1 7 NA
B 2007 ... 1 7 1
C 2002 .... 4 4 NA
Is there a way of doing this? :)
Thank you :)
We can build a helper key for signal to shift
#library(data.table)
dt=data.table(dt)
dt[, KEY:=c(0L,diff(year)), by=name]
dt[dt$KEY==1,c('V2','C2')]=data.table(apply(dt[,c('V2','C2')],2,shift)[dt$KEY==1,])
dt
name year x V2 C2 KeyC KEY
1: A 2001 ... 4 7.0 NA 0
2: A 2002 ... 4 7.0 1 1
3: A 2003 ... 2 0.5 0 1
4: A 2005 ... 3 0.3 NA 2
5: B 2004 ... 0 0.4 NA 0
6: B 2006 ... 1 7.0 NA 2
7: B 2007 ... 1 7.0 1 1
8: C 2002 .... 4 4.0 NA 0
You can use shift function from the data.table package:
dt <- read.table(text = "name year x V2 C2 KeyC
A 2001 ... 4 7 NA
A 2002 ... 2 0.5 1
A 2003 ... 4 0.2 0
A 2005 ... 3 0.3 NA
B 2004 ... 0 0.4 NA
B 2006 ... 1 7 NA
B 2007 ... 2 0.6 1
C 2002 .... 4 4 NA",
header = T)
library(data.table)
dt <- data.table(dt)
dt[, `:=` (previous.year = shift(year),
previous.V2 = shift(V2),
previous.C2 = shift(C2))]
dt[, has.previous.year := year - 1 == previous.year]
dt[has.previous.year == TRUE,
`:=` (V2 = previous.V2,
C2 = previous.C2)]
dt <- dt[, .(name, year, x, V2, C2, KeyC)]
dt
Assuming that the KeyC column accurately codes all the cases you want copied:
#make helper rows that are offset by 1
df$V2_help<-c(NA, df$V2[1:nrow(df)-1])
df$C2_help<-c(NA, df$C2[1:nrow(df)-1])
#use ifelse statement to replace data where KeyC is not NA
df$V2<-ifelse(!is.na(df$KeyC), df$V2_help, df$V2)
df$C2<-ifelse(!is.na(df$KeyC), df$C2_help, df$C2)
#remove helper columns
df<-df[,setdiff(colnames(df), c("V2_help", "C2_help"))]
Name Year V2 C2 KeyC
1 A 2001 4 7.0 NA
2 A 2002 4 7.0 1
3 A 2003 2 0.5 0
4 A 2005 3 0.3 NA
5 B 2004 0 0.4 NA
6 B 2006 1 7.0 NA
7 B 2007 1 7.0 1
8 C 2002 4 4.0 NA
using
library(tidyverse)
dt%>%
group_by(name)%>%
mutate_at(vars(C2,V2),funs(ifelse(c(0,diff(year))==1,lag(.),.)))
# A tibble: 8 x 6
# Groups: name [3]
name year x V2 C2 KeyC
<fct> <int> <fct> <int> <dbl> <int>
1 A 2001 ... 4 7.00 NA
2 A 2002 ... 4 7.00 1
3 A 2003 ... 2 0.500 0
4 A 2005 ... 3 0.300 NA
5 B 2004 ... 0 0.400 NA
6 B 2006 ... 1 7.00 NA
7 B 2007 ... 1 7.00 1
8 C 2002 .... 4 4.00 NA
You can also use
library(data.table)
setDT(dt)[,c("C2","V2") := lapply(.SD,function(x)ifelse(c(0,diff(year))==1,shift(x),x)),by=name, .SDcols=c("C2","V2")]
dt
name year x V2 C2 KeyC
1: A 2001 ... 4 7.0 NA
2: A 2002 ... 4 7.0 1
3: A 2003 ... 2 0.5 0
4: A 2005 ... 3 0.3 NA
5: B 2004 ... 0 0.4 NA
6: B 2006 ... 1 7.0 NA
7: B 2007 ... 1 7.0 1
8: C 2002 .... 4 4.0 NA
I have a pandas dataframe that looks like this:
Year A B C D
1999 1 3 5 7
2000 11 13 17 19
2001 23 29 31 37
And I want it to look like this:
Year Type Value
1999 A 1
1999 B 3
1999 C 5
1999 D 7
2000 A 11
2000 B 13
Etc. Is there a way to do this and if so, how?
You can recreate your df
pd.DataFrame({'Year':df.Year.repeat((df.shape[1]-1)),'Type':list(df)[1:]*len(df),'Value':np.concatenate(df.iloc[:,1:].values)})
Out[95]:
Type Value Year
0 A 1 1999
0 B 3 1999
0 C 5 1999
0 D 7 1999
1 A 11 2000
1 B 13 2000
1 C 17 2000
1 D 19 2000
2 A 23 2001
2 B 29 2001
2 C 31 2001
2 D 37 2001
First set_index and then stack, rename_axis and last reset_index:
df = df.set_index('Year').stack().rename_axis(('Year','Type')).reset_index(name='Value')
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37
Or use melt, but order of values is different:
df = df.melt('Year', var_name='Type', value_name='Value')
print (df)
Year Type Value
0 1999 A 1
1 2000 A 11
2 2001 A 23
3 1999 B 3
4 2000 B 13
5 2001 B 29
6 1999 C 5
7 2000 C 17
8 2001 C 31
9 1999 D 7
10 2000 D 19
11 2001 D 37
... so is necessary sorting:
df = (df.melt('Year', var_name='Type', value_name='Value')
.sort_values(['Year','Type'])
.reset_index(drop=True))
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37
Numpy solution:
a = np.repeat(df['Year'], len(df.columns.difference(['Year'])))
b = np.tile(df.columns.difference(['Year']), len(df.index))
c = df.drop('Year', 1).values.ravel()
df = pd.DataFrame(np.column_stack([a,b,c]), columns=['Year','Type','Value'])
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37
I need some help to rearrange a dataframe. this is what the data looks like.
Year item 1 item 2 item 3
2001 22 54 33
2002 77 54 33
2003 22 NaN 33
2004 22 54 NaN
The layout I want is:
Items Year Value
item 1 2001 22
item 1 2002 77
...
And so on...
Use melt if not necessary remove NaNs:
df = df.melt('Year', var_name='Items', value_name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
6 2003 item 2 NaN
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
11 2004 item 3 NaN
For remove NaNs add dropna:
df = df.melt('Year', var_name='Items', value_name='Value').dropna(subset=['Value'])
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
For a bit different ordering with removing NaNs is possible use set_index + stack + rename_axis + reset_index:
df = df.set_index('Year').stack().rename_axis(['Year','Items']).reset_index(name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0
Using comprehensions and pd.DataFrame.itertuples
pd.DataFrame(
[[y, i, v]
for y, *vals in df.itertuples(index=False)
for i, v in zip(df.columns[1:], vals)
if pd.notnull(v)],
columns=['Year', 'Item', 'Value']
)
Year Item Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0