How to take the semi last value per group - pandas

I would like per group to keep the semi-last value, as indicated below.
ID number
1 50
1 49
1 48
1 45
2 47
2 40
2 31
3 60
3 51
Example output
1 48
2 40
3 60

One liner:
df[df[::-1].groupby('ID').cumcount()[::-1]==1]
Output:
ID number
2 1 48
5 2 40
7 3 60

Use Groupby.nth with -2 :
df.groupby('ID')['number'].nth(-2)
[out]
ID
1 48
2 40
3 60
Name: number, dtype: int64

Related

R: How to make a violin/box plot of the last (or any) data points in a time series?

I have the following data frame, A, and would like to make a violin/box plot of the last data points (or any other selected) for all IDs in a time series, i.e. for time=90 the values for ID = 1...10 should be plotted.
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
ID time value
1 1 0 0.056152116
2 1 10 0.560673698
3 1 20 -0.240922725
4 1 30 -1.054686869
5 1 40 -0.734477812
6 1 50 1.123602646
7 1 60 -2.242830898
8 1 70 -0.818526167
9 1 80 1.476234401
10 1 90 -0.332324134
11 2 0 -1.486034438
12 2 10 0.222252053
13 2 20 -0.675720560
14 2 30 -3.144918043
15 2 40 3.058383376
16 2 50 0.978174555
17 2 60 -0.280927730
18 2 70 -0.188338714
19 2 80 -1.115583389
20 2 90 0.362044729
...
41 5 0 0.687402844
42 5 10 -1.127714642
43 5 20 0.117758547
44 5 30 0.507666153
45 5 40 0.205580300
46 5 50 -1.033018214
47 5 60 -1.906279605
48 5 70 0.117539035
49 5 80 -0.968888556
50 5 90 0.122049005
Try this:
set.seed(42)
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
library(ggplot2)
library(dplyr)
filter(A, time == 90) %>%
ggplot(aes(y = value)) +
geom_boxplot()
Created on 2020-06-09 by the reprex package (v0.3.0)

Sum of group but keep the same value for each row in pandas

How to solve same problem in this link Sum of group but keep the same value for each row in r using pandas?
I can generate separate df have the sum for each group and then merge the generated df with the original.
You can use groupby & transform as below to get your output.
df['sumx']=df.groupby(['ID', 'Group'],sort=False)['x'].transform(sum)
df['sumy']=df.groupby(['ID', 'Group'],sort=False)['y'].transform(sum)
df
output
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Finding the maximum value for each column and the corspondace vlaue of a common column

I am trying to get the maximum value from each column in a dataframe with their time that they occur.
l = [[1,6,2,6,7],[2,66,2,6,8],[3,44,2,44,8],[4,5,35,6,8],[5,3,9,6,95]]
dft = pd.DataFrame(l, columns=['Time','25','50','75','100'])
max_t = pd.DataFrame()
max_t['Max_f'] = dft.loc[:, ['25','50','75','100']].max(axis=0)
max_t
I managed to get the maximum value in each column, however, I could not figure out how to get the time.
IIUC:
In [48]: dft
Out[48]:
Time 25 50 75 100
0 1 6 2 6 7
1 2 66 2 6 8
2 3 44 2 44 8
3 4 5 35 6 8
4 5 3 9 6 95
In [49]: dft.set_index('Time').agg(['max','idxmax']).T
Out[49]:
max idxmax
25 66 2
50 35 4
75 44 3
100 95 5

Select for running total rank based on column values

I have problem while assigning the Ranks for the below scenarios.In my scenario running total calculated based on the Cnt field.
My sql query should return Rank values like below output. Per page it should accept only 40 rows, so im assigning ranks contain only 40 records. If the running total crossing 40 it should change ranks. For each count 40 it should change the rank values.
It would great help if I can get sql query to return values
select f1,f2,sum(f2) over(order by f1) runnign_total
from [dbo].[Sheet1$]
OutPut:
ID cnt Running Total Rank
1 4 4 1
2 5 9 1
3 4 13 1
4 4 17 1
5 4 21 1
6 5 26 1
7 4 30 1
8 4 34 1
9 4 38 1
10 4 42 2
11 4 46 2
12 4 50 2
13 4 54 2
14 4 58 2
15 4 62 2
16 4 66 2
17 4 70 2
18 4 74 2
19 4 78 2
20 4 82 3
21 4 86 3
22 4 90 3
select f1,f2,sum(f2) over(order by f1) running_total, Floor(sum(f2) over(order by f1) / 40) [rank]
from [dbo].[Sheet1$]