Finding the maximum value for each column and the corspondace vlaue of a common column - dataframe

I am trying to get the maximum value from each column in a dataframe with their time that they occur.
l = [[1,6,2,6,7],[2,66,2,6,8],[3,44,2,44,8],[4,5,35,6,8],[5,3,9,6,95]]
dft = pd.DataFrame(l, columns=['Time','25','50','75','100'])
max_t = pd.DataFrame()
max_t['Max_f'] = dft.loc[:, ['25','50','75','100']].max(axis=0)
max_t
I managed to get the maximum value in each column, however, I could not figure out how to get the time.

IIUC:
In [48]: dft
Out[48]:
Time 25 50 75 100
0 1 6 2 6 7
1 2 66 2 6 8
2 3 44 2 44 8
3 4 5 35 6 8
4 5 3 9 6 95
In [49]: dft.set_index('Time').agg(['max','idxmax']).T
Out[49]:
max idxmax
25 66 2
50 35 4
75 44 3
100 95 5

Related

pandas dataframe and how to find an element using row and column

is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.
Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130

Select all rows that contain the first x unique values in another column in Pandas [Python]

selected =df['col2'].unique().iloc[1:5]
apples = df[df['col2'].isin([selected])]
print(df)
Here is my pseudocode for what I'm trying to accomplish. I just want to get the first five unique values in column 2, and then subset the whole dataframe based on those values. I get this error on the first line:
AttributeError: 'numpy.ndarray' object has no attribute 'iloc'
The only issue is your array slicing
df = pd.DataFrame({"col2":np.random.randint(1,50,100)})
df[df["col2"].isin(df['col2'].unique()[:5])]
output
col2
0
3
1
13
2
1
3
27
4
4
9
1
20
13
27
1
31
4
35
4
42
13
43
27
48
3
59
4
60
4
67
4
90
3
95
4
96
4
98
13

R: How to make a violin/box plot of the last (or any) data points in a time series?

I have the following data frame, A, and would like to make a violin/box plot of the last data points (or any other selected) for all IDs in a time series, i.e. for time=90 the values for ID = 1...10 should be plotted.
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
ID time value
1 1 0 0.056152116
2 1 10 0.560673698
3 1 20 -0.240922725
4 1 30 -1.054686869
5 1 40 -0.734477812
6 1 50 1.123602646
7 1 60 -2.242830898
8 1 70 -0.818526167
9 1 80 1.476234401
10 1 90 -0.332324134
11 2 0 -1.486034438
12 2 10 0.222252053
13 2 20 -0.675720560
14 2 30 -3.144918043
15 2 40 3.058383376
16 2 50 0.978174555
17 2 60 -0.280927730
18 2 70 -0.188338714
19 2 80 -1.115583389
20 2 90 0.362044729
...
41 5 0 0.687402844
42 5 10 -1.127714642
43 5 20 0.117758547
44 5 30 0.507666153
45 5 40 0.205580300
46 5 50 -1.033018214
47 5 60 -1.906279605
48 5 70 0.117539035
49 5 80 -0.968888556
50 5 90 0.122049005
Try this:
set.seed(42)
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
library(ggplot2)
library(dplyr)
filter(A, time == 90) %>%
ggplot(aes(y = value)) +
geom_boxplot()
Created on 2020-06-09 by the reprex package (v0.3.0)

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00