pandas dataframe and how to find an element using row and column - pandas

is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.

Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130

Related

Select all rows that contain the first x unique values in another column in Pandas [Python]

selected =df['col2'].unique().iloc[1:5]
apples = df[df['col2'].isin([selected])]
print(df)
Here is my pseudocode for what I'm trying to accomplish. I just want to get the first five unique values in column 2, and then subset the whole dataframe based on those values. I get this error on the first line:
AttributeError: 'numpy.ndarray' object has no attribute 'iloc'
The only issue is your array slicing
df = pd.DataFrame({"col2":np.random.randint(1,50,100)})
df[df["col2"].isin(df['col2'].unique()[:5])]
output
col2
0
3
1
13
2
1
3
27
4
4
9
1
20
13
27
1
31
4
35
4
42
13
43
27
48
3
59
4
60
4
67
4
90
3
95
4
96
4
98
13

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

How to add a new column (not replace)

import pandas as pd
test=[
[14,12,1,13,15],
[11,21,1,19,32],
[48,16,1,16,12],
[22,24,1,18,41],
]
df = pd.DataFrame(test)
x = [1,2,3,4]
df['new'] = pd.DataFrame(x)
In this example,df will create new column 'new'
What I want is ...
I want create an new DataFrame (df1) include column 'new'(six column), and df is not changed (only five column).
I want df not to change.
How do I do that?
You can create the new DataFrame with .assign:
import pandas as pd
df= pd.DataFrame(test)
df1 = df.assign(new=x)
print(df)
0 1 2 3 4
0 14 12 1 13 15
1 11 21 1 19 32
2 48 16 1 16 12
3 22 24 1 18 41
print(df1)
0 1 2 3 4 new
0 14 12 1 13 15 1
1 11 21 1 19 32 2
2 48 16 1 16 12 3
3 22 24 1 18 41 4
.assign returns a new object, so you can modify it without affecting the original. The other alternative would be
df1 = df.copy() #New object, modifications do not affect `df`.
df1['new'] = x
Alternative way, 'e' is new column, np random creates random values for the new column
df.insert(len(df.columns),'e',np.random.randint(0,5,(5,1)))

Finding the maximum value for each column and the corspondace vlaue of a common column

I am trying to get the maximum value from each column in a dataframe with their time that they occur.
l = [[1,6,2,6,7],[2,66,2,6,8],[3,44,2,44,8],[4,5,35,6,8],[5,3,9,6,95]]
dft = pd.DataFrame(l, columns=['Time','25','50','75','100'])
max_t = pd.DataFrame()
max_t['Max_f'] = dft.loc[:, ['25','50','75','100']].max(axis=0)
max_t
I managed to get the maximum value in each column, however, I could not figure out how to get the time.
IIUC:
In [48]: dft
Out[48]:
Time 25 50 75 100
0 1 6 2 6 7
1 2 66 2 6 8
2 3 44 2 44 8
3 4 5 35 6 8
4 5 3 9 6 95
In [49]: dft.set_index('Time').agg(['max','idxmax']).T
Out[49]:
max idxmax
25 66 2
50 35 4
75 44 3
100 95 5