Selecting the higher of two data - pandas

I'm working with Python Pandas trying to sort some student testing data. On occasion, students will test twice during the same testing window, and I want to save only the highest of the two tests. Here's an example of my dataset.
Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60
Any ideas of how can I save only the higher score for each student?

If your data is in the dataframe df, you can sort by the score in descencing order and drop duplicate names, keeping the first:
df.sort_values(by='Score', ascending=False).drop_duplicates(subset='Name', keep='first')

You can do this with groupby. It works like this:
df.groupby('Name').agg({'Score': 'max'})
It results in:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
Btw. in that special setup, you could also use drop_duplicates to make the name unique after sorting on the score. This would yield the same result, but would not be extensible (e.g. if you later would like to add the average score etc). It would look like this:
df.sort_values(['Name', 'Score']).drop_duplicates(['Name'], keep='last')
From the test data you posted:
import pandas as pd
from io import StringIO
sio= StringIO("""Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60 """)
df= pd.read_csv(sio, sep='\s+')

There are multiple ways to do that, two of them are:
In [8]: df = pd.DataFrame({"Score" : [32, 75, 89, 40, 70, 60],
...: "Name" : ["Alice", "Alice", "John", "Mark", "Mark", "Amy"]})
...: df
Out[8]:
Score Name
0 32 Alice
1 75 Alice
2 89 John
3 40 Mark
4 70 Mark
5 60 Amy
In [13]: %time df.groupby("Name").max()
CPU times: user 2.26 ms, sys: 286 µs, total: 2.54 ms
Wall time: 2.11 ms
Out[13]:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
In [14]: %time df.sort_values("Name").drop_duplicates(subset="Name", keep="last")
CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
Wall time: 1.89 ms
Out[14]:
Score Name
1 75 Alice
5 60 Amy
2 89 John
4 70 Mark

This question has already been answered here on StackOverflow.
You can merge two pandas data frames and after that calculate the maximum number in each row. df1 and df2 are the pandas of students score:
import pandas as pd
df1 = pd.DataFrame({'Alice': 3,
'John': 8,
'Mark': 7.5,
'Amy': 0},
index=[0])
df2 = pd.DataFrame({'Alice': 7,
'Mark': 7},
index=[0])
result = pd.concat([df1, df2], sort=True)
result = result.T
result["maxvalue"] = result.max(axis=1)

Related

how to get cell value of a pd data frame [duplicate]

Let's say we have a pandas dataframe:
name age sal
0 Alex 20 100
1 Jane 15 200
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
Let's say, a few rows are now deleted and we don't know the indexes that have been deleted. For example, we delete row index 1 using df.drop([1]). And now the data frame comes down to this:
fname age sal
0 Alex 20 100
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
I would like to get the value from row index 3 and column "age". It should return 23. How do I do that?
df.iloc[3, df.columns.get_loc('age')] does not work because it will return 21. I guess iloc takes the consecutive row index?
Use .loc to get rows by label and .iloc to get rows by position:
>>> df.loc[3, 'age']
23
>>> df.iloc[2, df.columns.get_loc('age')]
23
More about Indexing and selecting data
dataset = ({'name':['Alex', 'Jane', 'John', 'Lsd', 'Mari'],
'age': [20, 15, 25, 23, 21],
'sal': [100, 200, 300, 392, 380]})
df = pd.DataFrame(dataset)
df.drop([1], inplace=True)
df.loc[3,['age']]
try this one:
[label, column name]
value = df.loc[1,"column_name]

Slicing pandas dataframe by closest value

I have a pandas data frame that looks like this:
age score
5 72 99.424
6 70 99.441
7 69 99.442
8 67 99.443
9 71 99.448
mean score: 99.4396
The mean is the mean over all score column. How can I slice/get an age value that is say +/- 0.001 closer to the mean score.
So in this case: 67 and 69
mean = df['score'].mean()
df[df['score'].between(mean - .001, mean + .001)]['age']
import pandas as pd
import statistics
df = pd.DataFrame({"age": [72, 70, 69, 67, 71], "score": (99.424, 99.441, 99.442, 99.443, 99.448)})
df["diff"] = abs(df["score"] - statistics.mean(list(df["score"])))
You get :
age score diff
0 72 99.424 0.0156
1 70 99.441 0.0014
2 69 99.442 0.0024
3 67 99.443 0.0034
4 71 99.448 0.0084
Then :
x = 0.002
ages = list(df.loc[df["diff"] < x]["age"])
[Out]: [70]
x will be your parameter for the difference with the mean.
EDIT: we cannot get the same result as you as we do not have your whole score column by the way

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Python keep rows if a specific column contains a particular value or string

I am very green in python. I have not found a specific answer to my problem searching for online resources. With that said it would be great if you could give some hints.
I have an example of df as below:
import pandas as pd
df = pd.DataFrame({'names':['Alex','Joseph','Kate'],'exam1': [90, 68, 70], 'exam2': [100, 98, 88]})
print(df)
names exam1 exam2
0 Alex 90 100
1 Joseph 68 98
2 Kate 70 88
I would like to make a for loop to iterate over the rows and if the names column is equal to Joseph and Kate to get a new df as below:
names exam1 exam2
0 Joseph 68 98
1 Kate 70 88
I know there is a way like below but I would like to do it via for loop.
list=['Joseph','Kate']
new_df=df[df['names'].isin(list)]
Thank you in Advance.
Not sure why you'd want to use loops but this is how you'd it:
rows = []
for index, row in df.iterrows():
if row['names'] == 'Kate' or row['names'] == 'Joseph':
rows.append(row)
new_df = pd.DataFrame(rows)
print(new_df)
names exam1 exam2
1 Joseph 68 98
2 Kate 70 88

How to order dataframe using a list in pandas

I have a pandas dataframe as follows.
import pandas as pd
data = [['Alex',10, 175],['Bob',12, 178],['Clarke',13, 179]]
df = pd.DataFrame(data,columns=['Name','Age', 'Height'])
print(df)
I also have a list as follows.
mynames = ['Emj', 'Bob', 'Jenne', 'Alex', 'Clarke']
I want to order the rows of my dataframe in the order of mynames list. In other words, my output should be as follows.
Name Age Height
0 Bob 12 178
1 Alex 10 175
2 Clarke 13 179
I was trying to do this as follows. I am wondering if there is an easy way to do this in pandas than converting the dataframe to list.
I am happy to provide more details if needed.
You can do pd.Categorical + argsort
df=df.loc[pd.Categorical(df.Name,mynames).argsort()]
Name Age Height
1 Bob 12 178
0 Alex 10 175
2 Clarke 13 179