Pandas: get the unique value with the biggest index [duplicate] - pandas

This question already has answers here:
group by pandas dataframe and select latest in each group
(6 answers)
Closed 4 years ago.
I have a df like this
Name Data
0 Mike 123
1 Mike 456
2 Mike 789
3 Fred 345
4 Fred 123
5 Ted 333
I need to get unique Name with the max index value
output:
Name Data
0 Mike 789
1 Fred 123
2 Ted 333

Step 1st: Import pandas.
import pandas as pd
Step 2nd: Copy OP's df values.
Step 3rd: Now run following command to create data frame from OP's samples.
df=pd.read_clipboard()
Step 4th: Run following code to remove duplicates and keep last value of Name column.
df.drop_duplicates(subset='Name',keep='last')
Output will be as follows.
Name Data
2 Mike 789
4 Fred 123
5 Ted 333

Related

Python Pandas: Rerange data from vertical to horizontal

I would like to transform a data frame using pandas.
Old-Dataframe:
Person-ID
Reference-ID
Name
1
1
Max
2
1
Kevin
3
1
Sara
4
4
Chessi
5
9
Fernando
into a new-dataframe in the following format.
New-Dataframe:
Person-ID
Reference-ID
Member1
Member2
Member3
1
1
Max
Kevin
Sara
4
4
Chessi
5
9
Fernando
My solution would be:
Write all the Reference-IDs from the old-dataframe into the new-dataframe
Write all the Person-Ids from the old-dataframe into the new-dataframe, which their reference_id is not in the old-dataframe (see example Fernando)
Loop trough the "old"-dataframe and add the name to the corresponding line in the new dataframe
Do you have any suggestions, on how to make this faster/simpler?
PS: The old-dataframe can be made like this
person_id = [1,2,3,4,5]
reference_id = [1,1,1,4,9]
name = ['Max','Kevin','Sara',"Chessi","Fernando"]
list_tuples=list(zip(person_id,reference_id,name))
old_dataframe = pd.DataFrame(list_tuples,columns=['Person_ID','Reference_id','Name'])
You can use pivot_table() like this:
df1= pd.pivot_table(df, index=['Reference-ID'], values=['Person-ID', 'Name'], aggfunc=({'Person-ID':'min', 'Name':lambda x:list(x), 'Person-ID':'min'}))
df1.reset_index()[['Person-ID','Reference-ID']].join(pd.DataFrame(df1.Name.tolist()))
Output:
Person-ID
Reference-ID
0
1
2
1
1
Max
Kevin
Sara
4
4
Chessi
None
None
5
9
Fernando
None
None
You can reassign column names like this:
df2=df1.reset_index()[['Person-ID','Reference-ID']].join(pd.DataFrame(df1.Name.tolist()))
df2.columns=list(df2.columns[0:2])+[f"Member{x+1}" for x in df2.columns[2:]]
Output:
Person-ID
Reference-ID
Member1
Member2
Member3
1
1
Max
Kevin
Sara
4
4
Chessi
None
None
5
9
Fernando
None
None

pandas drop row if value is not in different dataframe

I have two dataframes and want to drop rows from dataframe 'Total' if there is not a matching ID in dataframe 'Student'
DF Total:
ID name
0 115 john
1 118 mike
2 34 mac
3 897 sarah
DF Student:
ID name
0 34 mac
1 118 mike
2 897 sarah
In this example since ID 115 is not present in the Student df that row would be dropped from df Total and the resulting table would look like this:
ID name
0 118 mike
1 34 mac
2 897 sarah
one way is to use the .isin() method:
df_total[df_total['ID'].isin(df_student['ID'])]

Calculate intersecting sums from flat DataFrame for a heatmap

I'm trying to wrangle some data to show how many items a range of people have in common. The goal is to show this data in a heatmap format via Seaborn to understand these overlaps visually.
Here's some sample data:
demo_df = pd.DataFrame([
("Get Back", 1,0,2),
("Help", 5, 2, 0),
("Let It Be", 0,2,2)
],columns=["Song","John", "Paul", "Ringo"])
demo_df.set_index("Song")
John Paul Ringo
Song
Get Back 1 0 2
Help 5 2 0
Let It Be 0 2 2
I don't need a breakdown by song, just the total of shared items. The resulting data would show a sum of how many items they share like this:
Name
John
Paul
Ringo
John
-
7
3
Paul
7
-
4
Ringo
3
4
-
So far I've tried a few options with groupby and unstack but haven't been able to work out how to cross match the names into both column and header rows.
We may do dot then fill diag
out = df.T.dot(df.ne(0)) + df.T.ne(0).dot(df)
np.fill_diagonal(out.values, 0)
out
Out[176]:
John Paul Ringo
John 0 7 3
Paul 7 0 4
Ringo 3 4 0

compare 2 data frames based on 3 columns and update second data

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma
A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column
You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')

Pandas Dataframe and duplicate names [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
I've a Pandas dataframe, and some numerical data about some people.
What I need to do is to find people that appare more than one time in the dataframe, and to substitute all the row about one people with one row where the numeric values are the sum of the numeric values of the rows before.
Example:
Names Column1 Column1
Jonh 1 2
Bob 2 3
Pier 1 1
John 3 3
Bob 1 0
Have to become:
Names Column1 Column1
Jonh 4 5
Bob 3 3
Pier 1 1
How can I do?
Try this:
In [975]: df.groupby('Names')[['Column1','Column2']].sum()
Out[975]:
Column1 Column2
Names
Bob 3 3
John 4 5
Pier 1 1
groupby and sum should do the job
df.groupby('Names').sum().sort_values('Column1', ascending=False)
Column1 Column1.1
Names
Jonh 4 5
Bob 3 3
Pier 1 1