pandas drop row if value is not in different dataframe - pandas

I have two dataframes and want to drop rows from dataframe 'Total' if there is not a matching ID in dataframe 'Student'
DF Total:
ID name
0 115 john
1 118 mike
2 34 mac
3 897 sarah
DF Student:
ID name
0 34 mac
1 118 mike
2 897 sarah
In this example since ID 115 is not present in the Student df that row would be dropped from df Total and the resulting table would look like this:
ID name
0 118 mike
1 34 mac
2 897 sarah

one way is to use the .isin() method:
df_total[df_total['ID'].isin(df_student['ID'])]

Related

Python keep rows if a specific column contains a particular value or string

I am very green in python. I have not found a specific answer to my problem searching for online resources. With that said it would be great if you could give some hints.
I have an example of df as below:
import pandas as pd
df = pd.DataFrame({'names':['Alex','Joseph','Kate'],'exam1': [90, 68, 70], 'exam2': [100, 98, 88]})
print(df)
names exam1 exam2
0 Alex 90 100
1 Joseph 68 98
2 Kate 70 88
I would like to make a for loop to iterate over the rows and if the names column is equal to Joseph and Kate to get a new df as below:
names exam1 exam2
0 Joseph 68 98
1 Kate 70 88
I know there is a way like below but I would like to do it via for loop.
list=['Joseph','Kate']
new_df=df[df['names'].isin(list)]
Thank you in Advance.
Not sure why you'd want to use loops but this is how you'd it:
rows = []
for index, row in df.iterrows():
if row['names'] == 'Kate' or row['names'] == 'Joseph':
rows.append(row)
new_df = pd.DataFrame(rows)
print(new_df)
names exam1 exam2
1 Joseph 68 98
2 Kate 70 88

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1

compare 2 data frames based on 3 columns and update second data

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma
A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column
You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')

How to add values to the pandas dataframe coulmn depending upon value of column in other dataframe

I have pandas dataframe somthing like this. This dataframe contains unique user_id and corresponding user-name
df
user_id user_name
1 Jack
2 Neil
3 Peter
4 Smith
5 Neev
And I have second dataframe something like this
df1
user_id item_id user_name
1 23 Null
1 24 Null
2 34 Null
3 35 Null
5 45 Null
I want to fill user_name column above from the 1st dataframe. So,where user_id is matched it should enter corresponding user_name in that position.
So it should look like this..
df1
user_id item_id user_name
1 23 Jack
1 24 Jack
2 34 Neil
3 35 Peter
5 45 Neev
I am doin following in python
b = df.user_name[df['user_id'].isin(df1['user_id'])]
df1['user_name'] = b
But,It drops duplicates. I don't want to do that. Please help.
Use merge:
In [299]:
df1[['user_id','item_id']].merge(df,on='user_id')
Out[299]:
user_id item_id user_name
0 1 23 Jack
1 1 24 Jack
2 2 34 Neil
3 3 35 Peter
4 5 45 Neev