Conditional join in pandas - sql

I am merging 2 datasets liek this:
df1.merge(df2,how='left' on='ID')
I only want to select records where df2.NAME='ABC'
What is the quckest way to do this? In SQL, it would be:
select * from df1 left join df2 on df1.id=df2.id and df2.name='ABC'

df1.merge(df2[df2.NAME=='ABC'], how='left', on='ID')
or
df = df1.merge(df2, how='left', on='ID')
df = df[df.NAME=='ABC']
depending on whether you want these rows to exist in the resulting df (with NaNs) [snippet 1] or for them to be dropped entirely [snippet 2].

Related

How to Union 3 dataframes by Pandas?

I need to union 3 dataframes df1, df2, df3, how can I:keep all columns in all 3 dataframes without overlab?
3 dataframes are for 3 different kind of products, one dataframe has less columns than the other two.
step1 = pd.merge_ordered(df1, df2)
all_lob = pd.merge_ordered(step1, df3)
The result seems eliminated some columns, how can i just stack 3 dataframes all together?
Thank you.
I'm not sure what you want to do, but when we are talking about putting the dataframes together on the column-level, you can use the pandas concat method and correspondingly specify the axis:
import pandas as pd
df_1 = pd.DataFrame({'Col_1':[1,2,3], 'Col_2':[4,5,6]})
df_2 = pd.DataFrame({'Col_1':[1,2,3,4], 'Col_2':[4,5,6,7]})
df_3 = pd.DataFrame({'Col_1':[1,2,3,4,5], 'Col_2':[4,5,6,7,8]})
df = pd.concat([df_1,df_2,df_3],axis=1)
print(df)

Override non null values from one dataframe to another

I would like to override non null values from a dataframe to another dataframe with combination of first row and column (both being unique).
Basically, i am trying to join df2 on df1 only for non null values in df2, keeping df1 rows/column intact.
eg:
df1 =
df2 =
output =
This should work:
output = df1.merge(df2, on='ID')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_x'].fillna(output[f'{col}_y'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)
Explanation:
At first, we merge two dataframes using ID as a key. The merge joins two dataframes and if there are columns with the same name it adds suffixes _x and _y.
Then we iterate over all the columns in df1 and fill the NA values in the column col_x using on the values in col_y and put the value into a new column col.
We drop the auxiliary columns col_x and col_y
Edit:
Still, even with the updated requirements the approach is similar. However, in this case, you need to perform a left outer join and fillna values of the second dataframe. Here is the code:
output = df1.merge(df2, on='ID', how='left')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_y'].fillna(output[f'{col}_x'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

Merging pandas dataframe on unique values in a column

I have a df1 as:
There are a lot of duplicating values for SUBJECT_ID as shown in the picture. I have a df2 to merge from, but I want to merge it on unique SUBJECT_ID. For now I only know how to merge to entire SUBJECT_ID through this code:
df1 = pd.merge(df1,df2[['SUBJECT_ID', 'VALUE']], on='SUBJECT_ID', how='left' )
But this will merge on every SUBJECT_ID. I just need unique SUBJECT_ID. Please help me with this.
I think you will find your answer with the merge documentation.
It's not fully clear what you want, but here are some examples that may contain the answer you are looking for:
import pandas as pd
df1 = pd.read_csv('temp.csv')
display(df1)
SUBJECT_ID = [31, 32, 33]
something_interesting = ['cat', 'dog', 'fish']
df2 = pd.DataFrame(list(zip(SUBJECT_ID, something_interesting)),
columns =['SUBJECT_ID', 'something_interesting'])
display(df2)
df_keep_all = df1.merge(df2, on='SUBJECT_ID', how='outer')
display(df_keep_all)
df_keep_df1 = df1.merge(df2, on='SUBJECT_ID', how='inner')
display(df_keep_df1)
df_thinned = pd.merge(df1.drop_duplicates(), df2, on='SUBJECT_ID', how='inner')
display(df_thinned)
You can use pandas drop function for it using this function you can remove all duplicate values for column or columns.
df2 = df.drop_duplicates(subset=['SUBJECT_ID'])`

My pandas merge is not bringing over data from the right df. Why?

The code runs without error, but the right data is not populating into the resulting dataframe.
I've tried with and without the index and neither seem to work. I looked into dtypes but it looks like they match on the fields I'm using as the index. I noted that the indicator is saying left_only, making me think the merge is not actually bringing anything over. It clearly must not be, because fields that are not null in the right df are showing null in the resulting dataframe.
df = df[(df['A'].notna())]
group = df.groupby(['A', 'B', 'Period', 'D'])
df2 = group['Monthly_Need'].sum()
df2 = df2.reset_index()
df = df.set_index(['A', 'B', 'Period', 'D'])
df2 = df2.set_index(['A', 'B', 'Period', 'D'])
df = df.merge(df2, how='left', left_index=True, right_index=True, indicator=True)
df = df.reset_index()