Merging pandas dataframe on unique values in a column - pandas

I have a df1 as:
There are a lot of duplicating values for SUBJECT_ID as shown in the picture. I have a df2 to merge from, but I want to merge it on unique SUBJECT_ID. For now I only know how to merge to entire SUBJECT_ID through this code:
df1 = pd.merge(df1,df2[['SUBJECT_ID', 'VALUE']], on='SUBJECT_ID', how='left' )
But this will merge on every SUBJECT_ID. I just need unique SUBJECT_ID. Please help me with this.

I think you will find your answer with the merge documentation.
It's not fully clear what you want, but here are some examples that may contain the answer you are looking for:
import pandas as pd
df1 = pd.read_csv('temp.csv')
display(df1)
SUBJECT_ID = [31, 32, 33]
something_interesting = ['cat', 'dog', 'fish']
df2 = pd.DataFrame(list(zip(SUBJECT_ID, something_interesting)),
columns =['SUBJECT_ID', 'something_interesting'])
display(df2)
df_keep_all = df1.merge(df2, on='SUBJECT_ID', how='outer')
display(df_keep_all)
df_keep_df1 = df1.merge(df2, on='SUBJECT_ID', how='inner')
display(df_keep_df1)
df_thinned = pd.merge(df1.drop_duplicates(), df2, on='SUBJECT_ID', how='inner')
display(df_thinned)

You can use pandas drop function for it using this function you can remove all duplicate values for column or columns.
df2 = df.drop_duplicates(subset=['SUBJECT_ID'])`

Related

How do I subset a dataframe based on index matches to the column name of another dataframe?

I want to keep the columns of df if its column name matches the index of df2.
My code below only returns the df.index but I want to return the entire subset of pandas dataframe.
import pandas as pd
df = df[df.columns.intersection(df2.index)]
From my understanding, you want to have datas from both dataframes matching with index of df2. Correct?
You can use Merge to join the dataframes.
df = pd.merge(df1, df2, how='inner', on=[df2.index])

Override non null values from one dataframe to another

I would like to override non null values from a dataframe to another dataframe with combination of first row and column (both being unique).
Basically, i am trying to join df2 on df1 only for non null values in df2, keeping df1 rows/column intact.
eg:
df1 =
df2 =
output =
This should work:
output = df1.merge(df2, on='ID')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_x'].fillna(output[f'{col}_y'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)
Explanation:
At first, we merge two dataframes using ID as a key. The merge joins two dataframes and if there are columns with the same name it adds suffixes _x and _y.
Then we iterate over all the columns in df1 and fill the NA values in the column col_x using on the values in col_y and put the value into a new column col.
We drop the auxiliary columns col_x and col_y
Edit:
Still, even with the updated requirements the approach is similar. However, in this case, you need to perform a left outer join and fillna values of the second dataframe. Here is the code:
output = df1.merge(df2, on='ID', how='left')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_y'].fillna(output[f'{col}_x'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

Combine two dataframe to send a automated message [duplicate]

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Conditional join in pandas

I am merging 2 datasets liek this:
df1.merge(df2,how='left' on='ID')
I only want to select records where df2.NAME='ABC'
What is the quckest way to do this? In SQL, it would be:
select * from df1 left join df2 on df1.id=df2.id and df2.name='ABC'
df1.merge(df2[df2.NAME=='ABC'], how='left', on='ID')
or
df = df1.merge(df2, how='left', on='ID')
df = df[df.NAME=='ABC']
depending on whether you want these rows to exist in the resulting df (with NaNs) [snippet 1] or for them to be dropped entirely [snippet 2].