Combine two dataframe to send a automated message [duplicate] - pandas

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!

You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html

I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)

If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')

I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2

There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Related

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]

Preferred pandas code for selecting all rows and a subset of columns

Suppose that you have a pandas DataFrame named df with columns ['a','b','c','d','e'] and you want to create a new DataFrame newdf with columns 'b' and 'd'. There are two possible ways to do this:
newdf = df[['b','d']]
or
newdf = df.loc[:,['b','d']]
The first is using the indexing operator. The second is using .loc. Is there a reason to prefer one over the other?
Thanks to #coldspeed, it seems that newdf = df.loc[:,['b','d']] is preferred to avoid the dreaded SettingWithCopyWarning.

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

pandas merge multiple dataframes

For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).