How to concatenate portions of df to another df - pandas

I have a df like this: df1
And i have another df like this: df2
I want to concatenate all rows of df2 that have the same CODIGO with df1. Rows of df2 must be below df1 like this:
All values must be concatenated with their respective column. I will appreciate any help.
Many thanks.

You need to use pd.concat for this.
The axis will be 0, or default, you don't need specify it.
You need order, use sort for this,
result = pd.concat([df1, df2], ignore_index=True, sort=True)
You can use Dataframe.append.
result = df1.append(df4, ignore_index=True, sort=True)
Looks like indexes and order of the columns are different between your dataframes, you need to re-arrange this, and ideally have a meaningful unique index CODIGO_AñO_MES_DIA to do a proper sorting.
Be aware that sorting is alphabetical.

Related

DataFrame Groupby apply on second dataframe?

I have 2 dataframes df1, df2. Both have id as a column. I want to compute a new column, weighted_average, in df1 that is a function of the values in df2 with the same id.
First, I think I should do df1.groupby("id"). Is it possible to use GroupBy.apply(...) and have it use values from df2? In the examples I've seen, it usually just operates on df1 values.
If they have same id positions and length, you can do some like:
df2["new column name"] = df1["column name"].apply(...)

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

Combine two dataframe to send a automated message [duplicate]

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Preferred pandas code for selecting all rows and a subset of columns

Suppose that you have a pandas DataFrame named df with columns ['a','b','c','d','e'] and you want to create a new DataFrame newdf with columns 'b' and 'd'. There are two possible ways to do this:
newdf = df[['b','d']]
or
newdf = df.loc[:,['b','d']]
The first is using the indexing operator. The second is using .loc. Is there a reason to prefer one over the other?
Thanks to #coldspeed, it seems that newdf = df.loc[:,['b','d']] is preferred to avoid the dreaded SettingWithCopyWarning.

How to access (multi)index of a Data Frame?

I have a data frame and use some of its columns to group by:
grouped = df.groupby(['col1', 'col2'])
Now I use mean function to get a new data frame object from the above created groupby object:
df_new = grouped.mean()
Now I have two data frames (df and df2) and I would like to merge them using col1 and col2. The problem that I have now is that df2 does no have these columns. After groupby operation col1 and col2 are "shifted" to index. So, to resolve this problem, I try to create these columns:
df2['col1'] = df2['index'][0]
df2['col2'] = df2['index'][1]
But it does not work because 'index' is not recognized as a column of the data frame.
As an alternative Andy Hayden's method, you could use as_index=False to preserve the columns as columns rather than indices:
df2 = df.groupby(['col1', 'col2'], as_index=False).mean()
You can use left_index (or right_index) arguments of merge:
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s).
If it is a MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels
and use right_on to determine which columns it should merge the index with.
So it'll be something like:
pd.merge(df, df_new, left_on=['col1', 'col2'], right_index=True)