Join on column in pandas - pandas

So this works as expected:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]}).set_index('date')
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]}).set_index('date')
df1.join(df2, how='outer')
price1 price2
date
123 23.0 NaN
456 34.0 22.0
789 NaN 32.0
But if I don't set the index, it causes an error:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]})
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]})
df1.join(df2, on='date', how='outer')
ValueError: columns overlap but no suffix specified: Index(['date'], dtype='object')
Why is this, and am I incorrect for supposing they should give the same result?

If you want just to add the two dataframes and not joining by a certain column, you need to add suffixes so not to create columns with the same name. e.g.:
df1.join(df2, how='outer', lsuffix='_left', rsuffix='_right')
if you want to join on the column you should use merge:
df1.merge(df2, how='outer')

Related

Fillna() depending on another column

I want to do next:
Fill DF1 NaN with values from DF2 depending on column value in DF1.
Basically, DF1 has people with "income_type" and some NaN in "total_income". In DF2 there are "median income" for each "income_type". I want to fill NaN in "total_income" DF1 with median values from DF2
DF1, DF2
First, I would merge values from DF2 to DF1 by 'income_type'
DF3 = DF1.merge(DF2, how='left', on='income_type')
This way you have the values of median income and total income in the same dataframe.
After this, I would do an if else statement for a pandas dataframe columns
DF3.loc[DF3['total_income'].isna(), 'total_income'] = DF3['median income']
That will replace the NaN values with the median values from the merge
You need to join the two dataframes and then replace the nan values with the median. Here is a similar working example. Cheers mate.
import pandas as pd
#create the example dataframes
df1 = pd.DataFrame({'income_type':['a','b','c','a','a','b','b'], 'total_income':[200, 300, 500,None,400,None,None]})
df2 = pd.DataFrame({'income_type':['a','b','c'], 'median_income':[205, 305, 505]})
# inner join the df1 with df2 on the column 'income_type'
joined = df1.merge(df2, on='income_type')
# fill the nan values the value from the column 'median_income' and save it in a new column 'total_income_not_na'
joined['total_income_not_na'] = joined['total_income'].fillna(joined['median_income'])
print(joined)

pd.concat Plan shapes are not aligned error

I'm trying to concat two dataframes that have exactly the same column names.
df1 = [['A','B','C']]
df1 = A B C
1762 53RC982 0.22
1763 56XY931 0.33
1767 54AB171 0.47
1771 38CD410 0.22
df2 = [['A','B','C']]
A B C
1810 53RC982 0.42
1811 58XY821 0.63
1812 47AB261 0.33
1820 38CD410 0.81
where A is the unique column. I want to basically join them have the df1 and df2 untouched one after the other this shouldn't be a problem as I have a unique column i.e. if df1 is 100 rows and df2 is 50 rows the combined number of rows should be 150
I've tried:
df_final = pd.concat([df1, df2], axis=0, sort =True)
However, this is giving me
ValueError: Plan shapes are not aligned
if I do :
df_final = pd.concat([df1, df2], axis=1)
This gives me the information side by side. I've also tried the
.append
method similar results.
The problem this was happening is that the dfs didn't have the same data types. Used
df1['A'] = df1['A'].astype(str)
df2['A'] = df2['A'].astype(str)
then
df_final= pd.concat([df1, df2], axis=0, sort = True)
to merge them

Perplexing pandas index change after left merge

I have a data frame and I am interested in a particular row. When I run
questionnaire_events[questionnaire_events['event_id'].eq(6506308)]
I get the row, and its index is 7,816. I then merge questionnaire_events with another data frame
merged = questionnaire_events.merge(
ordinals,
how='left',
left_on='event_id',
right_on='id')
(It is worth noting that the ordinals data frame has no NaNs and no duplicated ids, but questionnaire_events does have some rows with NaN values for event_id.)
merged[merged['event_id'].eq(6506308)]
The resulting row has index 7,581. Why? What has happened in the merge, a left outer merge, to mean that my row has moved from 7,816 to 7,581? If there were multiple rows with the same id in the ordinals data frame then I can see how the merged data frame would have more rows than the left data frame in the merge, but that is not the case, so why has the row moved?
(N.B. Sorry I cannot give a crisp code sample. When I try to produce test data the row index change does not happen, it is only happening on my real data.)
pd.DataFrame.merge does not preserve the original datafame indexes.
df1 = pd.DataFrame({'key':[*'ABCDE'], 'val':[1,2,3,4,5]}, index=[100,200,300,400,500])
print('df1 dataframe:')
print(df1)
print('\n')
df2 = pd.DataFrame({'key':[*'AZCWE'], 'val':[10,20,30,40,50]}, index=[*'abcde'])
print('df2 dataframe:')
print(df2)
print('\n')
df_m = df1.merge(df2, on='key', how='left')
print('df_m dataframe:')
print(df_m)
Now, if your df1 is the default range index, then it is possible that you could get different index in your merged dataframe. If you subset or filter your df1, then your indexing will not match.
Work Around:
df1 = df1.reset_index()
df_m2 = df1.merge(df2, on='key', how='left')
df_m2 = df_m2.set_index('index')
print('df_m2 work around dataframe:')
print(df_m2)
Output:
df_m2 work around dataframe:
key val_x val_y
index
100 A 1 10.0
200 B 2 NaN
300 C 3 30.0
400 D 4 NaN
500 E 5 50.0

change groupby dimension to 1 from 2 in pandas

I performed a groupby:
df=pd.DataFrame({'grp':['a','a','b','b'],'value':[1,2,1,10]})
df.groupby('grp').agg({'value':['mean','median']})
and got:
how do I change this to a normal df that I can manipulate and access?
Change your code a bit - add column for aggregation after groupby and pass list of functions:
df1 = df.groupby('grp')['value'].agg(['mean','median'])
print (df1)
mean median
grp
a 1.5 1.5
b 5.5 5.5
Another idea is remove first level of MultiIndex, but if more columns is possible get duplicated columns names:
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.droplevel(0)
print (df1)
mean median
grp
a 1.5 1.5
b 5.5 5.5
Then is better for avoid duplicated columns names use map with join:
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.map('_'.join)
print (df1)
value_mean value_median
grp
a 1.5 1.5
b 5.5 5.5
Or for pandas 0.25 use named aggregation:
df2 = df.groupby("grp").agg(a=pd.NamedAgg(column='value', aggfunc='mean'),
b=pd.NamedAgg(column='value', aggfunc='median'))
print (df2)
a b
grp
a 1.5 1.5
b 5.5 5.5
You can simply drop a level of the columns and reset the index of the DataFrame like this:
df=pd.DataFrame({'grp':['a','a','b','b'],'value':[1,2,1,10]})
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.droplevel(0)
df1.reset_index()
Also, if you want a combined column, you can:
df1.columns = df1.columns.map('_'.join)
instead of:
df1.columns = df1.columns.droplevel(0)

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df