Fillna() depending on another column - pandas

I want to do next:
Fill DF1 NaN with values from DF2 depending on column value in DF1.
Basically, DF1 has people with "income_type" and some NaN in "total_income". In DF2 there are "median income" for each "income_type". I want to fill NaN in "total_income" DF1 with median values from DF2
DF1, DF2

First, I would merge values from DF2 to DF1 by 'income_type'
DF3 = DF1.merge(DF2, how='left', on='income_type')
This way you have the values of median income and total income in the same dataframe.
After this, I would do an if else statement for a pandas dataframe columns
DF3.loc[DF3['total_income'].isna(), 'total_income'] = DF3['median income']
That will replace the NaN values with the median values from the merge

You need to join the two dataframes and then replace the nan values with the median. Here is a similar working example. Cheers mate.
import pandas as pd
#create the example dataframes
df1 = pd.DataFrame({'income_type':['a','b','c','a','a','b','b'], 'total_income':[200, 300, 500,None,400,None,None]})
df2 = pd.DataFrame({'income_type':['a','b','c'], 'median_income':[205, 305, 505]})
# inner join the df1 with df2 on the column 'income_type'
joined = df1.merge(df2, on='income_type')
# fill the nan values the value from the column 'median_income' and save it in a new column 'total_income_not_na'
joined['total_income_not_na'] = joined['total_income'].fillna(joined['median_income'])
print(joined)

Related

Override non null values from one dataframe to another

I would like to override non null values from a dataframe to another dataframe with combination of first row and column (both being unique).
Basically, i am trying to join df2 on df1 only for non null values in df2, keeping df1 rows/column intact.
eg:
df1 =
df2 =
output =
This should work:
output = df1.merge(df2, on='ID')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_x'].fillna(output[f'{col}_y'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)
Explanation:
At first, we merge two dataframes using ID as a key. The merge joins two dataframes and if there are columns with the same name it adds suffixes _x and _y.
Then we iterate over all the columns in df1 and fill the NA values in the column col_x using on the values in col_y and put the value into a new column col.
We drop the auxiliary columns col_x and col_y
Edit:
Still, even with the updated requirements the approach is similar. However, in this case, you need to perform a left outer join and fillna values of the second dataframe. Here is the code:
output = df1.merge(df2, on='ID', how='left')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_y'].fillna(output[f'{col}_x'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

Join on column in pandas

So this works as expected:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]}).set_index('date')
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]}).set_index('date')
df1.join(df2, how='outer')
price1 price2
date
123 23.0 NaN
456 34.0 22.0
789 NaN 32.0
But if I don't set the index, it causes an error:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]})
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]})
df1.join(df2, on='date', how='outer')
ValueError: columns overlap but no suffix specified: Index(['date'], dtype='object')
Why is this, and am I incorrect for supposing they should give the same result?
If you want just to add the two dataframes and not joining by a certain column, you need to add suffixes so not to create columns with the same name. e.g.:
df1.join(df2, how='outer', lsuffix='_left', rsuffix='_right')
if you want to join on the column you should use merge:
df1.merge(df2, how='outer')

How can I replace NaN values in DataFrame from another table?

I have a DataFrame 'df'
And the second is 'nan_gdp'
How can I fill NaN gdp in 'nan_gdp' by using my first DataFrame 'df'.
Also, in the first df I dont have all countries, it means that there are some countries which are in 'nan_gdp' but not in 'df'
Use Series.fillna by mapped values from df by Series.map:
s = df.set_index('Country')['GDP ($M)']
waste['GDP ($M)'] = waste['GDP ($M)'].fillna(waste['Country'].map(s))

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df