Override non null values from one dataframe to another - pandas

I would like to override non null values from a dataframe to another dataframe with combination of first row and column (both being unique).
Basically, i am trying to join df2 on df1 only for non null values in df2, keeping df1 rows/column intact.
eg:
df1 =
df2 =
output =

This should work:
output = df1.merge(df2, on='ID')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_x'].fillna(output[f'{col}_y'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)
Explanation:
At first, we merge two dataframes using ID as a key. The merge joins two dataframes and if there are columns with the same name it adds suffixes _x and _y.
Then we iterate over all the columns in df1 and fill the NA values in the column col_x using on the values in col_y and put the value into a new column col.
We drop the auxiliary columns col_x and col_y
Edit:
Still, even with the updated requirements the approach is similar. However, in this case, you need to perform a left outer join and fillna values of the second dataframe. Here is the code:
output = df1.merge(df2, on='ID', how='left')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_y'].fillna(output[f'{col}_x'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)

Related

Split html table to smaller Pandas DataFrames

I'm trying to parse html tables from page ukwtv.de to Pandas DataFrames
The challange is that in one table there are combined 2 or even 3 tables together
From table
TV program name and SID as df1,
Kanal, Standort, etc. as df2,
Technische Details as df3,
Here what I managed to achieve so far:
table_MN = pd.read_html('https://www.ukwtv.de/cms/deutschland-tv/schleswig-holstein-tv.html', thousands='.', decimal=',')
df1 = table_MN[1]
df1.columns = df1.columns.str.replace(" ", "_")
df1.columns = df1.columns.str.replace("\n", "_")
df1=df1.iloc[:7 , :]
for col in df1.columns:
print(col)
if '.' in col:
df1.drop(col, axis=1, inplace=True)
df1.dropna(subset = ["TV-_und_Radio-Programme_des_Bouquets"],axis=0, inplace=True)
df1.head(15)
df2 = table_MN[1]
df2.columns = df2.iloc[7]
df2 = df2.iloc[8: , :]
df2 = df2.reset_index(drop=True)
df2.head(20)
To issue which I have problem to solve
row 7 is hardcoded how to recodnize blank line to split data i two dataframes?
Technische Details column in df1 need to be convered to separete dataframe where Modulation, Guardintervall, ... are Series names

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

Fillna() depending on another column

I want to do next:
Fill DF1 NaN with values from DF2 depending on column value in DF1.
Basically, DF1 has people with "income_type" and some NaN in "total_income". In DF2 there are "median income" for each "income_type". I want to fill NaN in "total_income" DF1 with median values from DF2
DF1, DF2
First, I would merge values from DF2 to DF1 by 'income_type'
DF3 = DF1.merge(DF2, how='left', on='income_type')
This way you have the values of median income and total income in the same dataframe.
After this, I would do an if else statement for a pandas dataframe columns
DF3.loc[DF3['total_income'].isna(), 'total_income'] = DF3['median income']
That will replace the NaN values with the median values from the merge
You need to join the two dataframes and then replace the nan values with the median. Here is a similar working example. Cheers mate.
import pandas as pd
#create the example dataframes
df1 = pd.DataFrame({'income_type':['a','b','c','a','a','b','b'], 'total_income':[200, 300, 500,None,400,None,None]})
df2 = pd.DataFrame({'income_type':['a','b','c'], 'median_income':[205, 305, 505]})
# inner join the df1 with df2 on the column 'income_type'
joined = df1.merge(df2, on='income_type')
# fill the nan values the value from the column 'median_income' and save it in a new column 'total_income_not_na'
joined['total_income_not_na'] = joined['total_income'].fillna(joined['median_income'])
print(joined)

How do I drop every column that only contains values from a list pandas

I want to be able to drop all columns that only have values from the list that I pass in.
list_of_drops = ['A','B']
d ={'C1':['A','A','A','B','B','B'],
'C2':['A','C','A','A','A','A'],
'C3':['A','B','B','A','B','B'],
'C4':['A','A','A','A','A','A'],
'C5':['A','A','B','AC','A','B'],
'C6':['A','A','AD','A','B','A']}
df = pd.DataFrame (d, columns = ['C1','C2','C3','C4','C5','C6'])
In this example, I want to create a dataframe that only has C1, C3, and C4.
To get rid of columns with just one value:
df= df.loc[:, (df != 'A').any(axis=0)]
To get rid of columns with only values from a list:
df= df.loc[:, (~df.isin(list_of_drops)).any(axis=0)]

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028