Get rid of NaN values and re-order columns in pandas - pandas

I have made an ad hoc example that you can run, to show you a dataframe similar to df3 that I have to use:
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age'])
df2 = pd.DataFrame(people2,columns=['Name','Age'])
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
df3
How would I modify the dataframe df3 to get rid of the NaN values and put the 2 columns one next to the other (I don't care about keeping the id's, I just want a clean dataframe with the 2 columns next to each other) ??

you can drop nan values 1st:
df3 = pd.concat([df1.dropna(), df2.dropna()])
Output:
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Mark 20.0
4 Jane 22.0
5 Jack 23.0
Or if you want to contact side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True), df2.dropna().reset_index(drop=True)], 1)
output:
Name Age Name Age
0 Alex 10.0 Mark 20.0
1 Bob 12.0 Jane 22.0
2 Clarke 13.0 Jack 23.0
If you just wanna concat the name column side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True)['Name'], df2.dropna().reset_index(drop=True)['Name']], 1)
output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack
If you want to modify only df3 it can be done via iloc and dropna:
df3 = pd.concat([df3.iloc[:,0].dropna().reset_index(drop=True) , df3.iloc[:,1].dropna().reset_index(drop=True)],1)
Output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack

people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name','Age']).dropna()
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
print(df3)
This will help you concatenate two df

if I have understood correctly what you mean, this is a possible solution
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name1','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name2','Age']).dropna().reset_index()
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name1'], people_list[1]['Name2']), axis=1)
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack
if you already have that dataframe:
count = df3.Name2.isna().sum()
df3.loc[:, 'Name2'] = df3.Name2.shift(-count)
df3 = df3.dropna()
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack

Related

Pandas compare Dataframes to search not exact match [duplicate]

I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1
INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2
OUPUT:
For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet..
Do you have suggestions please?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
Thank you for your ideas :)
I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Used with example:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3
I believe you need extract values to new column and then merge, last remove helper column Test3:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
Detail:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN

Vlookup from the same pandas dataframe

I have a hierarchical dataset that looks like this:
emp_id
emp_name
emp_manager
emp_org_lvl
1
John S
Bob A
1
2
Bob A
Paul P
2
3
Paul P
Charles Y
3
What I want to do is extend this table to have the emp_name for each manager going up the org chart. E.g.
emp_id
emp_name
emp_manager
emp_org_lvl
lvl2_name
lvl3_name
1
John S
Bob A
1
Paul P
Charles Y
In Excel, I would do a vlookup in column lvl2_name to see who Bob A's manager is e.g. something like this vlookup(c2,B:C,2,False). Using pandas, the direction seems to be to use Merge. The problem with this is that Merge seems to require two separate dataframes and you can't specify what column to return. Is there a better way than having a separate dataframe for each emp_org_lvl?
# Code to create table:
header = ['emp_id','emp_name','emp_manager','emp_org_lvl']
data = [[ 1,'John S' ,'Bob A', 1],[2, 'Bob A', 'Paul P', 2],[3, 'Paul P', 'Charles Y', 3]]
df = pd.DataFrame(data, columns=header)
You can try this:
# provide a lookup for employee to manager
manager_dict = dict(zip(df.emp_name, df.emp_manager))
# initialize the loop
levels_to_go_up = 3
employee_column_name = 'emp_manager'
# loop and keep adding columns to the dataframe
for i in range(2, levels_to_go_up + 1):
new_col_name = f'lvl{i}_name'
# create a new column by looking up employee_column_name's manager
df[new_col_name] = df[employee_column_name].map(manager_dict)
employee_column_name = new_col_name
>>>df
Out[67]:
emp_id emp_name emp_manager emp_org_lvl lvl2_name lvl3_name
0 1 John S Bob A 1 Paul P Charles Y
1 2 Bob A Paul P 2 Charles Y NaN
2 3 Paul P Charles Y 3 NaN NaN
Alternatively if you wanted to retrieve ALL managers in the tree, you could use a recursive function, and return the results as a list:
def retrieve_managers(name, manager_dict, manager_list=None):
if not manager_list:
manager_list = []
manager = manager_dict.get(name)
if manager:
manager_list.append(manager)
return retrieve_managers(manager, manager_dict, manager_list)
return manager_list
df['manager_list'] = df.emp_name.apply(lambda x: retrieve_managers(x, manager_dict))
>>> df
Out[71]:
emp_id emp_name emp_manager emp_org_lvl manager_list
0 1 John S Bob A 1 [Bob A, Paul P, Charles Y]
1 2 Bob A Paul P 2 [Paul P, Charles Y]
2 3 Paul P Charles Y 3 [Charles Y]
Finally, you can in fact self-join a dataframe while subselecting columns.
df = df.merge(df[['emp_name', 'emp_manager']], left_on='emp_manager', right_on='emp_name', suffixes=("", f"_joined"), how='left')
>>> df
Out[82]:
emp_id emp_name emp_manager emp_org_lvl emp_name_joined emp_manager_joined
0 1 John S Bob A 1 Bob A Paul P
1 2 Bob A Paul P 2 Paul P Charles Y
2 3 Paul P Charles Y 3 NaN NaN

How to append/concat 2 Pandas Dataframe with different columns

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview
import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

Merge 2 rows in Pandas based on a given condition in a column

How to merge 2 cells in the Pandas dataframe when one of the cells of the other column is empty
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30],
['', '', 26], ['juli', 'williams', 22]]
df = pd.DataFrame(lst,columns=['FName','LName','Age'],dtype=float)
In [4]:df
Out[4]:
FName LName Age
0 tom reacher 25.0
1 krish pete 30.0
2 26.0
3 juli williams 22.0
The ouput which I want is:
In [6]:df
Out[6]:
FName LName Age
0 tom reacher 25
1 krish pete 30,26
2 juli williams 22
First find empty cells in a column col1, then merge them to other column col2 and replace.
idx = df[df[col1] == ""].index # i guess definition of empty is ""
df.loc[idx,col1] = df.loc[idx][col2] + df.loc[idx][col1]
If always empty strings for both columns only is possible repalce them to missing values NaNs and forward filling them, so possible aggregate join:
df[['FName','LName']] = df[['FName','LName']].replace('', np.nan).ffill()
print (df[['FName','LName']])
FName LName
0 tom reacher
1 krish pete
2 krish pete
3 juli williams
df['Age'] = df['Age'].astype(int).astype(str)
df = df.groupby(['FName','LName'])['Age'].apply(','.join).reset_index()
print (df)
FName LName Age
0 juli williams 22
1 krish pete 30,26
2 tom reacher 25

How do I update the value of column based on other dataframe

How do i update a column in pandas based on condition from other data frame.
I have 2 dataframe df1 and df2
import pandas as pd
df1=pd.DataFrame({'names':['andi','andrew','jhon','andreas'],
'salary':[1000,2000,2300,1500]})
df2=pd.DataFrame({'names':['andi','andrew'],
'raise':[1500,2500]})
expected output
names salary
andi 1500
andrew 2500
jhon 2300
andreas 1500
Use Series.combine_first with DataFrame.set_index:
df = (df2.set_index('names')['raise']
.combine_first(df1.set_index('names')['salary'])
.reset_index())
print (df)
names raise
0 andi 1500.0
1 andreas 1500.0
2 andrew 2500.0
3 jhon 2300.0
Using merge & update, similar like sql.
df3 = pd.merge(df1, df2, how = 'left', left_on ='names', right_on = 'names')
df3.loc[df3['raise'].notnull(),'salary'] = df3['raise']
df3
names salary raise
0 andi 1500.0 1500.0
1 andrew 2500.0 2500.0
2 jhon 2300.0 NaN
3 andreas 1500.0 NaN