Vlookup from the same pandas dataframe - pandas

I have a hierarchical dataset that looks like this:
emp_id
emp_name
emp_manager
emp_org_lvl
1
John S
Bob A
1
2
Bob A
Paul P
2
3
Paul P
Charles Y
3
What I want to do is extend this table to have the emp_name for each manager going up the org chart. E.g.
emp_id
emp_name
emp_manager
emp_org_lvl
lvl2_name
lvl3_name
1
John S
Bob A
1
Paul P
Charles Y
In Excel, I would do a vlookup in column lvl2_name to see who Bob A's manager is e.g. something like this vlookup(c2,B:C,2,False). Using pandas, the direction seems to be to use Merge. The problem with this is that Merge seems to require two separate dataframes and you can't specify what column to return. Is there a better way than having a separate dataframe for each emp_org_lvl?
# Code to create table:
header = ['emp_id','emp_name','emp_manager','emp_org_lvl']
data = [[ 1,'John S' ,'Bob A', 1],[2, 'Bob A', 'Paul P', 2],[3, 'Paul P', 'Charles Y', 3]]
df = pd.DataFrame(data, columns=header)

You can try this:
# provide a lookup for employee to manager
manager_dict = dict(zip(df.emp_name, df.emp_manager))
# initialize the loop
levels_to_go_up = 3
employee_column_name = 'emp_manager'
# loop and keep adding columns to the dataframe
for i in range(2, levels_to_go_up + 1):
new_col_name = f'lvl{i}_name'
# create a new column by looking up employee_column_name's manager
df[new_col_name] = df[employee_column_name].map(manager_dict)
employee_column_name = new_col_name
>>>df
Out[67]:
emp_id emp_name emp_manager emp_org_lvl lvl2_name lvl3_name
0 1 John S Bob A 1 Paul P Charles Y
1 2 Bob A Paul P 2 Charles Y NaN
2 3 Paul P Charles Y 3 NaN NaN
Alternatively if you wanted to retrieve ALL managers in the tree, you could use a recursive function, and return the results as a list:
def retrieve_managers(name, manager_dict, manager_list=None):
if not manager_list:
manager_list = []
manager = manager_dict.get(name)
if manager:
manager_list.append(manager)
return retrieve_managers(manager, manager_dict, manager_list)
return manager_list
df['manager_list'] = df.emp_name.apply(lambda x: retrieve_managers(x, manager_dict))
>>> df
Out[71]:
emp_id emp_name emp_manager emp_org_lvl manager_list
0 1 John S Bob A 1 [Bob A, Paul P, Charles Y]
1 2 Bob A Paul P 2 [Paul P, Charles Y]
2 3 Paul P Charles Y 3 [Charles Y]
Finally, you can in fact self-join a dataframe while subselecting columns.
df = df.merge(df[['emp_name', 'emp_manager']], left_on='emp_manager', right_on='emp_name', suffixes=("", f"_joined"), how='left')
>>> df
Out[82]:
emp_id emp_name emp_manager emp_org_lvl emp_name_joined emp_manager_joined
0 1 John S Bob A 1 Bob A Paul P
1 2 Bob A Paul P 2 Paul P Charles Y
2 3 Paul P Charles Y 3 NaN NaN

Related

how to concatenate text from multiple rows in dataframe based on a specific structure

I am going to merge multiple rows of a dataframe that has a specific structure of a text
For example, I have
df = pd.DataFrame([
(1, 'john', 'merge'),
(1, 'smith,', 'merge'),
(1, 'robert', 'merge'),
(1, 'g', 'merge'),
(1, 'owens,', 'merge'),
(2, 'sarah will', 'OK'),
(2, 'ali kherad', 'OK'),
(2, 'david', 'merge'),
(2, 'lu,', 'merge'),
], columns=['ID', 'Name', 'Merge'])
which is
ID Name Merge
1 john merge
1 smith, merge
1 robert merge
1 g merge
1 owens, merge
2 sarah will OK
2 ali kherad OK
2 david merge
2 lu, merge
The goal is to have a datframe that merges the text in rows like this
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu
I found a way to create the column 'Merge' to know if I need to merge or not. Then I tried this
df = pd.DataFrame(df[df['Merge']=='merge'].groupby(['ID','Merge'], axis=0)['Name'].apply(' '.join))
res = df.apply(lambda x: x.str.split(',').explode()).reset_index().drop(['Merge'], axis=1)
First I groupby the names when the column 'Merge' is equal to 'merge'. I know this is not the best way because it only considers this condition but in my dataframe I should have the other rows when the column 'Merge' is equal to 'OK'.
Then I split by ','.
The result is
ID Name
0 1 john smith
1 1 robert g owens
2 1
3 2 david lu
4 2
The other problem is that the order is not correct in my real example when I have more than 4000 rows. How can I keep the order and merge the text when necessary?
make grouper for grouping
cond1 = df['Name'].str.contains('\,$') | df['Merge'].eq('OK')
g = cond1[::-1].cumsum()
g(chk reversed index)
8 1
7 1
6 2
5 3
4 4
3 4
2 4
1 5
0 5
dtype: int32
remove , and groupby by ID and g
out = (df['Name'].str.replace('\,$', '', regex=True)
.groupby([df['ID'], g], sort=False).agg(' '.join)
.droplevel(1).reset_index())
out
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu

Check if substring is in a string in a different DF, if it is then return value from another row

I want to check is a substring from DF1 is in DF2. If it is I want to return a value of a corresponding row.
DF1
Name
ID
Region
John
AAA
A
John
AAA
B
Pat
CCC
C
Sandra
CCC
D
Paul
DD
E
Sandra
R9D
F
Mia
dfg4
G
Kim
asfdh5
H
Louise
45gh
I
DF2
Name
ID
Company
John
AAAxx1
Microsoft
John
AAAxxREG1
Microsoft
Michael
BBBER4
Microsoft
Pat
CCCERG
Dell
Pat
CCCERGG
Dell
Paul
DFHDHF
Facebook
Desired Output
Where ID from DF1 is in the ID column of DF2 I want to create a new column in DF1 that matches the company
Name
ID
Region
Company
John
AAA
A
Microsoft
John
AAA
B
Microsoft
Pat
CCC
C
Dell
Sandra
CCC
D
Paul
DD
E
Sandra
R9D
F
Mia
dfg4
G
Kim
asfdh5
H
Louise
45gh
I
I have the below code that determines if the ID from DF1 is in DF2 however I'm not sure how I can bring in the company name.
DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])
Try to find ID string from df1 into df2 then merge on this column:
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
print(df1)
# Output:
Name ID Company
0 John AAA
1 Peter BAB Microsoft
2 Paul CCHF Google
3 Rosie R9D
Details: create a regex from df1['ID'] to extract partial string from df2['ID']:
# Regex pattern: try to extract the following pattern
>>> fr"({'|'.join(df1['ID'].values)})"
'(AAA|BAB|CCHF|R9D)'
# After extraction
>>> pd.concat([df2['ID'], key], axis=1)
ID ID
0 AEDSV NaN # Nothing was found
1 123BAB BAB # Found partial string BAB
2 CCHF-RB CCHF # Found partial string CCHF
3 YYYY NaN # Nothing was found
Update:
To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'],
right_on=['Name', key], how='left').drop_duplicates().fillna('')
print(df1)
# Output:
Name ID Region Company
0 John AAA A Microsoft
2 John AAA B Microsoft
4 Pat CCC C Dell
6 Sandra CCC D
7 Paul DD E
8 Sandra R9D F
9 Mia dfg4 G
10 Kim asfdh5 H
11 Louise 45gh I

How to return difference between two column values in pandas

I have 1 dataframe and want to check and then return the difference in values between two columns of the same dataframe only if there is a value in the 2nd column. The 2nd column in my example below is AppliancesO and first column is AppliancesH
Item Name AppliancesH AppliancesO
1 Joe TV TV
2 Mary [TV; Fridge] TV
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge]
4 Pete [Fridge;Oven]
and 1000 more rows as such
The output am looking for is
Item Name AppliancesH AppliancesO Diff
1 Joe TV TV
2 Mary [TV; Fridge] TV Fridge
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge] [Microwave;Computer]
4 Pete [Fridge;Oven]
I know how to compare the columns to determine if they are different, but I dont know how to return the difference
df.loc[(df['AppliancesH']!=df['AppliancesO'])& ~df.AppliancesO.isna()][['Name','AppliancesH', 'AppliancesO','Diff']]
Assuming the following data
>>> dict_ = {'AppliancesH': {1: ['TV'], 2: ['TV', 'Fridge'], 3: ['Microwave', 'TV', 'Fridge'], 4: ['Fridge', 'Oven']}, 'AppliancesO': {1: ['TV'], 2: ['TV'], 3: ['Computer', 'TV', 'Fridge'], 4: []}, 'Name': {1: 'Joe', 2: 'Mary', 3: 'Jack', 4: 'Pete'}}
>>> df = pd.DataFrame(dict_)
>>> df
AppliancesH AppliancesO Name
1 [TV] [TV] Joe
2 [TV, Fridge] [TV] Mary
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack
4 [Fridge, Oven] [] Pete
You can use set's ~.symmetric_difference to perform such operation. Let(s first define the callable we need:
def symdif(s: pd.Series) -> list:
h = s.AppliancesH
o = s.AppliancesO
return h and o and sorted(set(h).symmetric_difference(o))
and use it via pandas.DataFrame.apply
>>> df['Diff'] = df.apply(axis=1, func=symdif)
>>> df
AppliancesH AppliancesO Name Diff
1 [TV] [TV] Joe []
2 [TV, Fridge] [TV] Mary [Fridge]
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack [Computer, Microwave]
4 [Fridge, Oven] [] Pete []
Here is another way:
df['Differences'] = (df.set_index('Name')
.applymap(set)
.apply(lambda x: set.symmetric_difference(*x),axis=1).map(list)
.reset_index(drop=True))
This can also be done with XOR operator
def find_diff(row):
if row.isna().any():
return []
diff = set(row['AppliancesH']) ^ set(row['AppliancesO'])
return list(diff)
df.apply(find_diff, axis=1)
You might also need to write a function that converts those strings to a list

pandas - get count of duplicate rows (matching across multiple columns)

I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).
Id First Last
1 Dave Davis
2 Dave Smith
3 Bob Smith
4 Dave Smith
I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.
import pandas as pd
dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)
print(df2.groupby(df2.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
Output:
First Last records
0 Bob Smith 1
1 Dave Davis 1
2 Dave Smith 2
I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.
import pandas as pd
dict1 = {'Id': pd.Series([1, 2, 3, 4]),
'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)
print(df1.groupby(df1.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
gives:
Id First Last records
0 1 Dave Davis 1
1 2 Dave Smith 1
2 3 Bob Smith 1
3 4 Dave Smith 1
I want (ideally):
First Last records Ids
0 Dave Smith 2 2, 4
first filter only duplicated rows by DataFrame.duplicated by columns for check and keep=False for return all dupes, filter by boolean indexing. Then aggregate by GroupBy.agg counts with GroupBy.size and join id with converting to strings:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
.groupby(['First','Last'])['Id'].agg(tup)
.reset_index())
print (df2)
First Last records Ids
0 Dave Smith 2 2,4
Another idea is aggregate all values and then filter with DataFrame.query:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
.reset_index()
.query('records != 1'))
print (df2)
First Last records Ids
2 Dave Smith 2 2,4

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4