Find unique values of groupby/transform without None - pandas

The starting point is this kind of dataframe.
df = pd.DataFrame({'author': ['Jack', 'Steve', 'Greg', 'Jack', 'Steve', 'Greg', 'Greg'], 'country':['USA', None, None, 'USA', 'Germany', 'France', 'France'], 'c':np.random.randn(7), 'd':np.random.randn(7)})
author country c d
0 Jack USA -2.594532 2.027425
1 Steve None -1.104079 -0.852182
2 Greg None -2.356956 -0.450821
3 Jack USA -0.910153 -0.734682
4 Steve Germany 1.025113 0.441512
5 Greg France 0.218085 1.369443
6 Greg France 0.254485 0.322768
The desired output is one column or multiple columns with countrys of a author.
0 [USA]
1 [Germany]
2 [France]
3 [USA]
4 [Germany]
5 [France]
6 [France]
It has not to be a list, but my closest solution for now gives a list as output.
It could be seperated columns.
df.groupby('author')['country'].transform('unique')
0 [USA]
1 [None, Germany]
2 [None, France]
3 [USA]
4 [None, Germany]
5 [None, France]
6 [None, France]
Is there a easy way of deleting None out of this ?

You can remove missing values with Series.dropna, call SeriesGroupBy.unique and create new column by Series.map:
df['new'] = df['author'].map(df['country'].dropna().groupby(df['author']).unique())
print (df)
author country c d new
0 Jack USA 0.453358 -1.983282 [USA]
1 Steve None 0.011792 0.383322 [Germany]
2 Greg None -1.551810 0.308982 [France]
3 Jack USA 1.646301 0.040245 [USA]
4 Steve Germany -0.211451 0.841131 [Germany]
5 Greg France 1.049269 -0.813806 [France]
6 Greg France -1.244549 1.009006 [France]

Related

how to concatenate text from multiple rows in dataframe based on a specific structure

I am going to merge multiple rows of a dataframe that has a specific structure of a text
For example, I have
df = pd.DataFrame([
(1, 'john', 'merge'),
(1, 'smith,', 'merge'),
(1, 'robert', 'merge'),
(1, 'g', 'merge'),
(1, 'owens,', 'merge'),
(2, 'sarah will', 'OK'),
(2, 'ali kherad', 'OK'),
(2, 'david', 'merge'),
(2, 'lu,', 'merge'),
], columns=['ID', 'Name', 'Merge'])
which is
ID Name Merge
1 john merge
1 smith, merge
1 robert merge
1 g merge
1 owens, merge
2 sarah will OK
2 ali kherad OK
2 david merge
2 lu, merge
The goal is to have a datframe that merges the text in rows like this
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu
I found a way to create the column 'Merge' to know if I need to merge or not. Then I tried this
df = pd.DataFrame(df[df['Merge']=='merge'].groupby(['ID','Merge'], axis=0)['Name'].apply(' '.join))
res = df.apply(lambda x: x.str.split(',').explode()).reset_index().drop(['Merge'], axis=1)
First I groupby the names when the column 'Merge' is equal to 'merge'. I know this is not the best way because it only considers this condition but in my dataframe I should have the other rows when the column 'Merge' is equal to 'OK'.
Then I split by ','.
The result is
ID Name
0 1 john smith
1 1 robert g owens
2 1
3 2 david lu
4 2
The other problem is that the order is not correct in my real example when I have more than 4000 rows. How can I keep the order and merge the text when necessary?
make grouper for grouping
cond1 = df['Name'].str.contains('\,$') | df['Merge'].eq('OK')
g = cond1[::-1].cumsum()
g(chk reversed index)
8 1
7 1
6 2
5 3
4 4
3 4
2 4
1 5
0 5
dtype: int32
remove , and groupby by ID and g
out = (df['Name'].str.replace('\,$', '', regex=True)
.groupby([df['ID'], g], sort=False).agg(' '.join)
.droplevel(1).reset_index())
out
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu

Vlookup from the same pandas dataframe

I have a hierarchical dataset that looks like this:
emp_id
emp_name
emp_manager
emp_org_lvl
1
John S
Bob A
1
2
Bob A
Paul P
2
3
Paul P
Charles Y
3
What I want to do is extend this table to have the emp_name for each manager going up the org chart. E.g.
emp_id
emp_name
emp_manager
emp_org_lvl
lvl2_name
lvl3_name
1
John S
Bob A
1
Paul P
Charles Y
In Excel, I would do a vlookup in column lvl2_name to see who Bob A's manager is e.g. something like this vlookup(c2,B:C,2,False). Using pandas, the direction seems to be to use Merge. The problem with this is that Merge seems to require two separate dataframes and you can't specify what column to return. Is there a better way than having a separate dataframe for each emp_org_lvl?
# Code to create table:
header = ['emp_id','emp_name','emp_manager','emp_org_lvl']
data = [[ 1,'John S' ,'Bob A', 1],[2, 'Bob A', 'Paul P', 2],[3, 'Paul P', 'Charles Y', 3]]
df = pd.DataFrame(data, columns=header)
You can try this:
# provide a lookup for employee to manager
manager_dict = dict(zip(df.emp_name, df.emp_manager))
# initialize the loop
levels_to_go_up = 3
employee_column_name = 'emp_manager'
# loop and keep adding columns to the dataframe
for i in range(2, levels_to_go_up + 1):
new_col_name = f'lvl{i}_name'
# create a new column by looking up employee_column_name's manager
df[new_col_name] = df[employee_column_name].map(manager_dict)
employee_column_name = new_col_name
>>>df
Out[67]:
emp_id emp_name emp_manager emp_org_lvl lvl2_name lvl3_name
0 1 John S Bob A 1 Paul P Charles Y
1 2 Bob A Paul P 2 Charles Y NaN
2 3 Paul P Charles Y 3 NaN NaN
Alternatively if you wanted to retrieve ALL managers in the tree, you could use a recursive function, and return the results as a list:
def retrieve_managers(name, manager_dict, manager_list=None):
if not manager_list:
manager_list = []
manager = manager_dict.get(name)
if manager:
manager_list.append(manager)
return retrieve_managers(manager, manager_dict, manager_list)
return manager_list
df['manager_list'] = df.emp_name.apply(lambda x: retrieve_managers(x, manager_dict))
>>> df
Out[71]:
emp_id emp_name emp_manager emp_org_lvl manager_list
0 1 John S Bob A 1 [Bob A, Paul P, Charles Y]
1 2 Bob A Paul P 2 [Paul P, Charles Y]
2 3 Paul P Charles Y 3 [Charles Y]
Finally, you can in fact self-join a dataframe while subselecting columns.
df = df.merge(df[['emp_name', 'emp_manager']], left_on='emp_manager', right_on='emp_name', suffixes=("", f"_joined"), how='left')
>>> df
Out[82]:
emp_id emp_name emp_manager emp_org_lvl emp_name_joined emp_manager_joined
0 1 John S Bob A 1 Bob A Paul P
1 2 Bob A Paul P 2 Paul P Charles Y
2 3 Paul P Charles Y 3 NaN NaN

Pandas-way to separate a DataFrame based on previouse groupby() explorations without loosing the not-grouped columns

I tried to translate the problem with my real data to example data presented in my question. Maybe I just have a simple technical problem. Or maybe my whole way and workflow is not the best?
The objectiv
There are persons (column name) who have eaten different fruit's at different day's. And there is some more data (column foo and bar) I do not want to lose.
I want to separate/split the original data, without loosing the additational data (in foo and bar).
The condition to separate is the number of unique fruits eaten at the specific days.
That is the initial data
>>> df
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Anna 1 Banana 495 924
4 Anna 1 Strawberry 236 542
5 Bob 1 Strawberry 420 894
6 Bob 2 Apple 27 192
7 Bob 2 Kiwi 671 145
The separated interim result should look like this two DataFrame's:
>>> two
name day fruit foo bar
0 Anna 1 Banana 495 924
1 Anna 1 Strawberry 236 542
2 Bob 2 Apple 27 192
3 Bob 2 Kiwi 671 145
>>> non_two
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Bob 1 Strawberry 420 894
Example explanation in words: Tim ate just Apple's at day 1 and 2. It does not matter how many apples. It just matters that it is one unique fruit.
What I have done so far
I did some groupby() magic to find out who and when have eaten two or less/more then two unique fruits.
import pandas as pd
import random as rd
data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
'day': [1, 1, 2, 1, 1, 1, 2, 2],
'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
'Strawberry', 'Apple', 'Kiwi'],
'foo': rd.sample(range(1000), 8),
'bar': rd.sample(range(1000), 8)
}
# That is the primary DataFrame
df = pd.DataFrame(data)
# Explore the data
a = df[['name', 'day', 'fruit']].groupby(['name', 'day', 'fruit']).count().reset_index()
b = a.groupby(['name', 'day']).count()
# People who ate 2 fruits on specific days
two = b[(b.fruit == 2)].reset_index()
print(two)
# People who ate less or more then 2 fruits on specific days
non_two = b[(b.fruit != 2)].reset_index()
print(non_two)
Here is my roadblocker
With the dataframes two and non_two I have the informations I want. Know I want to separate the initial dataframe based on that informations. I think name and day are the columns I should use to select and separate in the initial dataframe.
# filter mask
mymask = (df.name == two.name) & (df.day == two.day)
df_two = df[mymask]
df_non_two = df[~mymask]
But this does not work. The first line raise ValueError: Can only compare identically-labeled Series objects.
Use DataFrameGroupBy.nunique in GroupBy.transform, so possible filter original DataFrame:
mymask = df.groupby(['name', 'day'])['fruit'].transform('nunique').eq(2)
df_two = df[mymask]
df_non_two = df[~mymask]
print (df_two)
name day fruit foo bar
3 Anna 1 Banana 335 62
4 Anna 1 Strawberry 286 694
6 Bob 2 Apple 822 738
7 Bob 2 Kiwi 793 449

Pandas dataframe long to wide grouping by column with duplicated element

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

concate 2 dataframes having keys of first not in another

I have 2 dataframes..
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
and
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_b
I want output like below..
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
I want concatenate all records of df_a and only those records in df_b which are not in df_a.
I am able to do this by below code.
import pandas as pd
import numpy as np
mask=np.logical_not(df_b['subject_id'].isin(df_a['subject_id']))
pd.concat([df_a,df_b.loc[mask]])
Is there any other short method available directly in function concat and merge.
Please help..
You can use combine_first with set_index()
new_df = df_a.set_index('subject_id').combine_first(df_b.set_index('subject_id'))\
.reset_index()
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
5 6 Bran Balwner
6 7 Bryce Brice
7 8 Betty Btisan
drop_duplicates default keeping the first of the duplicated pair
pd.concat([df_a,df_b]).drop_duplicates(['subject_id'])
Out[1015]:
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan