concate 2 dataframes having keys of first not in another - pandas

I have 2 dataframes..
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
and
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_b
I want output like below..
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
I want concatenate all records of df_a and only those records in df_b which are not in df_a.
I am able to do this by below code.
import pandas as pd
import numpy as np
mask=np.logical_not(df_b['subject_id'].isin(df_a['subject_id']))
pd.concat([df_a,df_b.loc[mask]])
Is there any other short method available directly in function concat and merge.
Please help..

You can use combine_first with set_index()
new_df = df_a.set_index('subject_id').combine_first(df_b.set_index('subject_id'))\
.reset_index()
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
5 6 Bran Balwner
6 7 Bryce Brice
7 8 Betty Btisan

drop_duplicates default keeping the first of the duplicated pair
pd.concat([df_a,df_b]).drop_duplicates(['subject_id'])
Out[1015]:
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan

Related

how to concatenate text from multiple rows in dataframe based on a specific structure

I am going to merge multiple rows of a dataframe that has a specific structure of a text
For example, I have
df = pd.DataFrame([
(1, 'john', 'merge'),
(1, 'smith,', 'merge'),
(1, 'robert', 'merge'),
(1, 'g', 'merge'),
(1, 'owens,', 'merge'),
(2, 'sarah will', 'OK'),
(2, 'ali kherad', 'OK'),
(2, 'david', 'merge'),
(2, 'lu,', 'merge'),
], columns=['ID', 'Name', 'Merge'])
which is
ID Name Merge
1 john merge
1 smith, merge
1 robert merge
1 g merge
1 owens, merge
2 sarah will OK
2 ali kherad OK
2 david merge
2 lu, merge
The goal is to have a datframe that merges the text in rows like this
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu
I found a way to create the column 'Merge' to know if I need to merge or not. Then I tried this
df = pd.DataFrame(df[df['Merge']=='merge'].groupby(['ID','Merge'], axis=0)['Name'].apply(' '.join))
res = df.apply(lambda x: x.str.split(',').explode()).reset_index().drop(['Merge'], axis=1)
First I groupby the names when the column 'Merge' is equal to 'merge'. I know this is not the best way because it only considers this condition but in my dataframe I should have the other rows when the column 'Merge' is equal to 'OK'.
Then I split by ','.
The result is
ID Name
0 1 john smith
1 1 robert g owens
2 1
3 2 david lu
4 2
The other problem is that the order is not correct in my real example when I have more than 4000 rows. How can I keep the order and merge the text when necessary?
make grouper for grouping
cond1 = df['Name'].str.contains('\,$') | df['Merge'].eq('OK')
g = cond1[::-1].cumsum()
g(chk reversed index)
8 1
7 1
6 2
5 3
4 4
3 4
2 4
1 5
0 5
dtype: int32
remove , and groupby by ID and g
out = (df['Name'].str.replace('\,$', '', regex=True)
.groupby([df['ID'], g], sort=False).agg(' '.join)
.droplevel(1).reset_index())
out
ID Name
0 1 john smith
1 1 robert g owens
2 2 sarah will
3 2 ali kherad
4 2 david lu

How to make a customized count of occurences in a grouped dataframe

Please find below my input/output :
INPUT :
Id Values Status
0 Id001 red online
1 Id002 brown running
2 Id002 white off
3 Id003 blue online
4 Id003 green valid
5 Id003 yellow running
6 Id004 rose off
7 Id004 purple off
OUTPUT :
Id Values Status Id_occ Val_occ Sta_occ
0 Id001 red online 1 1 1
1 Id002 brown|white running|off 2 2 2
2 Id003 blue|green|yellow online|valid|running 3 3 3
3 Id004 rose|purple off 2 2 1
I was able to re-calculate the columns Values and Status but I don't know how to create the three columns of occurences.
import pandas as pd
df = pd.DataFrame({'Id': ['Id001', 'Id002', 'Id002', 'Id003', 'Id003', 'Id003', 'Id004', 'Id004'],
'Values': ['red', 'brown','white','blue', 'green', 'yellow', 'rose', 'purple'],
'Status': ['online', 'running', 'off', 'online', 'valid', 'running', 'off', 'off']})
out = (df.groupby(['Id'])
.agg({'Values': 'unique', 'Status': 'unique'})
.applymap(lambda x: '|'.join([str(val) for val in list(x)]))
.reset_index()
)
Do you have any suggestions of how to create the three columns of occurences ? Also, is there a better way to re-calculate the columns Values and Status ?
You can use named aggregation and a custom function:
ujoin = lambda s: '|'.join(dict.fromkeys(s))
out = (df
.assign(N_off=df['Status'].eq('off'))
.groupby(['Id'], as_index=False)
.agg(**{'Values': ('Values', ujoin),
'Status': ('Status', ujoin),
'Id_occ': ('Values', 'size'),
'Val_occ': ('Values', 'nunique'),
'Stat_occ': ('Status', 'nunique'),
'N_off': ('N_off', 'sum')
})
)
Output:
Id Values Status Id_occ Val_occ Stat_occ N_off
0 Id001 red online 1 1 1 0
1 Id002 brown|white running|off 2 2 2 1
2 Id003 blue|green|yellow online|valid|running 3 3 3 0
3 Id004 rose|purple off 2 2 1 2
Use:
df.groupby('Id')['Values'].nunique()
for two columns:
df.groupby('Id')[['Values', 'Status']].nunique()
Output:
Values Status
Id
Id001 1 1
Id002 2 2
Id003 3 3
Id004 2 1

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Pandas dataframe long to wide grouping by column with duplicated element

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

Find unique values of groupby/transform without None

The starting point is this kind of dataframe.
df = pd.DataFrame({'author': ['Jack', 'Steve', 'Greg', 'Jack', 'Steve', 'Greg', 'Greg'], 'country':['USA', None, None, 'USA', 'Germany', 'France', 'France'], 'c':np.random.randn(7), 'd':np.random.randn(7)})
author country c d
0 Jack USA -2.594532 2.027425
1 Steve None -1.104079 -0.852182
2 Greg None -2.356956 -0.450821
3 Jack USA -0.910153 -0.734682
4 Steve Germany 1.025113 0.441512
5 Greg France 0.218085 1.369443
6 Greg France 0.254485 0.322768
The desired output is one column or multiple columns with countrys of a author.
0 [USA]
1 [Germany]
2 [France]
3 [USA]
4 [Germany]
5 [France]
6 [France]
It has not to be a list, but my closest solution for now gives a list as output.
It could be seperated columns.
df.groupby('author')['country'].transform('unique')
0 [USA]
1 [None, Germany]
2 [None, France]
3 [USA]
4 [None, Germany]
5 [None, France]
6 [None, France]
Is there a easy way of deleting None out of this ?
You can remove missing values with Series.dropna, call SeriesGroupBy.unique and create new column by Series.map:
df['new'] = df['author'].map(df['country'].dropna().groupby(df['author']).unique())
print (df)
author country c d new
0 Jack USA 0.453358 -1.983282 [USA]
1 Steve None 0.011792 0.383322 [Germany]
2 Greg None -1.551810 0.308982 [France]
3 Jack USA 1.646301 0.040245 [USA]
4 Steve Germany -0.211451 0.841131 [Germany]
5 Greg France 1.049269 -0.813806 [France]
6 Greg France -1.244549 1.009006 [France]