Pandas dataframe long to wide grouping by column with duplicated element - pandas

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz

Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz

Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

Related

How to concatenate two columns so that one comes below the other in pandas dataframe?

Col-1 Col-2
0 Erin Tanya
1 Cathy Tom
2 Ross Wes
This is my dataset
I need the result to look like this:
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes
I tried using .map, append, concat and .ravel but no luck. Any help would be appreciated :)
Use DataFrame.melt:
df1 = df[['Col-1','Col-2']].melt(value_name='New_column')[['New_column']]
print (df1)
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes
With numpy.ravel, you can use:
out = pd.DataFrame({'New_column': df.to_numpy().ravel(order='F')})
# or
out = pd.DataFrame({'New_column': np.ravel(df, order='F')})
output:
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes

How to differentiate mini dataframes appened to a bigger dataframe

I am trying to create a bigger dataframe from others dataframes, But I need to indentificate the them separately. I want to create a new column with a index of every dataframe.
frames = [dataTotal,dataFrame]
dataTotal = dataTotal.append(dataFrame, ignore_index=False, sort=False)
I tried use the pd.contact with the atributte key, but it doesn't work since the dataframes are in different sizes.
What I have to do?
Example:
I've this dataframe, and a want to append other to it, and create a index to differentiate them
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
Add a other dataframe
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
To create something like
data_id name LastName
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh
passing the atribute key, doesn't work, 'cause it say I can't concat dataframes with different levels. I don't know if I make ir wrong or something
You can use pd.concat with arg key:
In [1831]: df1
Out[1831]:
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
In [1832]: df2
Out[1832]:
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
In [1830]: df_list = [df1, df2]
In [1833]: df = pd.concat(df_list, keys=range(len(df_list)))
Then name the Multiindex using df.index.names:
In [1837]: df.index.names = ['data_id', '']
In [1838]: df
Out[1838]:
name LastName
data_id
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh

Complex Pivoting in Pandas involving multiple columns

My df:
t name team Value
1-Jan-10 Roger Ajou 10
1-Jan-10 Kim KSR 20
1-Jan-10 Tim KKR 0
2-Jan-10 Tim KKR 10
2-Jan-10 Roger Ajou 20
3-Jan-10 Kim KSR 20
3-Jan-10 Tim KKR 10
3-Jan-10 Roger Ajou 0
I tried pandas pivoting but, here I need to pivot 2 column together and expected output is like below
KSR Ajou KKR
Kim Roger Tim
1-Jan-10 20 10 0
2-Jan-10 20 10
3-Jan-10 20 0 10
Note: the column are sorted based on 'name' column. Is this doable in pandas?
Use DataFrame.set_index with Series.unstack for reshape, then sorting by second level in MultiIndex and last remove index and columns names by DataFrame.rename_axis:
df1 = (df.set_index(['t','team','name'])['Value']
.unstack([1,2], fill_value=0)
.sort_index(level=1, axis=1)
.rename_axis(index=None, columns=[None, None]))
print (df1)
KSR Ajou KKR
Kim Roger Tim
1-Jan-10 20 10 0
2-Jan-10 0 20 10
3-Jan-10 20 0 10

Groupby sum of two column and create new dataframe in pandas

I have a dataframe as shown below
Player Goal Freekick
Messi 2 5
Ronaldo 1 4
Messi 1 4
Messi 0 5
Ronaldo 0 9
Ronaldo 1 8
Xavi 1 1
Xavi 0 7
From the above I would like do groupby sum of Goal and Freekick as shown below.
Expected Output:
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8
I tried below code:
df1 = df.groupby(['Player'])['Goal'].sum().reset_index().rename({'Goal':'toatal_goals'})
df1['total_freekicks'] = df.groupby(['Player'])['Freekick'].sum()
But above one does not work, please help me..
First aggregate sum by Player, then DataFrame.add_prefix and convert columns names to lowercase:
df = df.groupby('Player').sum().add_prefix('total_').rename(columns=str.lower)
print (df)
total_goal total_freekick
Player
Messi 3 14
Ronaldo 2 21
Xavi 1 8
You can use namedagg to create the aggregations with customized column names.
(
df.groupby(by='Player')
.agg(toatal_goals=('Goal', 'sum'),
total_freekicks=('Freekick', 'sum'))
.reset_index()
)
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8

split index column based on existence of a substring

I have the following df:
stuff
james__America by Estonia : 2
luke__Spain by Italy 3
michael 4
Louis__Portugal by USA 2
I would like that in case in the index the substring "__" exists then I would like to split the index and create 2 new columns next to it to make a second split by ' by ' in order to get the following output:
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I thought using :
df.index.str.split('__', expand=True).split(' by ',expand=True).rename(columns={0:'name1',1:'name2'})
However it does not seem to work.
Convert Index to Series by Index.to_series, then use Series.str.split by first separator, then split by second column, join original columns and last overwrite index:
df1 = df.index.to_series().str.split('__', expand=True)
df2 = df1[1].str.split(' by ',expand=True).rename(columns={0:'name1',1:'name2'}).fillna('0')
df = df2.join(df)
df.index = df1[0].rename(None)
print (df)
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2