split index column based on existence of a substring - pandas

I have the following df:
stuff
james__America by Estonia : 2
luke__Spain by Italy 3
michael 4
Louis__Portugal by USA 2
I would like that in case in the index the substring "__" exists then I would like to split the index and create 2 new columns next to it to make a second split by ' by ' in order to get the following output:
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I thought using :
df.index.str.split('__', expand=True).split(' by ',expand=True).rename(columns={0:'name1',1:'name2'})
However it does not seem to work.

Convert Index to Series by Index.to_series, then use Series.str.split by first separator, then split by second column, join original columns and last overwrite index:
df1 = df.index.to_series().str.split('__', expand=True)
df2 = df1[1].str.split(' by ',expand=True).rename(columns={0:'name1',1:'name2'}).fillna('0')
df = df2.join(df)
df.index = df1[0].rename(None)
print (df)
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2

Related

Pandas Dataframe groupby on 3 columns and make one column lowercase

I have a dataframe:
country rating owner
0 England a John Smith
1 England b John Smith
2 France a Frank Foo
3 France a Frank foo
4 France a Frank Foo
5 France b Frank Foo
I'd like to produce a count of owners after grouping by country and rating and
ignoring case
gnoring any spaces ( leading, trailing or inbetween)
I am expecting:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
I have tried:
df.group_by(['rating','owner'])['owner'].count()
and
df.group_by(['rating','owner'].str.lower())['owner'].count()
Use title and replace to rework the string and groupby.size to aggregate:
out = (df.groupby(['country', 'rating',
df['owner'].str.title().str.replace(r'\s+', ' ', regex=True)])
.size().reset_index(name='count')
)
Output:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
Use Series.str.strip, Series.str.title and remove multiple spaces by Series.str.replace with aggregate GroupBy.size:
DataFrameGroupBy.count is used for count exclude missing values, seems not necessary here.
df1 = (df.groupby(['county','rating',
df['owner'].str.strip().str.title().str.replace('\s+',' ', regex=True)])
.size()
.reset_index(name='count'))

Convert MultiIndex into columns Pandas

I would like to know how to perform the below transformation in a pandas dataframe. I have no idea how to tackle this. The idea is to take the index level 0 and set it as level 0 column with the rest of the columns place into the appropiated main column
Try this reshaping the dataframe using set_index, unstack and swaplevel:
df_out = df.set_index(df.groupby(level=0).cumcount()+1, append=True)\
.reset_index(level=1)\
.rename(columns={'level_1':'ident'})\
.unstack(0)\
.swaplevel(0,1, axis=1)\
.sort_index(axis=1)
df_out
Output:
A B C
city ident population city ident population city ident population
1 NY 1 57578 London 4 543534 Berlin 7 5257537
2 LA 2 8767867 Paris 5 25725 Madrid 8 53755
3 Valencia 3 8767678 Beijin 6 275275 Belfast 9 354354

How to differentiate mini dataframes appened to a bigger dataframe

I am trying to create a bigger dataframe from others dataframes, But I need to indentificate the them separately. I want to create a new column with a index of every dataframe.
frames = [dataTotal,dataFrame]
dataTotal = dataTotal.append(dataFrame, ignore_index=False, sort=False)
I tried use the pd.contact with the atributte key, but it doesn't work since the dataframes are in different sizes.
What I have to do?
Example:
I've this dataframe, and a want to append other to it, and create a index to differentiate them
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
Add a other dataframe
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
To create something like
data_id name LastName
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh
passing the atribute key, doesn't work, 'cause it say I can't concat dataframes with different levels. I don't know if I make ir wrong or something
You can use pd.concat with arg key:
In [1831]: df1
Out[1831]:
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
In [1832]: df2
Out[1832]:
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
In [1830]: df_list = [df1, df2]
In [1833]: df = pd.concat(df_list, keys=range(len(df_list)))
Then name the Multiindex using df.index.names:
In [1837]: df.index.names = ['data_id', '']
In [1838]: df
Out[1838]:
name LastName
data_id
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh

Pandas dataframe long to wide grouping by column with duplicated element

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

Updating NaN values in a dataframe when there's a match to a list item in another dataframe column

beginner in Python here.
Have tried to look for a solution for this from a bunch of sites.
Might just not be connecting the dots right.
I'm trying to fill the 'NaN' values in a DataFrame based on values present in a list.
If the persons name appears on the list, the 'geo' column should updated with the correct geo name.
The lists are complete, with people in the regions, but the DataFrame is not and needs to be updated.
What I have looks roughly like this:
name geo
0 john EMEA
1 jack NaN
2 jill APAC
3 james NaN
4 judy EMEA
5 jared NaN
I would like to update the NaN values based on the below lists.
EMEA = ['john','jack','judy','jared']
APAC = ['jill','james']
First create dictionaryby each list:
EMEA = ['john','jack','judy','jared']
APAC = ['jill','james']
d = {'EMEA' : EMEA,
'APAC': APAC}
Then swap order with flatten:
d1 = {x: k for k, v in d.items() for x in v}
print (d1)
{'john': 'EMEA', 'jack': 'EMEA', 'judy': 'EMEA',
'jared': 'EMEA', 'jill': 'APAC', 'james': 'APAC'}
Last replace only misisng values by mapped values by Series.map and Series.fillna:
df['geo'] = df['geo'].fillna(df['name'].map(d1))
print (df)
name geo
0 john EMEA
1 jack EMEA
2 jill APAC
3 james APAC
4 judy EMEA
5 jared EMEA
Or map all values:
df['geo'] = df['name'].map(d1)
print (df)
name geo
0 john EMEA
1 jack EMEA
2 jill APAC
3 james APAC
4 judy EMEA
5 jared EMEA
Try this:
for x in df.index:
if df.loc[x,"name"] in EMEA:
df.loc[x,"geo"]='EMEA'
if df.loc[x,"name"] in APAC:
df.loc[x,"geo"]="APAC"
Hope this helps. Good Luck!
A simple np.where should resolve this:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
df['geo'] = np.where((df['geo'].isnull()) & (df['name'].isin(MEA)), 'MEA',
np.where((df['geo'].isnull()) & (df['name'].isin(APAC)), 'APAC',
pdf['geo']))