How to differentiate mini dataframes appened to a bigger dataframe - pandas

I am trying to create a bigger dataframe from others dataframes, But I need to indentificate the them separately. I want to create a new column with a index of every dataframe.
frames = [dataTotal,dataFrame]
dataTotal = dataTotal.append(dataFrame, ignore_index=False, sort=False)
I tried use the pd.contact with the atributte key, but it doesn't work since the dataframes are in different sizes.
What I have to do?
Example:
I've this dataframe, and a want to append other to it, and create a index to differentiate them
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
Add a other dataframe
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
To create something like
data_id name LastName
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh
passing the atribute key, doesn't work, 'cause it say I can't concat dataframes with different levels. I don't know if I make ir wrong or something

You can use pd.concat with arg key:
In [1831]: df1
Out[1831]:
name LastName
0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
In [1832]: df2
Out[1832]:
name LastName
0 Ana Lee
1 Renato Cristian
2 Joe Jonh
In [1830]: df_list = [df1, df2]
In [1833]: df = pd.concat(df_list, keys=range(len(df_list)))
Then name the Multiindex using df.index.names:
In [1837]: df.index.names = ['data_id', '']
In [1838]: df
Out[1838]:
name LastName
data_id
0 0 Vitor Albres
1 Matheus Wilson
2 Andrew George
3 Filipe Dircksen
4 Eli Matthew
1 0 Ana Lee
1 Renato Cristian
2 Joe Jonh

Related

Pandas Dataframe groupby on 3 columns and make one column lowercase

I have a dataframe:
country rating owner
0 England a John Smith
1 England b John Smith
2 France a Frank Foo
3 France a Frank foo
4 France a Frank Foo
5 France b Frank Foo
I'd like to produce a count of owners after grouping by country and rating and
ignoring case
gnoring any spaces ( leading, trailing or inbetween)
I am expecting:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
I have tried:
df.group_by(['rating','owner'])['owner'].count()
and
df.group_by(['rating','owner'].str.lower())['owner'].count()
Use title and replace to rework the string and groupby.size to aggregate:
out = (df.groupby(['country', 'rating',
df['owner'].str.title().str.replace(r'\s+', ' ', regex=True)])
.size().reset_index(name='count')
)
Output:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
Use Series.str.strip, Series.str.title and remove multiple spaces by Series.str.replace with aggregate GroupBy.size:
DataFrameGroupBy.count is used for count exclude missing values, seems not necessary here.
df1 = (df.groupby(['county','rating',
df['owner'].str.strip().str.title().str.replace('\s+',' ', regex=True)])
.size()
.reset_index(name='count'))

How to concatenate two columns so that one comes below the other in pandas dataframe?

Col-1 Col-2
0 Erin Tanya
1 Cathy Tom
2 Ross Wes
This is my dataset
I need the result to look like this:
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes
I tried using .map, append, concat and .ravel but no luck. Any help would be appreciated :)
Use DataFrame.melt:
df1 = df[['Col-1','Col-2']].melt(value_name='New_column')[['New_column']]
print (df1)
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes
With numpy.ravel, you can use:
out = pd.DataFrame({'New_column': df.to_numpy().ravel(order='F')})
# or
out = pd.DataFrame({'New_column': np.ravel(df, order='F')})
output:
New_column
0 Erin
1 Cathy
2 Ross
3 Tanya
4 Tom
5 Wes

Add a column stating whether a record occurred across datasets

I have 2 dfs upon which I want to contact together and remove duplicates, but not before adding a column stating whether a record from df_b which will be dropped as a consequence of deduplication can state whether it occurred or not across both dfs, otherwise the column will remain blank stating that there was no occurrence for that record in df_b (not a duplicate across dfs).
Desired result df_combined
df_a
title director
0 Toy Story John Lasseter
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese
df_b
title director
0 Toy Story John Lass
1 The Hangover Todd Phillips
2 Rocky John Avildsen
3 The Departed Martin Scorsese
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined
title director. occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen
We can use duplicated with keep=False to mark all duplicates and np.where to convert from boolean series to 'b' and ''. Then followup with drop_duplicates to remove duplicate rows. Both operations should be subset to only the title column:
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')
df_combine:
title director occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen

Pandas dataframe long to wide grouping by column with duplicated element

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

split index column based on existence of a substring

I have the following df:
stuff
james__America by Estonia : 2
luke__Spain by Italy 3
michael 4
Louis__Portugal by USA 2
I would like that in case in the index the substring "__" exists then I would like to split the index and create 2 new columns next to it to make a second split by ' by ' in order to get the following output:
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I thought using :
df.index.str.split('__', expand=True).split(' by ',expand=True).rename(columns={0:'name1',1:'name2'})
However it does not seem to work.
Convert Index to Series by Index.to_series, then use Series.str.split by first separator, then split by second column, join original columns and last overwrite index:
df1 = df.index.to_series().str.split('__', expand=True)
df2 = df1[1].str.split(' by ',expand=True).rename(columns={0:'name1',1:'name2'}).fillna('0')
df = df2.join(df)
df.index = df1[0].rename(None)
print (df)
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2