(pandas) Create new column based on first element in groupby object - pandas

Say I have the following dataframe:
>>> df = pd.DataFrame({'Person': ['bob', 'jim', 'joe', 'bob', 'jim', 'joe'], 'Color':['blue', 'green', 'orange', 'yellow', 'pink', 'purple']})
>>> df
Color Person
0 blue bob
1 green jim
2 orange joe
3 yellow bob
4 pink jim
5 purple joe
And I want to create a new column that represents the first color seen for each person:
Color Person First Color
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
I have come to a solution but it seems really inefficient:
>>> df['First Color'] = 0
>>> groups = df.groupby(['Person'])['Color']
>>> for g in groups:
... first_color = g[1].iloc[0]
... df['First Color'].loc[df['Person']==g[0]] = first_color
Is there a faster way to do this all at once where it doesn't have to iterate through the groupby object?

You need transform with first:
print (df.groupby('Person')['Color'].transform('first'))
0 blue
1 green
2 orange
3 blue
4 green
5 orange
Name: Color, dtype: object
df['First_Col'] = df.groupby('Person')['Color'].transform('first')
print (df)
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange

use transform() method:
In [177]: df['First_Col'] = df.groupby('Person')['Color'].transform('first')
In [178]: df
Out[178]:
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange

Related

Create feature matrix from Dataframe

I would like to transform a dataframe into a feature matrix (actually, I'm not sure it is called a feature matrix).
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
Car Color
0 Audi red
1 Toyota red
2 Chrysler blue
3 Toyota silver
4 Chrysler blue
5 Chrysler silver
I would like to create a matrix with cars and colors as index and columns where a True, or 1 shows a possible combination like follows:
Color Audit Chrysler Toyota
0 blue 0 1 0
1 red 1 0 1
2 silver 0 1 1
I can create a matrix and then iterate over the rows and enter the values, but this takes quite long. Is there a better way to create this matrix?
Kind regards,
Stephan
pivot_table would seem to apply here:
df.pivot_table(index="Car", columns="Color", aggfunc=len)
Which gives:
Color blue red silver
Car
Audi NaN 1.0 NaN
Chrysler 2.0 NaN 1.0
Toyota NaN 1.0 1.0
You specify the vertical component as the index column (Car), and the horizontal one as the columns component (Color), then provide a function to fill the cells (len).
Then, to nuance it a little, you could use fillna() to "paint" the empty cells with zeros. And apply a logical test to show which ones are "possible"
e.g.
df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0
Which gives:
Color blue red silver
Car
Audi False True False
Chrysler True False True
Toyota False True True
And as a final bit of polish, having learned about it from here, you could run an applymap to get your 0,1 output:
(df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0).applymap(lambda x : 1 if x==True else 0)
Giving:
Color blue red silver
Car
Audi 0 1 0
Chrysler 1 0 1
Toyota 0 1 1
Finally, this process is sometimes referred to in the literature as One Hot Encoding and there are some cool implementations such as this one from sklearn in case your investigations lead you in that direction.
In extension to Thomas's answer below code should give exactly what you desire in the output
import pandas as pd
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
output = (df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0).T > 0).astype(int)
print(output)
Car Audi Chrysler Toyota
Color
blue 0 1 0
red 1 0 1
silver 0 1 1

How do you groupby and aggregate using conditional statements in Pandas?

Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
Orange_Total
0 20
1 12
EDIT:
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
Orange_Total
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
John;Banana;3
Jane;Coconut;5
John;Lime;10
""")
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
})
df = pd.read_csv(StringIO("""
Index,Name,Item,Quantity
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
7,John,Banana,3
8,Jane,Coconut,5
9,John,Lime,10
"""))
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
items_list
All Items
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
temp2
Item Quantity
Name
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
pivot_df
Item Apple Orange
Name
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
output
All Items Apples Total Oranges Total
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0

How do you “pivot” using conditions, aggregation, and concatenation in Pandas?

I have a dataframe in a format such as the following:
Index Name Fruit Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
I need to turn it into a dataframe such as this:
Index Name All Fruits Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15 20
Question is how do I do this? I have looked at the groupby docs as well as a number of posts on pivot and aggregation but translating that into this use case somehow escapes me. Any help or pointers much appreciated.
Cheers!
Use GroupBy.agg with join, create column F by split and pass to DataFrame.pivot_table, last join together by DataFrame.join:
df1 = df.groupby('Name', sort=False)['Fruit'].agg(', '.join)
df2 = (df.assign(F = df['Fruit'].str.split().str[0])
.pivot_table(index='Name',
columns='F',
values='Quantity',
aggfunc='sum')
.add_suffix(' Total'))
df3 = df1.to_frame('All Fruits').join(df2).reset_index()
print (df3)
Name All Fruits Apple Total \
0 John Apple Red, Apple Green, Orange Cali 15
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15
Orange Total
0 12
1 20

Pandas filter maximum groupby

I have Pandas df:
family age fruits
------------------
Brown 12 7
Brown 33 5
Yellow 28 3
Yellow 11 9
I want to get ages with next conditions:
Group by family;
Having maximum of fruits
So result df will be:
family age
-----------
Brown 12
Yellow 11
We can do:
(df.sort_values(['family','fruits'], ascending=[True,False])
.drop_duplicates('family')
)
Output:
family age fruits
0 Brown 12 7
3 Yellow 11 9
Or with groupby().idxmax()
df.loc[df.groupby('family').fruits.idxmax(), ['family','age'] ]
Output:
family age
0 Brown 12
3 Yellow 11
Use head after sort_values
df.sort_values(
['family','fruits'], ascending=[True,False])
.groupby('family').head(1)

stop pandas from renaming columns with same name so i can use wide to long

I have an excel file that im reading into pandas that looks similar to this
name size color material size color material size color material
bob m red coton m yellow cotton m green dri-fit
james l green dri-fit l green cotton l red cotton
steve l green dri-fit l green cotton l red cotton
I want to tally all my shirt types into something like this
l green dri-fit 2
l red coton 2
m red coton 1
i am using pandas ExcelFile to read the file into a file object, then using parse to parse the sheet into a dataframe.
import pandas as pd
file = pd.ExcelFile('myexcelfile.xlsx')
df = file.parse('sheet1')
To try and get to my desired output, I am trying to use Wide to Long. The problem is, because some of my columns have the same names, when I read the file into pandas its renaming my columns. The second instance of size, for example, turns automatically into size.2, same with color and material. If i try to use stubnames with wide to long, it complains that the first instance of size ... "stubname cant be identical to a column name".
Is there any way to use wide to long prior to pandas renaming my columns?
The column numbering is problematic for pd.wide_to_long, so we need to modify the first instance of the column names, adding a .0, so they don't conflict with the stubs.
Sample Data
import pandas as pd
df = pd.read_clipboard()
print(df)
name size color material size.1 color.1 material.1 size.2 color.2 material.2
0 bob m red coton m yellow cotton m green dri-fit
1 james l green dri-fit l green cotton l red cotton
2 steve l green dri-fit l green cotton l red cotton
Code:
stubs = ['size', 'color', 'material']
d = {x: f'{x}.0' for x in stubs}
df.columns = [d.get(k, k) for k in df.columns]
res = pd.wide_to_long(df, i='name', j='num', sep='.', stubnames=stubs)
# size color material
#name num
#bob 0 m red coton
#james 0 l green dri-fit
#steve 0 l green dri-fit
#bob 1 m yellow cotton
#james 1 l green cotton
#steve 1 l green cotton
#bob 2 m green dri-fit
#james 2 l red cotton
#steve 2 l red cotton
res.groupby([*res]).size()
#size color material
#l green cotton 2
# dri-fit 2
# red cotton 2
#m green dri-fit 1
# red coton 1
# yellow cotton 1
value_counts
cols = ['size', 'color', 'material']
s = pd.value_counts([*zip(*map(np.ravel, map(df.get, cols)))])
(l, red, cotton) 2
(l, green, cotton) 2
(l, green, dri-fit) 2
(m, green, dri-fit) 1
(m, yellow, cotton) 1
(m, red, coton) 1
dtype: int64
Counter
And more to my liking
from collections import Counter
s = pd.Series(Counter([*zip(*map(np.ravel, map(df.get, cols)))]))
s.rename_axis(['size', 'color', 'material']).reset_index(name='freq')
size color material freq
0 m red coton 1
1 m yellow cotton 1
2 m green dri-fit 1
3 l green dri-fit 2
4 l green cotton 2
5 l red cotton 2
CODE BELOW:
df = pd.read_excel('C:/Users/me/Desktop/sovrflw_data.xlsx')
df.drop('name', axis=1, inplace=True)
arr = df.values.reshape(-1, 3)
df2 = pd.DataFrame(arr, columns=['size','color','material'])
df2['count']=1
df2.groupby(['size','color','material'],as_index=False).count()