How do you "pivot" using conditions, aggregation, and concatenation in Pandas?

I have a dataframe in a format such as the following:
Index Name Fruit Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
I need to turn it into a dataframe such as this:
Index Name All Fruits Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15 20
Question is how do I do this? I have looked at the groupby docs as well as a number of posts on pivot and aggregation but translating that into this use case somehow escapes me. Any help or pointers much appreciated.

Use GroupBy.agg with join, create column F by split and pass to DataFrame.pivot_table, last join together by DataFrame.join:
df1 = df.groupby('Name', sort=False)['Fruit'].agg(', '.join)
df2 = (df.assign(F = df['Fruit'].str.split().str[0])
.add_suffix(' Total'))
df3 = df1.to_frame('All Fruits').join(df2).reset_index()
print (df3)
Name All Fruits Apple Total \
0 John Apple Red, Apple Green, Orange Cali 15
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15
Orange Total
0 12
1 20


How do you groupby and aggregate using conditional statements in Pandas?

Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
0 20
1 12
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel() = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
df = pd.read_csv(StringIO("""
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
All Items
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
Item Quantity
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
Item Apple Orange
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
All Items Apples Total Oranges Total
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0

Divide two dataframe by matching part of the strings in Index

Please help, why df3 is not working? got error "merging with more than one level overlap on a multi-index is not implemented"
raw data:
LastName FirstName Year Cat Pay
0 Johnson David 2020 Apple 100
1 Bird Demi 2020 Apple 60
2 Bird Demi 2019 Banana 100
3 Johnson David 2019 Banana 100
df1=df.groupby(['LastName', 'FirstName']) ['Pay'].agg(['min','max', 'mean', 'sum'])
df2 = df.groupby(['LastName','FirstName','Year'])['Pay'].mean()
df3["PCT"] = df1['mean']/df2

Pandas filter maximum groupby

I have Pandas df:
family age fruits
Brown 12 7
Brown 33 5
Yellow 28 3
Yellow 11 9
I want to get ages with next conditions:
Group by family;
Having maximum of fruits
So result df will be:
family age
Brown 12
Yellow 11
We can do:
(df.sort_values(['family','fruits'], ascending=[True,False])
family age fruits
0 Brown 12 7
3 Yellow 11 9
Or with groupby().idxmax()
df.loc[df.groupby('family').fruits.idxmax(), ['family','age'] ]
family age
0 Brown 12
3 Yellow 11
Use head after sort_values
['family','fruits'], ascending=[True,False])

pandas search a value in a dataframe column

I have following dataframe and i want to search apple in column fruits and display all the rows if apple is found.
Before :
number fruits purchase
0 apple yes
1 apple no
2 mango yes
3 apple yes
4 grapes no
number fruits purchase
0 apple yes
1 apple no
3 apple yes
Use groupby and filter to filter groups that contain 'apple':
df['number'] = df['number'].ffill()
df.groupby('number').filter(lambda x: (x['fruits'] == 'apple').any())
df_out.assign(number = df_out['number'].mask(df.number.duplicated()))\
number fruits purchase
0 0 apple yes
1 mango
2 banana
3 1 apple no
4 cheery
7 3 apple yes
8 orange
It looks like you're using 'number' as the index, so I'm going to assume that.
Get the numbers where 'apple' is present, and slice into those:
idx = df.index[df.fruits == 'apple']

(pandas) Create new column based on first element in groupby object

Say I have the following dataframe:
>>> df = pd.DataFrame({'Person': ['bob', 'jim', 'joe', 'bob', 'jim', 'joe'], 'Color':['blue', 'green', 'orange', 'yellow', 'pink', 'purple']})
>>> df
Color Person
0 blue bob
1 green jim
2 orange joe
3 yellow bob
4 pink jim
5 purple joe
And I want to create a new column that represents the first color seen for each person:
Color Person First Color
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
I have come to a solution but it seems really inefficient:
>>> df['First Color'] = 0
>>> groups = df.groupby(['Person'])['Color']
>>> for g in groups:
... first_color = g[1].iloc[0]
... df['First Color'].loc[df['Person']==g[0]] = first_color
Is there a faster way to do this all at once where it doesn't have to iterate through the groupby object?
You need transform with first:
print (df.groupby('Person')['Color'].transform('first'))
0 blue
1 green
2 orange
3 blue
4 green
5 orange
Name: Color, dtype: object
df['First_Col'] = df.groupby('Person')['Color'].transform('first')
print (df)
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
use transform() method:
In [177]: df['First_Col'] = df.groupby('Person')['Color'].transform('first')
In [178]: df
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange