Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
0 20
1 12
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
df = pd.read_csv(StringIO("""
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
All Items
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
Item Quantity
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
Item Apple Orange
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
All Items Apples Total Oranges Total
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0
I have the following df:
james__America by Estonia : 2
luke__Spain by Italy 3
michael 4
Louis__Portugal by USA 2
I would like that in case in the index the substring "__" exists then I would like to split the index and create 2 new columns next to it to make a second split by ' by ' in order to get the following output:
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I thought using :
df.index.str.split('__', expand=True).split(' by ',expand=True).rename(columns={0:'name1',1:'name2'})
However it does not seem to work.
Convert Index to Series by Index.to_series, then use Series.str.split by first separator, then split by second column, join original columns and last overwrite index:
df1 = df.index.to_series().str.split('__', expand=True)
df2 = df1[1].str.split(' by ',expand=True).rename(columns={0:'name1',1:'name2'}).fillna('0')
df = df2.join(df)
df.index = df1[0].rename(None)
print (df)
name1 name2 stuff
james America Estonia 2
luke Spain Italy 3
michael 0 0 4
Louis Portugal USA 2
I want to do aggregations on a panda dataframe by word.
Basically there are 3 columns with the click/impression count with the corresponding phrase. I would like to split the phrase into tokens and then sum up their clicks to tokens to decide which token is relatively good/bad.
Expected input: Panda dataframe as below
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
Expected output:
click_count impression_count token
1 30 300 pizza // 30 = 20 + 10, 300 = 200+100
2 21 201 italian // 21 = 20 + 1
3 1 1 cheese // cheese only appeared once in italian cheese
tokens = df.text.str.split(expand=True)
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])]
tokens.columns = token_cols
df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1)
df2 = pd.lreshape(df1, {'tokens': token_cols})
This creates a new DataFrame like piRSquared's but tokens are stacked and merged with the original:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True)
.groupby('token')['click_count', 'impression_count'].sum())
click_count impression_count
cheese 1 1
italian 21 201
pizza 30 300
If you break this down, it merges this:
df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token')
1 pizza
2 pizza
2 italian
3 italian
3 cheese
with the original DataFrame on their indices. The resulting df is:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True))
token click_count impression_count text
1 pizza 10 100 pizza
2 pizza 20 200 pizza italian
2 italian 20 200 pizza italian
3 italian 1 1 italian cheese
3 cheese 1 1 italian cheese
The rest is grouping by the token column.
You could do
In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1)
In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum()
token click_count impression_count
0 pizza 30 300
1 italian 21 201
2 cheese 1 1
In [3093]: df
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
In [3094]: s
1 pizza
2 pizza
2 italian
3 italian
3 cheese
dtype: object
I've question about how to manipulate specific condtion in some colums.
For example,
from pandas import DataFrame
import pandas as pd
df = DataFrame({'name' : ['apple','pineapple','melon','orange','mango','durian'],
'amt' : [200,300,100,1,3,120]},
index = ['1','2','3','4','5','6'])
I can see,
amt name
1 200 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
From above result I want to manipulate amt of apple with other items hold.
I just only know...
df.loc[df.name.str.contains('apple'), 'amt'] = df['amt']/100
This syntax manipulates not only 'apple' but 'pineapple'.
I'd like to get only result revising apple's amt like...
amt name
1 2 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
Is there anyone help me?
Thanks for reading.