Write the values of a customer group into a Series - pandas

I have a dataframe which has various entries of customers. These customers, which has different customer numbers, belong to certain customer groups (contract, wholesaler, tender, etc.). I have to sum some of these values of the dataframe into a Series for each customer group (e.g., total sales of contract customers would be a single entry in the Series.)
I've tried using .isin() but I had an attribute error (float object has no attribute 'isin'). It is working if I work with or operator but then I will have to manually enter all customer numbers for all customer groups. I'm sure there must be a much simple way and efficient of doing it. Many thanks in advance.
for i in range(len(grouped_sales)):
if df.iloc[i,1]==value1 or df.iloc[i,1]==value2 or df.iloc[i,1]==...:
series[1]=series[1]+df.iloc[i,3]
elif df.iloc[1,i]==valueN or df.iloc[i,1]==value(N+1)...:
series[2]=series[2]+df.iloc[1,3]
elif:
...

If you want to sum the sales for every group you may want to look into panda's
df.groupby() maybe
I'm trying to reproduce what you want it would look like this
>>> df = pd.DataFrame()
>>> df['cust_numb']=[1,2,3,4,5]
>>> df['group']=['group1','group2','group3','group3','group1']
>>> df['sales']=[50,30,50,40,20]
>>> df
cust_numb group sales
0 1 group1 50
1 2 group2 30
2 3 group3 50
3 4 group3 40
4 5 group1 20
>>> df.groupby('group').sum()['sales']
group
group1 70
group2 30
group3 90
Name: sales, dtype: int64
You'll have a series with groups as index and the sum of the sales as values
EDIT: Based on your comment you have the group data in a separate dictionary, the implementation would like this
>>> sales_data = {'CustomerName': ['cust1', 'cust2', 'cust3', 'cust4'],'CustomerCode': [1,2,3,4], 'Sales': [10,10,15,25], 'Risk':[55,55,45,79]}
>>> sdf = pd.DataFrame.from_dict(sales_Data)
>>> group_data ={'group1': [1,3], 'group2': [2,4]}
You want to map your customer number to the groups so you need an inverted dictionary:
>>> dc = {v:k for k in group_data.keys() for v in group_data[k]}
{1: 'group1', 3: 'group1', 2: 'group2', 4: 'group2'}
You replace your customer number column by the group mapping in a new column and reproduce what I did above
>>> sdf['groups'] = sdf.replace({'CustomerCode': dc})['CustomerCode']
>>> sdf
CustomerName CustomerCode Sales Risk groups
0 cust1 1 10 55 group1
1 cust2 2 10 55 group2
2 cust3 3 15 45 group1
3 cust4 4 25 79 group2
>>> sdf.groupby('groups').sum()['Sales']
groups
group1 25
group2 35
Name: Sales, dtype: int64

Related

pandas pivot table how to rearrange columns

I have a pandas df which I am looking to build a pivot table with.
Here is a sample table
Name Week Category Amount
ABC 1 Clothing 50
ABC 1 Food 10
ABC 1 Food 10
ABC 1 Auto 20
DEF 1 Food 10
DEF 1 Services 20
The pivot table I am looking to create is to sum up the amounts per Name, per week per category.
Essentially, I am looking to land up with a table as follows:
Name Week Clothing Food Auto Services Total
ABC 1 50 20 20 0 90
DEF 1 0 10 0 20 30
If a user has no category value in a particular week, I take it as 0
And the total is the row sum.
I tried some of the options mentioned at https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html but couldnt get it to work...any thoughts on how I can achieve this. I used
df.pivot_table(values=['Amount'], index=['Name','Week','Category'], aggfunc=[np.sum]) followed by df.unstack() but that did not yield the desired result as both Week and Category got unstacked.
Thanks!
df_pvt = pd.pivot_table(df, values = 'Amount', index = ['Name', 'Week'], columns = 'Category', aggfunc = np.sum, margins=True, margins_name = 'Total', fill_value = 0
df_pvt.columns.name = None
df_pvt = df_pvt.reset_index()
Let us try crosstab
out = pd.crosstab(index = [df['Name'],df['Week']],
columns = df['Category'],
values=df['Amount'],
margins=True,
aggfunc='sum').fillna(0).iloc[:-1].reset_index()
Category Name Week Auto Clothing Food Services All
0 ABC 1 20.0 50.0 20.0 0.0 90
1 DEF 1 0.0 0.0 10.0 20.0 30

How can I create a datframe column which counts the occurrence of each value in anopther column?

I am trying to add a column to my dataframe, which will hold a value which represents the number of times a unique value has appeared in another column.
For example , I haver the following dataframe:
Date|Team|Goals|
22.08.20|Team1|4|
22.08.20|Team2|3|
22.08.20|Team3|1|
22.09.20|Team1|4|
22.09.20|Team3|5|
I would like to add a counter column, which counts how often each team appears:
Date|Team|Goals|Count|
22.08.20|Team1|4|1|
22.08.20|Team2|3|1|
22.08.20|Team3|1|1|
22.09.20|Team1|4|2|
22.09.20|Team3|5|2|
My Dataframe is ordered by date, so the teams should appear in the correct order.
Apologies, very new to pandas and stack overflow, so please let me know if I can format this question differently. Thanks
TRY:
df['Count'] = df.groupby('Team').cumcount().add(1)
OUTPUT:
Date Team Goals Count
0 22.08.20 Team1 4 1
1 22.08.20 Team2 3 1
2 22.08.20 Team3 1 1
3 22.09.20 Team1 4 2
4 22.09.20 Team3 5 2
Another answer building upon #Nk03's with replicable results:
import pandas as pd
import numpy as np
# Set numpy random seed
np.random.seed(42)
# Create dates array
dates = pd.date_range(start='2021-06-01', periods=10, freq='D')
# Create teams array
teams_names = ['Team 1', 'Team 2', 'Team 3']
teams = [teams_names[i] for i in np.random.randint(0, 3, 10)]
# Create goals array
goals = np.random.randint(1, 6, 10)
# Create DataFrame
data = pd.DataFrame({'Date': dates,
'Team': teams,
'Goals': goals})
# Cumulative count of teams
data['Count'] = data.groupby('Team').cumcount().add(1)
The output will be:
Date Team Goals Count
0 2021-06-01 Team 2 3 1
1 2021-06-02 Team 2 1 2
2 2021-06-03 Team 2 4 3
3 2021-06-04 Team 1 2 1
4 2021-06-05 Team 2 4 4
5 2021-06-06 Team 1 2 2
6 2021-06-07 Team 2 2 5
7 2021-06-08 Team 3 4 1
8 2021-06-09 Team 3 5 2
9 2021-06-10 Team 1 2 3

Combine two dataframes in Pandas to generate many to many relationship

I have two lists, say
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I want to generate a Pandas dataframe so:
All customers and accounts are used
There is a many to many relationship between customers and accounts (one customer 'may' have multiple accounts and an account 'may' be owned by multiple customers
I want the many to many relationship to be random. That is, some customers will have one account and some will have more than one. Similarly, some accounts will be owned by just one customers and others by more than one.
Something like,
Customer
Account
a
1
a
2
b
2
c
3
a
4
b
4
c
4
b
5
b
6
b
7
b
8
a
9
Since I am generating random data, in the worst case scenario, I can generate way too many accounts and discard the unused ones if the code is easier (essentially relaxing the requirement 1 above).
I am using sample (n=20, replace=True) to generate 20 records in both dataframes and then merging them into one based on the index. Is there an out of the box API or library to do this or is my code the recommended way?
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
customers_df = pd.DataFrame(data=customers)
customers_df = customers_df.sample (n=20, replace=True)
customers_df['new_index'] = range (20)
customers_df.set_index ('new_index', inplace=True)
accounts_df = pd.DataFrame (data=accounts)
accounts_df = accounts_df.sample (n=20, replace=True)
accounts_df['new_index'] = range (20)
accounts_df.set_index ('new_index', inplace=True)
combined_df = pd.merge (customers_df, accounts_df, on='new_index')
print (combined_df)
Edit: Modified the question and added sample code I have tried.
One way to accomplish this is to collect the set of all possible relationships with a cartesian product, then select from that list before building your dataframe:
import itertools
import random
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_associations = [ca for ca in itertools.product(customers, accounts)]
df = pd.DataFrame.from_records(random.choices(possible_associations, k=20), columns=['customers', 'accounts']).sort_values(['customers','accounts'])
print(df)
Output
customers accounts
0 a 2
3 a 2
15 a 2
18 a 4
16 a 5
14 a 7
7 a 8
12 a 8
1 a 9
2 b 5
9 b 5
8 b 8
11 b 8
19 c 2
17 c 3
5 c 4
4 c 5
6 c 5
13 c 5
10 c 7
To have a repeatable test result, start with np.random.seed(1) (in the target
version drop it).
Then proceed as follows:
Create a list of probabilities - how many accounts can have a customer, e.g.:
prob = [0.5, 0.25, 0.15, 0.09, 0.01]
Generate a Series stating how many owners shall have each account:
cnt = pd.Series(np.random.choice(range(1, len(prob) + 1), size=len(accounts),
p=prob), name='Customer')
Its name is Customer, because it will be the source to create just
Customer column.
For my sample probablities and generator seeding the result is:
0 1
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: Customer, dtype: int32
(the left column is the index, the right - actual values).
Because your data sample contains only 9 accounts, the result does
not contain "greater" number of owners. But in your target version,
with more accounts, there will be accounts with greater numbers of
owners.
Generate the result - cust_acct DataFrame, defining the assignment of customers
to accounts:
cust_acct = cnt.apply(lambda x: np.random.choice(customers, x, replace=False))\
.explode().to_frame().join(pd.Series(accounts, name='Account')).reset_index(drop=True)
The result, for your sample data and my seeding and probabilities, is:
Customer Account
0 b 1
1 a 2
2 b 2
3 b 3
4 b 4
5 c 5
6 b 6
7 c 7
8 a 8
9 b 9
Of course, you can assume different proabilities in prob.
You can also choose other "top" number of owners (the number of
entries in prob).
In this case no change in code is needed, because the range of values in
the first np.random.choice is set to accomodete to the lenght of prob.
Note: Because your sample data contains only 3 customers,
under different generator seeding there can occur ValueError: Cannot
take a larger sample than population when 'replace=False'.
The reason is that if the number of owners for some account was > 3 then just
this error occurs.
But with your target data, with greater number of customers, this error
will not occur.

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

How to apply different aggregate functions to different columns in pandas?

I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50
Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30