Data analysis with pandas - pandas

The following df is a summary of my hole dataset just to illustrate my problem.
The df shows the job application of each id and i want to know which combination of sector is more likely for an individual to apply?
df
id education area_job_application
1 Collage Construction
1 Collage Sales
1 Collage Administration
2 University Finance
2 University Sales
3 Collage Finance
3 Collage Sales
4 University Administration
4 University Sales
4 University Data analyst
5 University Administration
5 University Sales
answer
Construction Sales Administration Finance Data analyst
Contruction 1 1 1 0 0
Sales 1 5 3 1 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1
This answer shows that administration and sales are the sector that more chances have to receive a postulation by the same id (this is the answer which i am looking). But i am also interesting for other combinations, i think that a mapheat will be very informative to illustrate this data.
Sector combination from the same sector are irrelevant (maybe in the diagonal from the answer matrix should be a 0, doesnt matter the value, i wont anaylse).

Use crosstab or groupby with size and unstack first and then DataFrame.dot by transpose DataFrame and last add reindex for custom order of index and columns:
#dynamic create order by unique values of column
L = df['area_job_application'].unique()
#df = pd.crosstab(df.id, df.area_job_application)
df = df.groupby(['id', 'area_job_application']).size().unstack(fill_value=0)
df = df.T.dot(df).rename_axis(None).rename_axis(None, axis=1).reindex(columns=L, index=L)
print (df)
Construction Sales Administration Finance Data analyst
Construction 1 1 1 0 0
Sales 1 5 3 2 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1

Related

How to use Pandas to find relative share of rooms in individual floors [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 12 days ago.
I have a dataframe df with the following structure
Floor Room Area
0 1 Living room 25
1 1 Kitchen 20
2 1 Bedroom 15
3 2 Bathroom 21
4 2 Bedroom 14
and I want to add a series floor_share with the relative share/ratio of the given floor, so that the dataframe becomes
Floor Room Area floor_share
0 1 Living room 18 0,30
1 1 Kitchen 18 0,30
2 1 Bedroom 24 0,40
3 2 Bathroom 10 0,67
4 2 Bedroom 20 0,33
If it is possible to do this with a one-liner (or any other idiomatic manner), I'll be very happy to learn how.
Current workaround
What I have done that produces the correct results is to first find the total floor areas by
floor_area_sums = df.groupby('Floor')['Area'].sum()
which gives
Floor
1 60
2 35
Name: Area, dtype: int64
I then initialize a new series to 0, and find the correct values while iterating through the dataframe rows.
df["floor_share"] = 0
for idx, row in df.iterrows():
df.loc[idx, 'floor_share'] = df.loc[idx, 'Area']/floor_area_sums[row.Floor]
IIUC use:
df["floor_share"] = df['Area'].div(df.groupby('Floor')['Area'].transform('sum'))
print (df)
Floor Room Area floor_share
0 1 Living room 18 0.300000
1 1 Kitchen 18 0.300000
2 1 Bedroom 24 0.400000
3 2 Bathroom 10 0.333333
4 2 Bedroom 20 0.666667

Pandas: Evenly allocate/assign data from one Dataframe to another over multiple users

I have a Dataframe of Account Managers IDs and a total of their allocated accounts (df_am):
account_manager_id account_total
0 1 8
1 2 3
2 3 3
3 4 1
4 5 7
5 6 2
I have a second Dataframe of accounts to be allocated to an account manager (df_poc):
point_of contact account_no
0 John 100
1 Bob 78
2 Sally 125
3 Greg 128
4 Bret 78
5 Corey 100
6 Chad 100
7 Mavis 8
8 Andre 632
9 Hunter 157
10 Debra 12
I need to evenly allocate the accounts to an account manager, note that account_no can have multiples of the same with different point_of_contact as such these will need to be allocated to the same account_manager_id.
In order to do this I'm looking to check the unique account_no in df_poc and allocate it to the account manager who has the lowest total in account_total then count the totals again and move on to the next account_no.
For example account_manager_id 4 will get the first account_no as they only have 1 account so far. As it's account_no 100 and there are 3 pocs account_manager_id 4 will get all 3 bringing their total to 4.
This would make account_manager_id 6 the lowest on 2 and account_no 78 will be allocated to them.
We now have 3 account managers with 3 accounts (2, 3, 6) I have no preference here, so I will just allocate it to the first account_manager. Bringing account_manager_id 2 up to 4, leaving 3 and 6 still on 3 so on and so forth.
I truly hope you can see what I am trying to achieve. If you have a better solution please let me know.
Desired outcome, df_am:
account_manager_id account_total
0 1 8
1 2 5
2 3 5
3 4 5
4 5 7
5 6 5
Desired outcome, df_poc:
point_of contact account_no account_manager_id
0 John 100 4
1 Bob 78 6
2 Sally 125 2
3 Greg 128 3
4 Bret 78 6
5 Corey 100 4
6 Chad 100 4
7 Mavis 8 2
8 Andre 632 3
9 Hunter 157 4
10 Debra 12 6
As you can hopefully see account_managers 8 and 5 never got a single account in order for the other account_managers to catch up on their totals.
I've been using using a loop (iterrows) with .min to get the account manager to allocate, however this approach would not take in to consideration multiple account_no and would lead to accounts being split over multiple account_managers.
lowest = df_am[df_am["account_total"] == df_am["account_total"].min()] #to get lowest total
lowest = lowest.iloc[:1] #keep first account manager if multiples on same total
Thank you any help is appreciated.
Solved this, I created a new dataframe of unique account_no and counted the point_of_contact for the account.
account_no counts
0 100 3
1 78 2
2 125 1
3 128 1
4 8 1
5 632 1
6 157 1
7 12 1
Then I did an iterrow on the new df (df_check):
for i, row in df_check.iterrows(): #do not like using iterrow but see alternative
account = row["account_no"]
count = row["counts"]
lowest = df_am[df_am["account_total"] == df_am["account_total"].min()] #get the account manager will the fewest accounts
lowest = lowest.iloc[:1] #if more than one with fewest this ensure there is always one account manager
lowest["account_no"] = account #add the account number to assign to the account manager
new_count = lowest["account_total"] #takes the total accounts for the account
lowest = lowest.drop(["account_total"], axis=1) #drop from df as we no longer need this.
user = lowest["account_manager_id"].astype('int')
df_check.at[i, "id"] = user #assigns the account manager with the fewest accounts to the account
new_count = new_count + count #takes the account managers old total and adds the account count giving them a new total
df_am.at[user, "account_total"] = new_count #adds the new total to the account manager

How can I create a datframe column which counts the occurrence of each value in anopther column?

I am trying to add a column to my dataframe, which will hold a value which represents the number of times a unique value has appeared in another column.
For example , I haver the following dataframe:
Date|Team|Goals|
22.08.20|Team1|4|
22.08.20|Team2|3|
22.08.20|Team3|1|
22.09.20|Team1|4|
22.09.20|Team3|5|
I would like to add a counter column, which counts how often each team appears:
Date|Team|Goals|Count|
22.08.20|Team1|4|1|
22.08.20|Team2|3|1|
22.08.20|Team3|1|1|
22.09.20|Team1|4|2|
22.09.20|Team3|5|2|
My Dataframe is ordered by date, so the teams should appear in the correct order.
Apologies, very new to pandas and stack overflow, so please let me know if I can format this question differently. Thanks
TRY:
df['Count'] = df.groupby('Team').cumcount().add(1)
OUTPUT:
Date Team Goals Count
0 22.08.20 Team1 4 1
1 22.08.20 Team2 3 1
2 22.08.20 Team3 1 1
3 22.09.20 Team1 4 2
4 22.09.20 Team3 5 2
Another answer building upon #Nk03's with replicable results:
import pandas as pd
import numpy as np
# Set numpy random seed
np.random.seed(42)
# Create dates array
dates = pd.date_range(start='2021-06-01', periods=10, freq='D')
# Create teams array
teams_names = ['Team 1', 'Team 2', 'Team 3']
teams = [teams_names[i] for i in np.random.randint(0, 3, 10)]
# Create goals array
goals = np.random.randint(1, 6, 10)
# Create DataFrame
data = pd.DataFrame({'Date': dates,
'Team': teams,
'Goals': goals})
# Cumulative count of teams
data['Count'] = data.groupby('Team').cumcount().add(1)
The output will be:
Date Team Goals Count
0 2021-06-01 Team 2 3 1
1 2021-06-02 Team 2 1 2
2 2021-06-03 Team 2 4 3
3 2021-06-04 Team 1 2 1
4 2021-06-05 Team 2 4 4
5 2021-06-06 Team 1 2 2
6 2021-06-07 Team 2 2 5
7 2021-06-08 Team 3 4 1
8 2021-06-09 Team 3 5 2
9 2021-06-10 Team 1 2 3

Manipulating series in a dataframe

My dataframe has a list of comma separated values in one column. I want to find the list of distinct entries, create a new column for each distinct entry in the dataframe, then fill the new columns with 1 or 0 depending on whether the row has the city name.
The idea is to use the new columns in building a logistic regression model.
As an example
Before
Name City
Jack NewYork,Chicago,Seattle
Jill Seattle, SanFrancisco
Ted Chicago,SanFrancisco
Bill NewYork,Seattle
After
Name NewYork Chicago Seattle SanFrancisco
Jack 1 1 1 0
Jill 0 0 1 1
Ted 0 1 0 1
Bill 1 0 1 0
You can do this with the get_dummies str method:
import pandas as pd
df = pd.DataFrame(
{"Name": ["Jack", "Jill", "Ted", "Bill"],
"City": ["NewYork,Chicago,Seattle", "Seattle,SanFrancisco", "Chicago,SanFrancisco", "NewYork,Seattle"]}
)
print(pd.concat((df, df.City.str.get_dummies(",")), axis=1))
Result:
Name City Chicago NewYork SanFrancisco Seattle
0 Jack NewYork,Chicago,Seattle 1 1 0 1
1 Jill Seattle,SanFrancisco 0 0 1 1
2 Ted Chicago,SanFrancisco 1 0 1 0
3 Bill NewYork,Seattle 0 1 0 1

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1