feed the results of Pandas groupby() back into the original dataframe? [duplicate] - pandas

This question already has answers here:
How to assign a name to the size() column?
(5 answers)
Closed 6 days ago.
How can I use groupby() to arrive at a count of employee types on a given day, and feed the results back into the original dataframe?
Here's the data:
shifts = [("Cashier", "Thursday"), ("Cashier", "Thursday"),
("Cashier", "Thursday"), ("Cook", "Thursday"),
("Cashier", "Friday"), ("Cashier", "Friday"),
("Cook", "Friday"), ("Cook", "Friday"),
("Cashier", "Saturday"), ("Cook", "Saturday"),
("Cook", "Saturday")]
labels = ["JOB_TITLE", "DAY"]
df = pd.DataFrame.from_records(shifts, columns=labels)
This use of value_counts() produces the correct results:
shifts_series = df.groupby('DAY')['JOB_TITLE'].value_counts()
How to feed the values back into the original DF?
Desired results
JOB_TITLE DAY TYPE
0 Cashier Thursday 3
1 Cashier Thursday 3
2 Cashier Thursday 3
3 Cook Thursday 1
4 Cashier Friday 2
5 Cashier Friday 2
6 Cook Friday 2
7 Cook Friday 2
8 Cashier Saturday 1
9 Cook Saturday 2
10 Cook Saturday 2
transform()?
I found some answers suggesting to use transform(), but the results count the number of instances for 'DAY':
df['TYPE'] = df.groupby('DAY')['JOB_TITLE'].transform('count')
Nasty anti-pattern
I managed to make a nasty little Pandas anti-pattern using the answer to a different question. I tried to loop over the results and label [('Saturday', 'Cashier'), ('Thursday', 'Cook')]:
shift_filter1 = shifts_series[shifts_series == 1].index.tolist()
df['WORKED_SOLO'] = np.nan
for workday, title in shift_filter1:
df['WORKED_SOLO'] = (np.where(((df['WORKED_SOLO'].isna()) & (df['DAY'] == workday) & (df['JOB_TITLE'] == title)), True, np.nan))
But the resulting DF replaces the previous loop's result--despite the isna() test. Obviously not the Pandas way.

You can do the following:
import pandas as pd
shifts = [("Cashier", "Thursday"), ("Cashier", "Thursday"),
("Cashier", "Thursday"), ("Cook", "Thursday"),
("Cashier", "Friday"), ("Cashier", "Friday"),
("Cook", "Friday"), ("Cook", "Friday"),
("Cashier", "Saturday"), ("Cook", "Saturday"),
("Cook", "Saturday")]
labels = ["JOB_TITLE", "DAY"]
df = pd.DataFrame.from_records(shifts, columns=labels)
shifts_series = df.groupby('DAY')['JOB_TITLE'].value_counts()
shifts_series = shifts_series.reset_index(name='TYPE')
df = pd.merge(df, shifts_series, on=['JOB_TITLE', 'DAY'])
print(df)
which gives:
JOB_TITLE DAY TYPE
0 Cashier Thursday 3
1 Cashier Thursday 3
2 Cashier Thursday 3
3 Cook Thursday 1
4 Cashier Friday 2
5 Cashier Friday 2
6 Cook Friday 2
7 Cook Friday 2
8 Cashier Saturday 1
9 Cook Saturday 2
10 Cook Saturday 2

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

pandas dataframe group by create a new column

I have a dataframe with the below format:
name date
Anne 2018/07/04
Anne 2018/07/06
Bob 2015/10/01
Bob 2015/10/10
Bob 2015/11/11
Anne 2018/07/05
... ...
I would like to add a column which is a relative number of days passed from the minimum date of the person.
for each row:
relative_day = (person's date) - (minimum of person's date)
The output is:
name date relative_day
Anne 2018/07/04 0
Anne 2018/07/04 2
Bob 2015/10/01 0
Bob 2015/10/01 9
Bob 2015/11/11 41
Anne 2018/07/05 1
I tried to groupby name first and then writing a for loop over each name and add a column but it gives the error of
A value is trying to be set on a copy of a slice from a DataFrame.
Here is the code I have tried so far:
df['relative_day'] = None
person_groups = df.groupby('name')
for person_name, person_dates in person_groups:
person_dates['relative_day'] = person_dates['date'].min()
Get the name as an index, group on the name, then subtract the minimum to get your relative dates.
result = df.astype({"date": np.datetime64}).set_index("name")
result.assign(relative_day=result['date'] - result.groupby("name")['date'].transform("min"))
date relative_day
name
Anne 2018-07-04 0 days
Anne 2018-07-06 2 days
Bob 2015-10-01 0 days
Bob 2015-10-10 9 days
Bob 2015-11-11 41 days
Anne 2018-07-05 1 days
Let us try
df.date=pd.to_datetime(df.date)
df['new'] = (df.date - df.groupby('name').date.transform('min')).dt.days
df
name date new
0 Anne 2018-07-04 0
1 Anne 2018-07-06 2
2 Bob 2015-10-01 0
3 Bob 2015-10-10 9
4 Bob 2015-11-11 41
5 Anne 2018-07-05 1
#sammywemmy has a good solution. I want to show another possible way.
import pandas as pd
# read dataset
df = pd.read_csv('data.csv')
# change column data type
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
# group by name
df_group = df.groupby('name')
# get minimum date value
df_group_min = df_group['date'].min()
# create minimum date column by name
df['min'] = df.apply(lambda r: df_group_min[r['name']], axis=1)
# calculate relative day
df['relative_day'] = (df['date'] - df['min']).dt.days
# remove minimum column
df.drop('min', axis=1, inplace=True)
print(df)
Output
name date relative_day
0 Anne 2018-07-04 0
1 Anne 2018-07-06 2
2 Bob 2015-10-01 0
3 Bob 2015-10-10 9
4 Bob 2015-11-11 41
5 Anne 2018-07-05 1

pandas group by week and get day

supose i have tested data like below:
import pandas as pd
data_dic = {
"day": ['2019-01-18', '2019-01-18', '2019-01-18', '2019-01-19',
'2019-01-19','2019-01-25', '2019-02-19', '2019-02-24'],
"data": [0, 1,3,3, 0, 1,2 ,5],
"col2": [10, 11,1,1, 10, 1,2, 5],
"col3": [5, 6,7,8, 9, 1,2, 5]
}
df = pd.DataFrame(data_dic)
df.index = pd.to_datetime(df.day)
df = df.drop(['day'], axis=1)
df.insert(0, 'day_name', df.index.weekday_name)
Result:
day_name data col2 col3
day
2019-01-18 Friday 0 10 5
2019-01-18 Friday 1 11 6
2019-01-18 Friday 3 1 7
2019-01-19 Saturday 3 1 8
2019-01-19 Saturday 0 10 9
2019-01-25 Friday 1 1 1
2019-02-19 Tuesday 2 2 2
2019-02-24 Sunday 5 5 5
Now i need to group this data by week and by max value from column 2. I done this by:
df = df.groupby(df.index.to_period("w")).agg({'col2':'max'})
Result:
col2
day
2019-01-14/2019-01-20 11
2019-01-21/2019-01-27 1
2019-02-18/2019-02-24 5
Question:
How to get day date on witch the max grouped value is occurred?
Expected result:
col2 day
day
2019-01-14/2019-01-20 11 2019-01-18
2019-01-21/2019-01-27 1 2019-01-25
2019-02-18/2019-02-24 5 2019-02-24
Thanks for Your time and effort.
Use DataFrameGroupBy.idxmax with changed GroupBy.agg - specify column name after groupby and pass tuples:
df1 = df.groupby(df.index.to_period("w"))['col2'].agg([('col2','max'), ('day','idxmax')])
print (df1)
col2 day
day
2019-01-14/2019-01-20 11 2019-01-18
2019-01-21/2019-01-27 1 2019-01-25
2019-02-18/2019-02-24 5 2019-02-24
Pandas 0.25+ solution:
df.groupby(df.index.to_period("w")).agg(col2=pd.NamedAgg(column='col2', aggfunc='max'),
day=pd.NamedAgg(column='col2', aggfunc='idxmax'))

Calculate the number of weekends (Saturdays and Sundays), between two dates

I have a data frame with two date columns, a start and end date. How will I find the number of weekends between the start and end dates using pandas or python date-times
I know that pandas has DatetimeIndex which returns values 0 to 6 for each day of the week, starting Monday
# create a data-frame
import pandas as pd
df = pd.DataFrame({'start_date':['4/5/19','4/5/19','1/5/19','28/4/19'],
'end_date': ['4/5/19','5/5/19','4/5/19','5/5/19']})
# convert objects to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Trying to get the date index between dates as a prelim step but fails
pd.DatetimeIndex(df['end_date'] - df['start_date']).weekday
I'm expecting the result to be this: (weekend_count includes both start and end dates)
start_date end_date weekend_count
4/5/2019 4/5/2019 1
4/5/2019 5/5/2019 2
1/5/2019 4/5/2019 1
28/4/2019 5/5/2019 3
IIUC
df['New']=[pd.date_range(x,y).weekday.isin([5,6]).sum() for x , y in zip(df.start_date,df.end_date)]
df
start_date end_date New
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3
Try with:
df['weekend_count']=((df.end_date-df.start_date).dt.days+1)-np.busday_count(
df.start_date.dt.date,df.end_date.dt.date)
print(df)
start_date end_date weekend_count
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3

Selecting all the previous 6 months data records from occurrence of a particular value in a column in pandas

I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved