pandas group by week and get day - pandas

supose i have tested data like below:
import pandas as pd
data_dic = {
"day": ['2019-01-18', '2019-01-18', '2019-01-18', '2019-01-19',
'2019-01-19','2019-01-25', '2019-02-19', '2019-02-24'],
"data": [0, 1,3,3, 0, 1,2 ,5],
"col2": [10, 11,1,1, 10, 1,2, 5],
"col3": [5, 6,7,8, 9, 1,2, 5]
}
df = pd.DataFrame(data_dic)
df.index = pd.to_datetime(df.day)
df = df.drop(['day'], axis=1)
df.insert(0, 'day_name', df.index.weekday_name)
Result:
day_name data col2 col3
day
2019-01-18 Friday 0 10 5
2019-01-18 Friday 1 11 6
2019-01-18 Friday 3 1 7
2019-01-19 Saturday 3 1 8
2019-01-19 Saturday 0 10 9
2019-01-25 Friday 1 1 1
2019-02-19 Tuesday 2 2 2
2019-02-24 Sunday 5 5 5
Now i need to group this data by week and by max value from column 2. I done this by:
df = df.groupby(df.index.to_period("w")).agg({'col2':'max'})
Result:
col2
day
2019-01-14/2019-01-20 11
2019-01-21/2019-01-27 1
2019-02-18/2019-02-24 5
Question:
How to get day date on witch the max grouped value is occurred?
Expected result:
col2 day
day
2019-01-14/2019-01-20 11 2019-01-18
2019-01-21/2019-01-27 1 2019-01-25
2019-02-18/2019-02-24 5 2019-02-24
Thanks for Your time and effort.

Use DataFrameGroupBy.idxmax with changed GroupBy.agg - specify column name after groupby and pass tuples:
df1 = df.groupby(df.index.to_period("w"))['col2'].agg([('col2','max'), ('day','idxmax')])
print (df1)
col2 day
day
2019-01-14/2019-01-20 11 2019-01-18
2019-01-21/2019-01-27 1 2019-01-25
2019-02-18/2019-02-24 5 2019-02-24
Pandas 0.25+ solution:
df.groupby(df.index.to_period("w")).agg(col2=pd.NamedAgg(column='col2', aggfunc='max'),
day=pd.NamedAgg(column='col2', aggfunc='idxmax'))

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

get the sign change count from Dataframe

I have df like this
Date amount
0 2021-06-18 14
1 2021-06-19 -8
2 2021-06-20 -8
3 2021-06-21 17
4 2021-07-02 -8
5 2021-07-05 77
6 2021-07-06 -10
7 2021-08-02 -78
8 2021-08-06 77
9 2021-07-08 10
i went the count of sign change in amount month wise of count each month like in
count = [{"June-2021": 2},{"July-2021" : 3},{"Aug-2021" : 1}]
Note: Last Date of each month and first date of next month is different then count as in different count
i want a function for this
You can use (x.mul(x.shift()) < 0).sum() (current entry multiply by last entry being negative indicates a sign change) to get the count of sign changes within a group of month-year, as follows:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.to_dict()
)
Result:
print(count)
{'Jun-2021': 2, 'Jul-2021': 3, 'Aug-2021': 1}
Edit
If you want list of dict, you can use:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.reset_index()
.apply(lambda x: {x['Date']: x['amount']}, axis=1)
.to_list()
)
Result:
print(count)
[{'Jun-2021': 2}, {'Jul-2021': 3}, {'Aug-2021': 1}]

Pandas dataframe diff except some rows?

df
end_date dt_eps
0 20200930 0.9625
1 20200630 0.5200
2 20200331 0.2130
3 20191231 1.2700
4 20190930 -0.1017
5 20190630 -0.1058
6 20190331 0.0021
7 20181231 0.0100
Note: the value of end_date must be the last day of each year quarter and the sequence is sorted by near and the type is string.
Goal
create q_dt_eps column: calculate the diff of dt_eps between the nearest day but it is the same as dt_eps when the quarter is Q1. For example, the q_dt_eps for 20200930 is 0.4425(0.9625-0.5200) while 20200331 is 1.2700.
Try
df['q_dt_eps']=df['dt_eps'].diff(periods=-1)
But it could not return the same value of dt_eps when the quarter is Q1.
You can just convert the date to datetime, extract the quarter of the date, and then create your new column using np.where, keeping the original value when quarter is equal to 1, otherwise using the shifted value.
import numpy as np
import pandas as pd
df = pd.DataFrame({'end_date':['20200930', '20200630', '20200331',
'20191231', '20190930', '20190630', '20190331', '20181231'],
'dt_eps':[0.9625, 0.52, 0.213, 1.27, -.1017, -.1058, .0021, .01]})
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y%m%d')
df['qtr'] = df['end_date'].dt.quarter
df['q_dt_eps'] = np.where(df['qtr']==1, df['dt_eps'], df['dt_eps'].diff(-1))
df
end_date dt_eps qtr q_dt_eps
0 2020-09-30 0.9625 3 0.4425
1 2020-06-30 0.5200 2 0.3070
2 2020-03-31 0.2130 1 0.2130
3 2019-12-31 1.2700 4 1.3717
4 2019-09-30 -0.1017 3 0.0041
5 2019-06-30 -0.1058 2 -0.1079
6 2019-03-31 0.0021 1 0.0021
7 2018-12-31 0.0100 4 NaN

adding a new column using values in another column using pandas

I have a data set
id Category Date
1 Sick 2016-10-10
12:10:21
2 Active 2017-09-08
11:09:06
3 Weak 2018-11-12
06:10:04
Now i want to add a new column which only has year in the data set using pandas?
You could do:
import pandas as pd
data = [[1, 'Sick ', '2016-10-10 12:10:21'],
[2, 'Active', '2017-09-08 11:09:06'],
[3, 'Weak ', '2018-11-12 06:10:04']]
df = pd.DataFrame(data=data, columns=['id', 'category', 'date'])
df['year'] = pd.to_datetime(df['date']).dt.year
print(df)
Output
id category date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018
you can just do df['year'] = pd.DatetimeIndex(df['Date']).year
Output:
id category Date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018

How to add a yearly amount to daily data in Pandas

I have two DataFrames in pandas. One of them has data every month, the other one has data every year. I need to do some computation where the yearly value is added to the monthly value.
Something like this:
df1, monthly:
2013-01-01 1
2013-02-01 1
...
2014-01-01 1
2014-02-01 1
...
2015-01-01 1
df2, yearly:
2013-01-01 1
2014-01-01 2
2015-01-01 3
And I want to produce something like this:
2013-01-01 (1+1) = 2
2013-02-01 (1+1) = 2
...
2014-01-01 (1+2) = 3
2014-02-01 (1+2) = 3
...
2015-01-01 (1+3) = 4
Where the value of the monthly data is added to the value of the yearly data depending on the year (first value in the parenthesis is the monthly data, second value is the yearly data).
Assuming your "month" column is called date in the Dataframe df, then you can obtain the year by using the dt member:
pd.to_datetime(df.date).dt.year
Add a column like that to your month DataFrame, and call it year. (See this for an explanation).
Now do the same to the year DataFrame.
Do a merge on the month and year DataFrames, specifying how=left.
In the resulting DataFrame, you will have both columns. Now just add them.
Example
month_df = pd.DataFrame({
'date': ['2013-01-01', '2013-02-01', '2014-02-01'],
'amount': [1, 2, 3]})
year_df = pd.DataFrame({
'date': ['2013-01-01', '2014-02-01', '2015-01-01'],
'amount': [7, 8, 9]})
month_df['year'] = pd.to_datetime(month_df.date).dt.year
year_df['year'] = pd.to_datetime(year_df.date).dt.year
>>> pd.merge(
month_df,
year_df,
left_on='year',
right_on='year',
how='left')
amount_x date_x year amount_y date_y
0 1 2013-01-01 2013 7 2013-01-01
1 2 2013-02-01 2013 7 2013-01-01
2 3 2014-02-01 2014 8 2014-02-01