Pandas groupby time and ID and aggregate - pandas

I am trying to calculate, what is the sum of payment made 2nd half of year minus the 1st half of the year.
This is how the data may look:
ID date payment
1 1/1/2020 10
1 1/2/2020 11
1 1/3/2020 10
1 1/4/2020 10
1 1/5/2020 11
1 1/6/2020 10
1 1/7/2020 10
1 1/8/2020 11
1 1/9/2020 10
1 1/10/2020 32
1 1/11/2020 10
1 1/12/2020 12
2 1/1/2020 10
2 1/2/2020 10
2 1/3/2020 41
2 1/4/2020 10
2 1/5/2020 53
2 1/6/2020 10
2 1/7/2020 10
2 1/8/2020 44
2 1/9/2020 10
2 1/10/2020 2
2 1/11/2020 9
2 1/12/2020 5
I convert the df date to a pandas dt
df.date = df.date.astype(str).str.slice(0, 10)
df.date = pd.to_datetime(pay.date)
print(df.date.min(),df.date.max())
output: 2020-01-01 00:00:00 2020-12-01 00:00:00
Then i create time points and different data frames for 1st and 2nd half of the year
observation_date = '2020-12-31'
observation_date = datetime.strptime(observation_date, '%Y-%m-%d')
observation_date = observation_date.date()
observation_date = pd.Timestamp(observation_date)
print(observation_date)
mo6_ago = observation_date - relativedelta(months=6)
mo6_ago = pd.Timestamp(mo6_ago)
print(mo6_ago)
mo6_ago_plus1 = observation_date - relativedelta(months=6) + relativedelta(days=1)
mo6_ago_plus1 = pd.Timestamp(mo6_ago_plus1)
print(mo6_ago_plus1)
mo12_ago = observation_date - relativedelta(months=12) + relativedelta(days=1)
mo12_ago = pd.Timestamp(mo12_ago)
print(mo12_ago)
output:
2020-12-31 00:00:00
2020-06-30 00:00:00
2020-07-01 00:00:00
2020-01-01 00:00:00
mask = (df['date'] >= mo12_ago) & (df['date'] <= mo6_ago)
first_half = df.loc[mask]
first_half = first_half[['ID','date','payment']]
print(first_half.date.min(),first_half.date.max())
output: 2020-01-01 00:00:00 2020-06-01 00:00:00
mask = (df['date'] >= mo6_ago_plus1) & (df['date'] <= observation_date)
sec_half = df.loc[mask]
sec_half = sec_half[['ID','date','payment']]
print(sec_half.date.min(),sec_half.date.max())
output: 2020-07-01 00:00:00 2020-12-01 00:00:00
then i group and sum for the 2 half of the year and merge them into one df like that
sum_first_half = first_half.groupby(['ID'])['payment'].sum().reset_index()
sum_first_half = sum_first_half.rename(columns = {'payment':'payment_first_half'})
sum_sec_half = sec_half.groupby(['ID'])['payment'].sum().reset_index()
sum_sec_half = sum_sec_half.rename(columns = {'payment':'payment_sec_half'})
df_new = pd.merge(sum_first_half, sum_sec_half, how='outer', on='ID')
Finally i take minus the 2 columns this way
df_new['sec_minus_first'] = df_new['payment_sec_half'] -df_new['payment_first_half']
ID payment_first_half payment_sec_half sec_minus_first
1 62 85 23
2 134 80 -54
Is there a faster and more memory efficient way of doing this?

Using datetime:
from datetime import datetime as dt
Convert date column to datetime:
df["date"] = pd.to_datetime(df["date"])
Split on a date of your choice, group by ID, sum each half, then subtract the halves:
df.loc[df['date'] >= dt(2020, 7, 1)].groupby("ID").sum() - df.loc[df['date'] < dt(2020, 7, 1)].groupby("ID").sum()

Related

filter data based on month start and month end

Given a dataframe with date column in this format.
Date Group
2020-05-18 1
2020-06-22 1
2019-07-11 1
2018-03-01 1
2021-01-21 2
2021-05-05 2
2021-09-11 2
And two strings;
Start = 2020-05 (indicating month start)
End = 2021-09 (indicating month end)
I want to filter out the data so that only the dates that fall within the start and end date are available in the dataframe.
Expected output:
Date Group
2020-05-18 1
2020-06-22 1
2021-01-21 2
2021-05-05 2
2021-09-11 2
# Creating dummy data
d = {'dt':['2020-05-18',
'2020-06-22',
'2019-07-11',
'2018-03-01',
'2021-01-21',
'2021-05-05',
'2021-09-11'],
'group':[1,1,1,1,2,2,2]}
dt_df = pd.DataFrame(data=d)
dt_df
dt_df['dt'] = pd.to_datetime(dt_df['dt'])
dt_df
Inital Input:
0 2020-05-18
1 2020-06-22
2 2019-07-11
3 2018-03-01
4 2021-01-21
5 2021-05-05
6 2021-09-11
Name: dt, dtype: datetime64[ns]
Start = '2020-05'
End = '2021-09'
Start = pd.to_datetime(Start)
End = pd.to_datetime(End)
End = End+np.timedelta64(1, 'M')
Use loc to select only dates between Start and End timestamp.
dt_df.loc[(dt_df['dt'] - Start >= np.timedelta64(0,'D')) & (dt_df['dt'] - End <= np.timedelta64(0, 'D'))]
Output:
dt group
0 2020-05-18 1
1 2020-06-22 1
4 2021-01-21 2
5 2021-05-05 2
6 2021-09-11 2

expand year values to month in pandas

I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?
One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT

Transposing SQLite rows and columns with average per hour

I have a table in SQLite called param_vals_breaches that looks like the following:
id param queue date_time param_val breach_count
1 c a 2013-01-01 00:00:00 188 7
2 c b 2013-01-01 00:00:00 156 8
3 c c 2013-01-01 00:00:00 100 2
4 d a 2013-01-01 00:00:00 657 0
5 d b 2013-01-01 00:00:00 23 6
6 d c 2013-01-01 00:00:00 230 12
7 c a 2013-01-01 01:00:00 100 0
8 c b 2013-01-01 01:00:00 143 9
9 c c 2013-01-01 01:00:00 12 2
10 d a 2013-01-01 01:00:00 0 1
11 d b 2013-01-01 01:00:00 29 5
12 d c 2013-01-01 01:00:00 22 14
13 c a 2013-01-01 02:00:00 188 7
14 c b 2013-01-01 02:00:00 156 8
15 c c 2013-01-01 02:00:00 100 2
16 d a 2013-01-01 02:00:00 657 0
17 d b 2013-01-01 02:00:00 23 6
18 d c 2013-01-01 02:00:00 230 12
I want to write a query that will show me a particular queue (e.g. "a") with the average param_val and breach_count for each param on an hour by hour basis. So transposing the data to get something that looks like this:
Results for Queue A
Hour 0 Hour 0 Hour 1 Hour 1 Hour 2 Hour 2
param avg_param_val avg_breach_count avg_param_val avg_breach_count avg_param_val avg_breach_count
c xxx xxx xxx xxx xxx xxx
d xxx xxx xxx xxx xxx xxx
is this possible? I'm not sure how to go about it. Thanks!
SQLite does not have a PIVOT function but you can use an aggregate function with a CASE expression to turn the rows into columns:
select param,
avg(case when time = '00' then param_val end) AvgHour0Val,
avg(case when time = '00' then breach_count end) AvgHour0Count,
avg(case when time = '01' then param_val end) AvgHour1Val,
avg(case when time = '01' then breach_count end) AvgHour1Count,
avg(case when time = '02' then param_val end) AvgHour2Val,
avg(case when time = '02' then breach_count end) AvgHour2Count
from
(
select param,
strftime('%H', date_time) time,
param_val,
breach_count
from param_vals_breaches
where queue = 'a'
) src
group by param;
See SQL Fiddle with Demo