How to groupby year and unstack years into columns in pandas? - pandas

I have a pandas time series ser
ser
>>>
date x
2018-01-01 0.912
2018-01-02 0.704
...
2021-02-01 1.285
and I want to take a cumulative sum by year and make each year into a column as such, and the date index should now be just dates in year (e.g. Jan 01, Jan 02,... the formatting of Month and Day doesn't matter)
date 2018_x 2019_x 2020_x 2021_x 2022_x
Jan-01 0.912 ... ... ... ...
Jan-02 1.616 ... ... ... ...
...
I know how to groupby and take a cumulative sum, but then I want to do some sort of unstacking operation to get the years into columns
ser.groupby(ser.index.year).cumsum()
# what do I do next?
The standard pandas unstack() operation doesn't work here.
Can anyone please advise how to do this?

First you can aggregate sum per MM-DD with years and then reshape by Series.unstack:
df = ser.groupby([ser.index.strftime('%m-%d'), ser.index.year]).sum().unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285
Or if no duplicated datetimes create MultiIndex without groupby:
ser.index = [ser.index.strftime('%m-%d'), ser.index.year]
df = ser.unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285

Related

Pandas Implement Equation & Groupby 2 Conditions

I have data that looks like this below and I'm trying to calculate the CRMSE (centered root mean squared error) by site_name and year. Maybe i need an agg function or a lambda function to do this at each groupby parameters (plant_name, year). The dataframe data for df3m1:
plant_name year month obsvals modelvals
0 ARIZONA I 2021 1 8.90 8.30
1 ARIZONA I 2021 2 7.98 7.41
2 CAETITE I 2021 1 9.10 7.78
3 CAETITE I 2021 2 6.05 6.02
The equation that I need to implement by plant_name and year looks like this:
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals - df3m1.obsvals.mean()) -
(df3m1.modelvals - df3m1.modelvals.mean()) ) ** 2).mean() ** .5
This is a bit advanced for me yet on how to integrate a groupby and a calculation at the same time. thank you. Final dataframe would look like:
plant_name year crmse
0 ARIZONA I 2021 ?
1 CAETITE I 2021 ?
I have tried things like this with groupby -
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals -
df3m1.obsvals.mean()) - (df3m1.modelvals - df3m1.modelvals.mean()) )
** 2).mean() ** .5
but get errors like this:
TypeError: 'DataFrameGroupBy' object is not callable
Using groupby is correct. After that, we would have used .agg normally, but computing csrme interacts with multiple columns (obsvals and modelvals). So we pass the entire dataframe then take columns as we want by using .apply.
Code:
def crmse(x, y):
return np.sqrt(np.mean(np.square( (x - x.mean()) - (y - y.mean()) )))
def f(df):
return pd.Series(crmse(df['obsvals'], df['modelvals']), index=['crmse'])
crmse_series = (
df3m1
.groupby(['plant_name', 'year'])
.apply(f)
)
crmse_series
crmse
plant_name year
ARIZONA I 2021 0.015
CAETITE I 2021 0.645
You can merge the series into the original dataframe with merge.
df = df.merge(crmse_series, on=['plant_name', 'year'])
df
plant_name year month obsvals modelvals crmse
0 ARIZONA I 2021 1 8.90 8.30 0.015
1 ARIZONA I 2021 2 7.98 7.41 0.015
2 CAETITE I 2021 1 9.10 7.78 0.645
3 CAETITE I 2021 2 6.05 6.02 0.645
See Also:
Apply multiple functions to multiple groupby columns

How to plot time series and group years together?

I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);

Pandas 'multi-index' issue in merging dataframes

I have a panel dataset as df
stock year date return
VOD 2017 01-01 0.05
VOD 2017 01-02 0.03
VOD 2017 01-03 0.04
... ... ... ....
BAT 2017 01-01 0.05
BAT 2017 01-02 0.07
BAT 2017 01-03 0.10
so I use this code to get the mean and skewness of the return for each stock in each year.
df2=df.groupby(['stock','year']).mean().reset_index()
df3=df.groupby(['stock','year']).skew().reset_index()
df2 and df3 look fine.
df2 is like (after I change the column name)
stock year mean_return
VOD 2017 0.09
BAT 2017 0.14
... ... ...
df3 is like (after I change the column name)
stock year return_skewness
VOD 2017 -0.34
BAT 2017 -0.04
... ... ...
The problem is when I tried to merge df2 and df3 by using
want=pd.merge(df2,df2, on=['stock','year'],how='outer')
python gave me
'The column label 'stock' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.'
, which confuses me alot.
I can use want = pd.merge(df2,df3, left_index=True, right_index=True, how='outer') to merge df2 and df3, but after that i have to rename the columns as column names are in parentheses.
Is there any convenient way to merge df2 and df3 ? Thanks
Better is use agg for specify aggregate function in list and column for aggregation after function:
df3 = (df.groupby(['stock','year'])['return']
.agg([('mean_return','mean'),('return_skewness','skew')])
.reset_index())
print (df3)
stock year mean_return return_skewness
0 BAT 2017 0.073333 0.585583
1 VOD 2017 0.040000 0.000000
Your solution should be changed with remove reset_index, rename and last concat, also is specified column return for aggregate:
s2=df.groupby(['stock','year'])['return'].mean().rename('mean_return')
s3=df.groupby(['stock','year'])['return'].skew().rename('return_skewness')
df3 = pd.concat([s2, s3], axis=1).reset_index()
print (df3)
stock year mean_return return_skewness
0 BAT 2017 0.073333 0.585583
1 VOD 2017 0.040000 0.000000
EDIT:
If need aggregate all numeric columns remove list after groupby first and then use map with join for flatten MultiIndex:
print (df)
stock year date return col
0 VOD 2017 01-01 0.05 1
1 VOD 2017 01-02 0.03 8
2 VOD 2017 01-03 0.04 9
3 BAT 2017 01-01 0.05 1
4 BAT 2017 01-02 0.07 4
5 BAT 2017 01-03 0.10 3
df3 = df.groupby(['stock','year']).agg(['mean','skew'])
print (df3)
return col
mean skew mean skew
stock year
BAT 2017 0.073333 0.585583 2.666667 -0.935220
VOD 2017 0.040000 0.000000 6.000000 -1.630059
df3.columns = df3.columns.map('_'.join)
df3 = df3.reset_index()
print (df3)
stock year return_mean return_skew col_mean col_skew
0 BAT 2017 0.073333 0.585583 2.666667 -0.935220
1 VOD 2017 0.040000 0.000000 6.000000 -1.630059
Your solutions should be changed:
df2=df.groupby(['stock','year']).mean().add_prefix('mean_')
df3=df.groupby(['stock','year']).skew().add_prefix('skew_')
df3 = pd.concat([df2, df3], axis=1).reset_index()
print (df3)
stock year mean_return mean_col skew_return skew_col
0 BAT 2017 0.073333 2.666667 0.585583 -0.935220
1 VOD 2017 0.040000 6.000000 0.000000 -1.630059
A easier way to bypass this issue:
df2.to_clipboard(index=False)
df2clip=pd.read_clipboard(sep='\t')
df3.to_clipboard(index=False)
df3clip=pd.read_clipboard(sep='\t')
Then merge 2 df again:
pd.merge(df2clip,df3clip,on=['stock','year'],how='outer')

Backfilling a pandas dataframe missed the first month

I have a pandas df or irrigation demand data that has daily values from 1900 to 2099. I resampled the df to get the monthly average and then resampled and backfilled the monthly averages on a daily frequency, so that the average daily value for each month, was input as the daily value for every day of that month.
My problem is that the first month was not backfilled and there is only a value for the last day of that month (1900-01-31).
Here is my code, any suggestions on what I am doing wrong?
I2 = pd.DataFrame(IrrigDemand, columns = ['Year', 'Month', 'Day', 'IrrigArea_1', 'IrrigArea_2','IrrigArea_3','IrrigArea_4','IrrigArea_5'],dtype=float)
# set dates as index
I2.set_index('Year')
# make a column of dates in datetime format
dates = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# add the column of dates to df
I2['dates'] = pd.Series(dates, index=I2.index)
# set dates as index of df
I2.set_index('dates')
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.reset_index().set_index('dates').resample('m').mean()
I2_daily_average = I2_monthly_average.resample('d').bfill()
There is problem first day is not added by resample('m'), so necessary add it manually:
# make a column of dates in datetime format and assign to index
I2.index = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.resample('m').mean()
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
I2_daily_average = I2_monthly_average.resample('d').bfill()
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
I2 = pd.DataFrame({'a': range(10)}, index=rng)
print (I2)
a
2017-04-03 0
2017-04-23 1
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
I2_monthly_average = I2.resample('m').mean()
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
2017-04-01 0.5 <- added first day
I2_daily_average = I2_monthly_average.resample('d').bfill()
print (I2_daily_average.head())
a
2017-04-01 0.5
2017-04-02 0.5
2017-04-03 0.5
2017-04-04 0.5
2017-04-05 0.5

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.
UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")