How to create a function to convert monthly data to daily, weekly in pandas dataframe? - pandas

I have the below monthly data in the dataframe and I need to convert the data to weekly, daily, biweekly.
date chair_price vol_chair
01-09-2018 23 30
01-10-2018 53 20
daily: price as same and vol_chair divided by days of the month
weekly: price as same and vol_chair divided by number of weeks in a month
expected output:
daily:
date chair_price vol_chair
01-09-2018 23 1
02-09-2018 23 1
03-09-2018 23 1
..
30-09-2018 23 1
01-10-2018 53 0.64
..
31-10-2018 53 0.64
weekly:
date chair_price vol_chair
02-09-2018 23 6
09-09-2018 23 6
16-09-2018 23 6
23-09-2018 23 6
30-09-2018 23 6
07-10-2018 53 5
14-10-2018 53 5
..
I am using below code as for column vol, any quick way to do it together i.e. keep price same and vol - take action and find number of weeks in a month
df.resample('W').ffill().agg(lambda x: x/4)
df.resample('D').ffill().agg(lambda x: x/30)
and need to use calendar.monthrange(2012,1)[1] to identify days
def func_count_number_of_weeks(df):
return len(calendar.monthcalendar(df['DateRange'].year, df['DateRange'].month))
def func_convert_from_monthly(df, col, category, columns):
if category == "Daily":
df['number_of_days'] = df['DateRange'].dt.daysinmonth
for column in columns:
df[column] = df[column] / df['number_of_days']
df.drop('number_of_days', axis=1, inplace=True)
elif category == "Weekly":
df['number_of_weeks'] = df.apply(func_count_number_of_weeks, axis=1)
for column in columns:
df[column] = df[column] / df['number_of_weeks']
df.drop('number_of_weeks', axis=1, inplace=True)
return df
def func_resample_from_monthly(df,col, category):
df.set_index(col, inplace=True)
df.index = pd.to_datetime(df.index, dayfirst=True)
if category == "Monthly":
df = df.resample('MS').ffill()
elif category == "Weekly":
df = df.resample('W').ffill()
return df

Use:
#convert to datetimeindex
df.index = pd.to_datetime(df.index, dayfirst=True)
#add new next month for correct resample
idx = df.index[-1] + pd.offsets.MonthBegin(1)
df = df.append(df.iloc[[-1]].rename({df.index[-1]: idx}))
#resample with forward filling values, remove last helper row
#df1 = df.resample('D').ffill().iloc[:-1]
df1 = df.resample('W').ffill().iloc[:-1]
#divide by size of months
df1['vol_chair'] /= df1.resample('MS')['vol_chair'].transform('size')
print (df1)
chair_price vol_chair
date
2018-09-02 23 6.0
2018-09-09 23 6.0
2018-09-16 23 6.0
2018-09-23 23 6.0
2018-09-30 23 6.0
2018-10-07 53 5.0
2018-10-14 53 5.0
2018-10-21 53 5.0
2018-10-28 53 5.0

Related

how to create monthly and season 24 hours average table using pandas

I have a dataframe with 2 columns: Date and LMP and there are totals of 8760 rows. This is the dummy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2023-01-01 00:00', '2023-12-31 23:00', freq='1H'), 'LMP': np.random.randint(10, 20, 8760)})
I extract month from the date and then created the season column for the specific dates. Like this
df['month'] = pd.DatetimeIndex(df['Date']).month
season = []
for i in df['month']:
if i <= 2 or i == 12:
season.append('Winter')
elif 2 < i <= 5:
season.append('Spring')
elif 5 < i <= 8:
season.append('Summer')
else:
season.append('Autumn')
df['Season'] = season
df2 = df.groupby(['month']).mean()
df3 = df.groupby(['Season']).mean()
print(df2['LMP'])
print(df3['LMP'])
Output:
**month**
1 20.655113
2 20.885532
3 19.416946
4 22.025248
5 26.040606
6 19.323863
7 51.117965
8 51.434093
9 21.404680
10 14.701989
11 20.009590
12 38.706160
**Season**
Autumn 18.661426
Spring 22.499365
Summer 40.856845
Winter 26.944382
But I want the output to be in 24 hour average for both monthly and seasonal.
Desired Output:
for seasonal 24 hours average
For monthyl 24 hours average
Note: in the monthyl 24 hour average columns are months(1,2,3,4,5,6,7,8,9,10,11,12) and rows are hours(starting from 0).
Can anyone help?
try:
df['hour']=pd.DatetimeIndex(df['Date']).hour
dft = df[['Season', 'hour', 'LMP']]
dftg = dft.groupby(['hour', 'Season'])['LMP'].mean()
dftg.reset_index().pivot(index='hour', columns='Season')
result:

How to make a graph plotting monthly data over many years in pandas

I have 11 years worth of hourly ozone concentration data.
There are 11 csv files containing ozone concentrations at every hour of every day.
I was able to read all of the files in and convert the index from date to datetime.
For my graph:
I calculated the maximum daily 8-hour average and then averaged those values over each month.
My new dataframe (df3) has:
a datetime index, which consists of the last day of the month for each month of the year over the 12 years.
It also has a column including the average MDA8 values.
I want make 3 separate scatter plots for the months of April, May, and June. (x axis = year, y axis = average MDA8 for the month)
However, I am getting stuck on how to call these individual months and plot the yearly data.
Minimal sample
site,date,start_hour,value,variable,units,quality,prelim,name
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly
I've attached some code below:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
df = pd.read_csv(f, header = 0, index_col='date')
df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True)
df = df.iloc[0:]
df.index = pd.to_datetime(df.index) #converting date to datetime
df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
df['datetime'] = df.index + df['start_hour']
df.set_index('datetime', inplace = True)
df2 = df.value.rolling('8H', min_periods = 6).mean()
df2.index -= pd.DateOffset(hours=3)
df2 = df4.resample('D').max()
df2.index.name = 'timestamp'
The problem occurs below:
df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
df4 = df3[df3.index.month.isin([4,5,6])]
if df4 == True:
plt.plot(df3.index, df3.values)
print(df4)
whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
When I try this code with df4.any() == True:, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.
I've also tried adding the the following and removing the previous if statement:
df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
if df5.all() == True:
plt.plot(df4.index, df4.values)
However, this gives me an image like:
Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.
EDIT
In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it.
Ideally, I want it to look something like this, but a different point for each year and for the individual month of April
df.index -= pd.DateOffset(hours=3) has been removed for being potentially problematic
The first hours of each month would be in the previous month
The first hours of each day would be in the previous day
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')
# list of files
files = list(p.glob('OZONE*.csv'))
# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])
# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy() # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)
# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
df.info() & .head() before any processing
<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 78204 non-null int64
1 date 78204 non-null object
2 start_hour 78204 non-null int64
3 value 78204 non-null float64
4 variable 78204 non-null object
5 units 78204 non-null object
6 quality 4664 non-null float64
7 prelim 4664 non-null object
8 name 78204 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB
site date start_hour value variable units quality prelim name
0 3135 2011-01-01 0 14.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
1 3135 2011-01-01 1 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
2 3135 2011-01-01 2 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
3 3135 2011-01-01 3 25.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
4 3135 2011-01-01 5 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
df.info & .head() after processing
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 20708 non-null int64
1 value 20708 non-null float64
2 variable 20708 non-null object
3 units 20708 non-null object
4 quality 2086 non-null float64
5 prelim 2086 non-null object
6 name 20708 non-null object
7 month 20708 non-null int64
8 day 20708 non-null int64
9 year 20708 non-null int64
10 r_mean 20475 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB
site value variable units quality prelim name month day year r_mean
datetime
2011-04-01 00:00:00 3135 13.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 01:00:00 3135 29.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 02:00:00 3135 31.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 03:00:00 3135 28.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 05:00:00 3135 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
r_mean_daily_max.info() and .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 910 non-null int64
1 month 910 non-null int64
2 day 910 non-null int64
3 r_mean 910 non-null float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB
year month day r_mean
0 2011 4 1 44.125
1 2011 4 2 43.500
2 2011 4 3 42.000
3 2011 4 4 49.625
4 2011 4 5 45.500
mda8.info() & .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 30 non-null int64
1 month 30 non-null int64
2 r_mean 30 non-null float64
3 datetime 30 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB
year month r_mean datetime
0 2011 4 49.808135 2011-04-30
1 2011 5 55.225806 2011-05-31
2 2011 6 58.162302 2011-06-30
3 2012 4 45.865278 2012-04-30
4 2012 5 61.061828 2012-05-31
mda8
plot 1
sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plot 2
# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for g, d in mda8.groupby('year'):
sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plot 3
sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()
plot 4
for month in mda8.month.unique():
data = mda8[mda8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
sns.lineplot(data.datetime, data.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.title(f'Month: {month}')
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
There will be one plot per month
plot 5
for month in mda8.month.unique():
data = mda8[mda8.month == month]
sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
plt.legend(title='Month')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
Addressing I want make 3 separate scatter plots for the months of April, May, and June.
The main issue is, the data can't be plotted with a datetime axis.
The objective is to plot each day on the axis, with each figure as a different month.
Lineplot
It's kind of busy
A custom color map has been used because there aren't enough colors in the standard palette to give each year a unique color
# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(16, 10))
for year in v.index.year.unique(): # withing the month plot each year
data = v[v.index.year == year]
sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
plt.xlim(0, 33)
plt.xticks(range(1, 32))
plt.title(f'Month: {k}')
plt.xlabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()
Here's April, the other two figures look similar to this
Barplot
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(10, 20))
sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
plt.title(f'Month: {k}')
plt.ylabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:
Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

How do I use grouped data to plot rainfall averages in specific hourly ranges

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.
UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")