Resampling the decomposed series of trend, residual and seasonality of the data from monthly to daily - pandas

I am performing a short-time series decomposition on a dataset to evaluate trend, residual and seasonality of the data. I want to resample these trend, seasonality and residual from monthly to daily frequency. I am getting some error with my datetimeindex:
Following are my codes:
# Import
import numpy as np
import pandas as pd
import tsfresh
import matplotlib.pyplot as plt
# monthly time series data set
df = pd.read_csv("data.csv")
df = pd.DataFrame(df)
# perform linear interpolation on the dataframe
df.interpolate(method='linear', inplace=True)
X = df['X']
date_index = pd.date_range(start='2002-01-01', periods=12, freq='M')
monthly_X= pd.Series(X, index=date_index)
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df['X'], period = 1)
# Plot the decomposed time series to interpret.
plt.figure(figsize=(100, 20))
decomp.plot();
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
# Resample the DataFrame to daily frequency
daily_trend = **pd.Series(trend).resample('D').interpolate()
daily_seasonal = pd.Series(seasonal).resample('D').interpolate()
daily_residual = pd.Series(residual).resample('D').interpolate()
(Getting error here: pd.series(trend).resample - Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex')
How to resample these series?

Related

Calculating time-series as a percentage of total

I'm looking at county-level procurement data (millions of bills) and plotting time-series with matplotlib and pandas using groupby:
dataframe_slice.groupby(pd.Grouper(freq='1M')).bill_amount.sum().plot
where bill_amount is a column of floats that shows how much was billed. How can I change the graph to show the dataframe_slice as a percentage of total dataframe bill_amount?
I am not aware of an out-of-the-box pandas function for that (but I am hoping to be proven wrong). Imho, you have to calculate the percentage per group by calculating the total_sum, then determining the percentage per group aggregation. Stand-alone code:
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation
import numpy as np
np.random.seed(123)
n = 200
start = pd.to_datetime("2017-07-17")
end = pd.to_datetime("2018-04-03")
ndays = (end - start).days + 1
date_range = pd.to_timedelta(np.random.rand(n) * ndays, unit="D") + start
df = pd.DataFrame({"ind": date_range,
"bill_amount": np.random.randint(10, 30, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
df.set_index("ind", inplace=True)
#df.sort_index(inplace=True)
#this assumes that your dataframe has a datetime index
#here starts the actual calculation
total_sum = df.bill_amount.sum()
dataframe_slice = df.groupby(pd.Grouper(freq='1M')).bill_amount.sum().div(total_sum)*100
dataframe_slice.plot()
#and we beautify the plot
plt.xlabel("Month of expenditure")
plt.ylabel("Percentage of expenditure")
plt.tight_layout()
plt.show()
Sample output:

How to change a seaborn histogram plot to work for hours of the day?

I have a pandas dataframe with lots of time intervals of varying start times and lengths. I am interested in the distribution of start times over 24hours. I therefore have another column entitled Hour with just that in. I have plotted a histogram using seaborn to look at the distribution but obviously the x axis starts at 0 and runs to 24. I wonder if there is a way to change so it runs from 8 to 8 and loops over at 23 to 0 so it provides a better visualisation of my data from a time perspective. Thanks in advance.
sns.distplot(df2['Hour'], bins = 24, kde = False).set(xlim=(0,23))
If you want to have a custom order of x-values on your bar plot, I'd suggest using matplotlib directly and plot your histogram simply as a bar plot with width=1 to get rid of padding between bars.
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# prepare sample data
dates = pd.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 7),
freq="H")
random_dates = np.random.choice(dates, 1000)
df = pd.DataFrame(data={"date":random_dates})
df["hour"] = df["date"].dt.hour
# set your preferred order of hours
hour_order = [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7]
# calculate frequencies of each hour and sort them
plot_df = (
df["hour"]
.value_counts()
.rename_axis("hour", axis=0)
.reset_index(name="freq")
.set_index("hour")
.loc[hour_order]
.reset_index())
# day / night colour split
day_mask = ((8 <= plot_df["hour"]) & (plot_df["hour"] <= 20))
plot_df["color"] = np.where(day_mask, "skyblue", "midnightblue")
# actual plotting - note that you have to cast hours as strings
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.bar(
x=plot_df["hour"].astype(str),
height=plot_df["freq"],
color=plot_df["color"], width=1)
ax.set_xlabel('Hour')
ax.set_ylabel('Frequency')
plt.show()

What is fastest way to compute quantile over grouped dataframe?

I am creating monthly diurnal plots from pandas dataframe. I need to plot mean, median or any quantile. I am able to achieve it correctly, but with large data, quantile computation is way slower than mean or median computation. Is there any faster way to achieve this?
import pandas as pd
import numpy as np
import datetime as dt
date_range = pd.date_range(start=dt.datetime(2018,1,1,00,00), end=dt.datetime(2018,12,31,23,59), freq='1min')
N = len(date_range)
df = pd.DataFrame({'Test': np.random.rand(N)}, index=date_range)
df['Time'] = df.index.time
df['Month'] = df.index.month
time_mean_median = dt.datetime(2019,1,1,0,0,0)
time_qunatiles = dt.datetime(2019,1,1,0,0,0)
for i in range(12):
df_month = df[['Test', 'Time']].loc[df['Month'] == i + 1]
start_time = dt.datetime.now()
df1_group = df[['Test', 'Time']].groupby('Time').agg([np.mean, np.median])
time_mean_median += dt.datetime.now()-start_time
quantiles = [0.23, 0.72]
start_time = dt.datetime.now()
df2_group = df[['Test', 'Time']].groupby('Time').quantile(q=quantiles).unstack()
time_qunatiles += dt.datetime.now() - start_time
print('Mean/median computation time {}'.format(time_mean_median.time()))
print('Quantile computation time {}'.format(time_qunatiles.time()))
In this example I get mean/median total computation time around 0.7 seconds, compare to almost 12 second with quantile computation.

Exceedance (1-cdf) plot using seaborn and pandas

Assume a dataframe df with a single column (say latency, i.e. a uni-variate sample). The exceedance function is calculated and plotted as follows:
sorted_df = df.sort_values('latency')
samples = len(sorted_df)
exceedance = [1-(x/samples) for x in range(1, samples + 1)]
ax.plot(df['latency'], exceedance, 'o')
Is there a simpler/elegant way to calculate and plot exceedance function of a univariate sample using seaborn (may be distplot)? I recently learnt using seaborn's distplot function, but I can only plot the cdf as follows:
sns.distplot(df['latency'], hist=False, kde_kws={'cumulative':True})
I'm specifically interested in seaborn because I plan to use this function along with Seaborn.FacetGrid to get an exceedance plot for several factors.
Because you asked for a more elegant way, the following saves you two lines of code and is faster.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
def plot_exceedance(data, **kwargs):
df = data.sort_values()
exceedance = 1.-np.arange(1.,len(df) + 1.)/len(df)
plt.plot(sorted_df, exceedance, **kwargs)
g = sns.FacetGrid(df, row='factorA',col='factorB',hue='factorC')
g.map(plot_exceedance, 'latency')
There is no predefined API/paramaters to calculate exceedance. So, I had to use the code listed above. But considering that I was specifically interested in getting an exceedance plot of several factors and that I could use plt.plot along with seaborn.FacetGrid, the following piece of code worked.
def plot_exceedance(data, **kwargs):
sorted_df = data.sort_values()
samples = len(sorted_df)
exceedance = [1-(x/samples) for x in range(1, samples + 1)]
ax=plt.gca()
ax.plot(sorted_df, exceedance, **kwargs)
g = sns.FacetGrid(df, row='factorA',col='factorB',hue='factorC')
g.map(plot_exceedance, 'latency')
where factorA, factorB and factorC are additional columns in df.

Python rolling Sharpe ratio with Pandas or NumPy

I am trying to generate a plot of the 6-month rolling Sharpe ratio using Python with Pandas/NumPy.
My input data is below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# Generate sample data
d = pd.date_range(start='1/1/2008', end='12/1/2015')
df = pd.DataFrame(d, columns=['Date'])
df['returns'] = np.random.rand(d.size, 1)
df = df.set_index('Date')
print(df.head(20))
returns
Date
2008-01-01 0.232794
2008-01-02 0.957157
2008-01-03 0.079939
2008-01-04 0.772999
2008-01-05 0.708377
2008-01-06 0.579662
2008-01-07 0.998632
2008-01-08 0.432605
2008-01-09 0.499041
2008-01-10 0.693420
2008-01-11 0.330222
2008-01-12 0.109280
2008-01-13 0.776309
2008-01-14 0.079325
2008-01-15 0.559206
2008-01-16 0.748133
2008-01-17 0.747319
2008-01-18 0.936322
2008-01-19 0.211246
2008-01-20 0.755340
What I want
The type of plot I am trying to produce is this or the first plot from here (see below).
My attempt
Here is the equation I am using:
def my_rolling_sharpe(y):
return np.sqrt(126) * (y.mean() / y.std()) # 21 days per month X 6 months = 126
# Calculate rolling Sharpe ratio
df['rs'] = calc_sharpe_ratio(df['returns'])
fig, ax = plt.subplots(figsize=(10, 3))
df['rs'].plot(style='-', lw=3, color='indianred', label='Sharpe')\
.axhline(y = 0, color = "black", lw = 3)
plt.ylabel('Sharpe ratio')
plt.legend(loc='best')
plt.title('Rolling Sharpe ratio (6-month)')
fig.tight_layout()
plt.show()
The problem is that I am getting a horizontal line since my function is giving a single value for the Sharpe ratio. This value is the same for all the Dates. In the example plots, they appear to be showing many ratios.
Question
Is it possible to plot a 6-month rolling Sharpe ratio that changes from one day to the next?
Approximately correct solution using df.rolling and a fixed window size of 180 days:
df['rs'] = df['returns'].rolling('180d').apply(my_rolling_sharpe)
This window isn't exactly 6 calendar months wide because rolling requires a fixed window size, so trying window='6MS' (6 Month Starts) throws a ValueError.
To calculate the Sharpe ratio for a window exactly 6 calendar months wide, I'll copy this super cool answer by SO user Mike:
df['rs2'] = [my_rolling_sharpe(df.loc[d - pd.offsets.DateOffset(months=6):d, 'returns'])
for d in df.index]
# Compare the two windows
df.plot(y=['rs', 'rs2'], linewidth=0.5)
I have prepared an alternative solution to your question, this one is based on using solely the window functions from pandas.
Here I have defined "on the fly" the calculation of the Sharpe Ratio, please consider for your solution the following parameters:
I have used a Risk Free rate of 2%
The dash line is just a Benchmark for the rolling Sharpe Ratio, the value is 1.6
So the code is the following
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# Generate sample data
d = pd.date_range(start='1/1/2008', end='12/1/2015')
df = pd.DataFrame(d, columns=['Date'])
df['returns'] = np.random.rand(d.size, 1)
df = df.set_index('Date')
df['rolling_SR'] = df.returns.rolling(180).apply(lambda x: (x.mean() - 0.02) / x.std(), raw = True)
df.fillna(0, inplace = True)
df[df['rolling_SR'] > 0].rolling_SR.plot(style='-', lw=3, color='orange',
label='Sharpe', figsize = (10,7))\
.axhline(y = 1.6, color = "blue", lw = 3,
linestyle = '--')
plt.ylabel('Sharpe ratio')
plt.legend(loc='best')
plt.title('Rolling Sharpe ratio (6-month)')
plt.show()
print('---------------------------------------------------------------')
print('In case you want to check the result data\n')
print(df.tail()) # I use tail, beacause of the size of your window.
You should get something similar to this picture