Fecebook NeauralProphet - adding holidays - facebook-prophet

I have one common data set for my prediction that includes data across the globe.
ds y country_id
01/01/2021 09:00:00 5.0 1
01/01/2021 09:10:00 5.2 1
01/01/2021 09:20:00 5.4 1
01/01/2021 09:30:00 6.1 1
01/01/2021 09:00:00 2.0 2
01/01/2021 09:10:00 2.2 2
01/01/2021 09:20:00 2.4 2
01/01/2021 09:30:00 3.1 2
playoffs = pd.DataFrame({
'holiday': 'playoff',
'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
superbowls = pd.DataFrame({
'holiday': 'superbowl',
'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((playoffs, superbowls))
Now, I would like to add holidays to the model.
m = NeuralProphet(holidays=holidays)
m.add_country_holidays(country_name='US')
m.fit(df)
How can I add multiple country holidays to add_country_holidays (m.add_country_holidays)?
How to add country specific holidays to holidays data?
Do I need to generate different model specific to country? Or, one model for the entire dataset is fine and then will be able to add the regressor. What is the recommendation?

Here is a possible solution:
The program:
# NOTE 1: tested on google colab
# Un-comment the following (!pip) line if you need to install the libraries
# on google colab notebook:
#!pip install neuralprophet pandas numpy holidays
import pandas as pd
import numpy as np
import holidays
from neuralprophet import NeuralProphet
import datetime
# NOTE 2: Most of the code comes from:
# https://neuralprophet.com/html/events_holidays_peyton_manning.html
# Context:
# We will use the time series of the log daily page views for the Wikipedia
# page for Peyton Manning (American former football quarterback ) as an example.
# During playoffs and super bowls, the Peyton Manning's wiki page is more frequently
# viewed. We would like to see if countries specific holidays also have an
# influence.
# First, we load the data:
data_location = "https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "wp_log_peyton_manning.csv")
# To simulate your case, we add a country_id column filled with random values {1,2}
# Let's assume US=1 and Canada=2
import numpy as np
np.random.seed(0)
df['country_id']=np.random.randint(1,2+1,df['ds'].count())
print("The dataframe we are working on:")
print(df.head())
# We would like to add holidays for US and Canada to see if holidays have an
# influence on the # of daily's views on Manning's wiki page.
# The data in df starts in 2007 and ends in 2016:
StartingYear=2007
LastYear=2016
# Holidays for US and Canada:
US_holidays = holidays.US(years=[year for year in range(StartingYear, LastYear+1)])
CA_holidays = holidays.CA(years=[year for year in range(StartingYear, LastYear+1)])
holidays_US=pd.DataFrame()
holidays_US['ds']=[]
holidays_US['event']=[]
holidays_CA=pd.DataFrame()
holidays_CA['ds']=[]
holidays_CA['event']=[]
for i in df.index:
# Convert date string to datetime object:
datetimeobj=[int(x) for x in df['ds'][i].split('-')]
# Check if the corresponding day is a holyday in the US;
if df['country_id'][i]==1 and (datetime.datetime(*datetimeobj) in US_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_US']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_US
holidays_US=holidays_US.append(df1,ignore_index=True)
# Check if the corresponding day is a holyday in Canada:
if df['country_id'][i]==2 and (datetime.datetime(*datetimeobj) in CA_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_CA']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_CA
holidays_CA=holidays_CA.append(df1,ignore_index=True)
# Now we can drop the country_id in df:
df.drop('country_id', axis=1, inplace=True)
print("Days in df that are holidays in the US:")
print(holidays_US.head())
print()
print("Days in df that are holidays in Canada:")
print(holidays_CA.head())
# user specified events
# history events
playoffs = pd.DataFrame({
'event': 'playoff',
'ds': pd.to_datetime([
'2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07',
]),
})
superbowls = pd.DataFrame({
'event': 'superbowl',
'ds': pd.to_datetime([
'2010-02-07', '2012-02-05', '2014-02-02',
'2016-02-07',
]),
})
# Create the events_df:
events_df = pd.concat((playoffs, superbowls, holidays_US, holidays_CA))
# Create neural network and fit:
# NeuralProphet Object
m = NeuralProphet(loss_func="MSE")
m = m.add_events("playoff")
m = m.add_events("superbowl")
m = m.add_events("holiday_US")
m = m.add_events("holiday_CA")
# create the data df with events
history_df = m.create_df_with_events(df, events_df)
# fit the model
metrics = m.fit(history_df, freq="D")
# forecast with events known ahead
future = m.make_future_dataframe(df=history_df, events_df=events_df, periods=365, n_historic_predictions=len(df))
forecast = m.predict(df=future)
fig = m.plot(forecast)
fig_param = m.plot_parameters()
fig_comp = m.plot_components(forecast)
RESULT:
The results (see PARAMETERS figure) seem to show that when a day is a holiday, there are less views in both US and Canada. Does it make sense? Maybe... It looks plausible that people on holiday have more interesting things to do than browsing Manning's wiki page :-) I don't know.
PROGRAM'S OUTPUT:
The dataframe we are working on:
ds y country_id
0 2007-12-10 9.5908 1
1 2007-12-11 8.5196 2
2 2007-12-12 8.1837 2
3 2007-12-13 8.0725 1
4 2007-12-14 7.8936 2
Days in df that are holidays in the US:
ds event
0 2007-12-25 holiday_US
1 2008-01-21 holiday_US
2 2008-07-04 holiday_US
3 2008-11-27 holiday_US
4 2008-12-25 holiday_US
Days in df that are holidays in Canada:
ds event
0 2008-01-01 holiday_CA
1 2008-02-18 holiday_CA
2 2008-08-04 holiday_CA
3 2008-09-01 holiday_CA
4 2008-10-13 holiday_CA
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 138
88%
241/273 [00:02<00:00, 121.69it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.51E+00
88%
241/273 [00:02<00:00, 123.87it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.63E+00
89%
242/273 [00:02<00:00, 121.58it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.62E-02, min: 2.58E+00
INFO - (NP.forecaster._init_train_loader) - lr-range-test selected learning rate: 3.44E-02
Epoch[138/138]: 100%|██████████| 138/138 [00:29<00:00, 4.74it/s, MSELoss=0.012, MAE=0.344, RMSE=0.478, RegLoss=0]
The figures:
FORECASTS:
PARAMETERS:
COMPONENTS:

Related

How to convert days to hours in pandas

Example:
coupon expiration
Restaurant 1d
College 2d
Coffee House 2h
o/p:
coupon expiration
Restaurant 24h
College 48h
Coffee House 2h
How to convert days to hours in pandas
You can use pd.to_timedelta, but the values in the expiration column must be valid timedelta strings:
import pandas as pd
df = pd.read_clipboard() # Your df here
tds = pd.to_timedelta(df["expiration"])
# 0 1 days 00:00:00
# 1 2 days 00:00:00
# 2 0 days 02:00:00
# Name: expiration, dtype: timedelta64[ns]
# I would recommend stopping here, but you can reformat this into a string of hours:
df["expiration"] = tds.dt.total_seconds().div(3600).apply("{:g}h".format)
# coupon expiration
# 0 Restaurant 24h
# 1 College 48h
# 2 CoffeeHouse 2h
You can use str.replace on the expiration column and use a regex pattern to select those entries that have a day (d) suffix. You can also call a function for the repl parameter - which is where I chose to do the conversion to hours.
Code:
import pandas as pd
df = pd.DataFrame({"coupon":['Restaurant','College','Coffee House'], "expiration":['1d','2d','2h']})
def replacement(m):
x = int(m.group(0).split('d')[0]) * 24
return f"{x}h"
df.expiration = df.expiration.str.replace(pat=r'^\d+d$', repl=replacement, regex=True)
print(df)
Output:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h
Regex Pattern:
r'^\d+d$'
^ : start of string
\d+ : one or more digits [0-9]
d : followed by the letter d
$ : end of string
Note:
If you would rather a one-liner using a lambda function instead:
df.expiration = df.expiration.str.replace(pat=r'^\d+d$', repl= lambda m:f"{int(m.group(0).split('d')[0]) * 24}h", regex=True)
A simply Apply can help here
def convert(x):
if 'd' in x:
return f"{int(x.replace('d',''))*24}h"
return x
df['expiration']= df['expiration'].apply(lambda x:convert(x))
df
Out[57]:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h
Another possible solution, based on eval:
df['expiration'] = [str(eval(x)) + 'h' for x in
df['expiration'].str.replace('d', '*24').str.replace('h', '')]
Output:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h

Averaging over specified time period

I opened multiple netcdfs (each one corresponds to an hour 0-30hr) with xr.open_mfdataset('*.nc'). Now I have an extra dimension (time). I am considering one of my variables(u,v,w). I want to average u (time: 30 z: 200 y: 100 x: 100) over 24 hours instead of the whole time period I have. How can I do that?
To select the first 24 observations along the time dim, you can use .isel, e.g.:
ds24 = ds.isel(time=range(24))
See the xarray docs on indexing and selecting data for more options and examples.
Now, you can average over the time dim:
ds24.mean(dim='time')
Made some simple test case:
#!/usr/bin/env ipython
# ---------------------
import numpy as np
import xarray as xr
import datetime
# ---------------------
# let us make test serie:
N = 30;
datesout = np.array([datetime.datetime(2022,10,23,0) + datetime.timedelta(seconds = nn*3600) for nn in range(N)])
serieout = np.array([10 + nn for nn in range(N) ]);
# let us save netcdf:
dsout = xr.Dataset(data_vars=dict(someserie=(["time"], serieout)),coords=dict(time=datesout),attrs=dict(description="Test data."))
dsout.to_netcdf('test.nc')
# ---------------------
# let us make daily average:
with xr.open_dataset('test.nc') as ncin:
# let us calculate daily means:
dfmeanout = ncin.groupby("time.day").mean()
# -----------------------------------------
# keep only the first day:
dfmeanout = dfmeanout.isel(day=0)
# -----------------------------------------
print('Got value: ',dfmeanout.values)
# --------------------------------------------
# simple test:
expected_mean = np.sum(np.array([vv+10 for vv in range(24)]))/24
print('Correct mean is : ',expected_mean)
# ---------------------------------------------

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

Plotting time series box and whisker plot with missing date values for origin destination pairs

I have the following data set:
df.head(7)
Origin Dest Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
In short, this table represents the number of items (Quantity) that were sent from a given origin to a given destination on a given date. The table contains 1 month of data. This table was read with:
shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])
Note that this is a sparse table: if Quantity=0 for a particular (Origin,Dest,Date) pair then this row is not included in the table. As per example, on 2021-09-10 no items were sent from Atlanta to LA this row is not included in the data.
I would like to visualize this data using time series box and whisker plots. The x-axis of my graph should show the day, and Quantity should be on the y-axis. A boxplot should represent the various percentiles aggregated over all (origin-destination) pairs.
Similarly, would it be possible to create a graph which, instead of every day, only shows Monday-Sunday on the x-axis (and hence shows the results per day of the week)?
To generate the rows with missing data I used the following code:
table = pd.pivot_table(data=shipments, index='Date', columns=['Origin','Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06','2021-10-10')
table = table.reindex(idx,fill_value=0)
You could transpose the table dataframe, and use that as input for a sns.boxplot. And you could create a similar table for the day of the week. Note that with many zeros, the boxplot might look a bit strange.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first create some test data, somewhat similar to the given data
N = 1000
cities = ['Atlanta', 'LA', 'Chicago', 'Seattle', 'Newark']
shipments = pd.DataFrame({'Origin': np.random.choice(cities, N),
'Dest': np.random.choice(cities, N),
'Date': np.random.choice(pd.date_range('2021-09-06', '2021-10-10'), N),
'Quantity': (np.random.uniform(1, 4, N) ** 3).astype(int)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5), gridspec_kw={'width_ratios': [3, 1]})
# create boxplots for each day
table_month = pd.pivot_table(data=shipments, index='Date', columns=['Origin', 'Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06', '2021-10-10')
table_month = table_month.reindex(idx, fill_value=0)
sns.boxplot(data=table_month.T, ax=ax1)
labels = [day.strftime('%d\n%b %Y') if i == 0 or day.day == 1 else day.strftime('%d')
for i, day in enumerate(table_month.index)]
ax1.set_xticklabels(labels)
# create boxplots for each day of the week
table_dow = pd.pivot_table(data=shipments, index=shipments['Date'].dt.dayofweek, columns=['Origin', 'Dest'],
values='Quantity', fill_value=0)
table_dow = table_dow.reindex(range(7), fill_value=0)
sns.boxplot(data=table_dow.T, ax=ax2)
labels = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
ax2.set_xticklabels(labels)
ax2.set_xlabel('') # remove superfluous x label
fig.tight_layout()
plt.show()

Random Choice loop through groups of samples

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45