How to handle cross validation in Prophet with gaps in the dataset? - facebook-prophet

I have to run multiple models for different geo regions and some geo regions might have gaps in the data too.
For instance, the data might look like below -
In here, I have data from 2019 for only 6 days and then 1 day in 2020 and rest from 2021.
Now, I want to run cross validation on this dataset. The total rows is 423.
from prophet.diagnostics import cross_validation as cv, performance_metrics as pm
import pandas as pd
df = pd.read_csv('..')
model = ProphetPos(holidays = holidays, **config.MODEL_PARAMS)
model.fit(df)
results = cv(model, f'182 days', initial=f'182', period='30 days', parallel='threads')
When I try to run the code, the cross-validation step gives an error -
File "/usr/local/lib/python3.8/site-packages/prophet/diagnostics.py", line 202, in cross_validation
return pd.concat(predicts, axis=0).reset_index(drop=True)
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 285, in concat
op = _Concatenator(
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 339, in __init__
objs = list(objs)
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.8/site-packages/prophet/diagnostics.py", line 250, in single_cutoff_forecast
yhat = m.predict(df[index_predicted][columns])
File "/usr/local/lib/python3.8/site-packages/inf_forecast_colo/utils/prophet_forecaster.py", line 1682, in predict
if fcst['trend'].iloc[-1] < fcst['trend'].iloc[-2]:
File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 895, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1501, in _getitem_axis
self._validate_integer(key, axis)
File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1444, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
I guess what is happening is the initial data is available for 5 days out of the 180 days but then the test set has 0 records to test the performance against. Hence it fails.

Related

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

Linear Regression

My Problem Statement is :
The following data set shows the result of recently conducted study on the correlation of the number of hours spent driving with the risk of developing acute back pain. Find the equation of the best fit line for this data.
Data set is as below :
x y
10 95
9 80
2 10
15 50
10 45
16 98
11 38
16 93
Machine spec : Linux Ubuntu 18.10 64bit
I am having some error:
python LR.py
Accuracy :
43.70948145101002
[6.01607946]
Enter the no of hours10
y :
0.095271*10.000000+5.063367
Risk Score : 6.016079463451905
Traceback (most recent call last):
File "LR.py", line 30, in <module>
plt.plot(X,y,'o')
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/pyplot.py", line 3358, in plot
ret = ax.plot(*args, **kwargs)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/__init__.py", line 1855, in inner
return func(ax, *args, **kwargs)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_axes.py", line 1527, in plot
for line in self._get_lines(*args, **kwargs):
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 406, in _grab_next_args
for seg in self._plot_args(this, kwargs):
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 383, in _plot_args
x, y = self._xy_from_xy(x, y)
File "/home/sumeet/anaconda3/lib/python3.6/site-
packages/matplotlib/axes/_base.py", line 242, in _xy_from_xy
"have shapes {} and {}".format(x.shape, y.shape))
ValueError: x and y must have same first dimension, but have
shapes (8, 1) and (1,)
The code is as below:
import matplotlib.pyplot as plt
import pandas as pd
# Read Dataset
dataset=pd.read_csv("hours.csv")
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,1].values
# Import the Linear Regression and Create object of it
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X,y)
Accuracy=regressor.score(X, y)*100
print("Accuracy :")
print(Accuracy)
# Predict the value using Regressor Object
y_pred=regressor.predict([[10]])
print(y_pred)
# Take user input
hours=int(input('Enter the no of hours'))
#calculate the value of y
eq=regressor.coef_*hours+regressor.intercept_
y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
print("y :")
print(y)
print("Risk Score : ", eq[0])
plt.plot(X,y,'o')
plt.plot(X,regressor.predict(X));
plt.show()
In the beginning of your code, you define the y which you probably want to plot:
y=dataset.iloc[:,1].values
but further down, you re-define (and thus overwriting) it as
y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
which causes the error, as this last y is a string and not an array with 8 elements like X (and like your initial y).
Change it with something else, e.g. Y, at the relevant lines in the end:
Y='%f*%f+%f' %(regressor.coef_,hours,regressor.intercept_)
print("Y :")
print(Y)
so as to keep your y as initially defined, and you should be fine.

How to merge pandas series off of a column of dates

I have two series:
date DEF
0 1/31/1986 0.0140
1 2/28/1986 0.0150
2 3/31/1986 0.0160
3 4/30/1986 0.0120
4 5/30/1986 0.0120
date PE
0 1/31/1900 12.71
1 2/28/1900 12.94
2 3/31/1900 13.04
3 4/30/1900 13.21
4 5/31/1900 12.58
I need to iterate over several DataFrames of this nature and combine them all into one big DataFrame, where only values that align with the dates get added. My function so far:
def get_combined_vars(start, end):
rows = pd.date_range(start=start, end=end, freq='BM')
df1 = pd.DataFrame(rows, columns=['date'])
for key in variables.keys():
check = variables[key][0]
if check == 1:
df2 = pd.DataFrame(variables[key][1]())
print(df2.head(5))
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))
print(df1.head(10))
return df1
I can't seem to find the right command to merge DataFrames based off of a column.
Desired output:
date DEF PE
0 1/31/1900 0.0140 12.71
1 2/28/1900 0.0150 12.94
2 3/31/1900 0.0160 13.04
3 4/30/1900 0.0120 13.21
4 5/31/1900 0.0120 12.58
Merge_asof issue:
runfile('H:/Market Timing/Files/market_timing.py', wdir='H:/Market Timing/Files')
date BY
0 1/31/1963 0.98
1 2/28/1963 1
2 3/29/1963 1.01
3 4/30/1963 1.01
4 5/31/1963 1.01
Traceback (most recent call last):
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 303, in _convert_listlike
values, tz = tslib.datetime_to_datetime64(arg)
File "pandas\_libs\tslib.pyx", line 1884, in pandas._libs.tslib.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "H:/Market Timing/Files/market_timing.py", line 88, in <module>
print(get_combined_vars('1/31/1995', '1/31/2005').head(10))
File "H:/Market Timing/Files/market_timing.py", line 43, in get_combined_vars
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 373, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 306, in _convert_listlike
raise e
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 294, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 2156, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2379, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2373, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 581, in parse
ret = default.replace(**repl)
ValueError: day is out of range for month
I believe on the third pass of these two DataFrames attempting to be combined it runs into this error: ValueError: day is out of range for month
Can a buffer be added for discrepancies in data like this?
You can use pd.merge_asof, however, first you'll need to get your dates on a common year.
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))[['date','DEF','PE']]
Output:
date DEF PE
0 1900-01-31 0.014 12.71
1 1900-02-28 0.015 12.94
2 1900-03-31 0.016 13.04
3 1900-04-30 0.012 13.21
4 1900-05-31 0.012 12.58
You would use pandas.Merge (or DataFrame.join as shorthand) to do this:
import pandas as pd
pd.Merge(df1, df2, on="date")
...But as Scott Boston mentioned in his comment, the data doesn't align so you won't get your expected results.

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]