How to set up the cross_validation() function from Prophet? - pandas

I'm using Prophet (Python) to predict and analysis time series in bulk. that means that my time series share the same properties, but they are not exactly the same. They all run from 2016-01-01 to 2020-Jul-01.
I would like to cross validate my results using the first 3 years of data, and my forecast goal is 15 days only.
What is the best configuration to test my fit using the first 3 years, aiming for a 15 days forecast?
My naive try is the one below:
df_cv = cross_validation(mts, initial="1095 days", period='31 days', horizon = '15 days')
I'm not sure what to add in the 'period' and in the 'horizon' parameters.

As mentioned in Prophet's documentation:
We specify the forecast horizon (horizon), and then optionally the size of the initial training period (initial) and the spacing between cutoff dates (period).
Thus, a forecast is made for every observed point between cutoff and cutoff + horizon.
So, you can specify any combination of the 'period' and in the 'horizon' parameters as long as their sum is equal to the period for which you want to forecast (15 days).

Related

Pandas DateTimeSlicing for specific months per year

I was reading a lot of stuff about pandas and date time slicing but I haven't found a solution for my problem yet. I hope you could give me some good advices!
I have a data frame with a Datetimeindex and for example a single column with floats. The time series is about 60 years.
For example:
idx = pd.Series(pd.date_range("2016-11-1", freq="M", periods=48))
dft = pd.DataFrame(np.random.randn(48,1),columns=["NW"], index=idx)
enter image description here
I want to aggregate the column "NW" as sum() per month. I have to solve two problems.
The year begins in November and ends in October.
I have two periods per 12 months to analyse:
a) from November to End of April in the following year and
b) from May to End of October in the same year
For example: "2019-11-1":"2020-4-30" and "2020-05-01":"2020-10-31"
I think I could write a function but I wonder if there is an easier way with methods from pandas to solve this problems.
Do you have any tips?
Best regards Tommi.
Here are some additional informations:
The real datas are daily observations. I want to show a scatter plot for a time series with only the sum() for every month from November-April along the timeline (60 years til now). And the same for the values from May to October.
this is my solution so far. Not the shortest way I think, but it works fine.
d_precipitation_winter = {}
#for each year without the current year
for year in dft.index.year.unique()[:-1]:
#Definition start and end date to mark winter months
start_date = date(year,11,1)
end_date = date(year+1,4,30)
dft_WH = dft.loc[start_date:end_date,:].sum()
d_precipitation_winter[year]=dft_WH
df_precipitation_winter = pd.DataFrame(data=d_precipitation_winter)

How to adding month to date in and return in Date format in Robot Framework

I try to calculate month.
My Example code: 27/01/2020 + 20 months
Test Date
${PAYMENTDATE} Set Variable 27/01/2020
${PAYMENTDATE} Convert Date ${PAYMENTDATE} date_format=%d/%m/%Y result_format=%Y-%m-%d
${DATE} Add Time To Date ${PAYMENTDATE} 20 months result_format=%d/%m/%Y
log to console ${DATE}
But it not work, Could anyone help please?
In your code you are providing time value in months which is not valid. Unfortunately adding months to the date is not possible using the Robotframework DateTime library. From the DateTime documentation:
time: Time that is added in one of the supported time formats.
You need to provide the time value in one of the possible way e.g. You can provide days.
20 months approximates to 600 days and below code works without problem.
${DATE} Add Time To Date ${PAYMENTDATE} 600 days result_format=%d/%m/%Y
In case you are looking for exact days to be added for 20 months than you need to calculate exact number of days starting from the date you want to add 20 months to and provide it in above code instead of 600 days. You can easily find answers on how to calculate exact days using python like this here.

What is the minimum sample size required for time series forecasting with quarterly frequency?

I'm having a quarterly time series with 2-3 years of data (totally 8-12 data points - varies from case to case basis). I would like to understand the minimum sample size required to perform time series forecasting on quarterly interval series.
I have tried with 2 - 3 years of quarterly data and forecasted but I would like to know the recommended sample size for quarterly frequency.

Simple Averaging Algorithm is Slightly Off. Why? Active Record/PostgreSQL issue?

In my Rails app, I have two custom Rake Tasks running every 30 minutes. Task A scrapes hourly prices from the internet and saves them to a database as HourlyPrice. Task B goes into the db, takes hourly prices from each day for the last seven days, and averages them to create a new DailyAveragePrice record in a separate DB Table.
However, when running Task B, the last day's (of the seven) average price is incorrect.
After fiddling with the hourly prices of that day in an Excel spreadsheet, I see that the average price Task B is generating is the result of taking only the last three hours and averaging them.
Task B is mostly done with this single query:
averages = HourlyPrice.where('date >= ?', 7.days.ago).average(:price, :group => "DATE_TRUNC('day', date - INTERVAL '1 hour')")
I can't figure out why this is happening?
Clues
HourlyPrice has two attributes (datetime,price). Each HourlyPrice actually represents a price for the previous hour. So, source data lists a 24:00:00 price for each day which PostgreSQL does not want to import as is into a datetime column. Instead, it converts all 24:00:00 prices to 00:00:00 of the next day. To make up for this, I've tried to subtract an hour interval, as you can see in the query. Is this causing the problem?
My ActiveRecord's time zone is currently set to 'Mountain Time (US & Canada)'. That is where the price exchange is located. I have not adjusted my PostgreSQL DB's timezone, and I believe it defaults to UTC. When running Task B, I noticed that it was 9:20PM UTC, leaving three hours left in the UTC day, which might explain the averaging of only three HourlyPrices of the last of the seven days. I'll try running Task B again in the next hour to see if it will average only two hours. Update to come... Is this timezone conflict causing a problem, or is what I am doing insulated from timezones since I have my own date columns?
UPDATE - Problem identified, but how to fix?
Clue #2 is correct. It is a timezone issue. I just ran Task B again (an hour later, with 2 hours left until UTC day change), and it only averages two HourlyPrices now for the last of the seven days.
How can I fix my query above to average ONLY if there are 24 HourlyPrice records available?

Battling Datediff in SQL

I am writing a little query in SQL and am butting heads with an issue that it seems like someone must have run into before. I am trying to find the number of months between two dates. I am using an expression like ...
DATEDIFF(m,{firstdate},{seconddate})
However I notice that this function is tallying the times the date crosses the monthly threshold. In example...
DATEDIFF(m,3/31/2011,4/1/2011) will yield 1
DATEDIFF(m,4/1/2011,4/30/2011) will yield 0
DATEDIFF(m,3/1/2011,4/30/2011) will yield 1
Does anyone know how to find the months between two dates more-so based upon time passed then times passed the monthly threshold?
If you want to find some notional number of months, why not find the difference in days, then divide by 30 (cast to FLOAT as required). Or 30.5-ish perhaps - depends on how you want to handle the variable month length throughout the year. But perhaps that's not a factor in your particular case.
The following statements have the same startdate and the same endate. Those dates are adjacent and differ in time by .0000001 second. The difference between the startdate and endate in each statement crosses one calendar or time boundary of its datepart. Each statement returns 1. ...
SELECT DATEDIFF(month, '2005-12-31 23:59:59.9999999'
, '2006-01-01 00:00:00.0000000'); ....
(from DATEDIFF, section datepart Boundaries ). If you are not satisfied by it, you probably need to use days as unit as proposed by martin clayton
DATEDIFF(m,{firstdate},ISNULL({seconddate},GETDATE())) - CASE
WHEN DATEPART(d,{firstdate}) >= DATEPART(d,ISNULL({seconddate},GETDATE()))
THEN 1
ELSE 0
DATEDIFF is like this by design. When evaluating a particular time measurement (like months, or days, etc.), it considers only that measurement and higher values -- ignoring smaller ones. You'll run into this behavior with any time measurement. For example, if you used DATEDIFF to calculate days, and had one date a few seconds before midnight, and another date a few seconds after midnight, you'd get a "1" day difference, even though the two dates were only a few seconds apart.
DATEDIFF is meant to give a rough answer to questions, like this:
Question: how many years old are you?
Answer: some integer. You don't say "I'm 59 years, 4 months, 17 days, 5 hours, 35 minutes and 27 seconds old". You just say "I'm 59 years old". That's DATEDIFF's approach too.
If you want an answer that's tailored to some contextual meaning (like your son who says "I'm not 8! I'm 8 and 3-quarters!, or I'm almost 9!), then you should look at the next-smallest measurement and approximate with it. So if it's months you're after, then do a DATEDIFF on days or hours instead, and try to approximate months however it seems most relevant to your situation (maybe you want answers like 1-1/2 months, or 1.2 months, etc.) using CASE / IF-THEN kinds of logic.