I have last 5 years monthly data. I am using that to create a forecasting model using fbprophet. Last 5 months of my data is as follows:
data1['ds'].tail()
Out[86]: 55 2019-01-08
56 2019-01-09
57 2019-01-10
58 2019-01-11
59 2019-01-12
I have created the model on this and made a future prediction dataframe.
model = Prophet(
interval_width=0.80,
growth='linear',
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
seasonality_mode='additive'
)
# fit the model to data
model.fit(data1)
future_data = model.make_future_dataframe( periods=4, freq='m', include_history=True)
After 2019 December, I need the next year first four months. But it's adding next 4 months with same year 2019.
future_data.tail()
ds
59 2019-01-12
60 2019-01-31
61 2019-02-28
62 2019-03-31
63 2019-04-30
How to get the next year first 4 months in the future dataframe? Is there any specific parameter in that to adjust the year?
The issue is because of the date-format i.e. the 2019-01-12 (2019 December as per your question) is in format "%Y-%d-%m"
Hence, it creates data with month end frequency (stated by 'm') for the next 4 periods.
Just for reference this is how the future dataframe is created by Prophet:
dates = pd.date_range(
start=last_date,
periods=periods + 1, # An extra in case we include start
freq=freq)
dates = dates[dates > last_date] # Drop start if equals last_date
dates = dates[:periods] # Return correct number of periods
Hence, it infers the date format and extrapolates in the future dataframe.
Solution: Change the date format in training data to "%Y-%m-%d"
Stumbled here searching for the appropriate string for minutes
As per the docs the date time need to be YY-MM-DD format -
The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.
2019-01-12 in YY-MM-DD is 2019-12-01 ; using this
>>> dates = pd.date_range(start='2019-12-01',periods=4 + 1,freq='M')
>>> dates
DatetimeIndex(['2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30'],
dtype='datetime64[ns]', freq='M')
Other formats here; it is not given explicitly for python in prophet docs
https://pandas.pydata.org/docs/reference/api/pandas.tseries.frequencies.to_offset.html
dates = pd.date_range(start='2022-03-17 11:40:00',periods=10 + 1,freq='min')
>>> dates
DatetimeIndex(['2022-03-17 11:40:00', '2022-03-17 11:41:00',
'2022-03-17 11:42:00', '2022-03-17 11:43:00',
..],
dtype='datetime64[ns]', freq='T')
Related
I have imported a time series that I resampled to monthly time steps, however I would like to select all the years with only March, April, and May months (months 3,4, and 5).
Unfortunately this is not exactly reproducible data since it's a particular text file, but is there a way to just isolate all months 3, 4, and 5 of this time series?
# loading textfile
mjo = np.loadtxt('.../omi.1x.txt')
# setting up dates
dates = pd.date_range('1979-01', periods=mjo.shape[0], freq='D')
#resampling one of the columns to monthly data
MJO_amp = Series(mjo[:,6], index=dates)
MJO_amp_month = MJO_amp.resample("M").mean()[:-27] #match to precipitation time series (ends feb 2019)
MJO_amp_month_normed = (MJO_amp_month - MJO_amp_month.mean())/MJO_amp_month.std()
MJO_amp_month_normed
1979-01-31 0.032398
1979-02-28 -0.718921
1979-03-31 0.999467
1979-04-30 -0.790618
1979-05-31 1.113730
...
2018-10-31 0.198834
2018-11-30 0.221942
2018-12-31 1.804934
2019-01-31 1.359485
2019-02-28 1.076308
Freq: M, Length: 482, dtype: float64
print(MJO_amp_month_normed['2018-10'])
2018-10-31 0.198834
Freq: M, dtype: float64
I was thinking something along the lines of this:
def is_amj(month):
return (month >= 4) & (month <= 6)
seasonal_data = MJO_amp_month_normed.sel(time=is_amj(MJO_amp_month_normed))
but I think my issue is the textfile isn't exactly in pandas format and doesn't have column titles...
You can use the month attribute of pd.DatetimeIndex with isin like this:
df[df.index.month.isin([3,4,5])]
I'm trying to generate pandas dataframe with timeseries index with the fixed interval. As an input parameters I need to provide set start and end date. The challenge is that the generated index starts either from month start freq='3MS' or month end with freq='3M'. That cannot be defined in number of days as the whole year needs to have exact 4 periods and the start date needs to be as the defined start date.
The expected output should be in this case:
2020-10-05
2021-01-05
2021-04-05
2021-10-05
Any ideas appreciated.
interpolated = pd.DataFrame( index=pd.date_range('2020-10-05', '2045-10-05', freq='3M'),columns['dummy'])
i have a DF (df2) that is 8784 x 13 that looks like this with a "DATE" column in yyyy-mm-dd format and a "TIME" column in hours like below and I need to calculate daily and monthly averages for the year 2016:
DATE TIME BAFFIN BAY GATUN II GATUN I KLONDIKE IIIG \
8759 2016-01-01 0000 8.112838 3.949518 3.291540 7.629178
8760 2016-01-01 0100 7.977169 4.028678 3.097562 7.477159
KLONDIKE II LAGOA II LAGOA I PENASCAL II PENASCAL I SABINA \
8759 7.095450 NaN NaN 8.250527 8.911508 3.835205
8760 7.362562 NaN NaN 7.877099 7.858908 3.766714
SIERRA QUEMADA
8759 3.405049
8760 4.386598
I have tried converting the 'DATE' column to datetime to use groupby but I'm not sure how to do this. I have tried the following below but it is not grouping my data as expected for day or month averages when i test the calculation in Excel:
davg_df2 = df2.groupby(by=df2['DATE'].dt.date).mean() #
davg_df2m = df2.groupby(by=df2['DATE'].dt.month).mean() #
Thank you as I'm still learning python and to understand working with dates and different data types!
Try this:
df2['DATE'] = pd.to_datetime(df2['DATE'], format='%Y-%m-%d')
# monthly
davg_df2 = df2.groupby(pd.Grouper(freq='M', key='DATE')).mean()
# daily
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DATE')).mean()
# first convert the DATE column to datetime data type:
df2['DATE'] = pd.to_datetime(df2['DATE'])
# create new columns for month and day like so:
df2['month'] = df2['DATE'].apply(lambda t:t.month)
df2['day'] = df2['DATE'].apply(lambda t:t.day)
# then you group by day and month and get the mean like so:
davg_df2m = df2.groupby('month').mean()
davg_df2 = df2.groupby('day').mean()
Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.
I have a file containing dates from June 2015 + 365 days. I am using this as a lookup table because there are custom business dates (only certain holidays are observed and there are some no-work dates because of internal reasons). Using the customer business date offsets was just so slow with 3.5 million records.
initial_date | day_1 | day_2 | day_3 | ... | day_365
2015-06-01 2015-06-02 2015-06-03 2015-06-04
2015-06-02 2015-06-03 2015-06-04 2015-06-05
The idea is to 'tag' each row in the data based on the number of (custom) business dates since specific entries. Is there a better way to do this?
For example, if a new entry happens on 2016-06-28 then this is labeled as 'initial_date'. day_1 is tomorrow, day_2 is the next day, day_3 is Friday, and day_4 is next Monday.
My naive way of doing this is to create a loop which basically does this:
df.day_1_label = np.where(df.date == df.day_1, 'Day 1', '')
df.day_2_label = np.where(df.date == df.day_2, 'Day 2', '')
df.day_label = (df.day_1_label + df.day_2_label + ...).replace('', regex=True, inplace=True)
This would result in one label per row which I could then aggregate or plot. Eventually this would be used for forecasting. initial_date + the subset of customers from previous dates = total customers
Also, depending on what events happen, a subset of the customers would have a certain event occur in the future. I want to know on what business date from the initial_date this happens on, on average. So if we have 100 customers today, a certain percent will have an event next July 15th or whatever it might be.
Edit- pastebin with some sample data:
http://pastebin.com/ZuE1Q2KJ
So the output I am looking for is the day_label. This basically checks if the date == each date from day_0 - day_n. Is there a better way to fill that in? I am just trying to match the date of each row with a value in one of the day_ columns.
I think this can be made more efficiently if your data looks as I think it looks.
Say you have a calendar of dates (June 2015 + 365 days) and a data frame as for example:
cal = ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-08',
'2015-06-09']
df = pd.DataFrame({'date': ['2015-06-04', '2015-06-09', '2015-06-09'],
'initial_date': ['2015-06-02', '2015-06-02', '2015-06-01']})
Keeping dates in numpy.datetime types is more efficient, so I convert:
df['date'] = df['date'].astype(np.datetime64)
df['initial_date'] = df['initial_date'].astype(np.datetime64)
Now, let's turn cal into a fast lookup table:
# converting types:
cal = np.asarray(cal, dtype=np.datetime64)
# lookup series
s_cal = pd.Series(range(len(cal)), index=cal)
# a convenience lookup function
def lookup(dates):
return s_cal[dates].reset_index(drop=True)
EDIT:
The above lookup returns a series with a RangeIndex, which may be different from the index of df (this causes problems when assigning the result to a column). So it's probably better to rewrite this lookup so that it returns a plain numpy array:
def lookup2(dates):
return s_cal[dates].values
It's possible to set the correct index in lookup (taken from the input dates). But such a solution would be less flexible (would require that the input be a Series) or unnecessarily complicated.
The series:
s_cal
Out[220]:
2015-06-01 0
2015-06-02 1
2015-06-03 2
2015-06-04 3
2015-06-08 4
2015-06-09 5
dtype: int64
and how it works:
lookup(df.date)
Out[221]:
0 3
1 5
2 5
dtype: int64
lookup2(df.date)
Out[36]: array([3, 5, 5])
The rest is straightforward. This adds a column of integers (day differences) to the data frame:
df['day_label'] = lookup2(df.date) - lookup2(df.initial_date)
and if you want to convert it to labels:
df['day_label'] = 'day ' + df['day_label'].astype(str)
df
Out[225]:
date initial_date day_label
0 2015-06-04 2015-06-02 day 2
1 2015-06-09 2015-06-02 day 4
2 2015-06-09 2015-06-01 day 5
Hope this is what you wanted.