Future dates calculating incorrectly in FBProphet - make_future_dataframe method - facebook-prophet

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02

All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

How to merge records with aggregate historical data?

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

How to extract the last value of a group in pandas dataframe

I have a huge data that needs to be grouped based on its 'IDs' and only the last value of each ID needs to be exported into a SINGLE csv/excel file.
incl = ['A', 'B', 'C']
for k, g in df[df['ID'].isin(incl)].groupby('ID'):
g.tail(1).to_csv(f'{k}.csv')
I have tried this but it makes different csv files for each ID instead of a one big file containing the last value of each group.
Sample data:
ID Date Open High Low
30 UNITY 2020-06-18 11.50 11.75 11.41
31 UNITY 2020-06-21 11.44 11.50 10.88
32 UNITY 2020-06-22 11.26 11.78 11.26
33 UNITY 2020-06-23 11.72 12.08 11.53
34 UNITY 2020-06-24 11.51 11.59 11.40
35 UNITY 2020-06-25 11.85 11.85 11.11
36 SSOM 2020-05-03 27.50 27.95 27.00
37 SSOM 2020-05-05 27.50 27.50 27.50
38 SSOM 2020-05-06 29.20 29.56 29.20
39 SSOM 2020-05-07 31.77 31.77 31.77

Pandas - Group into 24-hour blocks, but not midnight-to-midnight

I have a time Series. I'd like to group into into blocks of 24-hour blocks, from 8am to 7:59am the next day. I know how to group by date, but I've tried and failed to handle this 8-hour offset using TimeGroupers and DateOffsets.
I think you can use Grouper with parameter base:
print df
date name
0 2015-06-13 00:21:25 1
1 2015-06-14 01:00:25 2
2 2015-06-14 02:54:48 3
3 2015-06-15 14:38:15 2
4 2015-06-15 15:29:28 1
print df.groupby(pd.Grouper(key='date', freq='24h', base=8)).sum()
name
date
2015-06-12 08:00:00 1.0
2015-06-13 08:00:00 5.0
2015-06-14 08:00:00 NaN
2015-06-15 08:00:00 3.0
alternatively to #jezrael's method you can use your custom grouper function:
start_ts = '2016-01-01 07:59:59'
df = pd.DataFrame({'Date': pd.date_range(start_ts, freq='10min', periods=1000)})
def my_grouper(df, idx):
return df.ix[idx, 'Date'].date() if df.ix[idx, 'Date'].hour >= 8 else df.ix[idx, 'Date'].date() - pd.Timedelta('1day')
df.groupby(lambda x: my_grouper(df, x)).size()
Test:
In [468]: df.head()
Out[468]:
Date
0 2016-01-01 07:59:59
1 2016-01-01 08:09:59
2 2016-01-01 08:19:59
3 2016-01-01 08:29:59
4 2016-01-01 08:39:59
In [469]: df.tail()
Out[469]:
Date
995 2016-01-08 05:49:59
996 2016-01-08 05:59:59
997 2016-01-08 06:09:59
998 2016-01-08 06:19:59
999 2016-01-08 06:29:59
In [470]: df.groupby(lambda x: my_grouper(df, x)).size()
Out[470]:
2015-12-31 1
2016-01-01 144
2016-01-02 144
2016-01-03 144
2016-01-04 144
2016-01-05 144
2016-01-06 144
2016-01-07 135
dtype: int64

How to create new Pandas Dataframe with columns form DataFrame (PYTHON)

I am creating a DataFrame from a csv file, where my index (rows) is date and my column names are names of cities.
After I create the raw DataFrame, I am trying to create a DataFrame from selected columns. I have tried:
A=df['city1'] #city 1
B=df['city2']
C=pd.merge(A,B)
but it does't work. This is what A and B look like.
Date
2013-11-01 2.56
2013-12-01 1.77
2014-01-01 0.00
2014-02-01 0.38
2014-03-01 13.16
2014-04-01 10.29
2014-05-01 15.43
2014-06-01 11.48
2014-07-01 8.54
2014-08-01 11.11
2014-09-01 2.71
2014-10-01 4.16
2014-11-01 13.01
2014-12-01 9.59
Name: Seattle.Washington, dtype: float64 Date
And this is what I am looking to create:
City1 City2
Date
2013-11-01 0.00 2.94
2013-12-01 8.26 3.41
2014-01-01 1.11 14.27
2014-02-01 32.86 84.26
2014-03-01 34.12 0.00
2014-04-01 68.39 0.00
2014-05-01 27.17 9.09
2014-06-01 10.47 32.00
2014-07-01 14.19 26.83
2014-08-01 14.91 6.36
2014-09-01 3.76 8.32
2014-10-01 5.83 2.19
2014-11-01 10.79 2.64
2014-12-01 21.24 8.08
Any suggestion?
Error Message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-222-ec50ff9f372f> in <module>()
14 S = df['City1']
15 A = df['City2']
16
---> 17 print merge(S,A)
18 #df2=pd.merge(A,A)
19 #print df2
C:\...\merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_on=right_on, left_index=left_index,
37 right_index=right_index, sort=sort, suffixes=suffixes,
---> 38 copy=copy)
39 return op.get_result()
40 if __debug__:
Answer: (Courtesy of #EdChum)
df[['City1', 'City2']]