I have hourly data and when I plot the ACF and PCF. The data I can see highly depends on the value 24 hour back. This means today value at 7 PM highlighr depends on 7PM values of last days. So I'm not what should be the p,q values. this is stationary dataset
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(data2['Count'],lags=80,ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(data2['Count'],lags=80,ax=ax2)
When you create a non-seasonal (or "normal") ARIMA model, you have a time series and perform an autoregression and an moving average model on the lags of the time series. If you have seasonal data, you should use the seasonal ARIMA, e.g. the SARIMAX from statsmodels. It basically creates a "derivative" time series of only the specific lags in the previous seasons. So for example, if you try to predict the value for 7 PM, it creates a time series of all observations at 7 PM. I wrote a more specific explaination here:
How does the seasonal component of Seasonal ARIMA work?
You then create a SARIMAX model on the data. You have two sets of parameters, the order=(p, d, q) is for the "normal", non-seasonal part. You have an aditional set of parameters seasonal_order=(P, D, Q, m). m is the length of a season, so 24. P, D and Q are used to create an ARIMA model on the derivative, seaonsal time series.
https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html
Related
I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!
I have been having Problems with price column every time I try to plot graphs on it and all my graphs have this problem and I want to change it to its actual values instead of decimals
Example of of linear graph
This is the dataframe containing the information of the dataset
Train is the name of dataframe.
Column contains the selected
columns = ['Id', 'year', 'distance_travelled(kms)', 'brand_rank', 'car_age']
for i in columns:
plt.scatter(train[i], y, label='Actual')
plt.xlabel(i)
plt.ylabel('price')
plt.show()
I tracked all the movies I watched in 2019 and I want to represent the year on a graph using matplotlib, pyplot or seaborn. I saw a graph by a user who also tracked the movies he watched in a year:
I want a graph like this:
How do I represent each movie as an 'event' on a timeline?
For reference, here is a look at my table.
(sorry if basic)
I've made an assumption (from your comment) that your date column is type str. Here is code that will produce the graph:
Modify your pd.DataFrame object
Firstly, a function to add a column to your dataframe:
def modify_dataframe(df):
""" Modify dataframe to include new columns """
df['Month'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').dt.month
return df
The pd.to_datetime function converts the series df['Date'] to a datetime series; and I'm creating a new column called Month which equates to the month number.
From this column, we can generate X and Y coordinates for your plot.
def get_x_y(df):
""" Get X and Y coordinates; return tuple """
series = df['Month'].value_counts().sort_index()
new_series = series.reindex(range(1,13)).fillna(0).astype(int)
return new_series.index, new_series.values
This takes in your modified dataframe, creates a series that counts the number of occurrences of each month. Then if there are any missing months, fillna fills them in with a value of 0. Now you can begin to plot.
Plotting the graph
I've created a plot that looks like the desired output you linked.
Firstly, call your functions:
df = modify_dataframe(df)
X, Y = get_x_y(df)
Create the canvas and axis to plot on to.
fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(1, 1, 1, title='Films watched per month - 2019')
Generate x-labels. This will replace the current month int values (i.e. 1, 2, 3...) on the x-axis.
xlabels = [datetime.datetime(2019, i, 1).strftime("%B") for i in list(range(1,13))]
ax.set_xticklabels(xlabels, rotation=45, ha='right')
Set the x-ticks, and x-label.
ax.set_xticks(range(1,13))
ax.set_xlabel('Month')
Set the y-axis, y-lim, and y-label.
ax.set_yticks(range(0, max(s1.values)+2))
ax.set_ylim(0, max(s1.values)+1)
ax.set_ylabel('Count')
To get your desired output, fill underneath the graph with a block-colour (I've chosen green here but you can change it to something else).
ax.fill_between(X, [0]*len(X), Y, facecolor='green')
ax.plot(X, Y, color="black", linewidth=3, marker="o")
Plot your graph!
plt.show() # or plt.savefig('output.png', format='png')
I have monthly data of 6 variables from 2014 until 2018 in one dataset.
I'm trying to draw 6 subplots (one for each variable) with monthly X axis (Jan, Feb....) and 5 series (one for each year) with their legend.
This is part of the data:
I created 5 series (one for each year) per variable (30 in total) and I'm getting the expected output but using MANY lines of code.
What is the best way to achieve this using less lines of code?
This is an example how I created the series:
CL2014 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2014)[0:12]
CL2015 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2015)[12:24]
This is an example of how I'm plotting the series:
axCL.plot(xvals, CL2014)
axCL.plot(xvals, CL2015)
axCL.plot(xvals, CL2016)
axCL.plot(xvals, CL2017)
axCL.plot(xvals, CL2018)
There's no need to litter your namespace with 30 variables. Seaborn makes the job very easy but you need to normalize your dataframe first. This is what "normalized" or "unpivoted" looks like (Seaborn calls this "long form"):
Date variable value
2014-01-01 Charity Lottery ...
2014-01-01 Racecourse ...
2014-04-01 Bingo Halls ...
2014-04-01 Casino ...
Your screenshot is a "pivoted" or "wide form" dataframe.
df_plot = pd.melt(df, id_vars='Date')
df_plot['Year'] = df_plot['Date'].dt.year
df_plot['Month'] = df_plot['Date'].dt.strftime('%b')
import seaborn as sns
plot = sns.catplot(data=df_plot, x='Month', y='value',
row='Year', col='variable', kind='bar',
sharex=False)
plot.savefig('figure.png', dpi=300)
Result (all numbers are randomly generated):
I would try using .groupby(), it is really powerful for parsing down things like this:
for _, group in data_total.groupby([year, month])[[x_variable, y_variable]]:
plt.plot(group[x_variables], group[y_variables])
So here the groupby will separate your data_total DataFrame into year/month subsets, with the [[]] on the end to parse it down to the x_variable (assuming it is in your data_total DataFrame) and your y_variable, which you can make any of those features you are interested in.
I would decompose your datetime column into separate year and month columns, then use those new columns inside that groupby as the [year, month]. You might be able to pass in the dt.year and dt.month like you had before... not sure, try it both ways!
I have multiple time series each having a different beginning and end time. When I plot them using pandas and matplotlib I get nice graphs beginning from t0 and ending at tx for each individual series. I know that I cannot plot different length series in one plot, but i would like to at least view them with the months lining up.
For example, say I have two series: 1, begins April and ends September, 2 begins February and ends December.
How do visualize them so that each series is plotted on a yearly graph (Jan to Dec) even though the data does not span those dates? I want to see them one above the other they lining up according to months.
I have it like this so far, with xlim=('jan', 'dec'), but I just get blank plots
for dfl in dfl_list[0:2]:
dfl.plot(x='DateTime', y=['VWCmax', 'VWCmin'],
ax=p1, fontsize=15, xlim=('Jan', 'Dec'))
p1.set_title('Time vs VWC', fontsize=15)
p1.set_ylabel('VWC (%) ' + '{}'.format(imei), fontsize=15)
p1.set_xlabel('Time Stamp', fontsize=15)
I've also tried xticks instead of xlim, but I also get blank plots.
The problem that I was having was that I thought that the argument for xlim could be be the strings 'Jan', and 'Dec', this ended up returning blank graphs because pyplot did not know how to fit a graph on string type. the solution is that xlim has to be passed datetime arguments:
for dfl in dfl_list[0:2]:
dfl.plot(x='DateTime', y=['VWCmax', 'VWCmin'],
ax=p1, fontsize=15, xlim=(datetime(2017,1,1), datetime(2017,12,31))
p1.set_title('Time vs VWC', fontsize=15)
p1.set_ylabel('VWC (%) ' + '{}'.format(imei), fontsize=15)
p1.set_xlabel('Time Stamp', fontsize=15)