forecasting in time series in R with dplyr and ggplot - ggplot2

Hope all goes well.
I have a data set that I can share a small piece of it:
date=c("2022-08-01","2022-08-02","2022-08-03","2022-08-04",
"2022-08-05","2022-08-6")
sold_items=c(12,18,9,31,19,10)
df <- data.frame(date=as.Date(date),sold_items)
df %>% sample_n(5)
date sold_items
1 2022-08-04 31
2 2022-08-03 9
3 2022-08-01 12
4 2022-08-06 10
5 2022-08-02 18
I need to forecast the number of sold items in the next two weeks (14 days after the last available date in the data).
And also need to show the forecasted data along with the current data on one graph using ggplot
I have been looking into forecast package to use ARIMA but I am lost and could not convert this data to a time series object.
I wonder if someone can provide a solution with dplyr to my problem.
Thank you very much.

# first create df
` df =
tibble(
sold = c(12, 18, 9, 31, 19, 10),
date = seq(as.Date("2022-08-01"),
by = "day",
length = length(sold))) %>%
relocate(date)
#then coerce to a tsibble object (requires package fpp3) and model:
df %>%
as_tsibble(index = date) %>%
model(ARIMA(sold)) %>%
forecast(h = 14)

Related

Filtering and calculating mean within groups ggplot2

I'm working with a large df trying to make some plots by filterig data through different attributes of interest. Let's say my df looks like:
df(site=c(A,B,C,D,E), subsite=c(w,x,y,z), date=c(01/01/1985, 05/01/1985, 16/03/1995, 24/03/1995), species=c(1,2,3,4), Year=c(1985,1990,1995,2012), julian day=c(1,2,3,4), Month=c(6,7,8,11).
I would like plot the average julian day per month each year in which a species was present in a Subsite and Site. So far I've got this code but the average has been calculated for each month over all the years in my df rather than per year. Any help/ directions would be welcome!
Plot1<- df %>%
filter(Site=="A", Year>1985, Species =="2")%>%
group_by(Month) %>%
mutate("Day" = mean(julian day)) %>%
ggplot(aes(x=Year, y=Day, color=Species)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point",
shape=1, size=1, show.legend=FALSE) +
stat_summary(fun=mean, colour="red", geom="text", show.legend = FALSE,
vjust=-0.7,size=3, aes(label=round(..y.., digits=0)))
Thanks!
I think I spotted the error.
I was missing this:
group_by(Month, **Year**) %>%

plotly displaying pandas dataframe on a radial chart

Thank you for taking the time to read my likely silly question.
I have time series data in a pandas dataframe and would like to plot two separate financial years as two separate lines, with the month as theta and the number of queries received each month as r.
df['FY'] = np.where(df['call_DT'] < "01/04/2020", "Financial Year 1", "Financial Year 2")
df['Month'] = df['call_DT'].dt.month
df = df.sort_values('Month')
df = df.groupby('Month')
df = df['Count of queries'].sum().reset_index()
df = df.set_index('Month')
fig = px.line_polar(df,r="Count of queries",theta=df.index)
plot(fig)
I understand I am removing all rows other than 'Count of queries', I am not sure how I am doing this, so I understand why the color would not show.
However, with r = "Count of queries" and theta = "Month", no graph is displayed. I understand I have butchered this by not properly understanding the code. Any help would be appreciated.
Edit:
A snippet of the used columns for this task. I group the data by month and sum the count of queries column. I want to differentiate the two lines in the radial chart by financial year, rather than year, so I included the 'B/A' column to differentiate between them.
call_DT B/A Count of queries
2 2021-05-17 Financial Year 2 1
5 2021-05-17 Financial Year 2 1
16 2021-05-14 Financial Year 2 1
18 2021-05-14 Financial Year 2 1
26 2021-05-14 Financial Year 2 1

How to plot only business hours and weekdays in pandas

I have hourly stock data.
I need a) to format it so that matplotlib ignores weekends and non-business hours and b) an hourly frequency.
The problem:
Currently, the graph looks crammed and I suspect it is because matplotlib is taking into account 24 hours instead of 8, and 7 days a week instead of business days.
How do I tell pandas to only take into account business hours, M- F?
How I am graphing the data:
I am looping through a list of price data dataframes, graphing each data frame:
mm = 0
for ii in df:
Ddate = ii['Date']
Pprice = ii['Price']
d = Ddate.to_list()
p = Pprice.to_list()
dates = make_dt(d)
prices = unstring(p)
plt.figure()
plt.plot(dates,prices)
plt.title(stocks[mm])
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Prices')
mm += 1
the graph:
To fetch business days, you can use below function:
df["IsBDay"] = bool(len(pd.bdate_range(df['date'], df['date'])))
//Above line should add a new column into the DF as IsBday.
//You can also use Lambda expression to check and have new column for BDay.
df['IsBDay'] = df['date'].apply(lambda x: 'True' if bool(len(pd.bdate_range(x, x))) else 'False')
Now create a new DF that will have only True IsBday column value and other columns.
df[df.IsBday != 'False']
Now your DF is ready for ploting.
Hope this helps.

pandas PeriodIndex, select 12 months of data based on last period

I have a large table of data, indexed with periods 2017-4 through 2019-3. What's the best way to get two 12 months of data slices?
I'm basically trying to find the correct way to select df['2018-4':'2019-3'] and df['2017-4':2018-3] without manually typing in the slices.
Play data:
np.random.seed(0)
ind = pd.period_range(start='2017-4', end='2019-3', freq='M')
df = pd.DataFrame(np.random.randint(0, 100, (len(ind), 2)), columns=['A', 'B'], index=ind)
df.head()

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.