Removing multiplicative seasonality and trend from observations using Prophet? - facebook-prophet

I've fitted Prophet with logistic growth and multiplicative seasonality over time series data (daily observations spanning several years, no additional regressors; just ds, y) and have the forecast dataframe. How do I use the values from forecast to remove seasonality?
Within forecast, I acknowledge the weekly and yearly columns deal with seasonality on a weekly/yearly basis, and multiplicative_terms deals with changing magnitudes over time; but I don't know how to put this together to remove seasonality form my data.
I have the following possibly ways about attempting to remove seasonality, but believe what I'm doing is wrong.
#R
df$y - forecast$trend * forecast$multiplicative_terms * forecast$weekly * forecast$yearly
For reference, when using seasonal_decompose, I've had to use the following to get rid of seasonality; this doesn't hold for Prophet because of the additional terms.
#Python
df.y - trend* decomposition.seasonal
Edit- After doing a bit of research, I'm currently running this which looks right, but I was wondering if anyone can confirm whether this is the correct way to remove seasonality + trend?
df$y -
(forecast$trend +
(forecast$trend * forecast$weekly) +
(forecast$trend * forecast$yearly) +
(forecast$trend * forecast$multiplicative_terms)
)

Related

Optimizing PowerBI query for creating a "to date" maximum

I have a time series data set that is almost monotone increasing. The values of the series will dip every now and then, but it generally increases. The dips that occur in the series are due to errors in a sensor reading either previous values being to high or later values being to low. My goal is to apply a pre-processing transformation to get a more stable signal. Here is an image for reference:
Pre-prossesing vs raw
I can do this relatively easily in python, but I'm having a difficult time doing this efficiently in Power BI. Here is an example table of raw and processed data to give an example of what I'm hoping to do:
Data table
Here is the DAX code that I've to apply to create the ProcessedValue Column:
ProcessedValue =
VAR CurrentIndex = Query1[Index]
RETURN
Calculate (
MAX ( Query1[Value] ),
FILTER (
ALL ( Query1 ),
Query1[Index] < CurrentIndex
&& Query1[Index] > CurrentIndex - 50
)
)
The two issues that I'm running into are 1) I run out of memory and 2) I don't know if the code is even doing what I'm intending it to do (because it doesn't complete). The table has ~ 1M data points. I'm fairly new to Power BI and I'm not aware of the tips and tricks needed to do fast calculations, so any help would be greatly appreciated. For reference here is the code that I'm using in python:
def backward_max(a, axis=0):
bmax = np.array([np.max(a[: max(1, i)], axis=axis) for i in range(len(a))])
return bmax
df['ProcessedValue'] = backward_max(df['Value'].values)

How to calculate slope of the line

I am trying to calculate the slope of the line for a 50 day EMA I created from the adjusted closing price on a few stocks I downloaded using the getSymbols function.
My EMA looks like this :
getSymbols("COLUM.CO")
COLUM.CO$EMA <- EMA(COLUM.CO[,6],n=50)
This gives me an extra column that contains the 50 day EMA on the adjusted closing price. Now I would like to include an additional column that contains the slope of this line. I'm sure it's a fairly easy answer, but I would really appreciate some help on this. Thank you in advance.
A good way to do this is with rolling least squares regression. rollSFM does a fast and efficient job for computing the slope of a series. It usually makes sense to look at the slope in relation to units of price activity in time (bars), so x can simply be equally spaced points.
The only tricky part is working out an effective value of n, the length of the window over which you fit the slope.
library(quantmod)
getSymbols("AAPL")
AAPL$EMA <- EMA(Ad(AAPL),n=50)
# Compute slope over 50 bar lookback:
AAPL <- merge(AAPL, rollSFM(Ra = AAPL[, "EMA"],
Rb = 1:nrow(AAPL), n = 50))
The column labeled beta contains the rolling window value of the slope (alpha contains the intercept, r.squared contains the R2 value).

Solving the multiple-choice multidimensional knapsack

I am trying to solve some (relatively easy) instances of the multiple-choice multidimensional knapsack problem (where there are groups of items where only one item per group can be obtained and the weights of the items are multi-dimensional as well as the knapsack capacity). I have two questions regarding the formulation and solution:
If two groups have different number of items, is it possible to fill in the groups with smaller number of items with items having zero profit and weight=capacity to express the problem in a matrix form? Would this affect the solution? Specifically, assume I have optimization programs, where the first group (item-set) might have three candidate items and the second group has two items (different than three), i.e. these have the following form:
maximize (over x_ij) {v_11 x_11 + v_12 x_12 + v_13 x_13 +
v_21 x_21 + v_22 x_22}
subject to {w^i_11 x_11 + w^i_12 x_12 + w^i_13 x_13 + w^i_21 x_21 + w^i_22 x_22 <= W^i, i=1,2
x_11 + x_12 + x_13 = 1, x_21 + x_22 = 1, x_ij \in {0,1} for all i and j.
Is it OK in this scenario to add an artificial item x_23 with value v_23 = 0 and w^1_23 = W^1, w^2_23 = W^2 to have full products v_ij x_ij (i=1,2 j=1,2,3)?
Given that (1) is possible, has anyone tried to solve instances using some open-source optimization package such as cvx? I know about cplex but it is difficult to get for a non-academic and I am not sure that GLPK supports groups of variables.

Efficiently finding the distance between 2 lat/longs in SQL

I'm working with billions of rows of data, and each row has an associated start latitude/longitude, and end latitude/longitude. I need to calculate the distance between each start/end point - but it is taking an extremely long time.
I really need to make what I'm doing more efficient.
Currently I use a function (below) to calculate the hypotenuse between points. Is there some way to make this more efficient?
I should say that I have already tried casting the lat/longs as spatial geographies and using SQL built in STDistance() functions (not indexed), but this was even slower.
Any help would be much appreciated. I'm hoping there is some way to speed up the function, even if it degrades accuracy a little (nearest 100m is probably ok).
Thanks in advance!
DECLARE #l_distance_m float
, #l_long_start FLOAT
, #l_long_end FLOAT
, #l_lat_start FLOAT
, #l_lat_end FLOAT
, #l_x_diff FLOAT
, #l_y_diff FLOAT
SET #l_lat_start = #lat_start
SET #l_long_start = #long_start
SET #l_lat_end = #lat_end
SET #l_long_end = #long_end
-- NOTE 2 x PI() x (radius of earth) / 360 = 111
SET #l_y_diff = 111 * (#l_lat_end - #l_lat_start)
SET #l_x_diff = 111 * (#l_long_end - #l_long_start) * COS(RADIANS((#l_lat_end + #l_lat_start) / 2))
SET #l_distance_m = 1000 * SQRT(#l_x_diff * #l_x_diff + #l_y_diff * #l_y_diff)
RETURN #l_distance_m
I haven't done any SQL programming since around 1994, however I'd make the following observations:The formula that you're using is a formula that works as long as the distances between your coordinates doesn't get too big. It'll have big errors for working out the distance between e.g. New York and Singapore, but for working out the distance between New York and Boston it should be fine to within 100m.I don't think there's any approximation formula that would be faster, however I can see some minor implementation improvements that might speed it up such as (1) why do you bother to assign #l_lat_start from #lat_start, can't you just use #lat_start directly (and same for #long_start, #lat_end, #long_end), (2) Instead of having 111 in the formulas for #l_y_diff and #l_x_diff, you could get rid of it there hence saving a multiplication, and instead of 1000 in the formula for #l_distance_m you could have 111000, (3) using COS(RADIANS(#l_lat_end)) or COS(RADIANS(#l_lat_start)) won't degrade the accuracy as long as the points aren't too far away, or if the points are all within the same city you could just work out the cosine of any point in the cityApart from that, I think you'd need to look at other ideas such as creating a table with the results, and whenever points are added/deleted from the table, updating the results table at that time.

Linear regression confidence intervals in SQL

I'm using some fairly straight-forward SQL code to calculate the coefficients of regression (intercept and slope) of some (x,y) data points, using least-squares. This gives me a nice best-fit line through the data. However we would like to be able to see the 95% and 5% confidence intervals for the line of best-fit (the curves below).
(source: curvefit.com)
What these mean is that the true line has 95% probability of being below the upper curve and 95% probability of being above the lower curve. How can I calculate these curves? I have already read wikipedia etc. and done some googling but I haven't found understandable mathematical equations to be able to calculate this.
Edit: here is the essence of what I have right now.
--sample data
create table #lr (x real not null, y real not null)
insert into #lr values (0,1)
insert into #lr values (4,9)
insert into #lr values (2,5)
insert into #lr values (3,7)
declare #slope real
declare #intercept real
--calculate slope and intercept
select
#slope = ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(Power(x,2)))-Power(Sum(x),2)),
#intercept = avg(y) - ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(Power(x,2)))-Power(Sum(x),2)) * avg(x)
from #lr
Thank you in advance.
An equation for confidence interval width as f(x) is given here under "Confidence Interval on Fitted Values"
http://www.weibull.com/DOEWeb/confidence_intervals_in_simple_linear_regression.htm
The page walks you through an example calculation too.
Try this site and scroll down to the middle. For each point of your best fit line, you know your Z, your sample size, and your std Deviation.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
#PowerUser: He needs to use the equations for two-variable setups, not for one-variable setups.
Matt: If I had my old Statistics textbook with me, I'd be able to tell you what you want; unfortunately, I don't have it with me, nor do I have my notes from my high school statistics course. On the other hand, from what I remember it may only have had stuff for the confidence interval of the regression line's slope...
Anyway, this page will hopefully be of some help: http://www.stat.yale.edu/Courses/1997-98/101/linregin.htm.