Augment a mable: residuals and innovations from regression with ARMA errors model are the same - broom

I think there is something odd here. For example the following code gives the same values for residuals and innovations:
fit <- us_change %>%
model(ARIMA(Consumption ~ Income)) %>%
augment()
It seems that the augment() function extracts only the innovation values and uses it for the residuals from the regression too. This is seen when we extract the residuals and innovations using residuals():
bind_rows(
`Regression Errors` = as_tibble(residuals(fit, type = "regression")),
`ARIMA Errors` = as_tibble(residuals(fit, type = "innovation")),
.id = "type"
)
Then the residuals and innovations are different as they should be.

The .resid column provided by augment() contains response residuals, not regression residuals. I have updated the documentation to clarify this: https://github.com/tidyverts/fabletools/commit/c0efd7166bca06450d7b18d3d0530fdeac67cce7
A response residual (.resid) is the error from backtransformed predictions on the original response variable. An innovation residual (.innov) is the error from the model (on potentially a different, transformed response variable). As your model does not transform the data, the response residuals (.resid) and the innovation residuals (.innov) are the same.
There is currently no way to obtain the regression residuals (residuals after performing regression, before applying ARIMA process) using the augment() function. This is something that would be nice to have in the future.

Related

How can I find a standard method of predicting next values of a stock market using Tensorflow?

Thank you for reading. I'm not good at English.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps.
I wonder if the time series data has been properly learned and predicted.
How i do this right get the following (next) value?
I want to get the next value using like model.predict or etc
I have x_test and x_test[-1] == t, so the meaning of the next value is t+1, t+2, .... t+n,
In this example I want to get predictions of the next t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same.
I'm wondering if it's the correct train and predicted value.
I'm wondering if it was a right training and predict.
full source
The method I tried first did not work correctly.
Seconds
I realized something is wrong, I tried using another official data
So, I used the time series in the Tensorflow tutorial to practice predicting the model.
a = y_val[-look_back:]
for i in range(N-step prediction): # predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) # predicted value
a = a[1:] # remove first
a = np.append(a, tmp) # insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious what is a standard method of predicting next values of a stock market.
Thank you for reading the long question. I seek advice about your priceless opinion.
Q : "How can I find a standard method of predicting next values of a stock market...?"
First - salutes to C64 practitioner!
Next, let me say, there is no standard method - there cannot be ( one ).
Principally - let me draw from your field of a shared experience - one can easily predict the near future flow of laminar fluids ( a technically "working" market instrument - is a model A, for which one can derive a better or worse predictive tool )
That will never work, however, for turbulent states of the fluids ( just read the complexity of the attempts to formulate the many-dimensional high-order PDE for a turbulence ( and it still just approximates the turbulence ) ) -- and this is the fundamentally "working" market ( after some expected fundamental factor was released ( read NFP or CPI ) or some flash-news was announced in the news - ( read a Swiss release of currency-bonding of CHF to some USD parity or Cyprus one time state tax on all speculative deposits ... the financial Big Bangs follow ... )
So, please, do not expect one, the less any simple, model for reasonably precise predictions, working for both the laminar and turbulent fluidics - the real world is for sure way more complex than this :o)

Is it relevant to use both feature normalizer_fn and batch normalization?

Is it relevant to use both feature normalizer_fn and batch normalization like following ?
feature_columns_complex_standardized = [
tf.feature_column.numeric_column("my_feature", normalizer_fn=lambda x: (x - xMean) / xStd)
]
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)
May be you get it wrong, as Normalization is one of the methods used to bring features in a dataset to the same scale, where batch normalization is used for solving the problem of internal covariate shift where each hidden unit’s input distribution changes every time there is a parameter update in the previous layer.
So you can use both at the same time.

How to predict missing values in python using linear regression 3 year worth of data

Hey guys so i have these 3 years worth of data from 2012~2014, however the 2014 have a missing value to it (100 rows), i'm really not too sure on how to deal with it, this is my attempt at it:
X = red2012Mob.values
y = red2014Mob.values
X = X.reshape(-1,1)
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
i'm not changing any data from the 2014 where it have missing value i just directly input it to the model
There is two ways:
Drop the instances with missing data (e.g. using red2012Mob.dropna(), or if it is time series, leave out complete blocks of missing data, e.g. start later in 2014).
Impute the missing data. Here however, you won't get a one size fits all answer, as it really depends on your data and your problem. Since you seem to have time series data, the simplest strategies for "small" holes is to us linear or constant interpolation. If time dependency is not so important, maybe the mean of the column may be a good strategy. For larger holes you may find a suitable model to fill the data. Sometimes a "naive" strategy like using the same value of a seasonality before (e.g. last monday's data for current monday) may work, or you use a KNN Imputer (either check out this sklearn PR or the package discussed here). For the simple strategies, there is also a module in the upcoming sklearn release.
In practice I usually combine methods. For instance up to some point I will try strategies of the second point, but if data is too bad it is usually better to have less "good" data than much of the imputed data.
I don't know if you have data for 2013 available with you. If it is available, my first recommendation would be to use that as well. As far as data for training goes, you should only take the data for 2014 with non-missing values and then fit your model using these values. Once you get a decent cross-validation accuracy on the model, you can take the subset of data with missing values for 2014 and use that to predict values for 2014.
For better understanding, here is a small piece of sample code to subset non nan values for a list/column:
import numpy as np
a1 = [v for v in a if not np.isnan(v)]

WinBugs error Trap -undefined real result

I am writing a WinBugs code for the Bayesian Statistics question :
Consider the following model that takes into account the fact that VIX (first variable) provides information for the variance of SP500 (second variable) and the fact that $Y_t^S$ and $Y_t^V$ may be correlated:
The model is at http://i.stack.imgur.com/qMHdq.png
for $t = 1, \ldots, 200$, where $\rho$ reflects the correlation between the increments of $Y_t^S$ and $Y_t^V$, $\alpha$ is a parameter taking values in the real line and $N_2(M,V)$ denotes a bivariate normal distribution with mean $M$ and covariance matrix $V$.
(The question is:)
Assign suitable priors to the parameters $\mu_s$, $\mu_v$, $\sigma$, $\omega$, $\rho$, $\alpha$ and write a WinBugs script to fit this model to your data. Implement it to sample from the posterior distribution of this model's parameters.
The WinBugs Code is :
model{for(i in 1:200){
y[i+1,1:2] ~ dnorm(mean[i,1:2],tau[i,1:2,1:2])
mean[i,1] <- y[i,1]+mu[1]+alpha*exp(y[i,2])
mean[i,2]<- y[i,2]+mu[2]
tau[i,1,1]<-exp(y[i,2])/prec[1]
tau[i,1,2]<-exp(y[i,2]/2)*rho/sqrt(prec[1]*prec[2])
tau[i,2,1]<-exp(y[i,2]/2)*rho/sqrt(prec[1]*prec[2])
tau[i,2,2]<-(1/(prec[2]))
}
mu[1] ~ dnorm (0, 0.0001)
mu[2] ~ dnorm (0, 0.0001)
prec[1] ~ dgamma (0.001, 0.001)
prec[2] ~ dgamma (0.001, 0.001)
alpha~dnorm(1,10000)
rho~dnorm(0,10)
}
list(y =structure(.Data= c(3.291839303,3.296274588,3.295265738,3.297438773,3.298200053,3.298412011,3.296300932,3.296426043,3.294455203,3.294481658,3.285708048,3.284464574,3.287575569,3.283348727,3.283355512,3.280935583,3.285914948,3.287111684,3.286400327,3.289303491,3.291186746,3.29116009,3.294849647,3.297015994,3.298090756,3.299369994,3.298503754,3.300578094,3.301034339,3.301056053,3.300321518,3.301761166,3.301524809,3.301186314,3.3005194,3.302700982,3.301364274,3.298512491,3.300093081,3.300475917,3.297878641,3.297570124,3.300808449,3.301370783,3.303489809,3.303282476,3.299788312,3.297272339,3.300660688,3.293581304,3.297289862,3.296182373,3.294970773,3.289178542,3.289180774,3.294003026,3.29332277,3.286703413,3.294221453,3.285154331,3.280152517,3.272941046,3.273626206,3.27009395,3.270156904,3.27571666,3.279669225,3.28808818,3.284906505,3.290217199,3.293269718,3.292617095,3.29777145,3.297169381,3.299866701,3.304931922,3.30488027,3.303649561,3.306118232,3.307754826,3.307906605,3.309259582,3.309562037,3.309257451,3.309487508,3.309591846,3.309911091,3.312135025,3.311482607,3.312336061,3.314604473,3.315846543,3.31534678,3.316563686,3.315458122,3.312482018,3.315245917,3.316877848,3.316372983,3.317095535,3.31393257,3.313829271,3.30666945,3.308634834,3.301535654,3.298772321,3.295069851,3.303820042,3.314126455,3.316106697,3.317758387,3.318516185,3.318455693,3.319890621,3.320264714,3.318136407,3.313635254,3.313487574,3.30547605,3.30159638,3.306618004,3.314318146,3.31065296,3.307123626,3.306002323,3.303470376,3.299435382,3.305226653,3.305899267,3.30794935,3.314530804,3.312139259,3.313253293,3.307399755,3.301498781,3.305620033,3.299940723,3.305534079,3.311760217,3.309951512,3.314398169,3.312911143,3.311062677,3.315674421,3.315661824,3.319830321,3.321596359,3.322289603,3.322153111,3.321691617,3.324344199,3.324212469,3.325408924,3.325076221,3.32443474,3.32314893,3.325800858,3.323825279,3.321915182,3.322434321,3.316234618,3.317944305,3.310514886,3.309681258,3.315119807,3.312473558,3.31831173,3.31686738,3.322115879,3.319994568,3.323891208,3.323132421,3.320457869,3.314088528,3.313054794,3.314082206,3.319364268,3.315527433,3.31380186,3.315332072,3.318192769,3.317296379,3.318459865,3.320391417,3.322645108,3.320650938,3.321358125,3.323588265,3.323250037,3.318309644,3.32230201,3.321658486,3.323862366,3.324885109,3.325862386,3.324060105,3.325261087,3.323633617,3.319212277,3.323930349,3.325205636,-1.674871187,-1.837305384,-1.784901741,-1.824437164,-1.877095042,-1.853296595,-1.793076756,-1.802020721,-1.75360385,-1.750339701,-1.541660595,-1.537570704,-1.640896418,-1.545769835,-1.571902641,-1.556650006,-1.604336613,-1.6935902,-1.699715676,-1.778820579,-1.811756808,-1.762148494,-1.818778584,-1.826568672,-1.857709419,-1.859185357,-1.880873164,-1.863628277,-1.868840571,-1.857709419,-1.838025906,-1.843086364,-1.823727823,-1.815963058,-1.796505852,-1.835147398,-1.795132589,-1.739332463,-1.780168274,-1.785580061,-1.751643889,-1.700330607,-1.790343193,-1.795818949,-1.839468745,-1.833711714,-1.727193104,-1.651880385,-1.754258154,-1.611526503,-1.656547093,-1.59284645,-1.575092078,-1.5540471,-1.583117287,-1.674274013,-1.621581021,-1.528943106,-1.641471071,-1.453534332,-1.345690975,-1.216718593,-1.28451135,-1.161741385,-1.197198918,-1.315549541,-1.462376193,-1.587427911,-1.495750895,-1.563454293,-1.585808919,-1.589591272,-1.683878412,-1.639174734,-1.676066767,-1.705884658,-1.663594506,-1.654210604,-1.6972603,-1.728462971,-1.76413233,-1.79444677,-1.777474973,-1.770778032,-1.720871468,-1.751643889,-1.708364571,-1.716473539,-1.710229163,-1.73420046,-1.778820579,-1.79788129,-1.823727823,-1.83658546,-1.750339701,-1.689935542,-1.782193745,-1.808267093,-1.814558711,-1.854765047,-1.694811844,-1.654210604,-1.464249161,-1.394472583,-1.352258787,-1.379888524,-1.255280835,-1.422607479,-1.548864573,-1.565558689,-1.633460313,-1.659476569,-1.685086464,-1.677263996,-1.644350056,-1.596113873,-1.433397543,-1.499648104,-1.401421332,-1.350612172,-1.428435452,-1.538591373,-1.511445758,-1.415487857,-1.373953779,-1.335931446,-1.299891813,-1.357631945,-1.402730434,-1.449377291,-1.570312304,-1.556650006,-1.618216566,-1.527933706,-1.379038217,-1.453534332,-1.356803139,-1.423054399,-1.522402875,-1.47367507,-1.54680019,-1.524410013,-1.463312172,-1.527429445,-1.541148304,-1.628349281,-1.665956408,-1.602685826,-1.622143032,-1.631185029,-1.689327925,-1.67367725,-1.727193104,-1.71772782,-1.71334574,-1.749688341,-1.769444817,-1.716473539,-1.6935902,-1.705265784,-1.636312824,-1.644350056,-1.555087327,-1.545769835,-1.623831253,-1.591760035,-1.613194194,-1.610416485,-1.709607188,-1.703411805,-1.770778032,-1.745142444,-1.731645785,-1.622705408,-1.602685826,-1.643773495,-1.676665175,-1.631185029,-1.641471071,-1.667139772,-1.663005033,-1.660651132,-1.708985657,-1.766120707,-1.800638718,-1.711474452,-1.728462971,-1.782869953,-1.79925891,-1.714595509,-1.752296718,-1.755568243,-1.791708899,-1.807570829,-1.820896234,-1.76413233,-1.812456437,-1.746438846,-1.674274013,-1.792392558,-1.782193745),
.Dim=c(201,2))
)
list( mu=c(0,0), prec=c(1,1),alpha=1,rhi=0.5)
I get an error "multivariate node expected" while compiling the model. What is wrong in the code?
Model
You cannot put multiple means and variances in dnorm, which you are currently doing. The model expects that your likelihood function is multivariate, but you are giving it a univariate likelihood function. That model that you specify is actually multivariate normal, which in JAGS you would specify as dmnorm, which can take a vector of means and then a variance covariance matrix (which you have already specified). Try changing the dnorm to dmnorm at the top of your model and then you should be good to go.

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.