Parameters model estimation - optimization

I have this data (t, TR and h) and need to estimate the parameters (a, b, c, d and e) for this model: h = (alog(TR)+b)(c*(t^d)-e)*41.59
t <- c(120,60,50,40,30,20,10,180,120,60,50,40,30,20,10,120,60,50,40,30,20,10,120,60,50,40,30,20,10)
TR <- c(2,2,2,2,2,2,2,5,5,5,5,5,5,5,5,10,10,10,10,10,10,10,20,20,20,20,20,20,20)
h <- c(49.4,41.6,39.1,36.2,31.9,25.1,15.9,80.2,70.3,55.2,51.6,47.3,41.7,33.2,22.1,84.2,64.1,59.9,54.6,48.1,38.6,26.2,97.5,72.8,67.9,61.6,54.3,43.7,30.1)
bell <- nls(h ~ ((a1*log(TR)+a2)*(a3*(t^b)-a4)*41.59), start = list(a=0.6,
b2=0.3, c=0.4, d=0.30, e=0.4))
I tried the "nls" and the "nls2" process but these didn't work because of this:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
So,I founded a possible solution following this procedure (I think it's a Linearly Constrained Optimization):
Solve this system of equations
U1 <- log(TR)*(c1*t^d1-e1)*41.59
U2 <- (c1*t^d1-e1)*41.59
U3 <- t^d1*(a1*log(TR)+b1)*41.59
U4 <- c1*t^d1*log(t)*((a1*log(TR))+b1)*41.59
U5 <- -(a1*log(TR)+b1)*41.59
From the initial values assumed for the parameters of the model,
successively the system of linear equations, in order to obtain values of the deviations each time nearest zero.
ΣZ = (a1-a) ΣU1,i + (b1-b) ΣU2,i + (c1-c) ΣU3,1 + (d1-d) ΣU4,i + (e1-e) ΣU5,i
For each iteration, the values of the parameters assume those obtained in
previous iteration, plus the deviations...
How could I do this in R?
I'm sorry cause I'm a R beginner and I don't speak english very well.
Many thanks for any help you can give me.

Related

systemfit 3SLS Testing for Overidentification Restrictions

currently I'm struggling to find a good way to perform the Hansen/Sargan tests of Overidentification restrictions within a Three-Stage Least Squares model (3SLS) in panel data using R. I was digging the whole day in different networks and couldn't find a way of depicting the tests in R using the well-known systemfit package.
Currently, my code is simple.
violence_c_3sls <- Crime ~ ln_GDP +I(ln_GDP^2) + ln_Gini
income_c_3sls <-ln_GDP ~ Crime + ln_Gini
gini_c_3sls <- ln_Gini ~ ln_GDP + I(ln_GDP^2) + Crime
inst <- ~ Educ_Gvmnt_Exp + I(Educ_Gvmnt_Exp^2)+ Health_Exp + Pov_Head_Count_1.9
system_c_3sls <- list(violence_c_3sls, income_c_3sls, gini_c_3sls)
fitsur_c_3sls <-systemfit(system_c_3sls, "3SLS",inst=inst, data=df_new, methodResidCov = "noDfCor" )
summary(fitsur_c_3sls)
However, adding more instruments to create an over-identified system do not yield in an output of the Hansen/Sargan test, thus I assume the test should be executed aside from the output and probably associated to systemfit class object.
Thanks in advance.
With g equations, l exogenous variables, and k regressors, the Sargan test for 3SLS is
where u is the stacked residuals, \Sigma is the estimated residual covariance, and P_W is the projection matrix on the exogenous variables. See Ch 12.4 from Davidson & MacKinnon ETM.
Calculating the Sargan test from systemfit should look something like this:
sargan.systemfit=function(results3sls){
result <- list()
u=as.matrix(resid(results3sls)) #model residuals, n x n_eq
n_eq=length(results3sls$eq) # number of equations
n=nrow(u) #number of observations
n_reg=length(coef(results3sls)) # total number of regressors
w=model.matrix(results3sls,which='z') #Matrix of instruments, in block diagonal form with one block per equation
#Need to aggregate into a single block (in case different instruments used per equation)
w_list=lapply(X = 1:n_eq,FUN = function(eq_i){
this_eq_label=results3sls$eq[[eq_i]]$eqnLabel
this_w=w[str_detect(rownames(w),this_eq_label),str_detect(colnames(w),this_eq_label)]
colnames(this_w)=str_remove(colnames(this_w),paste0(this_eq_label,'_'))
return(this_w)
})
w=do.call(cbind,w_list)
w=w[,!duplicated(colnames(w))]
n_inst=ncol(w) #w is n x n_inst, where n_inst is the number of unique instruments/exogenous variables
#estimate residual variance (or use residCov, should be asymptotically equivalent)
var_u=crossprod(u)/n #var_u=results3sls$residCov
P_w=w%*%solve(crossprod(w))%*%t(w) #Projection matrix on instruments w
#as.numeric(u) vectorizes the residuals into a n_eq*n x 1 vector.
result$statistic <- as.numeric(t(as.numeric(u))%*%kronecker(solve(var_u),P_w)%*%as.numeric(u))
result$df <- n_inst*n_eq-n_reg
result$p.value <- 1 - pchisq(result$statistic, result$df)
result$method = paste("Sargan over-identifying restrictions test")
return(result)
}

Exclude group if not including all fixed effect levels (lme4)?

I am unsure about what to do in the following situation, where some levels of the fixed effect are missing (within a random effect) - they are unbalanced.
Imagine an aquarium with 5,000 individual fish. They are part of 100 different species. I want to test if there is a relationship between their weight (continuous) and whether they are fed by Alan or Susie (there only are two employees that feed fish). Species is the random effect.
My model looks like this: weight ~ employee + (1 + employee | species): mixed model (lmer) with random intercept and slope.
But for some species, all fish are fed by the same employee (Alan or Susie). Should I leave these observations in the model, or should I exclude them? Is there some literature on this?
This should be fine. Mixed models are well suited to this kind of missingness, unless it's really extreme (e.g. there were no species, or very few, that were measured by both employees). A small made-up example is below.
The cases where employee 1's measurements were missing have slightly wider confidence intervals; the cases where employee 2's measurements are missing have considerably wider CIs on the employee-2 effect (not sure why these aren't exactly zero, but my guess is that it has to do with the particular random effects values simulated - i.e. the random effects have zero mean overall, so these may be slightly >0 to make the overall estimates balance ... ?)
n_emp <- 2
n_spp <- 10
n_rep <- 20
dd <- expand.grid(emp = factor(seq(n_emp)),
spp = factor(seq(n_spp)),
rep = seq(n_rep))
dd2 <- subset(dd,
!((emp=="1" & (spp %in% 1:2)) |
(emp=="2" & (spp %in% 3:4))))
library(lme4)
form <- weight ~ emp + (1 + emp | spp)
## BUG/edge case: form[-2] breaks?
dd2$weight <- simulate(~ emp + (1 + emp | spp),
seed = 101,
newdata = dd2,
newparams = list(beta = c(1,1),
theta = c(1,1,1),
sigma = 1),
family = gaussian)[[1]]
m <- lmer(form, data = dd2)
rr <- as.data.frame(ranef(m))
rr$miss <- with(rr,
ifelse(grp %in% 1:2, "miss1",
ifelse(grp %in% 3:4, "miss2",
"nomiss")))
library(ggplot2)
ggplot(rr, aes(y = grp, x = condval, xmin = condval-2*condsd,
xmax = condval + 2*condsd, colour = miss)) +
geom_pointrange() +
facet_wrap(~term)

How to find time-varying coefficients for a VAR model by using the Kalman Filter

I'm trying to write some code in R to reproduce the model i found in this article.
The idea is to model the signal as a VAR model, but fit the coefficients by a Kalman-filter model. This would essentially enable me to create a robust time-varying VAR(p) model and analyze non-stationary data to a degree.
The model to track the coefficients is:
X(t) = F(t) X(t− 1) +W(t)
Y(t) = H(t) X(t) + E(t),
where H(t) is the Kronecker product between lagged measurements in my time-series Y and a unit vector, and X(t) fills the role of regression-coefficients. F(t) is taken to be an identity matrix, as that should mean we assume coefficients to evolve as a random walk.
In the article, from W(T), the state noise covariance matrix Q(t) is chosen at 10^-3 at first and then fitted based on some iteration scheme. From E(t) the state noise covariance matrix is R(t) substituted by the covariance of the noise term unexplained by the model: Y(t) - H(t)Xhat(t)
I have the a priori covariance matrix of estimation error (denoted Σ in the article) written as P (based on other sources) and the a posteriori as Pmin, since it will be used in the next recursion as a priori, if that makes sense.
So far i've written the following, based on the articles Appendix A 1.2
Y <- *my timeseries, for test purposes two channels of 3000 points*
F <- diag(8) # F is (m^2*p by m^2 *p) where m=2 dimensions and p =2 lags
H <- diag(2) %x% t(vec(Y[,1:2])) #the Kronecker product of vectorized lags Y-1 and Y-2
Xhatminus <- matrix(1,8,1) # an arbitrary *a priori* coefficient matrix
Q <- diag(8)%x%(10**-7) #just a number a really low number diagonal matrix, found it used in some examples
R<- 1 #Didnt know what else to put here just yet
Pmin = diag(8) #*a priori* error estimate, just some 1-s...
Now should start the reccursion. To test i just took the first 3000 points of one trial of my data.
Xhatstorage <- matrix(0,8,3000)
for(j in 3:3000){
H <- diag(2) %x% t(vec(Y[,(j-2):(j-1)]))
K <- (Pmin %*% t(H)) %*% solve((H%*%Pmin%*%t(H) + R)) ##Solve gives inverse matrix ()^-1
P <- Pmin - K%*% H %*% Pmin
Xhatplus <- F%*%( Xhatminus + K%*%(Y[,j]-H%*%Xhatminus) )
Pplus <- (F%*% P %*% F) + Q
Xhatminus <- Xhatplus
Xhatstorage[,j] <- Xhatplus
Pmin <- Pplus
}
I extracted Xhatplus values into a storage matrix and used them to write this primitive VAR model with them:
Yhat<-array(0,3000)
for(t in 3:3000){
Yhat[t]<- t(vec(Y[,(t-2)])) %*% Xhatstorage[c(1,3),t] + t(vec(Y[,(t-1)])) %*% Xhatstorage[c(2,4),t]
}
The looks like this .
The blue line is VAR with Kalman filter found coefficients, Black is original data..
I'm having issue understanding how i can better evaluate my coefficients? Why is it so off?
How should i better choose the first a priori and a posteriori estimates to start the recursion? Currently, adding more lags to the VAR is not the issue i'm sure, it's that i don't know how to choose the initial values for Pmin and Xhatmin. Most places i pieced this together from start from arbitrary 0 assumptions in toy models, but in this case, choosing any of the said matrixes as 0 will just collapse the entire algorithm.
Lastly, is this recursion even a correct implementation of Oya et al describe in the article? I know im still missing the R evaluation based on previously unexplained errors (V(t) in Appendix A 1.2), but in general?

Define the function for distance matrix in ampl. Keep getting "i is not defined"

I'm trying to set up a ampl model which clusters given points in a 2-dimensional space according to the model of Saglam et al(2005). For testing purposes I want to generate randomly some datapoints and then calculate the euclidian distance matrix for them (since I need this one). I'm aware that I could only make the distance matrix without the data points but in a later step the data points will be given and then I need to calculate the distances between each the points.
Below you'll find the code I've written so far. While loading the model I keep getting the error message "i is not defined". Since i is a subscript that should run over x1 and x1 is a parameter which is defined over the set D and have one subscript, I cannot figure out why this code should be invalid. As far as I understand, I don't have to define variables if I use them only as subscripts?
reset;
# parameters to define clustered
param m; # numbers of data points
param n; # numbers of clusters
# sets
set D := 1..m; #points to be clustered
set L := 1..n; #clusters
# randomly generate datapoints
param x1 {D} = Uniform(1,m);
param x2 {D} = Uniform(1,m);
param d {D,D} = sqrt((x1[i]-x1[j])^2 + (x2[i]-x2[j])^2);
# variables
var x {D, L} binary;
var D_l {L} >=0;
var D_max >= 0;
#minimization funcion
minimize max_clus_dis: D_max;
# constraints
subject to C1 {i in D, j in D, l in L}: D_l[l] >= d[i,j] * (x[i,l] + x[j,l] - 1);
subject to C2 {i in D}: sum{l in L} x[i,l] = 1;
subject to C3 {l in L}: D_max >= D_l[l];
So far I tried to change the line form param x1 to
param x1 {i in D, j in D} = ...
as well as
param d {x1, x2} = ...
Alas, nothing of this helped. So, any help someone can offer is deeply appreciated. I searched the web but I found nothing useful for my task.
I found eventually what was missing. The line in which I calculated the parameter d should be
param d {i in D, j in D} = sqrt((x1[i]-x1[j])^2 + (x2[i]-x2[j])^2);
Retrospectively it's clear that the subscripts i and j should have been mentioned on the line, I don't know how I could miss that.

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.