Missing data in longitudinal data and fitted by linear mixed model - missing-data

I have a question about handling the following missing data scenario using the linear mixed effect model.
Suppose I have a closed longitudinal cohort followed by six years. There are 1500 individuals at the initial wave.
Available observations by each wave are as the following:
Wave 1: 1500
Wave 2: 1400
Wave 3: 1000
Wave 4: 800
Wave 5: 500
Wave 6: 67
There are two reasons for the missing observations. First, people dropped out. Second, the data collection process is ongoing, and not all individuals have been interviewed yet (this is more likely in the later wave).
I know the linear mixed effect model can address the missing problem using the maximum likelihood if MAR or MCAR. My question is: if I assume all missing happens at random, should I drop observations from wave 6 to avoid biased estimates? Or in other words, if I assume the missingness in my data set is happened at random, should I drop a specific wave with substantial amount of missingness to avoid a biased estimate?
The model I would like to run is as the following:
m_Kunkle_exe <- lmer(cs_exec_fn ~ PRS_Kunkle*AgeAtVisit*APOE_score +
PRS_Kunkle*I(AgeAtVisit^2)*APOE_score +
+ gender + EdYears_Coded_Max20 + VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
data = WRAP_all, REML = F)
Many thanks

Related

systemfit 3SLS Testing for Overidentification Restrictions

currently I'm struggling to find a good way to perform the Hansen/Sargan tests of Overidentification restrictions within a Three-Stage Least Squares model (3SLS) in panel data using R. I was digging the whole day in different networks and couldn't find a way of depicting the tests in R using the well-known systemfit package.
Currently, my code is simple.
violence_c_3sls <- Crime ~ ln_GDP +I(ln_GDP^2) + ln_Gini
income_c_3sls <-ln_GDP ~ Crime + ln_Gini
gini_c_3sls <- ln_Gini ~ ln_GDP + I(ln_GDP^2) + Crime
inst <- ~ Educ_Gvmnt_Exp + I(Educ_Gvmnt_Exp^2)+ Health_Exp + Pov_Head_Count_1.9
system_c_3sls <- list(violence_c_3sls, income_c_3sls, gini_c_3sls)
fitsur_c_3sls <-systemfit(system_c_3sls, "3SLS",inst=inst, data=df_new, methodResidCov = "noDfCor" )
summary(fitsur_c_3sls)
However, adding more instruments to create an over-identified system do not yield in an output of the Hansen/Sargan test, thus I assume the test should be executed aside from the output and probably associated to systemfit class object.
Thanks in advance.
With g equations, l exogenous variables, and k regressors, the Sargan test for 3SLS is
where u is the stacked residuals, \Sigma is the estimated residual covariance, and P_W is the projection matrix on the exogenous variables. See Ch 12.4 from Davidson & MacKinnon ETM.
Calculating the Sargan test from systemfit should look something like this:
sargan.systemfit=function(results3sls){
result <- list()
u=as.matrix(resid(results3sls)) #model residuals, n x n_eq
n_eq=length(results3sls$eq) # number of equations
n=nrow(u) #number of observations
n_reg=length(coef(results3sls)) # total number of regressors
w=model.matrix(results3sls,which='z') #Matrix of instruments, in block diagonal form with one block per equation
#Need to aggregate into a single block (in case different instruments used per equation)
w_list=lapply(X = 1:n_eq,FUN = function(eq_i){
this_eq_label=results3sls$eq[[eq_i]]$eqnLabel
this_w=w[str_detect(rownames(w),this_eq_label),str_detect(colnames(w),this_eq_label)]
colnames(this_w)=str_remove(colnames(this_w),paste0(this_eq_label,'_'))
return(this_w)
})
w=do.call(cbind,w_list)
w=w[,!duplicated(colnames(w))]
n_inst=ncol(w) #w is n x n_inst, where n_inst is the number of unique instruments/exogenous variables
#estimate residual variance (or use residCov, should be asymptotically equivalent)
var_u=crossprod(u)/n #var_u=results3sls$residCov
P_w=w%*%solve(crossprod(w))%*%t(w) #Projection matrix on instruments w
#as.numeric(u) vectorizes the residuals into a n_eq*n x 1 vector.
result$statistic <- as.numeric(t(as.numeric(u))%*%kronecker(solve(var_u),P_w)%*%as.numeric(u))
result$df <- n_inst*n_eq-n_reg
result$p.value <- 1 - pchisq(result$statistic, result$df)
result$method = paste("Sargan over-identifying restrictions test")
return(result)
}

Taking the difference of 2 nodes in a decision problem while keeping the model as an MILP

To explain the question it's best to start with this
picture
I am modeling an optimization decision problem and a feature that I'm trying to implement is heat transfer between the process stages (a = 1, 2) taking into account which equipment type is chosen (j = 1, 2, 3) by the binary decision variable y.
The temperatures for the equipment are fixed values and my goal is to find (in the case of the picture) dT = 120 - 70 = 50 while keeping the temperature difference as a parameter (I want to keep the problem linear and need to multiply the temperature difference with a variable later on).
Things I have tried:
dT = T[a,j] - T[a-1,j]
(this obviously gives T = 80 for T[a-1,j] which is incorrect)
T[a-1] = sum(T[a-1,j] * y[a-1,j] for j in (1,2,3)
This will make the problem non-linear when I multiply with another variable.
I am using pyomo and the linear "glpk" solver. Thank you for reading my post and if someone could help me with this it is greatly appreciated!
If you only have 2 stages and 3 pieces of equipment at each stage, you could reformulate and let a binary decision variable Y[i] represent each of the 9 possible connections and delta_T[i] be a parameter that represents the temp difference associated with the same 9 connections which could easily be calculated and put into a model parameter.
If you want to keep in double-indexed, and assuming that there will only be 1 piece of equipment selected at each stage, you could take the sum-product of the selection variable and temps at each stage and subtract them.
dT[a] = sum(T[a, j]*y[a, j] for j in J) - sum(T[a-1, j]*y[a-1, j] for j in J)
for a ∈ {2, 3, ..., N}

Bayesian estimation of log-normal using JAGS

I try to find 95% credible interval of 50 sample means. Sample sizes range from 2 to 600, and the values in each sample are bounded between 1 and 5.
ex:
sample 1 = (1,3.5,2.8,5,4.6)
sample 2 = (1,5)
sample 3 = (4.1,1.1,5,3.5,2,2.4,...)
Samples with size of 10 or more have a lognormal distribution where i used JAGS for Bayesian estimation of log-normal parameters adapted from John K. Kruschke, with model specification as below:
modelstring = "
model {
for( i in 1 : N ) {
y[i] ~ dlnorm( muOfLogY , 1/sigmaOfLogY^2 )
}
sigmaOfLogY ~ dunif( 0.001*sdOfLogY , 1000*sdOfLogY )
muOfLogY ~ dunif( 0.001*meanOfLogY , 1000*meanOfLogY )
muOfY <- exp(muOfLogY+sigmaOfLogY^2/2)
modeOfY <- exp(muOfLogY-sigmaOfLogY^2)
sigmaOfY <- sqrt(exp(2*muOfLogY+sigmaOfLogY^2)*(exp(sigmaOfLogY^2)-1))
}
"
The model works fine with sample size > 10. However, with 3 <= samples < 10 i got extreme values in upper limit (e.g., 3000) which exceeded the maximum possible value of the mean (e.g., 5).
In case of sample size = 2, i got the below error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
I am new to JAGS and can't figure out how to solve this issues. I think for smaples < 10 the distribution is no longer lognormal!
Any ideas?
Thank you
First a semantic note. You are not using JAGS to find sample means. You are using JAGS to find the means of the populations from which the samples arose. If you wanted to find the sample (log)means, you could just take the mean of the (logarithms of the) sample values.
Now, if the values in each sample are bounded between 1 and 5 (due to some external constraint), then the sample is NEVER drawn from a log-normal distribution, which inherently puts probability mass over values greater than five.
Let's imagine, for the sake of saying, that the samples do arise from lognormal sampling (and therefore aren't inherently bounded between 1 and 5). Then JAGS is simply telling you that there is not enough information contained in the sample to get a good estimate of the population mean from which it is drawn. I wouldn't worry about understanding the error when the sample size is two, because there is literally no way to get good inference about the population mean from two samples. This is true even if you know that the population is indeed log-normally distributed. And since your populations are not actually log-normally distributed (they are bounded between 1 and 5) the entire inferential procedure is invalid anyway.

What is definition of truncated polynomial?

In NTRUEncryption, I seen the trucated polynimials, but I cannot understand the trunacated polynomial calculation.
So, could tell me anyone How we calculate the truncated polynomial?
The polynomials are truncated in the sense that they only have coefficients up to a certain degree.
Here is how you truncate the product of two truncated polynomials (the sum is trivial):
Assume you have two truncated polynomials, i.e. two polynomials of degree no greater than n-1
a = a[0] + a[1]X + ... + a[n-1]X^(n-1)
b = b[0] + b[1]X + ... + b[n-1]X^(n-1)
Then their "truncated" product is defined as the polynomial
a * b = c[0] + c[1]X + ... +c[n-1]X^(n-1)
where the c[k] coefficients are computed as follow:
Reverse b[0]..b[n-1] to get b[n-1]..b[0].
Rotate the result of step 1 above k+1 times to the right and get b[k]..b[0]b[n-1]..b[k+1]
Denote with b_k[0]..b_k[n-1] the array calculated in 2.
Now define
c[k] = a[0]b_k[0] + a[1]b_k[1] + ... + a[n-1]b_k[n-1].
This operation can also be made by multiplying the polynomials a and b in the usual way and then truncating the result to the degree n-1. The reason for the algorithm above is to avoid computing coefficients that will not be used in the final result.

Is there an iterative way to calculate radii along a scanline?

I am processing a series of points which all have the same Y value, but different X values. I go through the points by incrementing X by one. For example, I might have Y = 50 and X is the integers from -30 to 30. Part of my algorithm involves finding the distance to the origin from each point and then doing further processing.
After profiling, I've found that the sqrt call in the distance calculation is taking a significant amount of my time. Is there an iterative way to calculate the distance?
In other words:
I want to efficiently calculate: r[n] = sqrt(x[n]*x[n] + y*y)). I can save information from the previous iteration. Each iteration changes by incrementing x, so x[n] = x[n-1] + 1. I can not use sqrt or trig functions because they are too slow except at the beginning of each scanline.
I can use approximations as long as they are good enough (less than 0.l% error) and the errors introduced are smooth (I can't bin to a pre-calculated table of approximations).
Additional information:
x and y are always integers between -150 and 150
I'm going to try a couple ideas out tomorrow and mark the best answer based on which is fastest.
Results
I did some timings
Distance formula: 16 ms / iteration
Pete's interperlating solution: 8 ms / iteration
wrang-wrang pre-calculation solution: 8ms / iteration
I was hoping the test would decide between the two, because I like both answers. I'm going to go with Pete's because it uses less memory.
Just to get a feel for it, for your range y = 50, x = 0 gives r = 50 and y = 50, x = +/- 30 gives r ~= 58.3. You want an approximation good for +/- 0.1%, or +/- 0.05 absolute. That's a lot lower accuracy than most library sqrts do.
Two approximate approaches - you calculate r based on interpolating from the previous value, or use a few terms of a suitable series.
Interpolating from previous r
r = ( x2 + y2 ) 1/2
dr/dx = 1/2 . 2x . ( x2 + y2 ) -1/2 = x/r
double r = 50;
for ( int x = 0; x <= 30; ++x ) {
double r_true = Math.sqrt ( 50*50 + x*x );
System.out.printf ( "x: %d r_true: %f r_approx: %f error: %f%%\n", x, r, r_true, 100 * Math.abs ( r_true - r ) / r );
r = r + ( x + 0.5 ) / r;
}
Gives:
x: 0 r_true: 50.000000 r_approx: 50.000000 error: 0.000000%
x: 1 r_true: 50.010000 r_approx: 50.009999 error: 0.000002%
....
x: 29 r_true: 57.825065 r_approx: 57.801384 error: 0.040953%
x: 30 r_true: 58.335225 r_approx: 58.309519 error: 0.044065%
which seems to meet the requirement of 0.1% error, so I didn't bother coding the next one, as it would require quite a bit more calculation steps.
Truncated Series
The taylor series for sqrt ( 1 + x ) for x near zero is
sqrt ( 1 + x ) = 1 + 1/2 x - 1/8 x2 ... + ( - 1 / 2 )n+1 xn
Using r = y sqrt ( 1 + (x/y)2 ) then you're looking for a term t = ( - 1 / 2 )n+1 0.36n with magnitude less that a 0.001, log ( 0.002 ) > n log ( 0.18 ) or n > 3.6, so taking terms to x^4 should be Ok.
Y=10000
Y2=Y*Y
for x=0..Y2 do
D[x]=sqrt(Y2+x*x)
norm(x,y)=
if (y==0) x
else if (x>y) norm(y,x)
else {
s=Y/y
D[round(x*s)]/s
}
If your coordinates are smooth, then the idea can be extended with linear interpolation. For more precision, increase Y.
The idea is that s*(x,y) is on the line y=Y, which you've precomputed distances for. Get the distance, then divide it by s.
I assume you really do need the distance and not its square.
You may also be able to find a general sqrt implementation that sacrifices some accuracy for speed, but I have a hard time imagining that beating what the FPU can do.
By linear interpolation, I mean to change D[round(x)] to:
f=floor(x)
a=x-f
D[f]*(1-a)+D[f+1]*a
This doesn't really answer your question, but may help...
The first questions I would ask would be:
"do I need the sqrt at all?".
"If not, how can I reduce the number of sqrts?"
then yours: "Can I replace the remaining sqrts with a clever calculation?"
So I'd start with:
Do you need the exact radius, or would radius-squared be acceptable? There are fast approximatiosn to sqrt, but probably not accurate enough for your spec.
Can you process the image using mirrored quadrants or eighths? By processing all pixels at the same radius value in a batch, you can reduce the number of calculations by 8x.
Can you precalculate the radius values? You only need a table that is a quarter (or possibly an eighth) of the size of the image you are processing, and the table would only need to be precalculated once and then re-used for many runs of the algorithm.
So clever maths may not be the fastest solution.
Well there's always trying optimize your sqrt, the fastest one I've seen is the old carmack quake 3 sqrt:
http://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/
That said, since sqrt is non-linear, you're not going to be able to do simple linear interpolation along your line to get your result. The best idea is to use a table lookup since that will give you blazing fast access to the data. And, since you appear to be iterating by whole integers, a table lookup should be exceedingly accurate.
Well, you can mirror around x=0 to start with (you need only compute n>=0, and the dupe those results to corresponding n<0). After that, I'd take a look at using the derivative on sqrt(a^2+b^2) (or the corresponding sin) to take advantage of the constant dx.
If that's not accurate enough, may I point out that this is a pretty good job for SIMD, which will provide you with a reciprocal square root op on both SSE and VMX (and shader model 2).
This is sort of related to a HAKMEM item:
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost
circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse
centered at the origin with its size
determined by the initial point.
epsilon determines the angular
velocity of the circulating point, and
slightly affects the eccentricity. If
epsilon is a power of 2, then we don't
even need multiplication, let alone
square roots, sines, and cosines! The
"circle" will be perfectly stable
because the points soon become
periodic.
The circle algorithm was invented by
mistake when I tried to save one
register in a display hack! Ben Gurley
had an amazing display hack using only
about six or seven instructions, and
it was a great wonder. But it was
basically line-oriented. It occurred
to me that it would be exciting to
have curves, and I was trying to get a
curve display hack with minimal
instructions.