Specify logit function explicitly in WinBUGS/OpenBUGS - bayesian

I'm new to OpenBUGS and I got some problem in fitting a model with the logit() function.
Reading around I found that one possible solution for this would be explicit specify the logit function without using the WinBUGS’ own logit function:
In more complex models, we have fairly often experienced problems when
using WinBUGS’ own logit function, for instance with achieving
convergence (actually, problems may arise even with fairly simple
models.). Therefore,it is often better to specify that transformation
explicitly by logit.p[i] <- log(p[i] / (1 – p[i])), p[i] <-
exp(logit.p[i]) / (1 + exp(logit.p[i])) or p[i] <- 1 / (1 + exp(-
logit.p[i])).
(more information here http://www.mbr-pwrc.usgs.gov/software/kerybook/AppendixA_list_of_WinBUGS_tricks.pdf at point 14.).
The problem is that I don't understand how to do that, let's suppose that my original likelihood function, using the WinBUGS integrated logit function, was:
for (i in 1:n){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta[1] + beta[2]*x1[i] + beta[3]*x2[i] + beta[4]*x3[i]
}
How I explicit write that?
Thank you very much.
Vincenzo

Thanks to a colleague, I found the way to explicitly specify a logit function in OpenBUGS, the working code is the following:
for (i in 1:n){
y[i] ~ dbern(logit.p[i])
logit.p[i] <- 1 / (1 + exp(-p[i]))
p[i] <- beta[1] + beta[2]*x1[i] + beta[3]*x2[i] + beta[4]*x3[i]
}

Related

Custom windowed operator in tensorflow

I want to implement next formula: Si = (x1 - w1) + ... + (xn - wn) like a Conv=x1*w1 + ... xn*wn for some area under X (input tensor) and kernel W. Important that this operation repeat on all areas X with slice W with stride and padding params like a simple convolution.
How do this?
I found similar question a while ago on stackoverflow. But it finished on custom implementation on C++ and compilation or changes CUDA source or something like that.
Is there an easier way today?
If I get you correctly then you compute Si=(x1 + ... + xn)-(w1 + ... + wn)? The sum of the weights is a single number, so you don't have a "Kernel" anymore. The first sum you can compute via tf.nn.conv2d and a filter that is initialized with tf.ones. But I don't think that's what you meant to do, so could you maybe specify your question further?

Running a logistic model in JAGS - Can you vectorize instead of looping over individual cases?

I'm fairly new to JAGS, so this may be a dumb question. I'm trying to run a model in JAGS that predicts the probability that a one-dimensional random walk process will cross boundary A before crossing boundary B. This model can be solved analytically via the following logistic model:
Pr(A,B) = 1/(1 + exp(-2 * (d/sigma) * theta))
where "d" is the mean drift rate (positive values indicate drift toward boundary A), "sigma" is the standard deviation of that drift rate and "theta" is the distance between the starting point and the boundary (assumed to be equal for both boundaries).
My dataset consists of 50 participants, who each provide 1800 observations. My model assumes that d is determined by a particular combination of observed environmental variables (which I'll just call 'x'), and a weighting coefficient that relates x to d (which I'll call 'beta'). Thus, there are three parameters: beta, sigma, and theta. I'd like to estimate a single set of parameters for each participant. My intention is to eventually run a hierarchical model, where group level parameters influence individual level parameters. However, for simplicity, here I will just consider a model in which I estimate a single set of parameters for one participant (and thus the model is not hierarchical).
My model in rjags would be as follows:
model{
for ( i in 1:Ntotal ) {
d[i] <- x[i] * beta
probA[i] <- 1/(1+exp(-2 * (d[i]/sigma) * theta ) )
y[i] ~ dbern(probA[i])
}
beta ~ dunif(-10,10)
sigma ~ dunif(0,10)
theta ~ dunif(0,10)
}
This model runs fine, but takes ages to run. I'm not sure how JAGS carries out the code, but if this code were run in R, it would be rather inefficient because it would have to loop over cases, running the model for each case individually. The time required to run the analysis would therefore increase rapidly as the sample size increases. I have a rather large sample, so this is a concern.
Is there a way to vectorise this code so that it can calculate the likelihood for all of the data points at once? For example, if I were to run this as a simple maximum likelihood model. I would vectorize the model and calculate the probability of the data given particular parameter values for all 1800 cases provided by the participant (and thus would not need the for loop). I would then take the log of these likelihoods and add them all together to give a single loglikelihood for the all observations given by the participant. This method has enormous time savings. Is there a way to do this in JAGS?
EDIT
Thanks for the responses, and for pointing out that the parameters in the model I showed might be unidentified. I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered. When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine. However, when I run the model on the entire sample (45 participants x 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through. This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?
I hope I am not misreading this situation (and I previously apologize if I am), but your question seems to come from a conceptual misunderstanding of how JAGS works (or WinBUGS or OpenBUGS for that matter).
Your program does not actually run, because what you wrote was not written in a programming language. So vectorizing will not help.
You wrote just a description of your model, because JAGS' language is a descriptive one.
Once JAGS reads your model, it assembles a transition matrix to run a MCMC whose stationary distribution is the posteriori distribution of your parameters given your (observed) data. JAGS does nothing else with your program.
All that time you have been waiting the program to run was actually waiting (and hoping) to reach relaxation time of your MCMC.
So, what is taking your program too long to run is that the resulting transition matrix must have bad relaxing properties or anything like that.
That is why vectorizing a program that is read and run only once will be of very little help.
So, your problem lies somewhere else.
I hope it helps and, if not, sorry.
All the best.
You can't vectorise in the same way that you would in R, but if you can group observations with the same probability expression (i.e. common d[i]) then you can use a Binomial rather than Bernoulli distribution which will help enormously. If each observation has a unique d[i] then you are stuck I'm afraid.
Another alternative is to look at Stan which is generally faster for large data sets like yours.
Matt
thanks for the responses. Yes, you make a good point that the parameters in the model I showed might be unidentified.
I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
theta ~ dunif(0,10)
discount ~ dunif(0,10)
gamma ~ dunif(0,1)
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered.
When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine.
However, when I run the model on the entire sample (45 participants X 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through.
This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?

Error:"Multiple definitions of node" in OpenBUGS.

So I thought the following code would work in OpenBUGS, but instead it gives me a "Multiple definitions of node Z" error.
model
{
Z <- round(X)
X ~ dnorm(0,1)T(-2,2)
}
list(Z=0)
Even if I replace Z <- round(X) with Z <- X I continue to get the same error. From this fact we can deduce that the error is resulting from the use of a logical assignment for an observable variable and in particular, the error is not due to the round() operation.
Why does BUGS not allow this? Also, what is a good work-around in this case? Here is a more general version that I want to implement, which is essentially modeling a discrete Gaussian with walls (the truncation):
model
{
for(i in 1:N){
Z[i] <- round(X[i])
X[i] ~ dnorm(mu,1)T(-2,2)
}
mu ~ dunif(-2,2)
}
Essentially, I want Z to be distributed with something like a discrete Gaussian with "walls" (the truncation) and I want to estimate mu from data on Z. I suppose I can try to make Z into a categorical variable and estimate the parameters but this seems theoretically painful. Is there some BUGS trick I can use to get my intended model?
WinBUGS and OpenBUGS don't allow observed data to be a deterministic function of an unobserved variable. As you suggest, you could use dcat() and express the probabilities in terms of the normal distribution.
Though you might prefer to switch to JAGS, which has a distribution dround() that deals with just this situation - data that are rounded to n significant digits, in your case n=0. Though this forum post suggests there's a bug in the current stable release for this case, and you might need to download the development version.

R: FAST multivariate optimization packages?

I am looking to find a local minimum of a scalar function of 4 variables, and I have range-constraints on the variables ("box constraints"). There's no closed-form for the function derivative, so methods needing an analytical derivative function are out of the question. I've tried several options and control parameters with the optim function, but all of them seem very slow. Specifically, they seem to spend a lot of time between calls to my (R-defined) objective function, so I know the bottleneck is not my objective function but the "thinking" between calls to my objective function. I looked at CRAN Task View for optimization and tried several of those options (DEOptim from RcppDE, etc) but none of them seem any good. I would have liked to try the nloptr package (an R wrapper for NLOPT library) but it seems to be unavailable for windows.
I'm wondering, are there any good, fast optimization packages that people use that I may be missing? Ideally these would be in the form of thin wrappers around good C++/Fortran libraries, so there's minimal pure-R code. (Though this shouldn't be relevant, my optimization problem arose while trying to fit a 4-parameter distribution to a set of values, by minimizing a certain goodness-of-fit measure).
In the past I've found R's optimization libraries to be quite slow, and ended up writing a thin R wrapper calling a C++ API of a commercial optimization library. So are the best libraries necessarily commercial ones?
UPDATE. Here is a simplified example of the code I'm looking at:
###########
## given a set of values x and a cdf, calculate a measure of "misfit":
## smaller value is better fit
## x is assumed sorted in non-decr order;
Misfit <- function(x, cdf) {
nevals <<- nevals + 1
thinkSecs <<- thinkSecs + ( Sys.time() - snapTime)
cat('S')
if(nevals %% 20 == 0) cat('\n')
L <- length(x)
cdf_x <- pmax(0.0001, pmin(0.9999, cdf(x)))
measure <- -L - (1/L) * sum( (2 * (1:L)-1 )* ( log( cdf_x ) + log( 1 - rev(cdf_x))))
snapTime <<- Sys.time()
cat('E')
return(measure)
}
## Given 3 parameters nu (degrees of freedom, or shape),
## sigma (dispersion), gamma (skewness),
## returns the corresponding 4-parameter student-T cdf parametrized by these params
## (we restrict the location parameter mu to be 0).
skewtGen <- function( p ) {
require(ghyp)
pars = student.t( nu = p[1], mu = 0, sigma = p[2], gamma = p[3] )
function(z) pghyp(z, pars)
}
## Fit using optim() and BFGS method
fit_BFGS <- function(x, init = c()) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
tUser <- system.time(
res <- optim(init, objFun,
lower = c(2.1, 0.1, -1), upper = c(15, 2, 1),
method = 'L-BFGS-B',
control = list(trace=2, factr = 1e12, pgtol = .01 )) )[1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
fit_DE <- function(x) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
require(RcppDE)
tUser <- system.time(
res <- DEoptim(objFun,
lower = c(2.1, 0.1, -1),
upper = c(15, 2, 1) )) [1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
Let's generate a random sample:
set.seed(1)
# generate 1000 standard-student-T points with nu = 4 (degrees of freedom)
x <- rt(1000,4)
First fit using the fit.tuv (for "T UniVariate") function in the ghyp package -- this uses the Max-likelihood Expectation-Maximization (E-M) method. This is wicked fast!
require(ghyp)
> system.time( print(unlist( pars <- coef( fit.tuv(x, silent = TRUE) ))[c(2,4,5,6)]))
nu mu sigma gamma
3.16658356 0.11008948 1.56794166 -0.04734128
user system elapsed
0.27 0.00 0.27
Now I am trying to fit the distribution a different way: by minimizing the "misfit" measure defined above, using the standard optim() function in base R. Note that the results will not in general be the same. My reason for doing this is to compare these two results for a whole class of situations. I pass in the above Max-Likelihood estimate as the starting point for this optimization.
> fit_BFGS( x, init = c(pars$nu, pars$sigma, pars$gamma) )
N = 3, M = 5 machine precision = 2.22045e-16
....................
....................
.........
iterations 5
function evaluations 7
segments explored during Cauchy searches 7
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0492174
final function value 0.368136
final value 0.368136
converged
Total time = 41.02 secs, ObjFun Time Pct = 99.77084
results:
[1] 3.2389296 1.5483393 0.1161706
I also tried to fit with the DEoptim() but it ran for too long and I had to kill it. As you can see from the output above, 99.8% of the time is attributable to the objective function! So Dirk and Mike were right in their comments below. I should have more carefully estimated the time spent in my objective function, and printing dots was not a good idea! Also I suspect the MLE(E-M) method is very fast because it uses an analytical (closed-form) for the log-likelihood function.
A maximum likelihood estimator, when it exists for your problem, will always be faster than a global optimizer, in any language.
A global optimizer, no matter the algorithm, typically combines some random jumps with local minimization routines. Different algorithms may discuss this in terms of populations (genetic algorithms), annealing, migration, etc. but they are all conceptually similar.
In practice, this means that if you have a smooth function, some other optimization algorithm will likely be fastest. The characteristics of your problem function will dictate whether that will be a quadratic, linear, conical, or some other type of optimization problem for which an exact (or near-exact) analytical solution exists, or whether you will need to apply a global optimizer that is necessarily slower.
By using ghyp, you're saying that your 4 variable function produces an output that may be fit to the generalized hyperbolic distribution, and you are using a maximum likelihood estimator to find the closest generalized hyperbolic distribution to the data you've provided. But if you are doing that, I'm afraid I don't understand how you could have a non-smooth surface requiring optimization.
In general, the optimizer you choose needs to be chosen based on your problem. There is no perfect 'optimal optimizer', in any programming language, and choice of optimization algorithm to fit your problem will likely make more of a difference than any minor inefficiencies of the implementation.

Why does this code not optimize for all three points?

Background
I am trying to fit a distribution to a 95% CI and mode.
The cost function that I am using solves three functions for 0: P(X=2.5 | mu, sigma)=0.025, P(X=7.5|mu, sigma)=0.975, and the mode of log-N(mu, sigma) = 3.3. note: mode of a lognormal is = $e^{\mu-\sigma^2)}$:
Approach
First I write a cost function, prior
prior <- function(parms) {
a <- abs(plnorm(2.5, parms[1], parms[2]) - 0.025)
b <- abs(plnorm(7.5, parms[1], parms[2]) - 0.975)
mode <- exp(parms[1] - parms[2]^2)
c <- abs(mode-3.3)
return(a + b + c)
}
And then I seek parameters that minimize the cost function
v = nlm(prior,c(log(3.3),0.14))
It is apparent that the function is maximized for the mode an LCL but not the UCL.
abs(plnorm(7.5, parms[1], parms[2]) - 0.975)
> [1] 0.02499989
Here is a plot with dotted lines at the desired 95%CI:
x <- seq(0,10,0.1)
plot(x,dlnorm(x, v$estimate[1],v$estimate[2]),type='l')
abline(v=c(2.5,7.5), lty=2) #95%CI
Question
The optimization two points closely and all of the error is in the third. However, I would like it to fit the points evenly.
How can I get the function to give equal weight to the magnitude of the a, b, and c terms? It appears that the function is only fitting a and c.
note: This question is based on one asked at [cross validated][1] except that this version is specifically about the function of R's nlm() optimization algorithm whereas the CV question is about finding the a more appropriate distribution.
The reason your optimization "does not work" is that the scale of the three parameters, a, b, and c does not match. a and b measure a difference in probabilities, and can always be set to be no larger then 0.025 by chosing a really small value for the standard deviation (parms[2]), since then plnorm(2.5, parms[1], parms[2]) will be 0 (same for 7.5). The same amount of error (0.025) would be unnoticable for c - this is the scaling mismatch.
You can rewrite your optimization function so that the errors are measured on the x scale for all three criteria by comparing the quantiles to 2.5 and 7.5:
prior2 <- function(parms) {
a <- abs(qlnorm(0.025, parms[1], parms[2]) - 2.5)
b <- abs(qlnorm(0.975, parms[1], parms[2]) - 7.5)
mode <- exp(parms[1] - parms[2]^2)
c <- abs(mode-3.3)
return(a + b + c)
}
This is similar to what Ramnath suggested, except not on the log scale. This approach does not really do well on the left tail because the distribution is skewed right: small changes in the location of the lower 2.5th percentile lead to large changes of the percentile at 2.5, while this is not the case at 7.5. Ramnath's suggestion of working on the log scale solves this problem, since the log-normal distriburion is symmetric on the log-scale.
Another way to improve your fit is to change the optimization criterion. Right now you are minimizing the average absolute error. This means that one large error is OK-ish as long as the other two error terms are really small. You can impose a bigger penalty on large errors by minimizing the mean squared error (a^2+b^2_c^2) instead. This latest version (on the log scale) produces the best-looking estimate from my point of view.
prior3 <- function(parms) {
a <- abs(parms[1] - 1.96*parms[2] - log(2.5))
b <- abs(parms[1] + 1.96*parms[2] - log(7.5))
c <- abs(parms[1] - parms[2]^2 - log(3.3))
return(a^2 + b^2 + c^2)
}
try an alternate formulation of your optimization function. the log of the 95% confidence interval for the lognormal distribution is given by mu - 2*sigma and mu + 2*sigma. so you can basically try to minimize abs(mu - 2*sigma - log(2.5)) + abs(mu + 2*sigma - log(7.5)) + abs(mu - sigma^2 - log(3.3)).
when i minimized this, i find that the confidence intervals are fit very closely, while the mode is a little off. depending on the nature of your application, you might want to weight the three terms differently