R: FAST multivariate optimization packages? - optimization

I am looking to find a local minimum of a scalar function of 4 variables, and I have range-constraints on the variables ("box constraints"). There's no closed-form for the function derivative, so methods needing an analytical derivative function are out of the question. I've tried several options and control parameters with the optim function, but all of them seem very slow. Specifically, they seem to spend a lot of time between calls to my (R-defined) objective function, so I know the bottleneck is not my objective function but the "thinking" between calls to my objective function. I looked at CRAN Task View for optimization and tried several of those options (DEOptim from RcppDE, etc) but none of them seem any good. I would have liked to try the nloptr package (an R wrapper for NLOPT library) but it seems to be unavailable for windows.
I'm wondering, are there any good, fast optimization packages that people use that I may be missing? Ideally these would be in the form of thin wrappers around good C++/Fortran libraries, so there's minimal pure-R code. (Though this shouldn't be relevant, my optimization problem arose while trying to fit a 4-parameter distribution to a set of values, by minimizing a certain goodness-of-fit measure).
In the past I've found R's optimization libraries to be quite slow, and ended up writing a thin R wrapper calling a C++ API of a commercial optimization library. So are the best libraries necessarily commercial ones?
UPDATE. Here is a simplified example of the code I'm looking at:
###########
## given a set of values x and a cdf, calculate a measure of "misfit":
## smaller value is better fit
## x is assumed sorted in non-decr order;
Misfit <- function(x, cdf) {
nevals <<- nevals + 1
thinkSecs <<- thinkSecs + ( Sys.time() - snapTime)
cat('S')
if(nevals %% 20 == 0) cat('\n')
L <- length(x)
cdf_x <- pmax(0.0001, pmin(0.9999, cdf(x)))
measure <- -L - (1/L) * sum( (2 * (1:L)-1 )* ( log( cdf_x ) + log( 1 - rev(cdf_x))))
snapTime <<- Sys.time()
cat('E')
return(measure)
}
## Given 3 parameters nu (degrees of freedom, or shape),
## sigma (dispersion), gamma (skewness),
## returns the corresponding 4-parameter student-T cdf parametrized by these params
## (we restrict the location parameter mu to be 0).
skewtGen <- function( p ) {
require(ghyp)
pars = student.t( nu = p[1], mu = 0, sigma = p[2], gamma = p[3] )
function(z) pghyp(z, pars)
}
## Fit using optim() and BFGS method
fit_BFGS <- function(x, init = c()) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
tUser <- system.time(
res <- optim(init, objFun,
lower = c(2.1, 0.1, -1), upper = c(15, 2, 1),
method = 'L-BFGS-B',
control = list(trace=2, factr = 1e12, pgtol = .01 )) )[1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
fit_DE <- function(x) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
require(RcppDE)
tUser <- system.time(
res <- DEoptim(objFun,
lower = c(2.1, 0.1, -1),
upper = c(15, 2, 1) )) [1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
Let's generate a random sample:
set.seed(1)
# generate 1000 standard-student-T points with nu = 4 (degrees of freedom)
x <- rt(1000,4)
First fit using the fit.tuv (for "T UniVariate") function in the ghyp package -- this uses the Max-likelihood Expectation-Maximization (E-M) method. This is wicked fast!
require(ghyp)
> system.time( print(unlist( pars <- coef( fit.tuv(x, silent = TRUE) ))[c(2,4,5,6)]))
nu mu sigma gamma
3.16658356 0.11008948 1.56794166 -0.04734128
user system elapsed
0.27 0.00 0.27
Now I am trying to fit the distribution a different way: by minimizing the "misfit" measure defined above, using the standard optim() function in base R. Note that the results will not in general be the same. My reason for doing this is to compare these two results for a whole class of situations. I pass in the above Max-Likelihood estimate as the starting point for this optimization.
> fit_BFGS( x, init = c(pars$nu, pars$sigma, pars$gamma) )
N = 3, M = 5 machine precision = 2.22045e-16
....................
....................
.........
iterations 5
function evaluations 7
segments explored during Cauchy searches 7
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0492174
final function value 0.368136
final value 0.368136
converged
Total time = 41.02 secs, ObjFun Time Pct = 99.77084
results:
[1] 3.2389296 1.5483393 0.1161706
I also tried to fit with the DEoptim() but it ran for too long and I had to kill it. As you can see from the output above, 99.8% of the time is attributable to the objective function! So Dirk and Mike were right in their comments below. I should have more carefully estimated the time spent in my objective function, and printing dots was not a good idea! Also I suspect the MLE(E-M) method is very fast because it uses an analytical (closed-form) for the log-likelihood function.

A maximum likelihood estimator, when it exists for your problem, will always be faster than a global optimizer, in any language.
A global optimizer, no matter the algorithm, typically combines some random jumps with local minimization routines. Different algorithms may discuss this in terms of populations (genetic algorithms), annealing, migration, etc. but they are all conceptually similar.
In practice, this means that if you have a smooth function, some other optimization algorithm will likely be fastest. The characteristics of your problem function will dictate whether that will be a quadratic, linear, conical, or some other type of optimization problem for which an exact (or near-exact) analytical solution exists, or whether you will need to apply a global optimizer that is necessarily slower.
By using ghyp, you're saying that your 4 variable function produces an output that may be fit to the generalized hyperbolic distribution, and you are using a maximum likelihood estimator to find the closest generalized hyperbolic distribution to the data you've provided. But if you are doing that, I'm afraid I don't understand how you could have a non-smooth surface requiring optimization.
In general, the optimizer you choose needs to be chosen based on your problem. There is no perfect 'optimal optimizer', in any programming language, and choice of optimization algorithm to fit your problem will likely make more of a difference than any minor inefficiencies of the implementation.

Related

Terminate Scipy solve_ivp on custom predicate

I have an ODE dy/dt = f(y,t), where y is a N dimensional vector, which I would like to solve using the scipy.integrate.solve_ivp function.
However, I would like to stop the integration if a certain predicate g(y,t) evaluates to True. The use case I have here is that I expect the value of y to converge towards some constant value y0 before the end of the integration duration t_end. I am interested in this constant value y0 and would like to save time by terminating the integration once convergence has happened.
I was hoping that I could create an array to store the values of y in the last 5 integration steps, and if they are very close, convergence is believed to have happened.
The event function of solve_ivp does not really help in my case: there is no root that I hope to find, and I am not interested in the t when convergence happens. I am surprised that this seemingly "common" use case of looking for a convergence cannot be done easily, and I can't find similar problems already on Stackoverflow.
If someone has some idea, I would love to hear it.
This is a good candidate for accessing the integrator classes that solve_ivp uses under the hood. If we take the simple function dy/dt = -y with initial condition y(0) = 100. We want to terminate the function when the solution has changed by less than 0.1 over 1 second of simulation, i.e. abs(y(t) - y(t-1)) < 0.1. For this ODE, this occurs at t=-ln(0.1 / (100(e-1)) or t~7.45. We can solve this using RK45 integrator (RK45 docs) as follows:
import numpy as np
from scipy.integrate import RK45
def fun(t, y):
return -y
y0 = [100]
t0 = 0
#max_step used to ensure that we take small enough time steps
rk45 = RK45(fun, t0, y0, t_bound=1000, max_step=0.1)
t = []
y = []
while rk45.status == "running":
t.append(rk45.t)
y.append(rk45.y[0])
if rk45.t > 1.0 and np.abs(np.interp(rk45.t-1, t, y) - rk45.y[0]) < 0.1:
break
rk45.step()
print(f"Final t: {t[-1]:.1f}")
# Because max_step=0.1, t[-11] will be 1 second behind t[-1]
print(f"Time period checked: {t[-1]-t[-11]:.1f}, delta_y: {y[-11]-y[-1]:.1f}")
yields
Final t: 7.5
Time period checked: 1.0, delta_y: 0.1

Fit a bayesian linear regression and predict unobservable values

I'd like to use Jags plus R to adjust a linear model with observable quantities, and make inference about unobservable ones. I found lots of example on the internet about how to adjust the model, but nothing on how to extrapolate its coefficients after having fitted the model in the Jags environment. So, I'll appreciate any help on this.
My data looks like the following:
ngroups <- 2
group <- 1:ngroups
nobs <- 100
dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
head(dta)
JAGS has powerful ways to make inference about missing data, and once you get the hang of it, it's easy! I strongly recommend that you check out Marc Kéry's excellent book which provides a wonderful introduction to BUGS language programming (JAGS is close enough to BUGS that almost everything transfers).
The easiest way to do this involves, as you say, adjusting the model. Below I provide a complete worked example of how this works. But you seem to be asking for a way to get the prediction interval without re-running the model (is your model very large and computationally expensive?). This can also be done.
How to predict--the hard way (without re-running the model)
For each iteration of the MCMC, simulate the response for the desired x-value based on that iteration's posterior draws for the covariate values. So imagine you want to predict a value for X=10. Then if iteration 1 (post burn-in) has slope=2, intercept=1, and standard deviation=0.5, draw a Y-value from
Y=rnorm(1, 1+2*10, 0.5)
And repeat for iteration 2, 3, 4, 5...
These will be your posterior draws for the response at X=10. Note: if you did not monitor the standard deviation in your JAGS model, you are out of luck and need to fit the model again.
How to predict--the easy way--with worked example
The basic idea is to insert (into your data) the x-values whose responses you want to predict, with the associated y-values NA. For example, if you want a prediction interval for X=10, you just have to include the point (10, NA) in your data, and set a trace monitor for the y-value.
I use JAGS from R with the rjags package. Below is a complete worked example that begins by simulating the data, then adds some extra x-values to the data, specifies and runs the linear model in JAGS via rjags, and summarizes the results. Y[101:105] contains draws from the posterior prediction intervals for X[101:105]. Notice that Y[1:100] just contains the y-values for X[1:100]. These are the observed data that we fed to the model, and they never change as the model updates.
library(rjags)
# Simulate data (100 observations)
my.data <- as.data.frame(matrix(data=NA, nrow=100, ncol=2))
names(my.data) <- c("X", "Y")
# the linear model will predict Y based on the covariate X
my.data$X <- runif(100) # values for the covariate
int <- 2 # specify the true intercept
slope <- 1 # specify the true slope
sigma <- .5 # specify the true residual standard deviation
my.data$Y <- rnorm(100, slope*my.data$X+int, sigma) # Simulate the data
#### Extra data for prediction of unknown Y-values from known X-values
y.predict <- as.data.frame(matrix(data=NA, nrow=5, ncol=2))
names(y.predict) <- c("X", "Y")
y.predict$X <- c(-1, 0, 1.3, 2, 7)
mydata <- rbind(my.data, y.predict)
set.seed(333)
setwd(INSERT YOUR WORKING DIRECTORY HERE)
sink("mymodel.txt")
cat("model{
# Priors
int ~ dnorm(0, .001)
slope ~ dnorm(0, .001)
tau <- 1/(sigma * sigma)
sigma ~ dunif(0,10)
# Model structure
for(i in 1:R){
Y[i] ~ dnorm(m[i],tau)
m[i] <- int + slope * X[i]
}
}", fill=TRUE)
sink()
jags.data <- list(R=dim(mydata)[1], X=mydata$X, Y=mydata$Y)
inits <- function(){list(int=rnorm(1, 0, 5), slope=rnorm(1,0,5),
sigma=runif(1,0,10))}
params <- c("Y", "int", "slope", "sigma")
nc <- 3
n.adapt <-1000
n.burn <- 1000
n.iter <- 10000
thin <- 10
my.model <- jags.model('mymodel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(my.model, n.burn)
my.model_samples <- coda.samples(my.model,params,n.iter=n.iter, thin=thin)
summary(my.model_samples)

Running a logistic model in JAGS - Can you vectorize instead of looping over individual cases?

I'm fairly new to JAGS, so this may be a dumb question. I'm trying to run a model in JAGS that predicts the probability that a one-dimensional random walk process will cross boundary A before crossing boundary B. This model can be solved analytically via the following logistic model:
Pr(A,B) = 1/(1 + exp(-2 * (d/sigma) * theta))
where "d" is the mean drift rate (positive values indicate drift toward boundary A), "sigma" is the standard deviation of that drift rate and "theta" is the distance between the starting point and the boundary (assumed to be equal for both boundaries).
My dataset consists of 50 participants, who each provide 1800 observations. My model assumes that d is determined by a particular combination of observed environmental variables (which I'll just call 'x'), and a weighting coefficient that relates x to d (which I'll call 'beta'). Thus, there are three parameters: beta, sigma, and theta. I'd like to estimate a single set of parameters for each participant. My intention is to eventually run a hierarchical model, where group level parameters influence individual level parameters. However, for simplicity, here I will just consider a model in which I estimate a single set of parameters for one participant (and thus the model is not hierarchical).
My model in rjags would be as follows:
model{
for ( i in 1:Ntotal ) {
d[i] <- x[i] * beta
probA[i] <- 1/(1+exp(-2 * (d[i]/sigma) * theta ) )
y[i] ~ dbern(probA[i])
}
beta ~ dunif(-10,10)
sigma ~ dunif(0,10)
theta ~ dunif(0,10)
}
This model runs fine, but takes ages to run. I'm not sure how JAGS carries out the code, but if this code were run in R, it would be rather inefficient because it would have to loop over cases, running the model for each case individually. The time required to run the analysis would therefore increase rapidly as the sample size increases. I have a rather large sample, so this is a concern.
Is there a way to vectorise this code so that it can calculate the likelihood for all of the data points at once? For example, if I were to run this as a simple maximum likelihood model. I would vectorize the model and calculate the probability of the data given particular parameter values for all 1800 cases provided by the participant (and thus would not need the for loop). I would then take the log of these likelihoods and add them all together to give a single loglikelihood for the all observations given by the participant. This method has enormous time savings. Is there a way to do this in JAGS?
EDIT
Thanks for the responses, and for pointing out that the parameters in the model I showed might be unidentified. I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered. When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine. However, when I run the model on the entire sample (45 participants x 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through. This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?
I hope I am not misreading this situation (and I previously apologize if I am), but your question seems to come from a conceptual misunderstanding of how JAGS works (or WinBUGS or OpenBUGS for that matter).
Your program does not actually run, because what you wrote was not written in a programming language. So vectorizing will not help.
You wrote just a description of your model, because JAGS' language is a descriptive one.
Once JAGS reads your model, it assembles a transition matrix to run a MCMC whose stationary distribution is the posteriori distribution of your parameters given your (observed) data. JAGS does nothing else with your program.
All that time you have been waiting the program to run was actually waiting (and hoping) to reach relaxation time of your MCMC.
So, what is taking your program too long to run is that the resulting transition matrix must have bad relaxing properties or anything like that.
That is why vectorizing a program that is read and run only once will be of very little help.
So, your problem lies somewhere else.
I hope it helps and, if not, sorry.
All the best.
You can't vectorise in the same way that you would in R, but if you can group observations with the same probability expression (i.e. common d[i]) then you can use a Binomial rather than Bernoulli distribution which will help enormously. If each observation has a unique d[i] then you are stuck I'm afraid.
Another alternative is to look at Stan which is generally faster for large data sets like yours.
Matt
thanks for the responses. Yes, you make a good point that the parameters in the model I showed might be unidentified.
I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
theta ~ dunif(0,10)
discount ~ dunif(0,10)
gamma ~ dunif(0,1)
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered.
When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine.
However, when I run the model on the entire sample (45 participants X 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through.
This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?

cumulative simpson integration with scipy

I have some code which uses scipy.integration.cumtrapz to compute the antiderivative of a sampled signal. I would like to use Simpson's rule instead of Trapezoid. However scipy.integration.simps seems not to have a cumulative counterpart... Am I missing something? Is there a simple way to get a cumulative integration with "scipy.integration.simps"?
You can always write your own:
def cumsimp(func,a,b,num):
#Integrate func from a to b using num intervals.
num*=2
a=float(a)
b=float(b)
h=(b-a)/num
output=4*func(a+h*np.arange(1,num,2))
tmp=func(a+h*np.arange(2,num-1,2))
output[1:]+=tmp
output[:-1]+=tmp
output[0]+=func(a)
output[-1]+=func(b)
return np.cumsum(output*h/3)
def integ1(x):
return x
def integ2(x):
return x**2
def integ0(x):
return np.ones(np.asarray(x).shape)*5
First look at the sum and derivative of a constant function.
print cumsimp(integ0,0,10,5)
[ 10. 20. 30. 40. 50.]
print np.diff(cumsimp(integ0,0,10,5))
[ 10. 10. 10. 10.]
Now check for a few trivial examples:
print cumsimp(integ1,0,10,5)
[ 2. 8. 18. 32. 50.]
print cumsimp(integ2,0,10,5)
[ 2.66666667 21.33333333 72. 170.66666667 333.33333333]
Writing your integrand explicitly is much easier here then reproducing the simpson's rule function of scipy in this context. Picking intervals will be difficult to do when provided a single array, do you either:
Use every other value for the edges of simpson's rule and the remaining values as centers?
Use the array as edges and interpolate values of centers?
There are also a few options for how you want the intervals summed. These complications could be why its not coded in scipy.
Your question has been answered a long time ago, but I came across the same problem recently. I wrote some functions to compute such cumulative integrals for equally spaced points; the code can be found on GitHub. The order of the interpolating polynomials ranges from 1 (trapezoidal rule) to 7. As Daniel pointed out in the previous answer, some choices have to be made on how the intervals are summed, especially at the borders; results may thus be sightly different depending on the package you use. Be also aware that the numerical integration may suffer from Runge's phenomenon (unexpected oscillations) for high orders of polynomials.
Here is an example:
import numpy as np
from scipy import integrate as sp_integrate
from gradiompy import integrate as gp_integrate
# Definition of the function (polynomial of degree 7)
x = np.linspace(-3,3,num=15)
dx = x[1]-x[0]
y = 8*x + 3*x**2 + x**3 - 2*x**5 + x**6 - 1/5*x**7
y_int = 4*x**2 + x**3 + 1/4*x**4 - 1/3*x**6 + 1/7*x**7 - 1/40*x**8
# Cumulative integral using scipy
y_int_trapz = y_int [0] + sp_integrate.cumulative_trapezoid(y,dx=dx,initial=0)
print('Integration error using scipy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
# Cumulative integral using gradiompy
y_int_trapz = gp_integrate.cumulative_trapezoid(y,dx=dx,initial=y_int[0])
y_int_simps = gp_integrate.cumulative_simpson(y,dx=dx,initial=y_int[0])
print('\nIntegration error using gradiompy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
print(' simpson = %9.5f' % np.linalg.norm(y_int_simps-y_int))
# Higher order cumulative integrals
for order in range(5,8,2):
y_int_composite = gp_integrate.cumulative_composite(y,dx,order=order,initial=y_int[0])
print(' order %i = %9.5f' % (order,np.linalg.norm(y_int_composite-y_int)))
# Display the values of the cumulative integral
print('\nCumulative integral (with initial offset):\n',y_int_composite)
You should get the following result:
'''
Integration error using scipy.integrate:
trapezoid = 176.10502
Integration error using gradiompy.integrate:
trapezoid = 176.10502
simpson = 2.52551
order 5 = 0.48758
order 7 = 0.00000
Cumulative integral (with initial offset):
[-6.90203571e+02 -2.29979407e+02 -5.92267425e+01 -7.66415188e+00
2.64794452e+00 2.25594840e+00 6.61937372e-01 1.14797061e-13
8.20130517e-01 3.61254267e+00 8.55804341e+00 1.48428883e+01
1.97293221e+01 1.64257877e+01 -1.13464286e+01]
'''
I would go with Daniel's solution. But you need to be careful if the function that you are integrating is itself subject to fluctuations. Simpson's requires the function to be well-behaved (meaning in this case, one that is continuous).
There are techniques for making a moderately badly behaved function look like it is better behaved than it really is (really forms of approximation of your function) but in that case you have to be sure that the function "adequately" approximates yours. In that case you might make the intervals may be non-uniform to handle the problem.
An example might be in considering the flow of a field that, over longer time scales, is approximated by a well-behaved function but which over shorter periods is subject to limited random fluctuations in its density.

Why does this code not optimize for all three points?

Background
I am trying to fit a distribution to a 95% CI and mode.
The cost function that I am using solves three functions for 0: P(X=2.5 | mu, sigma)=0.025, P(X=7.5|mu, sigma)=0.975, and the mode of log-N(mu, sigma) = 3.3. note: mode of a lognormal is = $e^{\mu-\sigma^2)}$:
Approach
First I write a cost function, prior
prior <- function(parms) {
a <- abs(plnorm(2.5, parms[1], parms[2]) - 0.025)
b <- abs(plnorm(7.5, parms[1], parms[2]) - 0.975)
mode <- exp(parms[1] - parms[2]^2)
c <- abs(mode-3.3)
return(a + b + c)
}
And then I seek parameters that minimize the cost function
v = nlm(prior,c(log(3.3),0.14))
It is apparent that the function is maximized for the mode an LCL but not the UCL.
abs(plnorm(7.5, parms[1], parms[2]) - 0.975)
> [1] 0.02499989
Here is a plot with dotted lines at the desired 95%CI:
x <- seq(0,10,0.1)
plot(x,dlnorm(x, v$estimate[1],v$estimate[2]),type='l')
abline(v=c(2.5,7.5), lty=2) #95%CI
Question
The optimization two points closely and all of the error is in the third. However, I would like it to fit the points evenly.
How can I get the function to give equal weight to the magnitude of the a, b, and c terms? It appears that the function is only fitting a and c.
note: This question is based on one asked at [cross validated][1] except that this version is specifically about the function of R's nlm() optimization algorithm whereas the CV question is about finding the a more appropriate distribution.
The reason your optimization "does not work" is that the scale of the three parameters, a, b, and c does not match. a and b measure a difference in probabilities, and can always be set to be no larger then 0.025 by chosing a really small value for the standard deviation (parms[2]), since then plnorm(2.5, parms[1], parms[2]) will be 0 (same for 7.5). The same amount of error (0.025) would be unnoticable for c - this is the scaling mismatch.
You can rewrite your optimization function so that the errors are measured on the x scale for all three criteria by comparing the quantiles to 2.5 and 7.5:
prior2 <- function(parms) {
a <- abs(qlnorm(0.025, parms[1], parms[2]) - 2.5)
b <- abs(qlnorm(0.975, parms[1], parms[2]) - 7.5)
mode <- exp(parms[1] - parms[2]^2)
c <- abs(mode-3.3)
return(a + b + c)
}
This is similar to what Ramnath suggested, except not on the log scale. This approach does not really do well on the left tail because the distribution is skewed right: small changes in the location of the lower 2.5th percentile lead to large changes of the percentile at 2.5, while this is not the case at 7.5. Ramnath's suggestion of working on the log scale solves this problem, since the log-normal distriburion is symmetric on the log-scale.
Another way to improve your fit is to change the optimization criterion. Right now you are minimizing the average absolute error. This means that one large error is OK-ish as long as the other two error terms are really small. You can impose a bigger penalty on large errors by minimizing the mean squared error (a^2+b^2_c^2) instead. This latest version (on the log scale) produces the best-looking estimate from my point of view.
prior3 <- function(parms) {
a <- abs(parms[1] - 1.96*parms[2] - log(2.5))
b <- abs(parms[1] + 1.96*parms[2] - log(7.5))
c <- abs(parms[1] - parms[2]^2 - log(3.3))
return(a^2 + b^2 + c^2)
}
try an alternate formulation of your optimization function. the log of the 95% confidence interval for the lognormal distribution is given by mu - 2*sigma and mu + 2*sigma. so you can basically try to minimize abs(mu - 2*sigma - log(2.5)) + abs(mu + 2*sigma - log(7.5)) + abs(mu - sigma^2 - log(3.3)).
when i minimized this, i find that the confidence intervals are fit very closely, while the mode is a little off. depending on the nature of your application, you might want to weight the three terms differently