I'm fairly new to clojure and functional programming generally. Being curious about the speed of some basic operations with data structures (clojures defaults and some I may implement) I wrote something to automate testing operations such as adding to the structures.
My method that runs a test for 3 data structures consistently returns very different average run times depending on how its called even when its inputs remain the same.
Code with tests and results at the bottom
(import '(java.util Date))
(defrecord test-suite ;;holds the test results for 3 datastructures
[t-list
t-vector
t-set]
)
(defrecord test-series ;;holds list of test results (list of test-suite) and the list of functions used in the respective tests
[t-suites
t-functions])
;;;Runs the test, returns time it took
(defn time-test [func init-ds delta-list]
(def startTime (. (new Date) (getTime)))
(reduce func init-ds delta-list)
(def endTime (. (new Date) (getTime)))
(- endTime startTime)
)
;;;Runs the test x number of times returning the average run time
(defn test-struct ([iter func init-ds delta-list] (test-struct iter func init-ds delta-list ()))
([iter ;;number of times to run tests
func ;;function being tested (add remove etc)
init-ds ;;initial datastructure being tested
delta-list
addRes ;;test results
]
(println (first addRes));;print previous run time for debugging
;;test if done recursing
(if (> iter 0)
(test-struct
(- iter 1)
func
init-ds
delta-list
(conj addRes (time-test func init-ds delta-list)))
(/ (reduce + addRes) (count addRes)))
))
;;;Tests a function on a passed in data structure and a randomly generated list of numbers
(defn run-test
[iter ;;the number of times each test will be run
func ;;the function being tested
init-ds] ;;the initial datstructure being tested
(def delta-list (shuffle (range 1000000)));;the random values being added/removed/whatever from the ds
(println init-ds)
(println iter)
(test-suite.
;;the list test
(test-struct iter func (nth init-ds 0) delta-list)
;;the vector test
(test-struct iter func (nth init-ds 1) delta-list)
;;the set test
(test-struct iter func (nth init-ds 2) delta-list)
)
)
;;;Calls run-test a number of times storing the results as a list in a test-series data structure along with the list of functions tested.
(defn run-test-set
([iter func-list ds-list] (run-test-set iter (test-series. nil func-list) func-list ds-list))
([iter ;;the number of times each test is run before being averaged
series ;;data-structure that aggregates the list of test results, and ultimately is returned
func-list ;;the list of functions to be tested
ds-list] ;;the list of initial datastructures to be tested
(if (> (count func-list) 0)
(run-test-set ;;recursively run this aggregateing test-suites as we go
iter
(test-series. ;;create a new test series with all the functions and suites run so far
(conj (:t-suites series) ;;run a test suite and append it to those run so far
(run-test iter (first func-list) (first ds-list)))
(:t-functions series))
(rest func-list)
(rest ds-list)
)
series)) ;;finished with last run return results
)
Tests
All times in ms
;;;;;;;;;;;;;;;;;;EVALUATING 'run-test' directly
;;;get average speeds for adding 100000 random elements to list vector and set
;;;run the test 20 times and average the results
(run-test 20 conj '(() [] #{}))
;;;;;RESULT
#test.test-suite{:t-list 254/5, :t-vector 2249/20, :t-set 28641/20}
or about 51 112 and 1432 for list vector and set respectively
;;;;;;;;;;;;;;;;;;EVALUATING using 'run-test-set' which calls run-test
(run-test-set
20 ;;;times the test is run
'(conj) ;;;just run conj (adding to the ds for now)
'((() [] #{})) ;;;add random values to blank structures
)
;;;;RESULT
#test.test-series{
:t-suites (
#test.test-suite{
:t-list 1297/10,
:t-vector 1297/10,
:t-set 1289/10}) ;;;;;;;;;;;;Result of adding values
:t-functions (conj)}
or about 130 for list vector and set, this is roughly the same rate as vector above
Does anyone know why its returning such different results depending on how its run?
Is this clojure related or possibly an optimization Java is doing?
The right way to test performance of clojure code is criterium. Among other things, criterium reports statistical information about the distribution of times of code execution, and ensures that the jvm hotspot compiler is warmed up before taking the measurements. The jvm hotspot compiler is likely the reason you are seeing these performance discrepencies.
Don't use def inside defn, def is designed for top level global definitions. Use let for bindings that are only used inside one function.
Defining records that are only used once and only exist to hold a few variables is not idiomatic in clojure, the overhead of defining the classes is greater than any benefit they may give you (for increased difficulty understanding the code, if not the performance of your code). Save records for when you need to specialize a protocol or improve performance in a tight loop.
When your priority is human readability of numbers, rather than accuracy, you can use double to coerce to a more readable format for printing.
Here is how one would test the properties you are interested in idiomatically (transcript from a repl session, though this could be run from a -main function as well):
user> (require '[criterium.core :as crit])
nil
user> (def input (shuffle (range 1000000)))
#'user/input
user> (crit/bench (reduce conj [] input))
WARNING: Final GC required 3.501501258094413 % of runtime
WARNING: Final GC required 2.381979156956431 % of runtime
Evaluation count : 1680 in 60 samples of 28 calls.
Execution time mean : 36.435413 ms
Execution time std-deviation : 1.607847 ms
Execution time lower quantile : 35.764207 ms ( 2.5%)
Execution time upper quantile : 37.527755 ms (97.5%)
Overhead used : 2.222121 ns
Found 4 outliers in 60 samples (6.6667 %)
low-severe 1 (1.6667 %)
low-mild 3 (5.0000 %)
Variance from outliers : 30.3257 % Variance is moderately inflated by outliers
nil
user> (crit/bench (reduce conj () input))
WARNING: Final GC required 9.024275674955403 % of runtime
Evaluation count : 3540 in 60 samples of 59 calls.
Execution time mean : 19.623083 ms
Execution time std-deviation : 3.842658 ms
Execution time lower quantile : 17.891881 ms ( 2.5%)
Execution time upper quantile : 26.569738 ms (97.5%)
Overhead used : 2.222121 ns
Found 3 outliers in 60 samples (5.0000 %)
low-severe 3 (5.0000 %)
Variance from outliers : 91.0960 % Variance is severely inflated by outliers
nil
user> (crit/bench (reduce conj #{} input))
WARNING: Final GC required 12.0207279064623 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
Execution time mean : 965.389668 ms
Execution time std-deviation : 682.645918 ms
Execution time lower quantile : 674.958427 ms ( 2.5%)
Execution time upper quantile : 2.287535 sec (97.5%)
Overhead used : 2.222121 ns
Found 9 outliers in 60 samples (15.0000 %)
low-severe 9 (15.0000 %)
Variance from outliers : 98.2830 % Variance is severely inflated by outliers
nil
user>
Related
A question about typed/racket. I'm currently working my way through the Euler Project problems to better learn racket. Some of my solutions are really slow, especially when dealing with primes and factors. So for some problems, I've tried to make a typed/racket version and I find no improvement in speed, quite the opposite. (I try to minimize the impact of overhead by using really big numbers, calculations are around 10 seconds.)
I know from the Racket docs that the best optimizations happen when using Floats/Flonums. So... yeah, I've tried to make float versions of problems dealing with integers. As in this problem with a racket version using integers, and a typed/racket one artificially turning integers to floats. I have to use tricks: checking equality between two numbers actually means checking that they are "close enough", like in this function which checks if x can be divided by y :
(: divide? (-> Flonum Flonum Boolean))
(define (divide? x y)
(let ([r (/ x y)])
(< (- r (floor r)) 1e-6)))
It works (well... the solution is correct) and I have a 30%-40% speed improvement.
How acceptable is this? Do people actually do that in real life? If not, what is the best way to optimize typed/racket solutions when using integers? Or should typed/racket be abandoned altogether when dealing with integers and reserved for problems with float calculations?
In most cases the solution is to use better algorithms rather than converting to Typed Racket.
Since most problems at Project Euler concern integers, here is a few tips and tricks:
The division operator / needs to compute the greatest common division between the denominator and the numerator in order to cancel out common factors. This makes / a bad choice if you only want to know whether one number divides another. Use (= (remainder n m) 0) to check whether m divides n. Also: use quotient rander than / when you know the division has a zero remainder.
Use memoization to avoid recomputation. I.e. use a vector to store already computed results. Example: https://github.com/racket/math/blob/master/math-lib/math/private/number-theory/eulerian-number.rkt
First implement a naive algorithm. Then consider how to reduce the number of cases. A rule of thumb: brute force works best if you can reduce the number of cases to 1-10 million.
To reduce the number of cases look for parametrizations of the search space. Example: If you need to find a Pythagorean triple: loop over numbers m and n and then compute a = m^2 - n^2, b = 2mn, and, c = m^2 + n^2. This will be faster than looping over a, b, and, c skipping those triples where a^2 + b^2 = c^2 is not true.
Look for tips and tricks in the source of math/number-theory.
Not aspiring to be an real answer since I can't provide any general tips soegaard hasn't posted already, but since I recently have done "Amicable numbers
Problem 21", I thought may as well leave you my solution here (Sadly not many Lisp solutions get posted on Euler...).
(define (divSum n)
(define (aux i sum)
(if (> (sqr i) n)
(if (= (sqr (sub1 i)) n) ; final check if n is a perfect square
(- sum (sub1 i))
sum)
(aux (add1 i) (if (= (modulo n i) 0)
(+ sum i (/ n i))
sum))))
(aux 2 1))
(define (amicableSum n)
(define (aux a sum)
(if (>= a n)
sum
(let ([b (divSum a)])
(aux (add1 a)
(if (and (> b a) (= (divSum b) a))
(+ sum a b)
sum)))))
(aux 2 0))
> (time (amicableSum 10000))
cpu time: 47 real time: 46 gc time: 0
When dealing with divisors one can often stop at the square-root of n like here with divSum. And when you find an amicable pair you may as well add both to the sum at once, what saves you an unnecessary computation of (divSum b) in my code.
I have the following code :
(defn series-sum
"Compute a series : (+ 1 1/4 1/7 1/10 1/13 1/16 ...)"
[n]
(->> (iterate (partial + 3) 1)
(map #(/ 1 %))
(take n)
(reduce +)
float
(format "%.2f")
(str)))
It is working just fine, except that it's running time explodes when numbers get big. On my computer (series-sum 2500) is maybe a second or two, but (series-sum 25000) and I have to kill my REPL.
I tried moving (take n) as far as possible, but that is not enough. I feel that I don't understand something about Clojure since I don't see why it would be slower (I would expect (series-sum 25000) to take roughly 10 times as (series-sum 2500)).
There is an obvious loop/recur solution to optimize it, but I like the idea of being able to print the steps and to have one step (the (take n) looking like the docstring).
How can I improve the performance of this code while maintaining debuggability ?
Better yet, can I measure the time of each step to see the one taking time ?
yes, it is relevant to #zerkms's link. You map to rationals, probably should better map to floats:
(defn series-sum
"Compute a series : (+ 1 1/4 1/7 1/10 1/13 1/16 ...)"
[n]
(->> (iterate (partial + 3) 1)
(take n)
(map #(/ 1.0 %))
(reduce +)
(format "%.2f")))
now it works much faster:
user> (time (series-sum 2500000))
"Elapsed time: 686.233199 msecs"
"5,95"
For this type of mathematical operation, computing in a loop is faster than using lazy sequences. This is an order of magnitude faster than the other answer for me:
(defn series-sum
[n]
(loop [i 0
acc 0.0]
(if (< i n)
(recur (inc i)
(+ acc (/ (float 1) (inc (* 3 i)))))
(format "%.2f" acc))))
Note: you don't need the str because format returns a string.
Edit: of course this is not the main issue with the code in the original question. The bulk of the improvement comes from eliminating rationals as shown by the other answer. This is just a further optimization.
I'm fairly new to JAGS, so this may be a dumb question. I'm trying to run a model in JAGS that predicts the probability that a one-dimensional random walk process will cross boundary A before crossing boundary B. This model can be solved analytically via the following logistic model:
Pr(A,B) = 1/(1 + exp(-2 * (d/sigma) * theta))
where "d" is the mean drift rate (positive values indicate drift toward boundary A), "sigma" is the standard deviation of that drift rate and "theta" is the distance between the starting point and the boundary (assumed to be equal for both boundaries).
My dataset consists of 50 participants, who each provide 1800 observations. My model assumes that d is determined by a particular combination of observed environmental variables (which I'll just call 'x'), and a weighting coefficient that relates x to d (which I'll call 'beta'). Thus, there are three parameters: beta, sigma, and theta. I'd like to estimate a single set of parameters for each participant. My intention is to eventually run a hierarchical model, where group level parameters influence individual level parameters. However, for simplicity, here I will just consider a model in which I estimate a single set of parameters for one participant (and thus the model is not hierarchical).
My model in rjags would be as follows:
model{
for ( i in 1:Ntotal ) {
d[i] <- x[i] * beta
probA[i] <- 1/(1+exp(-2 * (d[i]/sigma) * theta ) )
y[i] ~ dbern(probA[i])
}
beta ~ dunif(-10,10)
sigma ~ dunif(0,10)
theta ~ dunif(0,10)
}
This model runs fine, but takes ages to run. I'm not sure how JAGS carries out the code, but if this code were run in R, it would be rather inefficient because it would have to loop over cases, running the model for each case individually. The time required to run the analysis would therefore increase rapidly as the sample size increases. I have a rather large sample, so this is a concern.
Is there a way to vectorise this code so that it can calculate the likelihood for all of the data points at once? For example, if I were to run this as a simple maximum likelihood model. I would vectorize the model and calculate the probability of the data given particular parameter values for all 1800 cases provided by the participant (and thus would not need the for loop). I would then take the log of these likelihoods and add them all together to give a single loglikelihood for the all observations given by the participant. This method has enormous time savings. Is there a way to do this in JAGS?
EDIT
Thanks for the responses, and for pointing out that the parameters in the model I showed might be unidentified. I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered. When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine. However, when I run the model on the entire sample (45 participants x 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through. This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?
I hope I am not misreading this situation (and I previously apologize if I am), but your question seems to come from a conceptual misunderstanding of how JAGS works (or WinBUGS or OpenBUGS for that matter).
Your program does not actually run, because what you wrote was not written in a programming language. So vectorizing will not help.
You wrote just a description of your model, because JAGS' language is a descriptive one.
Once JAGS reads your model, it assembles a transition matrix to run a MCMC whose stationary distribution is the posteriori distribution of your parameters given your (observed) data. JAGS does nothing else with your program.
All that time you have been waiting the program to run was actually waiting (and hoping) to reach relaxation time of your MCMC.
So, what is taking your program too long to run is that the resulting transition matrix must have bad relaxing properties or anything like that.
That is why vectorizing a program that is read and run only once will be of very little help.
So, your problem lies somewhere else.
I hope it helps and, if not, sorry.
All the best.
You can't vectorise in the same way that you would in R, but if you can group observations with the same probability expression (i.e. common d[i]) then you can use a Binomial rather than Bernoulli distribution which will help enormously. If each observation has a unique d[i] then you are stuck I'm afraid.
Another alternative is to look at Stan which is generally faster for large data sets like yours.
Matt
thanks for the responses. Yes, you make a good point that the parameters in the model I showed might be unidentified.
I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
theta ~ dunif(0,10)
discount ~ dunif(0,10)
gamma ~ dunif(0,1)
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered.
When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine.
However, when I run the model on the entire sample (45 participants X 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through.
This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?
I am looking to find a local minimum of a scalar function of 4 variables, and I have range-constraints on the variables ("box constraints"). There's no closed-form for the function derivative, so methods needing an analytical derivative function are out of the question. I've tried several options and control parameters with the optim function, but all of them seem very slow. Specifically, they seem to spend a lot of time between calls to my (R-defined) objective function, so I know the bottleneck is not my objective function but the "thinking" between calls to my objective function. I looked at CRAN Task View for optimization and tried several of those options (DEOptim from RcppDE, etc) but none of them seem any good. I would have liked to try the nloptr package (an R wrapper for NLOPT library) but it seems to be unavailable for windows.
I'm wondering, are there any good, fast optimization packages that people use that I may be missing? Ideally these would be in the form of thin wrappers around good C++/Fortran libraries, so there's minimal pure-R code. (Though this shouldn't be relevant, my optimization problem arose while trying to fit a 4-parameter distribution to a set of values, by minimizing a certain goodness-of-fit measure).
In the past I've found R's optimization libraries to be quite slow, and ended up writing a thin R wrapper calling a C++ API of a commercial optimization library. So are the best libraries necessarily commercial ones?
UPDATE. Here is a simplified example of the code I'm looking at:
###########
## given a set of values x and a cdf, calculate a measure of "misfit":
## smaller value is better fit
## x is assumed sorted in non-decr order;
Misfit <- function(x, cdf) {
nevals <<- nevals + 1
thinkSecs <<- thinkSecs + ( Sys.time() - snapTime)
cat('S')
if(nevals %% 20 == 0) cat('\n')
L <- length(x)
cdf_x <- pmax(0.0001, pmin(0.9999, cdf(x)))
measure <- -L - (1/L) * sum( (2 * (1:L)-1 )* ( log( cdf_x ) + log( 1 - rev(cdf_x))))
snapTime <<- Sys.time()
cat('E')
return(measure)
}
## Given 3 parameters nu (degrees of freedom, or shape),
## sigma (dispersion), gamma (skewness),
## returns the corresponding 4-parameter student-T cdf parametrized by these params
## (we restrict the location parameter mu to be 0).
skewtGen <- function( p ) {
require(ghyp)
pars = student.t( nu = p[1], mu = 0, sigma = p[2], gamma = p[3] )
function(z) pghyp(z, pars)
}
## Fit using optim() and BFGS method
fit_BFGS <- function(x, init = c()) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
tUser <- system.time(
res <- optim(init, objFun,
lower = c(2.1, 0.1, -1), upper = c(15, 2, 1),
method = 'L-BFGS-B',
control = list(trace=2, factr = 1e12, pgtol = .01 )) )[1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
fit_DE <- function(x) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
require(RcppDE)
tUser <- system.time(
res <- DEoptim(objFun,
lower = c(2.1, 0.1, -1),
upper = c(15, 2, 1) )) [1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
Let's generate a random sample:
set.seed(1)
# generate 1000 standard-student-T points with nu = 4 (degrees of freedom)
x <- rt(1000,4)
First fit using the fit.tuv (for "T UniVariate") function in the ghyp package -- this uses the Max-likelihood Expectation-Maximization (E-M) method. This is wicked fast!
require(ghyp)
> system.time( print(unlist( pars <- coef( fit.tuv(x, silent = TRUE) ))[c(2,4,5,6)]))
nu mu sigma gamma
3.16658356 0.11008948 1.56794166 -0.04734128
user system elapsed
0.27 0.00 0.27
Now I am trying to fit the distribution a different way: by minimizing the "misfit" measure defined above, using the standard optim() function in base R. Note that the results will not in general be the same. My reason for doing this is to compare these two results for a whole class of situations. I pass in the above Max-Likelihood estimate as the starting point for this optimization.
> fit_BFGS( x, init = c(pars$nu, pars$sigma, pars$gamma) )
N = 3, M = 5 machine precision = 2.22045e-16
....................
....................
.........
iterations 5
function evaluations 7
segments explored during Cauchy searches 7
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0492174
final function value 0.368136
final value 0.368136
converged
Total time = 41.02 secs, ObjFun Time Pct = 99.77084
results:
[1] 3.2389296 1.5483393 0.1161706
I also tried to fit with the DEoptim() but it ran for too long and I had to kill it. As you can see from the output above, 99.8% of the time is attributable to the objective function! So Dirk and Mike were right in their comments below. I should have more carefully estimated the time spent in my objective function, and printing dots was not a good idea! Also I suspect the MLE(E-M) method is very fast because it uses an analytical (closed-form) for the log-likelihood function.
A maximum likelihood estimator, when it exists for your problem, will always be faster than a global optimizer, in any language.
A global optimizer, no matter the algorithm, typically combines some random jumps with local minimization routines. Different algorithms may discuss this in terms of populations (genetic algorithms), annealing, migration, etc. but they are all conceptually similar.
In practice, this means that if you have a smooth function, some other optimization algorithm will likely be fastest. The characteristics of your problem function will dictate whether that will be a quadratic, linear, conical, or some other type of optimization problem for which an exact (or near-exact) analytical solution exists, or whether you will need to apply a global optimizer that is necessarily slower.
By using ghyp, you're saying that your 4 variable function produces an output that may be fit to the generalized hyperbolic distribution, and you are using a maximum likelihood estimator to find the closest generalized hyperbolic distribution to the data you've provided. But if you are doing that, I'm afraid I don't understand how you could have a non-smooth surface requiring optimization.
In general, the optimizer you choose needs to be chosen based on your problem. There is no perfect 'optimal optimizer', in any programming language, and choice of optimization algorithm to fit your problem will likely make more of a difference than any minor inefficiencies of the implementation.
I'm trying to find a better approach that my current one to manage the current state vector of a continuous Markov chain. The state vector used stores pairs of (state, proability) where probability is a float.
The piece of the algorithm that needs to be optimized does the following operations:
every iteration starts with a current state vector
computes the reachable states for every current one in vector and store all of them in a temporary list together with the probability of going there
for every element in this new list it calculates the new state vector by iterating over the possible transitions (mind that there can be many transitions which lead to the same state but found from different source states)
this is actually done by using hashtables that have as keys the states and as values the probabilities.
So basically to build up the new vector, for every transition, the normalized value is calculated, then the state in the vector is retrieved with a get, the probability of current transition is added and then the result is stored back.
This approach seemed to work so far but I'm trying to cope with systems that lead in very large space vectors (500k-1mil states) so, although hashtable give a constant complexity to store and retrieve, iterations starts slowing down really a lot.
This because for example, we start from a vector that has 100k states, for each one we computes the reachable states (which usually are > 1) so that we obtain a transition list of 100k*(avg reachable states). Then every transition is gone through to compute the new probability vector.
Unfortunately I need to go through the whole reachable list to find the normalizing value without actually computing the next vecto but in any case, as I saw through profiling, this is not the bottleneck of the algorithm. The bottleneck is present when every transition is computed.
This is the function used to compute the transition list from the current vector (pi):
HTable.fold (fun s p l ->
if check s f2 then (0., s, p, [s, 1.0]) :: l
else if not (check s f1) then (0., s, p, [s, 1.0]) :: l
else
let ts = P.rnext s in
if List.length ts = 0 then (0., s, p, [s, 1.0]) :: l
else
let lm = List.fold_left (fun a (s,f) -> f +. a) 0. ts in
(lm, s, p, ts) :: l) pi []
And this is the function that computes the new pi by going through the transition list and by normalizing them all:
let update_pi s v =
try
let t = HTable.find pi s in
HTable.replace pi s (v +. t)
with Not_found -> HTable.add pi s v
in
HTable.clear pi;
List.iter (fun (llm, s, p, ts) ->
if llm = 0. then
update_pi s p
else begin
List.iter (fun (ss, pp) ->
update_pi ss (p *. (pp /. lm))
) ts;
if llm < lm then update_pi s (p *. (1. -. (llm /. lm)))
end
) u;
I should find a data structures which is better suited for the operations I'm doing, the problem is that with current approach I have to do a get and a set for every single transition but doing so many operations over hashtables kills performance since the constant cost is quite expensive.
It cannot hurt to replace the linear time if List.length ts = 0 by the constant time if ts = [], although I doubt this is going to solve your performance problem.
Your algorithm sound a little like multiplying a matrix by a vector to get a new vector. This is usually sped up by blocking. But here, the representation in hashtables may be costing you locality. Would it be possible, once and for all, to number all the states, and then use arrays instead of hashtables? Note that with arbitrary transitions, the destination states would still not be local, but it may be an improvement (if only because accessing an array is more direct than accessing a hashtable).