Logtalk method calls performance optimization

Logtalk method calls performance optimization - optimization

While playing with Logtalk, is seems my program was longer to execute with Logtalk object versus plain Prolog. I did a benchmark comparing the execution of the simple predicate in plain Prolog with the logtalk object encapsulation equivalent below :
%%
% plain prolog predicate
plain_prolog_simple :-
fail.
%%
% object encapsulation
:- object(logtalk_obj).
:- public([simple/0]).
simple :-
fail.
:- end_object.
Here’s what I get :
?- benchmark(plain_prolog_simple).
Number of repetitions: 500000
Total time calls: 0.33799099922180176 seconds
Average time per call: 6.759819984436035e-7 seconds
Number of calls per second: 1479329.3346604244
true.
?- benchmark(logtalk_obj::simple).
Number of repetitions: 500000
Total time calls: 2.950408935546875 seconds
Average time per call: 5.90081787109375e-6 seconds
Number of calls per second: 169468.0333888435
true.
We can see logtalk_obj::simple call is slower than plain_prolog_simple call.
I use SWI Prolog as backend, I tried to set some log talk flags, without success.
Edit : We can find benchmark code samples to https://github.com/koryonik/logtalk-experiments/tree/master/benchmarks
What's wrong ? Why this performance diff? How to optimize Logtalk method calls ?

In a nutshell, you're benchmarking the Logtalk compilation of the ::/2 goal at the top-level INTERPRETER. That's a classic benchmarking error. Goals at the top-level, being it plain Prolog goals, module explicitly-qualified predicate goals, or message sending goals are always going to be interpreted, i.e. compiled on the fly.
You get performance close to plain Prolog for message sending goals in compiled source files, which is the most common scenario. See the benchmarks example in the Logtalk distribution for a benchmarking solution that avoids the above trap.
The performance gap (between plain Prolog and Logtalk goals) depend on the chosen backend Prolog compiler. The gap is negligible with mature Prolog VMs (e.g. SICStus Prolog or ECLiPSe) when static binding is possible. Some Prolog VMs (e.g. SWI-Prolog) lack some optimizations that can make the gap bigger, specially in tight loops, however.
P.S. Logtalk comes out-of-box with a settings configuration for development, not for performance. See in particular the documentation on the optimize flag, which should be turned on for static binding optimizations.
UPDATE
Starting from the code in your repository, and assuming SWI-Prolog as backend compiler, try:
----- code.lgt -----
% plain prolog predicate
plain_prolog_simple :-
fail.
% object encapsulation
:- object(logtalk_obj).
:- public(simple/0).
simple :-
fail.
:- end_object.
--------------------
----- bench.lgt -----
% load the SWI-Prolog "statistics" library
:- use_module(library(statistics)).
:- object(bench).
:- public(bench/0).
bench :-
write('Plain Prolog goal:'), nl,
prolog_statistics:time({plain_prolog_simple}).
bench :-
write('Logtalk goal:'), nl,
prolog_statistics:time(logtalk_obj::simple).
bench.
:- end_object.
---------------------
Save both files and then startup Logtalk:
$ swilgt
...
?- set_logtalk_flag(optimize, on).
true.
?- {code, bench}.
% [ /Users/pmoura/Desktop/bench/code.lgt loaded ]
% (0 warnings)
% [ /Users/pmoura/Desktop/bench/bench.lgt loaded ]
% (0 warnings)
true.
?- bench::bench.
Plain Prolog goal:
% 2 inferences, 0.000 CPU in 0.000 seconds (69% CPU, 125000 Lips)
Logtalk goal:
% 2 inferences, 0.000 CPU in 0.000 seconds (70% CPU, 285714 Lips)
true.
The time/1 predicate is a meta-predicate. The Logtalk compiler uses the meta-predicate property to compile the time/1 argument. The {}/1 control construct is a Logtalk compiler bypass. It ensures that its argument is called as-is in the plain Prolog database.

A benchmarking trick that works with SWI-Prolog and YAP (possibly others) that provide a time/1 meta-predicate is to use this predicate with Logtalk's <</2 debugging control construct and the logtalk built-in object. Using SWI-Prolog as the backend compiler:
?- set_logtalk_flag(optimize, on).
...
?- time(true). % ensure the library providing time/1 is loaded
...
?- {code}.
...
?- time(plain_prolog_simple).
% 2 inferences, 0.000 CPU in 0.000 seconds (59% CPU, 153846 Lips)
false.
?- logtalk<<(prolog_statistics:time(logtalk_obj::simple)).
% 2 inferences, 0.000 CPU in 0.000 seconds (47% CPU, 250000 Lips)
false.
A quick explanation, the <</2 control construct compiles its goal argument before calling it. As the optimize flag is turned on and time/1 is a meta-predicate, its argument is fully compiled and static binding is used for the message sending. Hence the same number of inferences we get above. Thus, this trick allows you to do quick benchmarking at the top-level for Logtalk message-sending goals.
Using YAP is similar but simpler as time/1 is a built-in meta-predicate instead of a library meta-predicate as in SWI-Prolog.

You can also make interpreters for object orientation which are quite fast. Jekejeke Prolog has a purely interpreted (::)/2 operator. There is not much overhead as of now. This is the test code:
Jekejeke Prolog 3, Runtime Library 1.3.0
(c) 1985-2018, XLOG Technologies GmbH, Switzerland
?- [user].
plain :- fail.
:- begin_module(obj).
simple(_) :- fail.
:- end_module.
And these are some actual results. There is not such a drastic difference between a plain call and an (::)/2 operator based call. Under the hood both predicate lookups are inline cached:
?- time((between(1,500000,_), plain, fail; true)).
% Up 76 ms, GC 0 ms, Thread Cpu 78 ms (Current 06/23/18 23:02:41)
Yes
?- time((between(1,500000,_), obj::simple, fail; true)).
% Up 142 ms, GC 11 ms, Thread Cpu 125 ms (Current 06/23/18 23:02:44)
Yes
We have still an overhead which might be removed in the future. It has to do that we still do a miniature rewrite for each (::)/2 call. But maybe this goes away, we are working on it.
Edit 23.06.2018: We have now a built-in between/3 and implemented already a few optimizations. The above figures show a preview of this new prototype which is not yet out.

Related

Why is matmul slower with gfortran compiler optimization turned on?

If I use gfortran (Homebrew GCC 8.2.0) on my Mac to compile the simple program below without optimization (-O0) the call to matmul consistently executes in ~90 milliseconds. If I use any optimization (flags -O1, -O2 or -O3) the execution time increases to ~250 milliseconds. I've tried using a wide range of different sizes for inVect and matrix but in all cases the -O0 option outperforms the other three optimization flags by at least a factor of 2.5. If I use smaller matrices with just a few hundred elements but loop over many calls to matmul the performance hit is even worse, close to a factor of 10.
Is there a way I can avoid this behavior? I need to use optimization in some portions of my code but, at the same time, I also would like to perform the matrix multiplication as efficiently as possible.
I compile the file sandbox.f90 containing the code below with the command gfortran -ON sandbox.f90, where N is an optimization level 0-3 (no other compiler flags are used). The first value of outVect is printed solely to keep the gfortran optimization from being clever and skipping the call to matmul altogether.
I'm Fortran novice so I apologize in advance if I am missing something obvious here.
program main
implicit none
real :: inVect(20000), matrix(20000,10000), outVect(10000)
real :: start, finish
call random_number(inVect)
call random_number(matrix)
call cpu_time(start)
outVect = matmul(inVect, matrix)
call cpu_time(finish)
print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main

First, consider that I may be wrong. I just saw this problem for the first time, and I'm as surprized as you.
I just studied this problem and I understand it as follow. The optimization -O0, O3, Ofast and... are written for most general (frequent) cases. However, in some cases (when -O3 is less efficient than -O*<-O3) the optimization induces a drawback. This is due to the fact that these optimizations call implicitly flags that induce a lower execution time for the specific task. For your case, the -O3 imposes, amongst other, that all matmul() function will be inlined. Such a thing is generally good, but not necessary true for big array or multiple call of this function. Somehow, the cost of inlining matmul() is more significant than the gain obtained for an inline function (at least this is how I see it).
To avoid this behavior, I suggest the use of the flag -O3 -finline-matmul-limit=0 which cancel the inlining of matmul function. Using the flag -O3 -finline-matmul-limit=0 leads to an execution time that is not worst than what is obtained for -O0.
You can use -finline-matmul-limit=n where you will inline the matmul function only if the involved array are smaller than n. I use n=0 for simplicity.
I hope that this help you.

SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6

You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

What is actually meant by parallel_iterations in tfp.mcmc.sample_chain?

I am not able to get what does the parameter parallel_iterations stand for in sampling multiple chains during MCMC.
The documentation for mcmc.sample_chain() doesn't give much details, it just says that
The parallel iterations are the number of iterations allowed to run in parallel. It must be a positive integer.
I am running a NUTS sampler with multiple chains while specifying parallel_iterations=8.
Does it mean that the chains are strictly run in parallel? Is the parallel execution dependent on multi-core support? If so, what is a good value (based on the number of cores) to set parallel_iterations? Should I naively set it to some higher value?

TensorFlow can unroll iterations of while loops to execute in parallel, when some parts of the data flow (I.e. iteration condition) can be computed faster than other parts. If you don't have a special preference (i.e. reproducibility with legacy stateful samplers), leave it at default.

TensorFlow Operator Source Code

I'm trying to find the source code for TensorFlow's low level linear-algebra and matrix arithmetic operators for execution on CPU. For example, where is the actual implementation of tf.add() for execution on a CPU? As far as I know, most linear algebra operators are actually implemented by Eigen, but I'd like to know what Eigen functions specifically are being called.
I've tried tracing back from the high-level API, but this is difficult as there are a lot of steps between placing an operator on the graph, and the actual execution of the operator by the TF runtime.

The implementation is hidden behind some meta-template programming (not unusual for Eigen).
Each operation in TensorFlow is registered at some point. Add is registered here and here.
REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);
The actual implementation of Operations is based on OpKernel. The Add operation is implemented in BinaryOp::Compute The class hierarchy would be BinaryOp : BinaryOpShared : OpKernel
In the case of adding two scalars, the entire implementation is just:
functor::BinaryFunctor<Device, Functor, 1>().Right(
eigen_device, out_flat, in0.template flat<Tin>(),
in1.template scalar<Tin>(), error_ptr);
where in0, in1 are the incoming Tensor-Scalars, Device is either GPU or CPU, and Functor is the operation itself. The other lines are just for performing the broadcasting.
Scroll down in this file and expanding the REGISTER3 macro explains how the arguments are passed from REGISTER3 to functor::BinaryFunctor<Device, Functor, ...>.
You cannot expect to see some loops as Eigen use Expressions to do Lazy Evaluation and Aliasing. The Eigen-"Call" is here:
https://github.com/tensorflow/tensorflow/blob/7a0def60d45c1841a4e79a0ddf6aa9d50bf551ac/tensorflow/core/kernels/cwise_ops.h#L693-L696

Moving beyond R's optim function

I am trying to use R to estimate a multinomial logit model with a manual specification. I have found a few packages that allow you to estimate MNL models here or here.
I've found some other writings on "rolling" your own MLE function here. However, from my digging around - all of these functions and packages rely on the internal optim function.
In my benchmark tests, optim is the bottleneck. Using a simulated dataset with ~16000 observations and 7 parameters, R takes around 90 seconds on my machine. The equivalent model in Biogeme takes ~10 seconds. A colleague who writes his own code in Ox reports around 4 seconds for this same model.
Does anyone have experience with writing their own MLE function or can point me in the direction of something that is optimized beyond the default optim function (no pun intended)?
If anyone wants the R code to recreate the model, let me know - I'll glady provide it. I haven't provided it since it isn't directly relevant to the problem of optimizing the optim function and to preserve space...
EDIT: Thanks to everyone for your thoughts. Based on a myriad of comments below, we were able to get R in the same ballpark as Biogeme for more complicated models, and R was actually faster for several smaller / simpler models that we ran. I think the long term solution to this problem is going to involve writing a separate maximization function that relies on a fortran or C library, but am certainly open to other approaches.

Tried with the nlm() function already? Don't know if it's much faster, but it does improve speed. Also check the options. optim uses a slow algorithm as the default. You can gain a > 5-fold speedup by using the Quasi-Newton algorithm (method="BFGS") instead of the default. If you're not concerned too much about the last digits, you can also set the tolerance levels higher of nlm() to gain extra speed.
f <- function(x) sum((x-1:length(x))^2)
a <- 1:5
system.time(replicate(500,
optim(a,f)
))
user system elapsed
0.78 0.00 0.79
system.time(replicate(500,
optim(a,f,method="BFGS")
))
user system elapsed
0.11 0.00 0.11
system.time(replicate(500,
nlm(f,a)
))
user system elapsed
0.10 0.00 0.09
system.time(replicate(500,
nlm(f,a,steptol=1e-4,gradtol=1e-4)
))
user system elapsed
0.03 0.00 0.03

Did you consider the material on the CRAN Task View for Optimization ?

I am the author of the R package optimParallel, which could be helpful in your case. The package provides parallel versions of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times as illustrated in the following figure (p is the number of paramters).
See https://cran.r-project.org/package=optimParallel and http://arxiv.org/abs/1804.11058 for more information.

FWIW, I've done this in C-ish, using OPTIF9. You'd be hard-pressed to go faster than that. There are plenty of ways for something to go slower, such as by running an interpreter like R.
Added: From the comments, it's clear that OPTIF9 is used as the optimizing engine. That means that most likely the bulk of the time is spent in evaluating the objective function in R. While it is possible that C functions are being used underneath for some of the operations, there still is interpreter overhead. There is a quick way to determine which lines of code and function calls in R are responsible for most of the time, and that is to pause it with the Escape key and examine the stack. If a statement costs X% of time, it is on the stack X% of the time. You may find that there are operations that are not going to C and should be. Any speedup factor you get this way will be preserved when you find a way to parallelize the R execution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas