Gap tolerance control in Z3 optimization - optimization

I would like to use z3 optimize class to get sub-optimal results, while still being able to control how far am I from the optimum result. I am using the C++ API.
As an example, CPLEX has the parameters epgap and epagap for relative and absolute tolerance respectively. It uses the current lower or upper bounds (depending if it is a minimization or maximization) to assess how far (at most) the current solution is from the optimal one.
This leads to shorter run-times for when an approximate solution is already good enough.
Is this possible using the optimize class, or is this something I would need to implement using a solver instance and control the bounds myself?

I am not absolutely certain about this, but I doubt that z3 has such parameters.
For sure, nothing like that appears to be exposed in the command-line interface:
~$ z3 -p
...
[module] opt, description: optimization parameters
dump_benchmarks (bool) (default: false)
dump_models (bool) (default: false)
elim_01 (bool) (default: true)
enable_sat (bool) (default: true)
enable_sls (bool) (default: false)
maxlex.enable (bool) (default: true)
maxres.add_upper_bound_block (bool) (default: false)
maxres.hill_climb (bool) (default: true)
maxres.max_core_size (unsigned int) (default: 3)
maxres.max_correction_set_size (unsigned int) (default: 3)
maxres.max_num_cores (unsigned int) (default: 4294967295)
maxres.maximize_assignment (bool) (default: false)
maxres.pivot_on_correction_set (bool) (default: true)
maxres.wmax (bool) (default: false)
maxsat_engine (symbol) (default: maxres)
optsmt_engine (symbol) (default: basic)
pb.compile_equality (bool) (default: false)
pp.neat (bool) (default: true)
priority (symbol) (default: lex)
rlimit (unsigned int) (default: 0)
solution_prefix (symbol) (default: )
timeout (unsigned int) (default: 4294967295)
...
Alternative #01:
An option is to implement this yourself on top of z3.
I would suggest using the binary search schema (see Optimization in SMT with LA(Q) Cost Functions), otherwise the OMT solver is going to refine only one end of the optimization search interval and this may defeat the intended purpose of your search-termination criteria.
Notice that in order for this approach to be effective, it is important that the internal T-optimizer is invoked over the Boolean assignment of each intermediate model encountered along the search. (I am not sure whether this functionality is exposed at the API level with z3).
You may also want to take a look at the approach based on linear regression used in Puli - A Problem-Specific OMT Solver. If applicable, it may speed-up the optimization search and improve the estimate of the relative distance from the optimal solution.
Alternative #02:
OptiMathSAT may be exposing the functionality you are looking for, both at the API and the command-line level:
~$ optimathsat -help
Optimization search options:
-opt.abort_interval=FLOAT
If greater than zero, an objective is no longer actively optimized as
soon as the current search interval size is smaller than the given
value. Applies to all objective functions. (default: 0)
-opt.abort_tolerance=FLOAT
If greater than zero, an objective is no longer actively optimized as
soon as the ratio among the current search interval size wrt. its
initial size is smaller than the given value. Applies to all
objective functions. (default: 0)
The abort interval is a termination criterion based on the absolute size of the current optimization search interval, while the abort tolerance is a termination criterion based on the relative size of the current optimization search interval with respect to the initial search interval.
Notice that in order to use these termination criteria, the user is expected to:
provide (at least) an initial lower-bound for any minimization objective:
(minimize ... :lower ...)
provide (at least) an initial upper-bound for any maximization objective:
(maximize ... :upper ...)
Furthermore, the tool must be configured to use either Binary or Adaptive search:
-opt.strategy=STR
Sets the optimization search strategy:
- lin : linear search (default)
- bin : binary search
- ada : adaptive search
A lower bound is required to minimize an objective with bin/ada
search strategy. Dual for maximization.
In case neither of these termination criterion is satisfactory to you, you can also implement your own algorithm on top of OptiMathSAT. It is relatively easy to do, thanks to the following option that can be set both via API and command-line:
-opt.no_optimization=BOOL
If true, the optimization search stops at the first (not optimal)
satisfiable solution. (default: false)
When enabled, it makes OptiMathSAT behave like a regular SMT solver, except that when it finds a complete Boolean assignment for which there exists a Model of the input formula, it ensures that the Model is optimal wrt. the objective function and the given Boolean assignment (in other words, it invokes the internal T-optimizer procedure for you).
Some Thoughts.
OMT solvers work differently from most CP solvers. They use infinite-precision arithmetic and the optimization search is guided by the SAT engine. Improving the value of the objective function becomes increasingly difficult because the OMT solver is forced to enumerate a progressively larger number of (possibly total) Boolean assignments while resolving conflicts and back-jumping along the way.
In my opinion, the size of the current search interval is not always a good indicator of the relative difficulty of making progress with the optimization search. There are far too many factors to take into consideration, e.g. the pruning power of conflict clauses involving the objective function, the encoding of the input formula, and so on. This is also one of the reasons why, as far as I have seen, most people in the OMT community simply use a fixed timeout rather than bothering to use any other termination criteria. The only situation in which I have found it to be particularly useful, is when dealing with non-linear optimization (which, however, is not yet publicly available with OptiMathSAT).

Related

Is it possible to get the native CPU size of an integer in Rust?

For fun, I'm writing a bignum library in Rust. My goal (as with most bignum libraries) is to make it as efficient as I can. I'd like it to be efficient even on unusual architectures.
It seems intuitive to me that a CPU will perform arithmetic faster on integers with the native number of bits for the architecture (i.e., u64 for 64-bit machines, u16 for 16-bit machines, etc.) As such, since I want to create a library that is efficient on all architectures, I need to take the target architecture's native integer size into account. The obvious way to do this would be to use the cfg attribute target_pointer_width. For instance, to define the smallest type which will always be able to hold more than the maximum native int size:
#[cfg(target_pointer_width = "16")]
type LargeInt = u32;
#[cfg(target_pointer_width = "32")]
type LargeInt = u64;
#[cfg(target_pointer_width = "64")]
type LargeInt = u128;
However, while looking into this, I came across this comment. It gives an example of an architecture where the native int size is different from the pointer width. Thus, my solution will not work for all architectures. Another potential solution would be to write a build script which codegens a small module which defines LargeInt based on the size of a usize (which we can acquire like so: std::mem::size_of::<usize>().) However, this has the same problem as above, since usize is based on the pointer width as well. A final obvious solution is to simply keep a map of native int sizes for each architecture. However, this solution is inelegant and doesn't scale well, so I'd like to avoid it.
So, my questions: is there a way to find the target's native int size, preferably before compilation, in order to reduce runtime overhead? Is this effort even worth it? That is, is there likely to be a significant difference between using the native int size as opposed to the pointer width?
It's generally hard (or impossible) to get compilers to emit optimal code for BigNum stuff, that's why https://gmplib.org/ has its low level primitive functions (mpn_... docs) hand-written in assembly for various target architectures with tuning for different micro-architecture, e.g. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/core2/mul_basecase.asm for the general case of multi-limb * multi-limb numbers. And https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/aors_n.asm for mpn_add_n and mpn_sub_n (Add OR Sub = aors), tuned for SandyBridge-family which doesn't have partial-flag stalls so it can loop with dec/jnz.
Understanding what kind of asm is optimal may be helpful when writing code in a higher level language. Although in practice you can't even get close to that so it sometimes makes sense to use a different technique, like only using values up to 2^30 in 32-bit integers (like CPython does internally, getting the carry-out via a right shift, see the section about Python in this). In Rust you do have access to add_overflow to get the carry-out, but using it is still hard.
For practical use, writing Rust bindings for GMP is probably your best bet, unless that already exists.
Using the largest chunks possible is very good; on all current CPUs, add reg64, reg64 has the same throughput and latency as add reg32, reg32 or reg8. So you get twice as much work done per unit. And carry propagation through 64 bits of result in 1 cycle of latency.
(There are alternate ways to store BigInteger data that can make SIMD useful; #Mysticial explains in Can long integer routines benefit from SSE?. e.g. 30 value bits per 32-bit int, allowing you to defer normalization until after a few addition steps. But every use of such numbers has to be aware of these issues so it's not an easy drop-in replacement.)
In Rust, you probably want to just use u64 regardless of the target, unless you really care about small-number (single-limb) performance on 32-bit targets. Let the compiler build u64 operations for you out of add / adc (add with carry).
The only thing that might need to be ISA-specific is if u128 is not available on some targets. You want to use 64 * 64 => 128-bit full multiply as your building block for multiplication; if the compiler can do that for you with u128 then that's great, especially if it inlines efficiently.
See also discussion in comments under the question.
One stumbling block for getting compilers to emit efficient BigInt addition loops (even inside the body of one unrolled loop) is writing an add that takes a carry input and produces a carry output. Note that x += 0xff..ff + carry=1 needs to produce a carry out even though 0xff..ff + 1 wraps to zero. So in C or Rust, x += y + carry has to check for carry out in both the y+carry and the x+= parts.
It's really hard (probably impossible) to convince compiler back-ends like LLVM to emit a chain of adc instructions. An add/adc is doable when you don't need the carry-out from adc. Or probably if the compiler is doing it for you for u128.overflowing_add
Often compilers will turn the carry flag into a 0 / 1 in a register instead of using adc. You can hopefully avoid that for at least pairs of u64 in addition by combining the input u64 values to u128 for u128.overflowing_add. That will hopefully not cost any asm instructions because a u128 already has to be stored across two separate 64-bit registers, just like two separate u64 values.
So combining up to u128 could just be a local optimization for a function that adds arrays of u64 elements, to get the compiler to suck less.
In my library ibig what I do is:
Select architecture-specific size based on target_arch.
If I don't have a value for an architecture, select 16, 32 or 64 based on target_pointer_width.
If target_pointer_width is not one of these values, use 64.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

NLopt with univariate optimization

Anyone know if NLopt works with univariate optimization. Tried to run following code:
using NLopt
function myfunc(x, grad)
x.^2
end
opt = Opt(:LD_MMA, 1)
min_objective!(opt, myfunc)
(minf,minx,ret) = optimize(opt, [1.234])
println("got $minf at $minx (returned $ret)")
But get following error message:
> Error evaluating untitled
LoadError: BoundsError: attempt to access 1-element Array{Float64,1}:
1.234
at index [2]
in myfunc at untitled:8
in nlopt_callback_wrapper at /Users/davidzentlermunro/.julia/v0.4/NLopt/src/NLopt.jl:415
in optimize! at /Users/davidzentlermunro/.julia/v0.4/NLopt/src/NLopt.jl:514
in optimize at /Users/davidzentlermunro/.julia/v0.4/NLopt/src/NLopt.jl:520
in include_string at loading.jl:282
in include_string at /Users/davidzentlermunro/.julia/v0.4/CodeTools/src/eval.jl:32
in anonymous at /Users/davidzentlermunro/.julia/v0.4/Atom/src/eval.jl:84
in withpath at /Users/davidzentlermunro/.julia/v0.4/Requires/src/require.jl:37
in withpath at /Users/davidzentlermunro/.julia/v0.4/Atom/src/eval.jl:53
[inlined code] from /Users/davidzentlermunro/.julia/v0.4/Atom/src/eval.jl:83
in anonymous at task.jl:58
while loading untitled, in expression starting on line 13
If this isn't possible, does anyone know if a univariate optimizer where I can specify bounds and an initial condition?
There are a couple of things that you're missing here.
You need to specify the gradient (i.e. first derivative) of your function within the function. See the tutorial and examples on the github page for NLopt. Not all optimization algorithms require this, but the one that you are using LD_MMA looks like it does. See here for a listing of the various algorithms and which require a gradient.
You should specify the tolerance for conditions you need before you "declare victory" ¹ (i.e. decide that the function is sufficiently optimized). This is the xtol_rel!(opt,1e-4) in the example below. See also the ftol_rel! for another way to specify a different tolerance condition. According to the documentation, for example, xtol_rel will "stop when an optimization step (or an estimate of the optimum) changes every parameter by less than tol multiplied by the absolute value of the parameter." and ftol_rel will "stop when an optimization step (or an estimate of the optimum) changes the objective function value by less than tol multiplied by the absolute value of the function value. " See here under the "Stopping Criteria" section for more information on various options here.
The function that you are optimizing should have a unidimensional output. In your example, your output is a vector (albeit of length 1). (x.^2 in your output denotes a vector operation and a vector output). If you "objective function" doesn't ultimately output a unidimensional number, then it won't be clear what your optimization objective is (e.g. what does it mean to minimize a vector? It's not clear, you could minimize the norm of a vector, for instance, but a whole vector - it isn't clear).
Below is a working example, based on your code. Note that I included the printing output from the example on the github page, which can be helpful for you in diagnosing problems.
using NLopt
count = 0 # keep track of # function evaluations
function myfunc(x::Vector, grad::Vector)
if length(grad) > 0
grad[1] = 2*x[1]
end
global count
count::Int += 1
println("f_$count($x)")
x[1]^2
end
opt = Opt(:LD_MMA, 1)
xtol_rel!(opt,1e-4)
min_objective!(opt, myfunc)
(minf,minx,ret) = optimize(opt, [1.234])
println("got $minf at $minx (returned $ret)")
¹ (In the words of optimization great Yinyu Ye.)

Maximum Likelihood Estimation of a log function with sevaral parameters

I am trying to find out the parameters for the function below:
$$
\log L(\alpha,\beta,v) = v/\beta(e^{-\beta T} -1) + \alpha/\beta \sum_{i=1}^{n}(e^{-\beta(T-t_i)} -1) + \sum_{i=1}^{N}log(v e^{-\beta t_i} + \alpha \sum_{j=1}^{jmax(t_i)} e^{-\beta(t_i - t_j)}).
$$
However, the conventional methods like fmin, fminsearch are not converging properly. Any suggestions on any other methods or open libraries which I can use?
I was trying CVXPY, but they don't support the division by a variable in the expression.
The problem may not be convex (I have not verified this but it could be why CVXPY refused it). We don't have the data so we cannot try things out, but I can give some general advice:
Provide exact gradients (and 2nd derivatives if needed) or use a modeling system with automatic differentiation. Especially first derivatives should be preferably quite precise. With finite differences you may lose half the precision.
Provide a good starting point. May be using an alternative estimation method.
Some solvers can use bounds on the variables to restrict the feasible region where functions will be evaluated. This can be used to restrict the search to interesting areas only and also to protect operations like division and log functions.

What can a compiler do with branching information?

On a modern Pentium it is no longer possible to give branching hints to the processor it seems. Assuming that a profiling compiler such as gcc with profile-guided optimization gains information about likely branching behavior, what can it do to produce code that will execute more quickly?
The only option I know of is to move unlikely branches to the end of a function. Is there anything else?
Update.
http://download.intel.com/products/processor/manual/325462.pdf volume 2a, section 2.1.1 says
"Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for
a branch. Use these prefixes only with conditional branch instructions (Jcc). Other use of branch hint prefixes
and/or other undefined opcodes with Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior."
I don't know if these actually have any effect however.
On the other hand section 3.4.1. of http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf says
"
Compilers generate code that improves the efficiency of branch prediction in Intel processors. The Intel
C++ Compiler accomplishes this by:
keeping code and data on separate pages
using conditional move instructions to eliminate branches
generating code consistent with the static branch prediction algorithm
inlining where appropriate
unrolling if the number of iterations is predictable
With profile-guided optimization, the compiler can lay out basic blocks to eliminate branches for the most
frequently executed paths of a function or at least improve their predictability. Branch prediction need
not be a concern at the source level. For more information, see Intel C++ Compiler documentation.
"
http://cache-www.intel.com/cd/00/00/40/60/406096_406096.pdf says in "Performance Improvements with PGO "
"
PGO works best for code with many frequently executed branches that are difficult to
predict at compile time. An example is the code with intensive error-checking in which
the error conditions are false most of the time.
The infrequently executed (cold) errorhandling code can be relocated so the branch is rarely predicted incorrectly. Minimizing
cold code interleaved into the frequently executed (hot) code improves instruction cache
behavior."
There are two possible sources for the information you want:
There's Intel 64 and IA-32 Architectures Software Developer's Manual (3 volumes). This is a huge work which has evolved for decades. It's the best reference I know on a lot of subjects, including floating-point. In this case, you want to check volume 2, the instruction set reference.
There's Intel 64 and IA-32 Architectures Optmization Reference Manual. This will tell you in somewhat brief terms what to expect from each microarchitecture.
Now, I don't know what you mean by a "modern Pentium" processor, this is 2013, right? There aren't any Pentiums anymore...
The instruction set does support telling the processor if the branch is expected to be taken or not taken by a prefix to the conditional branch instructions (such as JC, JZ, etc). See volume 2A of (1), section 2.1.1 (of the version I have) Instruction Prefixes. There is the 2E and 3E prefixes for not taken and taken respectively.
As to whether these prefixes actually have any effect, if we can get that information, it will be on Optimization Reference Manual, the section for the microarchitecture you want (and I'm sure it won't be the Pentium).
Apart from using those, there is an entire section on the Optimization Reference Manual on that subject, that's section 3.4.1 (of the version I have).
It makes no sense to reproduce that here, since you can download the manual for free.
Briefly:
Eliminate branches by using conditional instructions (CMOV, SETcc),
Consider the static prediction algorithm (3.4.1.3),
Inlining
Loop unrolling
Also, some compilers, GCC, for instance, even when CMOV is not possible, often perform bitwise arithmetic to select one of two distinct things computed, thus avoiding branches. It does this particularly with SSE instructions when vectorizing loops.
Basically, the static conditions are:
Unconditional branches are predicted to be taken (... kind of expectable...)
Indirect branches are predicted not to be taken (because of a data dependency)
Backward conditionals are predicted to be taken (good for loops)
Forward conditionals are predicted not to be taken
You probably want to read the entire section 3.4.1.
If it's clear that a loop is rarely entered, or that it normally iterates very few times, then the compiler might avoid unrolling the loop, as doing so can add a lot of harmful complexity to handle edge conditions (an odd-number iterations, etc.). Vectorisation, in particular, should be avoided in such cases.
The compiler might rearrange nested tests, so that the one that most frequently results in a short-cut can be used to avoid performing a test on something with a 50% pass rate.
Register allocation can be optimised to avoid having a rarely-used block force register spill in the common case.
These are just some examples. I'm sure there are others I haven't thought of.
Off the top of my head, you have two options.
Option #1: Inform the compiler of the hints and let the compiler organize the code appropriately. For example, GCC supports the following ...
__builtin_expect((long)!!(x), 1L) /* GNU C to indicate that <x> will likely be TRUE */
__builtin_expect((long)!!(x), 0L) /* GNU C to indicate that <x> will likely be FALSE */
If you put them in macro form such as ...
#if <some condition to indicate support>
#define LIKELY(x) __builtin_expect((long)!!(x), 1L)
#define UNLIKELY(x) __builtin_expect((long)!!(x), 0L)
#else
#define LIKELY(x) (x)
#define UNLIKELY(x) (x)
#endif
... you can now use them as ...
if (LIKELY (x != 0)) {
/* DO SOMETHING */
} else {
/* DO SOMETHING ELSE */
}
This leaves the compiler free to organize the branches according to static branch prediction algorithms, and/or if the processor and compiler support it, to use instructions that indicate which branch is more likely to be taken.
Option #2: Use math to avoid branching.
if (a < b)
y = C;
else
y = D;
This could be re-written as ...
x = -(a < b); /* x = -1 if a < b, x = 0 if a >= b */
x &= (C - D); /* x = C - D if a < b, x = 0 if a >= b */
x += D; /* x = C if a < b, x = D if a >= b */
Hope this helps.
It can make the fall-through (ie the case where a branch is not taken) the most used path. That has two big effects:
only 1 branch can be taken per clock, or on some processors even per 2 clocks, so if there are any other branches (there usually are, most code that matters is in a loop), a taken branch is bad news, a non-taken branch less so.
when the branch predictor is wrong, the code that it does have to execute is more likely to be in the code cache (or µop cache, where applicable). If it wasn't, that would have been a double-whammy of restarting the pipeline and waiting for a cache miss. This is less of an issue in most loops, since both sides of the branch are likely to be in the cache, but it comes into play in big loops and other code.
It can also decide whether to do if-conversion based on better data than a heuristic guess. If-conversions may seem like "always a good idea", but they're not, they're only "often a good idea". If the branch in the branching implementation is very well-predicted, the if-converted code can well be slower.