Is there a function for condensing an expression that has many redundant terms so as to not have to recalculate them? - optimization

Consider the simple expression:
c = ((a * b) / (x * y)) * sqrt(a * b + 1 + x * y)
The computer only needs to calculate a * b and x * y once, but evaluating the expression as given would make the computer calculate them twice. It would have been better to write:
d = a * b
e = x * y
f = d / e
g = d + 1 + e
c = f * sqrt(g)
Is there a function (ideally python but any language should suffice) to convert the first code block into the second? The problem is I have a very long (pages long output from a Mathematica Solve) expression that has many unique but duplicate terms (meaning, many more than the 2 unique terms a * b and x * y that were duplicated (each appeared more than once) in the expression above) so it's not reasonable for me to do the reducing by hand. In addition, a * b could appear later when calculating another quantity. FYI I need speed in my application and so these sort of optimizations are useful.
I tried googling for this functionality but perhaps I don't know the right search terms.

Thank you, Jerome. (I tried upvoting but don't yet have enough "reputation"). Your response led me to the answer: Use sympy.cse and then copy paste its output and then use a compiler module like Numba as you suggested. One note is sympy appears to only reduce to a certain extent and then decides it's simple enough, e.g.
import sympy as sp
a,b,c,x,y = sp.symbols('a b x y')
input_expr = ((a * b) / (x * y)) * sp.sqrt(a * b + 1 + x * y)
repl, redu = sp.cse(input_expr)
print(repl, redu)
for variable, expr in repl:
print(f'{variable} = {expr}')
>>> [(x0, a*b)] [x0*sqrt(x*y + x0 + 1)/(x*y)]
>>> x0 = a*b
So, it does not assign x1 to x*y, but if instead the expression is slightly more complicated, then it does.
input_expr = ((a * b) / (x * y + 2)) * sp.sqrt(a * b + 1 + x * y)
>>> [(x0, x*y), (x1, a*b)] [x1*sqrt(x0 + x1 + 1)/(x0 + 2)]
>>> x0 = x*y
>>> x1 = a*b
But, even for my very large expression, sympy reduced it to a point that I can finish up by hand without difficulty, so I consider my problem solved.
Best practice for using common subexpression elimination with lambdify in SymPy

Yes, this optimization is called Common Sub-expression Elimination and it is done in all mainstream optimising compilers since decades (I advise you to read the associated books and research papers about it).
However, this is certainly not a good idea to use this kind of optimization in Python. Indeed, Python is not design with speed in mind and the default interpreter does not perform such optimization (in fact it almost perform no optimization). Python codes are expected to be readable and not generated. A generated code is likely slower than a native code while having nearly no benefit over a generated native code. There are just-in-time (JIT) compilers for Python like PyPy doing such optimization so it is certainly wise to try them. Alternatively, Cython (not to be confused with CPython) and Numba can strongly help to remove the overheads Python thanks to the (JIT/AOT) compilation of a Python code to a native execution. Indeed, recomputing multiple time the same expression if far from being the only one issue with the CPython interpreter: basic floating-point/arithmetic operations tends to be at least one order of magnitude slower than native code (mainly due to the dynamic management of objects including their allocation, not to mention the overhead of the interpreter loop). Thus, please use a compiler that does such optimizations very well instead of rewriting the wheel.

Related

Numerically stable maximum step in fixed direction satisfying inequality

This is a question about floating point analysis and numerical stability. Say I have two [d x 1] vectors a and x and a scalar b such that a.T # x < b (where # denotes a dot product).
I additionally have a unit [d x 1] vector d. I want to derive the maximum scalar s so that a.T # (x + s * d) < b. Without floating point errors this is trivial:
s = (b - a.T # x) / (a.T # d).
But with floating point errors though this s is not guaranteed to satisfy a.T # (x + s * d) < b.
Currently my solution is to use a stabilized division, which helps:
s = sign(a.T # x) * sign(a.T # d) * exp(log(abs(a.T # x) + eps) - log(abs(a.T # d) + eps)).
But this s still does not always satisfy the inequality. I can check how much this fails by:
diff = a.T # (x + s * d) - b
And then "push" that diff back through: (x + s * d - a.T # (diff + eps2)). Even with both the stable division and pushing the diff back sometimes the solution fails to satisfy the inequality. So these attempts at a solution are both hacky and they do not actually work. I think there is probably some way to do this that would work and be guaranteed to minimally satisfy the inequality under floating point imprecision, but I'm not sure what it is. The solution needs to be very efficient because this operation will be run trillions of times.
Edit: Here is an example in numpy of this issue coming into play, because a commenter had some trouble replicating this problem.
np.random.seed(1)
p, n = 10, 1
k = 3
x = np.random.normal(size=(p, n))
d = np.random.normal(size=(p, n))
d /= np.sum(d, axis=0)
a, b = np.hstack([np.zeros(p - k), np.ones(k)]), 1
s = (b - a.T # x) / (a.T # d)
Running this code gives a case where a.T # (s * d + x) > b failing to satisfy the constraint. Instead we have:
>>> diff = a.T # (x + s * d) - b
>>> diff
array([8.8817842e-16])
The question is about how to avoid this overflow.
The problem you are dealing with appear to be mainly rounding issues and not really numerical stability issues. Indeed, when a floating-point operation is performed, the result has to be rounded so to fit in the standard floating point representation. The IEEE-754 standard specify multiple rounding mode. The default one is typically the rounding to nearest.
This mean (b - a.T # x) / (a.T # d) and a.T # (x + s * d) can be rounded to the previous or nest floating-point value. As a result, there is slight imprecision introduced in the computation. This imprecision is typically 1 unit of least precision (ULP). 1 ULP basically mean a relative error of 1.1e−16 for double-precision numbers.
In practice, every operation can result in rounding and not the whole expression so the error is typically of few ULP. For operation like additions, the rounding tends to mitigate the error while for some others like a subtraction, the error can dramatically increase. In your case, the error seems only to be due to the accumulation of small errors in each operations.
The floating point computing units of processors can be controlled in low-level languages. Numpy also provides a way to find the next/previous floating point value. Based on this, you can round the value up or down for some parts of the expression so for s to be smaller than the target theoretical value. That being said, this is not so easy since some the computed values can be certainly be negative resulting in opposite results. One can round positive and negative values differently but the resulting code will certainly not be efficient in the end.
An alternative solution is to compute the theoretical error bound so to subtract s by this value. That being said, this error is dependent of the computed values and the actual algorithm used for the summation (eg. naive sum, pair-wise, Kahan, etc.). For example the naive algorithm and the pair-wise ones (used by Numpy) are sensitive to the standard deviation of the input values: the higher the std-dev, the bigger the resulting error. This solution only works if you exactly know the distribution of the input values or/and the bounds. Another issues is that it tends to over-estimate the error bounds and gives a just an estimation of the average error.
Another alternative method is to rewrite the expression by replacing s by s+h or s*h and try to find the value of h based on the already computed s and other parameters. This methods is a bit like a predictor-corrector. Note that h may not be precise also due to floating point errors.
With the absolute correction method we get:
h_abs = (b - a # (x + s * d)) / (a # d)
s += h_abs
With the relative correction method we get:
h_rel = (b - a # x) / (a # (s * d))
s *= h_rel
Here are the absolute difference with the two methods:
Initial method: 8.8817842e-16 (8 ULP)
Absolute method: -8.8817842e-16 (8 ULP)
Relative method: -8.8817842e-16 (8 ULP)
I am not sure any of the two methods are guaranteed to fulfil the requirements but a robust method could be to select the smallest s value of the two. At least, results are quite encouraging since the requirement are fulfilled for the two methods with the provided inputs.
A good method to generate more precise results is to use the Decimal package which provide an arbitrary precision at the expense of a much slower execution. This is particularly useful to compare practical results with more precise ones.
Finally, a last solution is to increase/decrease s one by one ULP so to find the best result. Regarding the actual algorithm used for the summation and inputs, results can change. The exact expression used to compute the difference also matter. Moreover, the result is certainly not monotonic because of the way floating-point arithmetic behave. This means one need to increase/decrease s by many ULP so to be able to perform the optimization. This solution is not very efficient (at least, unless big steps are used).

Projection of fisheye camera model by Scaramuzza

I am trying to understand the fisheye model by Scaramuzza, which is implemented in Matlab, see https://de.mathworks.com/help/vision/ug/fisheye-calibration-basics.html#mw_8aca38cc-44de-4a26-a5bc-10fb312ae3c5
The backprojection (uv to xyz) seems fairly straightforward according to the following equation:
, where rho=sqrt(u^2 +v^2)
However, how does the projection (from xyz to uv) work?! In my understanding we get a rather complex set of equations. Unfortunately, I don't find any details on that....
Okay, I believe I understand it now fully after analyzing the functions of the (windows) calibration toolbox by Scaramuzza, see https://sites.google.com/site/scarabotix/ocamcalib-toolbox/ocamcalib-toolbox-download-page
Method 1 found in file "world2cam.m"
For the projection, use the same equation above. In the projection case, the equation has three known (x,y,z) and three unknown variables (u,v and lambda). We first substitute lambda with rho by realizing that
u = x/lambda
v = y/lambda
rho=sqrt(u^2+v^2) = 1/lambda * sqrt(x^2+y^2) --> lambda = sqrt(x^2+y^2) / rho
After that, we have the unknown variables (u,v and rho)
u = x/lambda = x / sqrt(x^2+y^2) * rho
v = y/lambda = y / sqrt(x^2+y^2) * rho
z / lambda = z /sqrt(x^2+y^2) * rho = a0 + a2*rho^2 + a3*rho^3 + a4*rho^4
As you can see, the last equation now has only one unknown, namely rho. Thus, we can solve it easily using e.g. the roots function in matlab. However, the result does not always exist nor is it necessarily unique. After solving the unknown variable rho, calculating uv is very simple using the equation above.
This procedure needs to be performed for each point (x,y,z) separately and is thus rather computationally expensive for an image.
Method 2 found in file "world2cam_fast.m"
The last equation has the form rho(x,y,z). However, if we define m = z / sqrt(x^2+y^2) = tan(90°-theta), it only depends on one variable, namely rho(m).
Instead of solving this equation rho(m) for every new m, the authors "plot" the function for several values of m and fit an 8th order polynomial to these points. Using this polynomial they can calculate an approximate value for rho(m) much quicker in the following.
This becomes clear, because "world2cam_fast.m" makes use of ocam_model.pol, which is calculated in "undistort.m". "undistort.m" in turn makes use of "findinvpoly.m".

How to best use Numpy/Scipy to find optimal common coefficients for a set different linear equations?

I have n (around 5 million) sets of specific (k,m,v,z)* parameters that describe some linear relationships. I want to find the optimal positive a,b and c coefficients that minimize the addition of their absolute values as shown below:
I know beforehand the range for each a, b and c and so, I could use it to make things a bit faster. However, I do not know how to properly implement this problem to best take advantage of Numpy (or Scipy/etc).
I was thinking of iteratively making checks using different a, b and c coefficients (based on a step) and in the end keeping the combination that would provide the minimum sum. But properly implementing this in Numpy is another thing.
*
(k,m,v are either 0 or positive and are in fact k,m,v,i,j,p)
(z can be negative too)
Any tips are welcome!
Either I am missing something, or a == b == c == 0 is optimal. So, a positive solution for (a,b,c) does not exist in general. You can verify this explicitly by posing the minimization problem as a quantile regression of 0 on (k, m, v) with the quantile set to 0.5.
import numpy as np
from statsmodels.regression.quantile_regression import QuantReg
x = np.random.rand(1000, 3)
a, b, c = QuantReg(np.zeros(x.shape[0]), x).fit(0.5).params
assert np.allclose([a, b, c], 0)

does eigen have self transpose multiply optimization like H.transpose()*H

I have browsed the tutorial of eigen at
https://eigen.tuxfamily.org/dox-devel/group__TutorialMatrixArithmetic.html
it said
"Note: for BLAS users worried about performance, expressions such as c.noalias() -= 2 * a.adjoint() * b; are fully optimized and trigger a single gemm-like function call."
but how about computation like H.transpose() * H , because it's result is a symmetric matrix so it should only need half time as normal A*B, but in my test, H.transpose() * H spend same time as H.transpose() * B. does eigen have special optimization for this situation , like opencv, it has similar function.
I know symmetric optimization will break the vectorization , I just want to know if eigen have solution which could provide both symmetric optimization and vectorization
You are right, you need to tell Eigen that the result is symmetric this way:
Eigen::MatrixXd H = Eigen::MatrixXd::Random(m,n);
Eigen::MatrixXd Z = Eigen::MatrixXd::Zero(n,n);
Z.template selfadjointView<Eigen::Lower>().rankUpdate(H.transpose());
The last line computes Z += H * H^T within the lower triangular part. The upper part is left unchanged. You want a full matrix, then copy the lower part to the upper one:
Z.template triangularView<Eigen::Upper>() = Z.transpose();
This rankUpdate routine is fully vectorized and comparable to the BLAS equivalent. For small matrices, better perform the full product.
See also the respective doc.

how to use scipy.ndimage.convolve for a given stencil?

I need to do something very similar to what is detailed in this post. But the way the stencils are done are not obvious to me... well the stencil for _flux is, but the ones for temp_bz & temp_bx are not.
I think the picture would get clearer with variables, instead of numbers (something like stencil = np.array([[a, b], [c, d]]) with a=0.5, b=...
As example, if the recurrence relation is
flux2[i,j] = a*flux2[i-1,j] + b*bz[i-1,j]*dx + c*flux2[i,j-1] - d*bx[i,j-1]*dz
how the code would be changed ?
Having flux2, bz and bx variables, and assuming they are numpy arrays (if they are not, they should), you could write that ecuation in a vectorized form as follows:
flux2[1:,1:] = a * flux2[:-1,1:] + b * bz[:-1,1:] * dx + c * flux2[1:,:-1] - d * bx[1:,:-1] * dz
Note that, since you didn't mention dz, I assumed it is a constant, if it is a matrix of the same shape as flux2, replace with dz[1:, 1:] (same applies to dx).
That line above will vectorize the operation to every i,j of the matrix, and thus, remove the for loop, giving a considerable speedup.
You would have to define the boundary conditions for row and column 0, as your equation doesn't define what to do in those special cases.
So, in short, as your stencil only uses one position for each variable, and only has 4 interactions, I would say is way faster to calculate it in its analytic form, rather than convolving 3 images with almost all-0 stencils (which would be quite a lot of overkill).