How to maximize the log-likelihood for a Gaussian Process in Mathematica - optimization

I am currently trying to implement a Gaussian Process in Mathematica and am stuck with the maximization of the loglikelihood. I just tried to use the FindMaximum formula on my loglikelihood function but this does not seem to work on this function.
gpdata = {{-1.5, -1.8}, {-1., -1.2}, {-0.75, -0.4}, {-0.4,
0.1}, {-0.25, 0.5}, {0., 0.8}};
kernelfunction[i_, j_, h0_, h1_] :=
h0*h0*Exp[-(gpdata[[i, 1]] - gpdata[[j, 1]])^2/(2*h1^2)] +
KroneckerDelta[i, j]*0.09;
covariancematrix[h0_, h1_] =
ParallelTable[kernelfunction[i, j, h0, h1], {i, 1, 6}, {j, 1, 6}];
loglikelihood[h0_, h1_] := -0.5*
gpdata[[All, 2]].LinearSolve[covariancematrix[h0, h1],
gpdata[[All, 2]], Method -> "Cholesky"] -
0.5*Log[Det[covariancematrix[h0, h1]]] - 3*Log[2*Pi];
FindMaximum[loglikelihood[a, b], {{a, 1}, {b, 1.1}},
MaxIterations -> 500, Method -> "QuasiNewton"]
In the loglikelihood I would usually have the product of the inverse of the covariance matrix times the gpdata[[All, 2]] vector but because the covariance matrix is always positive semidefinite I wrote it this way. Also the evaluation does not stop if I use
gpdata[[All, 2]].Inverse[
covariancematrix[h0, h1]].gpdata[[All, 2]]
Has anyone an idea? I am actually working on a far more complicated problem where I have 6 parameters to optimize but I already have problems with 2.

In my experience I've seen that second-order methods fail with hyper-parameter optimization more than gradient based methods. I think this is because (most?) second-order methods rely on the function being close to a quadratic near the current estimate.
Using conjugate-gradient or even Powell's (derivative-free) conjugate direction method has proved successful in my experiments. For the two parameter case, I would suggest making a contour plot of the hyper-parameter surface for some intuition.

Related

Julia - Constrained Optimization

I am trying to minimize a function (fairly complicated, no analytical form).
First, I residualized it and run the following:
using Optim
x0= [.1, .3, 1/3, .75]
f(x) = prstream_res(x[1],x[2],x[3],x[4])
z= optimize(f, x0)
which gives me an unconstrained solution. However, the solution Julia finds has no real world sense (some of the minimizer arguments are negative).
I need to constrain the solution to positive values only. As a matter of fact, I can give a lower and upper bound of resulting parameters.
I tried:
f(x) = prstream_res(x[1],x[2],x[3],x[4])
lower= [0, 0, 0, 0]
upper= [1, 1, 1, 1]
x0= [.1, .3, 1/3, .75]
inner_optimizer = GradientDescent()
results = optimize(f, lower, upper, x0, Fminbox(inner_optimizer))
but it doesn't work throwing
ERROR: MethodError: no method matching optimize(::typeof(f), ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Float64,1}, ::Fminbox{GradientDescent{InitialPrevious{Float64},HagerZhang{Float64,Base.RefValue{Bool}},Nothing,Optim.var"#13#15"},Float64,Optim.var"#43#45"})
Can please someone help me constrain the solution?
Thanks in advance!

Why is broadcasting done by aligning axes backwards

Numpy's broadcasting rules have bitten me once again and I'm starting to feel there may be a way of thinking about this
topic that I'm missing.
I'm often in situations as follows: the first axis of my arrays is reserved for something fixed, like the number of samples. The second axis could represent different independent variables of each sample, for some arrays, or it could be not existent when it feels natural that there only be one quantity attached to each sample in an array. For example, if the array is called price, I'd probably only use one axis, representing the price of each sample. On the other hand, a second axis is sometimes much more natural. For example, I could use a neural network to compute a quantity for each sample, and since neural networks can in general compute arbitrary multi valued functions, the library I use would in general return a 2d array and make the second axis singleton if I use it to compute a single dependent variable. I found this approach to use 2d arrays is also more amenable to future extensions of my code.
Long story short, I need to make decisions in various places of my codebase whether to store array as (1000,) or (1000,1), and changes of requirements occasionally make it necessary to switch from one format to the other.
Usually, these arrays live alongside arrays with up to 4 axes, which further increases the pressure to sometimes introduce singleton second axis, and then have the third axis represent a consistent semantic quality for all arrays that use it.
The problem now occurs when I add my (1000,) or (1000,1) arrays, expecting to get (1000,1), but get (1000,1000) because of implicit broadcasting.
I feel like this prevents giving semantic meaning to axes. Of course I could always use at least two axes, but that leads to the question where to stop: To be fail safe, continuing this logic, I'd have to always use arrays of at least 6 axes to represent everything.
I'm aware this is maybe not the best technically well defined question, but does anyone have a modus operandi that helps them avoid these kind of bugs?
Does anyone know the motivations of the numpy developers to align axes in reverse order for broadcasting? Was computational efficiency or another technical reason behind this, or a model of thinking that I don't understand?
In MATLAB broadcasting, a jonny-come-lately to this game, expands trailing dimensions. But there the trailing dimensions are outermost, that is order='F'. And since everything starts as 2d, this expansion only occurs when one array is 3d (or larger).
https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/
explains, and gives a bit of history. My own history with the language is old enough, that the ma_expanded = ma(ones(3,1),:) style of expansion is familiar. octave added broadcasting before MATLAB.
To avoid ambiguity, broadcasting expansion can only occur in one direction. Expanding in the direction of the outermost dimension makes seems logical.
Compare (3,) expanded to (1,3) versus (3,1) - viewed as nested lists:
In [198]: np.array([1,2,3])
Out[198]: array([1, 2, 3])
In [199]: np.array([[1,2,3]])
Out[199]: array([[1, 2, 3]])
In [200]: (np.array([[1,2,3]]).T).tolist()
Out[200]: [[1], [2], [3]]
I don't know if there are significant implementation advantages. With the striding mechanism, adding a new dimension anywhere is easy. Just change the shape and strides, adding a 0 for the dimension that needs to be 'replicated'.
In [203]: np.broadcast_arrays(np.array([1,2,3]),np.array([[1],[2],[3]]),np.ones((3,3)))
Out[203]:
[array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]), array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]]), array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])]
In [204]: [x.strides for x in _]
Out[204]: [(0, 8), (8, 0), (24, 8)]

Does Numpy have an inbuilt elementwise matrix modular exponentiation implementation

Does numpy have an inbuilt implementation for modular exponentation of matrices?
(As pointed out by user2357112, I am actually looking for element wise modular reduction)
One way modular exponentiation on regular numbers is done is with Exponentiation by Squaring (https://en.wikipedia.org/wiki/Exponentiation_by_squaring), with a modular reduction taken at each step. I am wondering if there is a similar inbuilt solution for matrix multiplication. I am aware I can write code to emulate this easily, but I am wondering if there is an inbuilt solution.
Modular exponentiation is not currently built in NumPy (GitHub issue). The easiest/laziest way to achieve it is frompyfunc:
modexp = np.frompyfunc(pow, 3, 1)
print(modexp(np.array([[1, 2], [3, 4]]), 2, 3).astype(int))
prints
[[1 1]
[0 1]]
This is of course slower than native NumPy would be, and we get an array with dtype=object (hence astype(int) is added).

Maximizing in mathematica with multiple maxima

I'm trying to compute the maxima of some function of one variable (something like this:)
(which is calculated from a non-trivial convolution, so, no, I don't have an expression for it)
Using the command:
NMaximize[{f[x], 0 < x < 1}, x, AccuracyGoal -> 4, PrecisionGoal -> 4]
(I'm not that worried about super accuracy, a rough estimate of 10^-4 is already enough)
The result of this is x* = 0.55, which is not what should be. (i.e., it is picking the third peak).
Is there any way of telling mathematica that the global maxima is the first one when counting from x = 0 (I know this is always true), or make mathematica search with a better approach? (Notice, I don't want things like Stimulated Annealing approach; each evaluation is very costly!)
Thanks very much!
Try FindMaximum with a starting point of 0 or some similarly small value.

Mathematica exponentiation and finding a specified coefficient

I have the following code, and it does exactly what I want it to do, except that it is ridiculously slow. I would not be so bothered, except that when I process the code "manually", i.e., I break it into parts and do them individually, it's near instantaneous.
Here is my code:
Coefficient[Product[Sum[x^(j*Prime[i]), {j, 0, Floor[q/Prime[i]]}],
{i, 1, PrimePi[q]}], x, q]
Picture added for clarity:
I think it is trying to optimize the sum, but am not sure. Is there a way to stop that?
In addition, since all my coefficients are positive, and I only want the x^qth one, is there a way to get Mathematica to discard all exponents that are larger than that and not do all the multiplication with those?
I may be misunderstanding what you want but, as the coefficient will depend on q, I assume you want it evaluated for specific q. Since I suspected (like you) that the time is taken to optimise the produt and sum, I rewrote it. You had something like:
With[{q = 80}, Coefficient[\!\(
\*UnderoverscriptBox[\(\[Product]\), \(i = 1\), \(PrimePi[q]\)]\((
\*UnderoverscriptBox[\(\[Sum]\), \(j = 0\), \(\[LeftFloor]
\*FractionBox[\(q\), \(Prime[i]\)]\[RightFloor]\)]
\*SuperscriptBox[\(x\), \(j*Prime[i]\)])\)\), x, q]] // Timing
(*
-> {8.36181, 10003}
*)
which I rewrote with purely structural operations as
With[{q = 80},
Coefficient[Times ##
Table[Plus ## Table[x^(j*Prime[i]), {j, 0, Floor[q/Prime[i]]}],
{i, 1, PrimePi[q]}], x, q]] // Timing
(*
-> {8.36357, 10003}
*)
(this just builds up a list of the terms and then multiplies them, so no symbolic analysis is performed).
Just building up the polynomial is instantaneous, but it has a few thousand terms, so what is probably happening is that Coefficient spends a lot of time to make sure it has the right coefficient. Actually you can solve this by Expanding the polynomial. Thus:
With[{q = 80}, Coefficient[Expand[\!\(
\*UnderoverscriptBox[\(\[Product]\), \(i = 1\), \(PrimePi[q]\)]\((
\*UnderoverscriptBox[\(\[Sum]\), \(j = 0\), \(\[LeftFloor]
\*FractionBox[\(q\), \(Prime[i]\)]\[RightFloor]\)]
\*SuperscriptBox[\(x\), \(j*Prime[i]\)])\)\)], x, q]] // Timing
(*
-> {0.240862, 10003}
*)
and it also works for my method.
So to summarise, just stick Expand in front of the expression and before you take the coefficient.
I think that the reason that the original code is slow is because Coefficient is made to work even with very large expressions - ones that would not fit into the memory if naively expanded.
Here's the original polynomial:
poly[q_, x_] := Product[Sum[ x^(j*Prime[i]),
{j, 0, Floor[q/Prime[i]]}], {i, 1, PrimePi[q]}]
See how for not too large q, expanding the polynomial takes up a lot more memory and becomes fairly slow:
In[2]:= Through[{LeafCount, ByteCount}[poly[300, x]]] // Timing
Through[{LeafCount, ByteCount}[Expand#poly[300, x]]] // Timing
Out[2]= { 0.01, { 1859, 55864}}
Out[3]= {25.27, {77368, 3175840}}
Now let's define the coefficient in 3 different ways and time them
coeff[q_] := Module[{x}, Coefficient[poly[q, x], x, q]]
exCoeff[q_] := Module[{x}, Coefficient[Expand#poly[q, x], x, q]]
serCoeff[q_] := Module[{x}, SeriesCoefficient[poly[q, x], {x, 0, q}]]
In[7]:= Table[ coeff[q],{q,1,30}]//Timing
Table[ exCoeff[q],{q,1,30}]//Timing
Table[serCoeff[q],{q,1,30}]//Timing
Out[7]= {0.37,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
Out[8]= {0.12,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
Out[9]= {0.06,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
In[10]:= coeff[100]//Timing
exCoeff[100]//Timing
serCoeff[100]//Timing
Out[10]= {56.28,40899}
Out[11]= { 0.84,40899}
Out[12]= { 0.06,40899}
So SeriesCoefficient is definitely the way to go. Unless of course you're
a bit better at combinatorics than me and you know the following prime partition formulae
(oeis)
In[13]:= CoefficientList[Series[1/Product[1-x^Prime[i],{i,1,30}],{x,0,30}],x]
Out[13]= {1,0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}
In[14]:= f[n_]:=Length#IntegerPartitions[n,All,Prime#Range#PrimePi#n]; Array[f,30]
Out[14]= {0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}