Stan vs PYMC3 for Discrete Mixture Models - bayesian

I am studying zero-inflated count temporal data. I have built a stan model that deals with this zero-inflated data with an if statement in the model block. This is as they advise in the Stan Reference Guide. e.g.,
model {
for (n in 1:N) {
if (y[n] == 0)
target += log_sum_exp(bernoulli_lpmf(1 | theta), bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda));
else
target += bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda);
}
}
This if statement is clearly necessary as Stan uses NUTS as the sampler which does not deal with discrete variables (and thus we are marginalising over this discrete random variable instead of sampling from it). I have not had very much experience with pymc3 but my understanding is that it can deal with a Gibbs update step (to sample from the discrete bernoulli likelihood). Then conditioned on the zero-inflated value, it could perform a Metropolis or NUTS update for the parameters that depend on the Poisson likelihood.
My question is: Can (and if so how can) pymc3 be used in such a way to sample from the discrete zero-inflated variable with the updates to the continuous variable being performed with a NUTS update? If it can, is the performance significantly improved over the above implementation in stan (which marginalises out the discrete random variable)? Further, if pymc3 can only support a Gibbs + Metropolis update, is this change away from NUTS worth considering?

Related

How can I order the basic solutions of a min cost flow problem according to their cost?

I was wondering if, given a min cost flow problem and an integer n, there is an efficient algorithm/package or mathematical method, to obtain the set of the
n-best basic solutions of the min cost flow problem (instead of just the best).
Not so easy. There were some special LP solvers that could do that (see: Ralph E. Steuer, Multiple Criteria Optimization: Theory, Computation, and Application, Wiley, 1986), but currently available LP solvers can't.
There is a way to encode a basis using binary variables:
b[i] = 1 if variable x[i] = basic
0 nonbasic
Using this, we can use "no good cuts" or "solution pool" technology to get the k best bases. See: https://yetanothermathprogrammingconsultant.blogspot.com/2016/01/finding-all-optimal-lp-solutions.html. Note that not all solution-pools can do the k-best. (Cplex can't, Gurobi can.) The "no-good" cuts work with any mip solver.
Update: a more recent reference is Craig A. Piercy, Ralph E. Steuer,
Reducing wall-clock time for the computation of all efficient extreme points in multiple objective linear programming, European Journal of Operational Research, 2019, https://doi.org/10.1016/j.ejor.2019.02.042

SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6
You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

Optimization Algorithm vs Regression Models

Currently, I'm dealing with forecasting problems. I have a reference that used linear function to represent the input and output data.
y = po + p1.x1 + p2.x2
Both of x1 and x2 are known input; y is output; p0, p1, and p2 are the coefficient. Then, he used all the training data and Least Square Estimation (LSE) method to find the optimal coefficient (p0, p1, p2) to build the model.
My question is if he already used the LSE algorithm, can I try to improve his method by using any optimization algorithm (PSO or GA for example) to try find better coefficient value?
You answered this yourself:
Blockquote Then, he used all the training data and Least Square Estimation (LSE) method to find the optimal coefficient (p0, p1, p2) to build the model.
Because a linear-model is quite easy to optimize, the LSE method obtained a global optimum (ignoring subtle rounding-errors and early-stopping/tolerance errors). Without changing the model, there is no gain in terms of using other coefficients, independent on the usage of meta-heuristics lika GA.
So you may modify the model, or add additional data (feature-engineering: e.g. product of two variables; kernel-methods).
One thing to try: Support-Vector machines. These are also convex and can be trained efficiently (with not too much data). They are also designed to work well with kernels. An additional advantage (compared with more complex models: e.g. non-convex): they are quite good regarding generalization which seems to be important here because you don't have much data (sounds like a very small dataset).
See also #ayhan's comment!

PyMC: How can I describe a state space model?

I used to code my MCMC using C. But I'd like to give PyMC a try.
Suppose X_n is the underlying state whose dynamics following a Markov chain and Y_n is the observed data. In particular,
Y_n has Poisson distribution with mean depending on X_n and a multidimensional unknown parameter theta
X_n | X_{n-1} has distribution depending on theta
How should I describe this model using PyMC?
Another question: I can find conjugate priors for theta but not for X_n. Is it possible to specify which posteriors are updated using conjugate priors and which using MCMC?
Here is an example of a state-space model in PyMC on the PyMC wiki. It basically involves populating a list and allowing PyMC to treat it as a container of PyMC nodes.
As for the second part of the question, you could certainly calculate some of your conjugate posteriors ahead of time and put them into the model. For example, if you observed binomial data x=4, n=10 you could insert a Beta node p = Beta('p', 5, 7) to represent that posterior (its really just a prior, as far as the model is concerned, but it is the posterior given data x). Then PyMC would draw a sample for this posterior at every iteration to be used wherever it is needed in the model.