pgmpy bayesian Network - CPD too large, but roots are independent - bayesian-networks

I have a large baysian network to build and I'm using pgmpy.
For simplicity, the network is only 2 levels deep:
layer 1: causes
layer 2: effects
There are about 100 possible causes, and each effect e, is related to about ~30 different causes.
The CPD for each effect is HUGE (2 ** 30 wide).
But! i know that each cause c is independent of all other causes.
P(f/c1,c2) = p(f/c1)*p(f/c2)
Can i use this information to make a scale-able network and avoid generating these huge CPDs?
Thx

Related

Mosek runs indefinitely on large MIQCQP

We have a large-scale MIQCQP problem. Problem size:
Decision vars: ~9K (with 3K continuous and 6K integral vars)
Objective: 1 Linear expression
Constraints (linear): 35K linear constraints (9K lower bound + 9K upper bound + remaining inequality constraints)
Constraints (Quadratic): 1 quad constraint (with Q matrix size as 3K*3K, which is PSD)
When we use Mosek (via Cvxpy), it runs indefinitely (in the branch & bound logic). Moreover, from the mosek logs: BEST_INT_OBJ and REL_GAP(%) are displayed NA throughout.
Since this problem contains proprietary data, its difficult to share it.
Are there any general tips or tricks to speed up the problem?
(Weirdly, Gurobi can solve the same problem in less then a minute)

SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6
You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

For a buckling analysis, must all forces be multiplied by the resulting eigenvalue, or only the compressive load?

I am trying to do a linear buckling analysis (sol 105) with Nastran on a cylindrical shell structure. My understanding is that the compressive load that I apply to the structure must be multiplied by the resulting eigenvalue to get the buckling load. This gives me results that I expect.
However, now I apply a single perturbation load (SPL), a small transverse force acting midway along the cylinder on a single grid point. My understanding is that the magnitude of the SPL stays the way it is, (Unlike the compressive load where I multiply it with the eigenvalue to obtain buckling load.) The results I obtain are not what I expect, as the buckling load should not reduce so much as the SPL increases, according to the theory on this topic.
I am wondering if anyone knows what I am doing wrong. I feel like my mistake is probably very easy, but I haven't been able to solve it yet. Here is some more information on my implementation:
Axial compressive force spread over top grid points of cylinder.
Both SPL (the transverse point load) and axial loads are added to the static analysis subcase. Then the buckling subcase uses the static subcase for its analysis. This is how I understand it should be done.
boundary conditions:
SPC1 restraining 123 (xyz) directions at bottom grid points.
SPC1 restraining 12 (xy) directions at top grid points.
I'm not a Nastran user but I've done a lot of buckling analysis with Cast3M software.
The linear buckling analysis does not need perturbation loading, but only your main axial loading (F^0).
To recap,
Solve the linear problem for axial loading :
solve for u^0 : [K] * u^0 = F^0
get the linear stresses from the Hooke law : \sigma^0 = D * B * u^0
Solve the eigenvalue buckling problem :
[ K + \lambda Kgeo(\sigma^0)] * X = 0
Then, if you want to perform a non-linear (large displacement) post-buckling analysis, it is recommended to introduce a small perturbation which "excites" the buckling mode.
If you introduce the perturbation loading before the linear buckling analysis, maybe Nastran is adding it to F^0 and it is then logical that the result of buckling changes.
Hope this can help you.
There is a way to scale some loads and hold others constant. Create 2 Static Subcases with 2 (different) sets of loads:
Constant loads (that are not scaled, like a preload or internal pressure)
Those that will be scaled by the eigenvalue
Order does not matter
Use the Nastran STATSUB entry to define. It looks like this:
SUBCASE 100
LOAD = 1 $ Static pre-load
SUBCASE 200
LOAD = 2 $ Varying buckling load
$ -------------
SUBCASE 1000
STATSUB(PRELOAD) = 100
STATSUB(BUCKLING) = 200
METHOD = 10
The eigensolution is modified to include influence of static and varying loads.

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Understanding tensorflow inter/intra parallelism threads

I would like to understand a little more about these two parameters: intra and inter op parallelism threads
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
I read this post which has a pretty good explanation: TensorFlow: inter- and intra-op parallelism configuration
But I am seeking confirmations and also asking new questions below. And I am running my task in keras 2.0.9, tensorflow 1.3.0:
when both are set to 1, does it mean that, on a computer with 4 cores for example, there will be only 1 thread shared by the four cores?
why using 1 thread does not seem to affect my task very much in terms of speed? My network has the following structure: dropout, conv1d, maxpooling, lstm, globalmaxpooling,dropout, dense. The post cited above says that if there are a lot of matrix multiplication and subtraction operations, using a multiple thread setting can help. I do not know much about the math underneath but I'd imagine there are quite a lot of such matrix operations in my model? However, setting both params from 0 to 1 only sees a 1 minute slowdown over a 10 minute task.
why multi-thread could be a source of non-reproducible results? See Results not reproducible with Keras and TensorFlow in Python. This is the main reason I need to use single threads as I am doing scientific experiments. And surely tensorflow has been improving over the time, why this is not addressed in the release?
Many thanks in advance
When both parameters are set to 1, there will be 1 thread running on 1 of the 4 cores. The core on which it runs might change but it will always be 1 at a time.
When running something in parallel there is always a trade-off between lost time on communication and gained time through parallelization. Depending on the used hardware and the specific task (like the size of the matrices) the speedup will change. Sometimes running something in parallel will be even slower than using one core.
For example when using floats on a cpu, (a + b) + c will not be equal to a + (b + c) because of the floating point precision. Using multiple parallel threads means that operations like a + b + c will not always be computed in the same order, leading to different results on each run. However those differences are extremely small and will not effect the overall result in most cases. Completely reproducible results are usually only needed for debugging. Enforcing complete reproducibility would slow down multi-threading a lot.
Answer to question 1 is "No".
Setting both the parameters to 1 (intra_op_parallelism_threads=1, inter_op_parallelism_threads=1) will generate N threads, where N is the count of cores. I've tested it multiple times on different versions of TensorFlow. This is true even for latest version of TensorFlow. There are multiple questions on how to reduce the number of threads to 1 but with no clear answer. Some examples are
How to stop TensorFlow from multi-threading
https://github.com/usnistgov/frvt/issues/12
Changing the number of threads in TensorFlow on Cifar10
Importing TensorFlow spawns threads
https://github.com/tensorflow/tensorflow/issues/13853