What's the difference between Stabilizer() block and enable_self_stabilization parameter? - cntk

When should I use one or another? Tutorials and examples use either Sequential([Stabilizer(), Recurrence(LSTM(hidden_dim))]) or LSTMP_component_with_self_stabilization from Examples/common/nn.py. I've tried replacing the former with Recurrence(LSTM(hidden_dim, enable_self_stabilization=True)) in the char_rnn.py example, but the results are significantly worse.

The Stabilizer layer multiplies its input with a learnable scalar. This simple trick has been shown to significantly improve convergence and stability. It has some similarity with BatchNormalization. Generally, when you can use BatchNormalization, you should try that first. Where that is not possible, which is specifically inside recurrent loops, I recommend to use Stabilizer instead.
Normally, you must inject it explicitly in your model. A special case are the recurrent step functions (e.g. LSTM), which include Stabilizers inside. Use enable_self_stabilization=True to enable that. Those built-in Stabilizers only apply to internal variables. For the main input, you must insert a Stabilizer yourself.
If you include explicit Stabilizers but set enable_self_stabilization=False (e.g. as a default_option), then those explicit Stabilizers are no-ops.
It is not my experience that Stabilizer makes things worse. It is generally a sure-fire thing to improve convergence. It does change numeric ranges, though. So if it makes convergence worse, I suggest to experiment with different hyper-parameter settings, e.g. reduce the learning rate.

Related

Does increasing the number of iterations affect log-lik, AIC etc.?

Whenever I try to solve a convergence issue in one of my glmer models with the help of a different optimizer, I repeat the entire model optimization procedure with the new optimizer. That is, I re-run all the models I've computed so far with the new optimizer and again conduct comparisons with anova (). I do this because as far as I know different optimizers may lead to differences in AICs and log-lik ratios for one and the same model, making comparisons between two models that use different optimizers problematic.
In my most recent analysis, I've increased the number of iterations with optCtrl=list(maxfun=100000) to avoid convergence errors. I'm now wondering whether this can also lead to differences in AIC/log-lik etc. for one and the same model? Is it equally problematic to compare two models that differ with regard to the inclusion of the optCtrl=list(maxfun=100000) argument?
I actually thought that increasing the number of iterations would simply lead to longer computation times (rather than different results), but I was unable to verify this online. Any hint/explanation is appreciated.
As far as I know, you should be fine. As long as the models were fit with the same number of observations you should be able to compare them using the AIC. Hopefully someone else can comment on the nuances of the computations of the AIC itself, but I just fit a bunch of models with the same formula and dataset and different number of max iterations, getting the AIC each time. It didn't change as a function of the iterations. The iterations are just the time the model fitting process can take to maximize the likelihood, which for complex models can be tricky. Once a model is fit, and has converged on an answer, the number of iterations shouldn't change anything about the model itself.
If you look at this question, the top answer explains the AIC quite well:https://stats.stackexchange.com/questions/232465/how-to-compare-models-on-the-basis-of-aic

tensorflow datasets: more efficient to vectorize with unbatching (batch -> map -> unbatch) or just map?

TensorFlow recommends batching of datasets before transformations with map in order to vectorize the transformation and reduce overhead: https://www.tensorflow.org/guide/data_performance#vectorizing_mapping
However, there are cases where you want to perform transformations on the dataset and then do something (e.g., shuffle) on the UNBATCHED dataset.
I haven't been able to find anything to indicate which is more efficient:
1) dataset.map(my_transformations)
2) dataset.batch(batch_size).map(my_transformations).unbatch()
(2) has reduced map overhead from having vectorized with batch, but has additional overhead from having to unbatch the dataset after.
I could also see it being that there is not a universal rule. Short of testing every time I try a new dataset or transformation (or hardware!), is there a good rule of thumb here? I have seen several examples online use (2) without explanation, but I have no intuition on this subject, so...
Thanks in advance!
EDIT: I have since found that in at least some cases, (2) is MUCH less efficient than (1). For example, on our image dataset, applying random flips and rotations (with .map and the built-in TF functions tf.image.random_flip_left_right, tf.image.random_flip_up_down, and tf.image.rot90) per epoch for data augmentation takes 50% longer with (2). I still have no idea when to expect this to be the case, or not, but the tutorials' suggested approach is at least sometimes wrong.
The answer is (1). https://github.com/tensorflow/tensorflow/issues/40386
TF is modifying the documentation to reflect that the overhead from unbatch will usually (always?) be higher than the savings from vectorized transformations.

scipy.sparse.linalg: what's the difference between splu and factorized?

What's the difference between using
scipy.sparse.linalg.factorized(A)
and
scipy.sparse.linalg.splu(A)
Both of them return objects with .solve(rhs) method and for both it's said in the documentation that they use LU decomposition. I'd like to know the difference in performance for both of them.
More specificly, I'm writing a python/numpy/scipy app that implements dynamic FEM model. I need to solve an equation Au = f on each timestep. A is sparse and rather large, but doesn't depend on timestep, so I'd like to invest some time beforehand to make iterations faster (there may be thousands of them). I tried using scipy.sparse.linalg.inv(A), but it threw memory exceptions when the size of matrix was large. I used scipy.linalg.spsolve on each step until recently, and now am thinking on using some sort of decomposition for better performance. So if you have other suggestions aside from LU, feel free to propose!
They should both work well for your problem, assuming that A does not change with each time step.
scipy.sparse.linalg.inv(A) will return a dense matrix that is the same size as A, so it's no wonder it's throwing memory exceptions.
scipy.linalg.solve is also a dense linear solver, which isn't what you want.
Assuming A is sparse, to solve Au=f and you only want to solve Au=f once, you could use scipy.sparse.linalg.spsolve. For example
u = spsolve(A, f)
If you want to speed things up dramatically for subsequent solves, you would instead use scipy.sparse.linalg.factorized or scipy.sparse.linalg.splu. For example
A_inv = splu(A)
for t in range(iterations):
u_t = A_inv.solve(f_t)
or
A_solve = factorized(A)
for t in range(iterations):
u_t = A_solve(f_t)
They should both be comparable in speed, and much faster than the previous options.
As #sascha said, you will need to dig into the documentation to see the differences between splu and factorize. But, you can use 'umfpack' instead of the default 'superLU' if you have it installed and set up correctly. I think umfpack will be faster in most cases. Keep in mind that if your matrix A is too large or has too many non-zeros, an LU decomposition / direct solver may take too much memory on your system. In this case, you might be stuck with using an iterative solver such as this. Unfortunately, you wont be able to reuse the solve of A at each time step, but you might be able to find a good preconditioner for A (approximation to inv(A)) to feed the solver to speed it up.

Will using multiple minimizing ops at once work as expected in Tensorflow?

For example, if I do:
loss_one = something
loss_two = somthing_else
train_one = tf.train.AdamOptimzer(0.001).minimize(loss_one)
train_two = tf.train.AdamOptimizer(0.001).minimize(loss_two)
sess.run([train_one, train_two])
Will that do what's expected? The reason I'm concerned is because I don't exactly know how gradients are accumulated. Are they stored on the optimizers themselves? Or on the variables? If it's the second, I can imagine them interfering.
Most likely not. Presumably, both loss_one and loss_two are a measure of how close the output of your model, let's say out, is to what you expected, so they would both be a function of out and maybe something else. Both optimizers compute the variable updates from the out computed with the values that the variables had before calling session.run. So if you apply one update and then the other, the second update would not be really correct, because it has not been computed using the now-updated variables. This may not be a huge issue though, since. A more complicated problem is that, depending on how exactly the optimizer is implemented, if it is something more or less like variable = variable + update then it is not deterministic whether that variable on the right-hand side of the expression has the original or first-updated value, so you could end adding only one of the updates or both, non-deterministically.
There are several better alternatives:
Use only one optimizer at a time, so you call sess.run(train_one) first and sess.run(train_two) later.
Optimize the (possibly weighted) sum of both losses (tf.train.AdamOptimzer(0.001).minimize(loss_one + loss_two)).
Call compute_gradients from the optimizer for each loss value, combine the resulting gradients however you see fit (e.g. adding or averaging the updates) and apply them with apply_gradients.
Use tf.control_dependencies to make sure that one optimization step always takes place after the other. However this means that using the second optimizer will always require using the first one (could be work around, maybe with tf.cond, but it's more of a hassle).
the optimizer is mainly in charge of calculating the gradients(backpropagation), if you give it loss twice(run it two times as you are doing), it will update the gradients twice by performing inference once.not sure why would you do that though

Constrained optimization with hessian in scipy

I want to minimize a function, subject to constraints (the variables are non-negative). I can compute the gradient and Hessian exactly. So I want something like:
result = scipy.optimize.minimize(objective, x0, jac=grad, hess=hess, bounds=bds)
I need to specify a method for the optimization (http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). Unfortunately I can't seem to find a method that allows for both user-specified bounds and a Hessian!
This is particularly annoying because methods "TNC" and "Newton-CG" seem essentially the same, however TNC estimates Hessian internally (in C code), while Newton-CG doesn't allow for constraints.
So, how can I do a constrained optimization with user-specified Hessian? Seems like there ought to be an easy option for this in scipy -- am I missing something?
I realized a workaround for my problem, which is to transform the constrained optimization into an unconstrained optimization.
In my case, since I have the constraint x > 0, I decided to optimize over log(x) instead of x. This was easy to do for my problem since I am using automatic differentiation.
Still, this seems like a somewhat unsatisfying solution -- I still think scipy should allow some constrained second-order minimization method.
just bumped into exactly this point myself. I think the TNC applies an active set to the line search of the CG, not the direction of the line search. Conversely the Hessian chooses the direction of the line. So, er, could maybe cut the line search out of NCG and drop it into TNC. Problem is when you are at the boundary the Hessian might not take you out of it.
How about using TNC for an extremely sloppy first guess [give it a really large error bound to hit], then use NCG with a small number of iterations, check: if on boundary back to TNC, else continue with NCG. Ugh...
Yes, or use log(x). I'm going to follow your lead.