Will using multiple minimizing ops at once work as expected in Tensorflow? - tensorflow

For example, if I do:
loss_one = something
loss_two = somthing_else
train_one = tf.train.AdamOptimzer(0.001).minimize(loss_one)
train_two = tf.train.AdamOptimizer(0.001).minimize(loss_two)
sess.run([train_one, train_two])
Will that do what's expected? The reason I'm concerned is because I don't exactly know how gradients are accumulated. Are they stored on the optimizers themselves? Or on the variables? If it's the second, I can imagine them interfering.

Most likely not. Presumably, both loss_one and loss_two are a measure of how close the output of your model, let's say out, is to what you expected, so they would both be a function of out and maybe something else. Both optimizers compute the variable updates from the out computed with the values that the variables had before calling session.run. So if you apply one update and then the other, the second update would not be really correct, because it has not been computed using the now-updated variables. This may not be a huge issue though, since. A more complicated problem is that, depending on how exactly the optimizer is implemented, if it is something more or less like variable = variable + update then it is not deterministic whether that variable on the right-hand side of the expression has the original or first-updated value, so you could end adding only one of the updates or both, non-deterministically.
There are several better alternatives:
Use only one optimizer at a time, so you call sess.run(train_one) first and sess.run(train_two) later.
Optimize the (possibly weighted) sum of both losses (tf.train.AdamOptimzer(0.001).minimize(loss_one + loss_two)).
Call compute_gradients from the optimizer for each loss value, combine the resulting gradients however you see fit (e.g. adding or averaging the updates) and apply them with apply_gradients.
Use tf.control_dependencies to make sure that one optimization step always takes place after the other. However this means that using the second optimizer will always require using the first one (could be work around, maybe with tf.cond, but it's more of a hassle).

the optimizer is mainly in charge of calculating the gradients(backpropagation), if you give it loss twice(run it two times as you are doing), it will update the gradients twice by performing inference once.not sure why would you do that though

Related

How to control reduction strategy for stateful metric in keras mirrored strategy

I use keras fit() method with custom metrics passed to model.
The metrics are stateful - i.e. are a subclass of a Metric, as described in https://keras.io/api/metrics/#as-subclasses-of-metric-stateful
When I run the code in a multi-gpu environment using a tf.distribute.MirroredStrategy() my metric code is called on every GPU separately with batch_size/no_of_gpus examples passed, which is reasonable to expect.
What happens next is that multiple scalars (one from every GPU) of the metric value need to be reduced to a single scalar, and what I get all the time is a sum reduction, while I would like to control that.
Keep in mind, that reduction parameter is the one of Loss in keras, and there is no such thing in the Metric class: https://github.com/tensorflow/tensorflow/blob/acbc065f8eb2ed05c7ab5c42b5c5bd6abdd2f91f/tensorflow/python/keras/metrics.py#L87
(the only crazy thing I tried was to inherit from a Mean class that is a subclass of a Metric but that didn't change anything)
reduction is mentioned in the metrics code, however this is a reduction over multiple accumulated values in a single metric object, and in multi-gpu setting - this is not the case, as every metric works in its own GPU and is somehow aggregated at the end.
The way I debugged it to understand this behaviour was - I was printing the shapes and the results inside update_state method of the metric. And then I looked at value of the metric in logs object in on_batch_end callback.
I tried looking at TF code, but couldn't find the place this is happening.
I would like to be able to control this behaviour - so either pick 'mean' or 'sum' for the metric, or at least know where it is being done in the code.
Edited: I guess this https://github.com/tensorflow/tensorflow/issues/39268 sheds some more light on this issue
I am facing the same problem as you (and that's why I found your question).
Seeing that it's been 15 days since you asked the question and there are no answers/comments yet, I thought I might share my temporary workaround.
Like you, I also think that a SUM reduction has been performed when combining progress over multiple GPUs. What I did is to pass the number of GPUs (e.g. given by the num_replicas_in_sync attribute of your tf.distribute strategy object) into the __init__(...) constructor of your sub-classed metric object, and use it to divide the return value in the results() method.
Potentially, you could also use tf.distribute.get_strategy() from within the metric object to make it "strategy aware", and use the information to decide how to modify the values in an ad hoc manner so that the SUM reduction will produce what you want.
I hope this helps for now, whether as a suggestion or as a confirmation that you're not alone on this.
When implementing the subclass of the Keras Metric class, you have to override the merge_state() function correctly. If you do not override this function, the default implementation will be used - which is a simple sum.
See: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric

Does increasing the number of iterations affect log-lik, AIC etc.?

Whenever I try to solve a convergence issue in one of my glmer models with the help of a different optimizer, I repeat the entire model optimization procedure with the new optimizer. That is, I re-run all the models I've computed so far with the new optimizer and again conduct comparisons with anova (). I do this because as far as I know different optimizers may lead to differences in AICs and log-lik ratios for one and the same model, making comparisons between two models that use different optimizers problematic.
In my most recent analysis, I've increased the number of iterations with optCtrl=list(maxfun=100000) to avoid convergence errors. I'm now wondering whether this can also lead to differences in AIC/log-lik etc. for one and the same model? Is it equally problematic to compare two models that differ with regard to the inclusion of the optCtrl=list(maxfun=100000) argument?
I actually thought that increasing the number of iterations would simply lead to longer computation times (rather than different results), but I was unable to verify this online. Any hint/explanation is appreciated.
As far as I know, you should be fine. As long as the models were fit with the same number of observations you should be able to compare them using the AIC. Hopefully someone else can comment on the nuances of the computations of the AIC itself, but I just fit a bunch of models with the same formula and dataset and different number of max iterations, getting the AIC each time. It didn't change as a function of the iterations. The iterations are just the time the model fitting process can take to maximize the likelihood, which for complex models can be tricky. Once a model is fit, and has converged on an answer, the number of iterations shouldn't change anything about the model itself.
If you look at this question, the top answer explains the AIC quite well:https://stats.stackexchange.com/questions/232465/how-to-compare-models-on-the-basis-of-aic

What's the difference between Stabilizer() block and enable_self_stabilization parameter?

When should I use one or another? Tutorials and examples use either Sequential([Stabilizer(), Recurrence(LSTM(hidden_dim))]) or LSTMP_component_with_self_stabilization from Examples/common/nn.py. I've tried replacing the former with Recurrence(LSTM(hidden_dim, enable_self_stabilization=True)) in the char_rnn.py example, but the results are significantly worse.
The Stabilizer layer multiplies its input with a learnable scalar. This simple trick has been shown to significantly improve convergence and stability. It has some similarity with BatchNormalization. Generally, when you can use BatchNormalization, you should try that first. Where that is not possible, which is specifically inside recurrent loops, I recommend to use Stabilizer instead.
Normally, you must inject it explicitly in your model. A special case are the recurrent step functions (e.g. LSTM), which include Stabilizers inside. Use enable_self_stabilization=True to enable that. Those built-in Stabilizers only apply to internal variables. For the main input, you must insert a Stabilizer yourself.
If you include explicit Stabilizers but set enable_self_stabilization=False (e.g. as a default_option), then those explicit Stabilizers are no-ops.
It is not my experience that Stabilizer makes things worse. It is generally a sure-fire thing to improve convergence. It does change numeric ranges, though. So if it makes convergence worse, I suggest to experiment with different hyper-parameter settings, e.g. reduce the learning rate.

What types of operations are permitted inside tf.cond

It seems that tf.cond(cond, fn1, fn2) executes possible dependencies for both branches, so any computation we would like to perform if and only if the conditions hold have to be put into the function fn1 fn2.
However I am confused as to what fn actually is. Every variable/op in tensorflow should be a node of the computation graph, but fn is actually a python function. This leads to many questions. For example, is this function re-evaluated every time sess.run is executed? Can this function return different computation graphs each time? Can placeholders be defined in them, and if not how to avoid supplying values to placeholders we know will not be used when, for example, there is a switch variable that chooses between different inputs?
The functions passed to tf.cond are only run when the op is defined, not during graph execution. And both of them are run, exactly once as far as I can see. The functions themselves are just a way to indicate exactly which ops should have the conditional execution behavior: note the context_t.Enter()/context_t.Exit() calls surrounding each function call.
Hopefully that clarifies things. The functions are a useful way of grouping ops during graph definition. There's no function execution magic going on in the TensorFlow graph.

How to effectively use knn in Stata

I have two questions with executing discrim knn in Stata.
1) How do you properly code the command? I've tried various versions, but seem to always get an error that there are too many variables specified.
The vector with the correct result is buy.
I am trying: discrim knn buy, group(train test) k(1)
2) My understanding with KNN was that factor variables (binary) were fine for using KNN, even encouraged. However I get the error message that factor variables and time-series operators not allowed.
Lastly, though I know this isn't the best space for this question, should each vector be normalized for knn? I've heard conflicting responses.
I'm guessing that the error you're getting is
group(): too many variables specified
This is because you can only group by 1 variable with knn. knn performs discriminant analysis based on a single grouping variable, in your case, distinguishing the training from the test. I imagine your train and test variables are binary, in which case using only one of the variables is enough, as they are merely logical opposites of each other. A single variable has enough information to distinguish the two groups.