How to compute a per-class parameter from a mini-batch in TensorFlow? - tensorflow

I am starting to learn TensorFlow and I have a seemingly simple modeling question. Suppose I have a C-class problem and data arrives into TensorFlow in mini-batches containing B samples each. Each sample x is a D-dimensional vector that comes with its label y (non-negative integer between 0 and C-1). I want to estimate a class-specific parameter (for example the sample mean) for each class. The estimation takes place after each sample independently undergoes a TensorFlow-defined transformation pipeline. The per-class parameter/sample-mean is then utilized in the computation of other tensors.
Intuitively, I would group the samples in each mini-batch by label, sum-combine them, and add the total of each label group to the corresponding class parameter, with appropriate normalization.
How can I implement such a simple procedure (group by label, perform a per-group operation, then use the labels as indices for writing into a tensor) or an equivalent one, using TensorFlow? What TensorFlow operations do I need to learn about to achieve this? Is it advisable to do it outside TensorFlow?

Related

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

how does tensorflow calculate gradients *efficiently* from input to loss?

To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.

Tensorflow: intercept gradients of arbitrary node in the computational graph (not necessarily loss)

I would like to intercept gradients that are backpropagated in my Tensorflow graph, which are not based on the loss (∂L/∂w), but based on some other node in the graph, for example the class scores (∂s/∂w) in a classification problem or some activation (∂a/∂w) to see how it changes when certain weights w change.
How can one implement this efficiently in Tensorflow? Intuitively, the gradients should already all be there for backprop of the loss as intermediate results, so there should be a solution without a big overhead.
I am already aware of the following suggestions, which don't exactly solve the problem:
The Tensorflow method tf.gradients(ys, xs), which computes the gradient for every y in ys w.r.t. every xs, but then, for every x in xs sums over all y. Applying this function for every y in ys separately, however, induces a large computational overhead.
This stackoverflow post, which ask this question for the derivative of the loss w.r.t. some parameters, i.e. ∂L/∂w.
The part of the documentation, which proposes to call optimizer.compute_gradients() as an easy to use 'wrapper' around tf.gradients(). However, calling this function for every variable of interest introduces again a large computational overhead.
Update: Phrased differently, what I want is the Jacobian of any component of the computational graph w.r.t. any other. This topic has been touched in this recent Tensorflow issue, but is described as currently not being efficiently/conveniently implemented therein.

Seq2Seq for prediction of complex states

My problem:
I have a sequence of complex states and I want to predict the future states.
Input:
I have a sequence of states. Each sequence can be of variable length. Each state is a moment in time and is described by several attributes: [att1, att2, ...]. Where each attribute is a number between an interval [[0..5], [1..3651], ...]
The example (and paper) of Seq2Seq is based on that each state (word) is taken from their dictionary. So each state has around 80.000 possibilities. But how would you represent each state when it is taken from a set of vectors and the set is just each possible combination of the attributes.
Is there any method to work with more complex states with TensorFlow? Also, what is a good method do decide the boundaries of your buckets when the relation between input length and output length is unclear?
May I suggest a rephrasing and splitting of your question into two parts? The first is really a general machine learning/LSTM question that's independent of tensorflow: How to use an LSTM to predict when the sequence elements are general vectors, and the second is how to represent this in tensorflow. For the former - there's nothing really magical to do there.
But a very quick answer: You've really just skipped the embedding lookup part of seq2seq. You can feed dense tensors in to a suitably modified version of it -- your state is just a dense vector representation of the state. That's the same thing that comes out of an embedding lookup.
The vector representation tutorial discusses the preprocessing that turns, e.g., words into embeddings for use in later parts of the learning pipeline.
If you look at line 139 of seq2seq.py you'll see that the embedding_rnn_decoder takes in a 1D batch of things to decide (the dimension is elements in the batch), but then uses the embedding lookup to turn it into a batch_size * cell.input_size tensor. You want to directly input a batch_size * cell.input_size tensor into the RNN, skipping the embedding step.