Implementing backprop in numpy - numpy

I a trying to implement backprop in numpy by defining a function that performs some kind operation given an input, weight matrix and bias, and returns the output with the backward function, which can be used to update weights.
Currently this is my code , however I think there are some bugs in the derivation, as the gradients for the W1 matrix are too large. Here is a pytorch implementation for the same thing as a reference torch.
Any help is appreciated.

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

How to define a loss function that needs to input numpy array(not tensor) when build a tensorflow graph?

I want to add a constraint option in my loss function. The definition of this constraint option needs numpy array type as input. So, I can not define it as a tensor type as a graph node in tensorflow. How can I define this part in graph so as to join in the network optimization?
Operations done on numpy arrays cannot be automatically differentiated in TensorFlow. Since you are using this computation as part of loss computation, I assume you want to differentiate it. In this case, your best option is probably to reimplement the constraint in TensorFlow. The only other approach I can think of is to use autograd in conjuction with TF. This seems possible - something along the lines of evaluate part of the graph with TF, get numpy arrays out, call your function under autograd, get gradients, feed them back into TF - but will likely be harder and slower.
If you are reimplementing it in TF, most numpy operations have easy one-to-one corresponded operations in TF. If the implementation is using a lot of control flow (which can be painful in classic TF), you can use eager execution or py_func.

LSTM calculation with Vanilla Numpy

I've compared LSTM result with Keras/Tensorflow calculation and Numpy calculation. However, the result is slightly different:
Numpy: [[ 0.16315128 -0.04277606 0.26504123 0.08014129 0.38561829]]
Keras: [[ 0.16836338 -0.04930305 0.25080156 0.08938988 0.3537751 ]]
Keras' LSTM implementation does not use tf.contrib.rnn but Keras directly manages the parameters, and tf.matmul is used to calculate. I found the corresponding implementation of Keras and tried the same calculation with Numpy, but the values are slightly different as shown above.
I have checked the formula several times and it seems like the same. The only difference is the differences between tf.matmul or np.dot. Maybe there are some differences about decimal point calculation method. Even so, I think the results are too much different. The biggest difference is about 10%. I'd like to match the Numpy calculation with the tensorflow calculation. If someone could give me some hint or point me to the right implementation, I'd really appreciate it.
Keras implementation and the Numpy code implemented myself:
Keras: https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1921-L1948
Numpy: https://github.com/likejazz/jupyter-notebooks/blob/master/deep-learning/lstm-keras-inspect.py
The default value of recurrent_activation is 'hard_sigmoid' for Keras LSTM layer. However, the original sigmoid function is used in your NumPy implementation.
So you can either change the recurrent_activation argument to 'sigmoid',
model.add(LSTM(5, input_shape=(8, 3), recurrent_activation='sigmoid'))
or use the "hard" sigmoid function in your NumPy code.
def hard_sigmoid(x):
return np.clip(0.2 * x + 0.5, 0, 1)

How to wrap a custom TensorFlow loss function in Keras?

This is my third attempt to get a deep learning project off the ground. I'm working with protein sequences. First I tried TFLearn, then raw TensorFlow, and now I'm trying Keras.
The previous two attempts taught me a lot, and gave me some code and concepts that I can re-use. However there has always been an obstacle, and I've asked questions that the developers can't answer (in the case of TFLearn), or I've simply gotten bogged down (TensorFlow object introspection is tedious).
I have written this TensorFlow loss function, and I know it works:
def l2_angle_distance(pred, tgt):
with tf.name_scope("L2AngleDistance"):
# Scaling factor
count = tgt[...,0,0]
scale = tf.to_float(tf.count_nonzero(tf.is_finite(count)))
# Mask NaN in tgt
tgt = tf.where(tf.is_nan(tgt), pred, tgt)
# Calculate L1 losses
losses = tf.losses.cosine_distance(pred, tgt, -1, reduction=tf.losses.Reduction.NONE)
# Square the losses, then sum, to get L2 scalar loss.
# Divide the loss result by the scaling factor.
return tf.reduce_sum(losses * losses) / scale
My target values (tgt) can include NaN, because my protein sequences are passed in a 4D Tensor, despite the fact that the individual sequences differ in length. Before you ask, the data can't be resampled like an image. So I use NaN in the tgt Tensor to indicate "no prediction needed here." Before I calculate the L2 cosine loss, I replace every NaN with the matching values in the prediction (pred) so the loss for every NaN is always zero.
Now, how can I re-use this function in Keras? It appears that the Keras Lambda core layer is not a good choice, because a Lambda only takes a single argument, and a loss function needs two arguments.
Alternately, can I rewrite this function in Keras? I shouldn't ever need to use the Theano or CNTK backend, so it isn't necessary for me to rewrite my function in Keras. I'll use whatever works.
I just looked at the Keras losses.py file to get some clues. I imported keras.backend and had a look around. I also found https://keras.io/backend/. I don't seem to find wrappers for ANY of the TensorFlow function calls I happen to use: to_float(), count_nonzero(), is_finite(), where(), is_nan(), cosine_distance(), or reduce_sum().
Thanks for your suggestions!
I answered my own question. I'm posting the solution for anyone who may come across this same problem.
I tried using my TF loss function directly in Keras, as was independently suggested by Matias Valdenegro. I did not provoke any errors from Keras by doing so, however, the loss value went immediately to NaN.
Eventually I identified the problem. The calling convention for a Keras loss function is first y_true (which I called tgt), then y_pred (my pred). But the calling convention for a TensorFlow loss function is pred first, then tgt. So if you want to keep a Tensorflow-native version of the loss function around, this fix works:
def keras_l2_angle_distance(tgt, pred):
return l2_angle_distance(pred, tgt)
<snip>
model.compile(loss = keras_l2_angle_distance, optimizer = "something")
Maybe Theano or CNTK uses the same parameter order as Keras, I don't know. But I'm back in business.
You don't need to use keras.backend, as your loss is directly written in TensorFlow, then you can use it directly in Keras. The backend functions are an abstraction layer so you can code a loss/layer that will work with the multiple available backends in Keras.
You just have to put your loss in the model.compile call:
model.compile(loss = l2_angle_distance, optimizer = "something")

Custom external loss metric for Gradient Optimizer?

I have an external function which takes y and y_prediction (in matrix format), and computes a metric which depicts how good or bad the prediction actually is.
Unfortunately the metric is no simple y - ypred or confusion matrix, but still very useful and important. How can I use this number computed for the loss or as an argument for optimizer.minimize?
If i understood correctly i think there is two way to do this:
Either the loss you want to compute can be writen as tensorflow ops which gradient is defined (for exemple SVD has no gradient defined in tensorflow library saddly) then the optimisation is direct.
Or you can always write your loss function with numpy operators and use tf.py_func() https://www.tensorflow.org/api_docs/python/tf/py_func and then you have to explicit the gradient by hand as said in here : How to make a custom activation function with only Python in Tensorflow?
But you have to know an explicit formula of your gradient ...