Learning a Categorical Variable with TensorFlow Probability - tensorflow2.0

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1

I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

Related

LSTM training error is very high and relatively unchanging

As a learning exercise, I'm trying to use an LSTM model with the Keras framework to predict the stock market based on multiple data points. The size of my input array is roughly [5000, 100]. Based on other questions on this site and articles online, the approach seems fairly standard: put the data in a numpy array, scale it, reshape it to 3 dimensions for the LSTM, split it into train and test sections, and feed it through the model. Running only the training portion of the model, I am consistently getting loss scores around 400,000,000. This is not changed by altering the batch size, the number of epochs, the number of layers, replacing the normalization with dropout layers, changing the sizes of each layer, or using different optimizers and loss functions. Any idea why the loss is so high and what I can do to fix that? Attached is the code. All advice is greatly appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Model, preprocessing
from keras.utils import plot_model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler(feature_range=(0, 1))
features_df = pd.read_csv("dataset.csv")
features_np = np.array(features_df)
features_np.astype(np.float64)
scaler.fit_transform(features_np)
num_features=features_np.shape[1]
features = np.reshape(features_np, (features_np.shape[0], 1, features_np.shape[1]))
labels_np = np.array(pd.read_csv("output.csv"))
scaler.fit_transform(labels_np)
test_in = features_np[int(features_np.shape[0] * 0.75):]
test_in = np.reshape(test_in, (test_in.shape[0], 1, test_in.shape[1]))
test_out = labels_np[int(labels_np.shape[0] * 0.75):]
test_out = np.reshape(test_out, (test_out.shape[0], 1, test_out.shape[1]))
inputs = layers.Input(shape=(1, features.shape[2]))
x = layers.LSTM(5000, return_sequences=True)(inputs)
lstm1 = layers.LSTM(1000, return_sequences=True)(x)
norm1 = layers.BatchNormalization()(lstm1)
lstm2 = layers.LSTM(1000, return_sequences=True)(norm1)
lstm3 = layers.LSTM(1000, return_sequences=True)(lstm2)
norm2 = layers.BatchNormalization()(lstm3)
lstm4 = layers.LSTM(1000, return_sequences=True)(norm2)
lstm5 = layers.LSTM(1000)(lstm4)
dense1 = layers.Dense(1000, activation='relu')(lstm5)
dense2 = layers.Dense(1000, activation='sigmoid')(dense1)
outputs = layers.Dense(2)(dense2)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(features, labels_np, epochs=1, batch_size=4)
evaluate = model.evaluate(test_in, test_out, verbose=2)
While I have not solved the error, implementing the Sequential() model and using only two LSTM layers and a Dense layer changed the error: the training error is now very low while testing remains high. This now appears to be a (relatively) simple problem of overfitting rather than the more confusing error of high training loss. Hopefully, this helps anyone having a similar problem.
There are two things i notice and dont understand why you use them. First one is , dense2 layer with sigmoid activation. I dont think sigmoid activation is benefical to when we are trying to solve a regression problem. Can you change that to relu and see what happens. Second one is you have two dense layers. You did not specify that but i think you are predicting two values with same inputs. If you are trying to predict just one value, you should you should change that to
outputs = layers.Dense(1)(dense2)

Custom optimizer with multiple loss function evaluations

I want to implement a custom optimization algorithm for TF models.
I have read the following sources
tf documentation on custom optimizers
tf SGD implementation
keras documentation on custom models
towardsdatascience guide on custom optimizers
However lot of questions remain.
It seems like it is not possible to evaluate the loss function multiple times (for different weight settings) before applying a gradient step, when using the custom optimizer API. For example in a line-search type of algorithm this is necessary.
I tried to do all steps manually.
Assume I have setup my model and my optimization problem like this
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import models
model = models.Sequential()
model.add(layers.Dense(15, input_dim=10))
model.add(layers.Dense(20))
model.add(layers.Dense(1))
x_train, y_train = get_train_data()
loss = losses.MeanSquaredError()
def val_and_grads(weights):
model.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, model(x_train))
grads = tape.gradient(val, model.trainable_variables)
return val, grads
initial_weights = model.get_weights()
optimal_weigths = my_fancy_optimization_algorithm(val_and_grads, initial_weights)
However my function val_and_grads needs a list of weights and returns a list of gradients from my_fancy_optimization_algorithms point of view that seems unnatural.
I could warp val_and_grads to "stack" the returned gradients and "split" the passed weights like this
def wrapped_val_and_grad(weights):
grads = val_and_grads(split_weights(weights))
return stack_grads(grads)
however that seems very inefficient.
Anyway, I do not like this approach since it seems that I would loose out out on a lot of the surrounding tensorflow infrastructure (printing of current loss function values and metrics during learning, tensorboard stuff, ...).
I could also pack the above in a custom model with a tailored train_step like this
def CustomModel(keras.Model):
def train_step(self, data):
x_train, y_train = data
def val_and_grads(weights):
self.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, self(x_train))
grads = tape.gradient(val, self.trainable_variables)
return val, grads
trainable_vars = self.trainable_variables
old_weights = self.get_weights()
update = my_fancy_update_finding_algorithm(val_and_grads, self.get_weights()) # this can do multiple evaluations of the model
self.set_weights(old_weights) # restore the weights
self.optimizer.apply_gradients(zip(update, trainable_vars))
Here I would need a accompanying custom optimizer that does nothing else than updating the current weights by adding the update (new_weigths = current_weights + update).
I am still unsure if this is the best way to go.
If someone can comment on the snippets and ideas above, guide me to any other resource that I should consider or provide new approaches and other feedback I would be very glad.
Thanks all.
Franz
EDIT:
Sadly I did not get any response here so far. Maybe my question is not concrete enough. As a first smaller question:
Given the model and val_and_grads in the first listing. How would I efficiently calculate the norm of the WHOLE gradient? What I do so far is
import numpy as np
_, grads = val_and_grad(model.get_weights())
norm_grads = np.linalg.norm(np.concatenate([grad.numpy().flatten() for grad in grad]))
This surely cannot be the "right" way.

tf.keras loss from two images in serial

I want to use the stability training approach of the paper and apply it to a very simple CNN.
The principle architecture is given by:
As shown in the figure you compute the loss based on the output f(I) for the input image I and on
the output f(I') for the perturbed image I'.
My question would be how to do this in a valid way without having two instances of the DNN,
as I'm training on large 3D images. In other words: how can I process two images in serial and compute the loss based on those two images?
I'm using tf2 with keras.
You can first write your DNN as a tf.keras Model.
After that, you can write another model which takes two image inputs, applies some Gaussian noise to one, passes them to DNN.
Design a custom loss function which finds the proper loss from the two outputs.
Here's a demo code:
from tensorflow.keras.layers import Input, Dense, Add, Activation, Flatten
from tensorflow.keras.models import Model
import tensorflow as tf
import numpy as np
import random
from tensorflow.python.keras.layers import Input, GaussianNoise, BatchNormalization
# shared DNN, this is the base model with a feature-space output, there is only once instance of the model
ip = Input(shape=(32,32,1)) # same as original inputs
f0 = Flatten()(ip)
d0 = Dense(10)(f0) # 10 dimensional feature embedding
dnn = Model(ip, d0)
# final model with two version of images and loss
input_1 = Input(shape=(32,32,1))
input_2 = Input(shape=(32,32,1))
g0 = GaussianNoise(0.5)(input_2) # only input_2 passes through gaussian noise layer, you can design your own custom layer too
# passing the two images to same DNN
path1 = dnn(input_1) # no noise
path2 = dnn(g0) # noise
model = Model([input_1, input_2], [path1, path2])
def my_loss(y_true, y_pred):
# calculate your loss based on your two outputs path1, path2
pass
model.compile('adam', my_loss)
model.summary()

keras batchnorm layer moving average for variance

I've been trying to understand the Keras BatchNorm layer behavior in my Keras NN model. One question I encountered was how the BN layer is calculating the moving average of the 'variance'. My understanding is Keras is using exponential-weighted-average method to calculate the moving average for both mean and variance from the training mini-batches. But regardless of this, after a really large number of epochs, this moving average should approach the mean/variance of the training data set. But in my simple example, the 'variance' moving average is always different from the training data 'variance'. Below is my code and output:
from keras.layers import Input, BatchNormalization
from keras.models import Model
from keras.optimizers import Adam, RMSprop
import numpy as np
X_input = Input(shape=(6,))
X = BatchNormalization(axis=-1)(X_input)
model = Model(inputs=X_input, outputs=X)
model.compile(optimizer=RMSprop(), loss='mean_squared_error')
np.random.seed(3)
train_data = np.random.random((5,6))
train_label = np.random.random((5,6))
model.fit(x=train_data, y=train_label, epochs=10000, batch_size=6, verbose=False)
bn_gamma, bn_beta, bn_mean, bn_var = model.layers[1].get_weights()
train_mean = np.mean(train_data, axis=0)
train_var = np.var(train_data, axis=0)
print("train_mean: {}".format(train_mean))
print("moving_mean: {}".format(bn_mean))
print("train_var: {}".format(train_var))
print("moving_var: {}".format(bn_var))
Below is the output:
train_mean: [0.42588575 0.47785879 0.32170309 0.49151921 0.355046 0.60104636]
moving_mean: [0.4258843 0.47785735 0.32170165 0.49151778 0.35504454 0.60104346]
train_var: [0.03949981 0.05228663 0.04027516 0.02522536 0.10261097 0.0838988 ]
moving_var: [0.04938692 0.06537427 0.05035637 0.03153942 0.12829503 0.10489936]
If you see, the train_mean is the same as the moving average mean of BN layer, but train_var (variance) is not. Can anyone please help here? Thanks.
If you look at the source code of batchnorm, you can see that the unbiased estimator of population variance is used, here is the relevant line:
variance *= sample_size / (sample_size - (1.0 + self.epsilon))
In your case, the sample size is 5, so you should have train_var * 5./4 == moving_var, which is the case.

Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend

I'm trying to use Keras to implement part of an algorithm that requires weight clipping, i.e. limiting the weight values after a gradient update. I haven't found any solutions through web searches so far.
For background, this has to do with the WGANs algorithm:
https://arxiv.org/pdf/1701.07875.pdf
If you look at algorithm 1 on page 8, you'll see the following:
I've highlighted the lines that I'm trying to implement in Keras: after computing a gradient to use to update the weights in the network, I want to make sure that all the weights are clipped between some values [-c, c] that I can set.
How could I go about doing this in Keras?
For reference I am using the TensorFlow backend. I don't mind digging into things and adding messy quick-fixes for now.
While creating the optimizer object set param clipvalue. It will do precisely what you want.
# all parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
rsmprop = RMSprop(clipvalue=0.5)
and then use this object to for model compiling
model.compile(loss='mse', optimizer=rsmprop)
For more reference check: here.
Also, I prefer to use clipnorm over clipvalue because with clipnorm the optimization remains stable. For example say you have 2 parameters and the gradients came out to be [0.1, 3]. By using clipvalue the gradients will become [0.1, 0.5] ie there are chances that the direction of steepest decent can get changed drastically. While clipnorm don't have similar problem as all the gradients will be appropriately scaled and the direction will be preserved and all the while ensuring the constraint on the magnitude of the gradient.
Edit: The question asks weights clipping not gradient clipping:
Gradiant clipping on weights is not part of keras code. But maxnorm on weights constraints is. Check here.
Having said that it can be easily implemented. Here is a very small example:
from keras.constraints import Constraint
from keras import backend as K
class WeightClip(Constraint):
'''Clips the weights incident to each hidden unit to be inside a range
'''
def __init__(self, c=2):
self.c = c
def __call__(self, p):
return K.clip(p, -self.c, self.c)
def get_config(self):
return {'name': self.__class__.__name__,
'c': self.c}
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(30, input_dim=100, W_constraint = WeightClip(2)))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop')
X = np.random.random((1000,100))
Y = np.random.random((1000,1))
model.fit(X,Y)
I have tested the running of the above code, but not the validity of the constraints. You can do so by getting the model weights after training using model.get_weights() or model.layers[idx].get_weights() and checking whether its abiding the constraints.
Note: The constrain is not added to all the model weights .. but just to the weights of the specific layer its used and also W_constraint adds constrain to W param and b_constraint to b (bias) param