Loss reduction in canned TF estimators - tensorflow

I use a TensorFlow canned estimator (LinearClassifier) to predict game actions from situations favourizing best scores. Scores are included in train_data and used as weight and passed as weight column in the estimator.
I know weight values are multiplicated with loss (MSE in this case) but I want to know if loss minimization is done or if I have to define optimizer as:
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False).minimize(loss),
model = tf.estimator.LinearClassifier(feature_columns=feature_columns,
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
# dropout=0.1,
# activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models/ActionPlayerModel20/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10))

Not at all sure what you mean by:
I know weight values are multiplicated with loss
but the classifier line is correct as you have it. You should pass the Optimizer object into the classifier and not the .minimize() operation. The estimator will generate & handle the minimize operation internally.

Related

How to build a Neural Network in Keras using a custom loss function with datapoint-specific weight?

I want to train a Neural Network for a classification task in Keras using a TensorFlow backend with a custom loss function. In my loss, I want to give different weights to different training examples. I have some datapoints I consider important and some I do not consider as important. I want my loss function to take this into account and punish errors in important examples more than in less important ones.
I have already built my model:
input = tf.keras.Input(shape=(16,))
hidden_layer_1 = tf.keras.layers.Dense(5, kernel_initializer='glorot_uniform', activation='relu')(input)
output = tf.keras.layers.Dense(1, kernel_initializer='normal', activation='softmax')(hidden_layer_1)
model = tf.keras.Model(input, output)
model.compile(loss=custom_loss(input), optimizer='adam', run_eagerly=True, metrics = [tf.keras.metrics.Accuracy(), 'acc'])
and the currrent state of my loss function is:
def custom_loss(input):
def loss(y_true, y_pred):
return ...
return loss
I'm struggling with implementing the loss function in the way I explained above, mainly because I don't exactly know what input, y_pred and y_true are (KerasTensors, I know - but what is the content? And is it for one training example only or for the whole batch?). I'd appreciate help with
printing out the values of input, y_true and y_pred
converting the input value to a numpy ndarray ([1,3,7] for example) so I can use the array to look up my weight for this specific training data point
once I have my weigth as a number (0.5 for example), how do I implement the computation of the loss function in Keras? My loss for one training exaple should be 0 if the classification was correct and weight if it was incorrect.

Tensorflow Polynomial Linear Regression curve fit

I have created this Linear regression model using Tensorflow (Keras). However, I am not getting good results and my model is trying to fit the points around a linear line. I believe fitting points around degree 'n' polynomial can give better results. I have looked googled how to change my model to polynomial linear regression using Tensorflow Keras, but could not find a good resource. Any recommendation on how to improve the prediction?
I have a large dataset. Shuffled it first and then spited to 80% training and 20% Testing. Also dataset is normalized.
1) Building model:
def build_model():
model = keras.Sequential()
model.add(keras.layers.Dense(units=300, input_dim=32))
model.add(keras.layers.Activation('sigmoid'))
model.add(keras.layers.Dense(units=250))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=200))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=150))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=100))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=50))
model.add(keras.layers.Activation('linear'))
model.add(keras.layers.Dense(units=1))
#sigmoid tanh softmax relu
optimizer = tf.train.RMSPropOptimizer(0.001,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp')
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
model = build_model()
model.summary()
2) Train the model:
class PrintDot(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
if epoch % 100 == 0: print('')
print('.', end='')
EPOCHS = 500
# Store training stats
history = model.fit(train_data, train_labels, epochs=EPOCHS,
validation_split=0.2, verbose=1,
callbacks=[PrintDot()])
3) plot Train loss and val loss
enter image description here
4) Stop When results does not get improved
enter image description here
5) Evaluate the result
[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)
#Testing set Mean Abs Error: 1.9020842795676374
6) Predict:
test_predictions = model.predict(test_data).flatten()
enter image description here
7) Prediction error:
enter image description here
Polynomial regression is a linear regression with some extra additional input features which are the polynomial functions of original input features.
i.e.;
let the original input features are : (x1,x2,x3,...)
Generate a set of polynomial functions by adding some transformations of the original features, for example: (x12, x23, x13x2,...).
One may decide which all functions are to be included depending on their constraints such as intuition on correlation to the target values, computational resources, and training time.
Append these new features to the original input feature vector. Now the transformed input feature vector has a size of len(x1,x2,x3,...) + len(x12, x23, x13x2,...)
Further, this updated set of input features (x1,x2,x3,x12, x23, x13x2,...) is feeded into the normal linear regression model. ANN's architecture may be tuned again to get the best trained model.
PS: I see that your network is huge while the number of inputs is only 32 - this is not a common scale of architecture. Even in this particular linear model, reducing the hidden layers to one or two hidden layers may help in training better models (It's a suggestion with an assumption that this particular dataset is similar to other generally seen regression datasets)
I've actually created polynomial layers for Tensorflow 2.0, though these may not be exactly what you are looking for. If they are, you could use those layers directly or follow the procedure used there to create a more general layer https://github.com/jloveric/piecewise-polynomial-layers

Working with multiple losses and their weights in keras

Training a GAN model using train_on_batch with multiple losses, can I use random loss_weights while compiling a model or is there some specific strategy to use these loss weights as mentioned Here. In my problem, mean_sqaured_error is a loss function for generated_image and original_image and binary_crossentropy is a classification loss function for 0 and 1 class.
model.compile(optimizer=optimizer, loss=['mean_squared_error', 'binary_crossentropy'], loss_weights=[100,1])
The weights are hyper parameters that you need to optimize. Notice that optimizing these hyper parameters is not simple, due to the fact that lowering the weights will automatically decrease the loss (which we usually aim to minimize), but will not necessarily create a better model. MSE can range between [0, infinity) if not normalized, or, e.g. [0, 1] if the features are normalized between [0,1] (and a sigmoid is used). Binary cross entropy values can range between [0, infinity), which make sthe process not as simple as we may think. Without any knowledge of your specific problem I will try first using the default weights (1 each).

How can I implement max norm constraints in an MLP in tensorflow?

How can I implement max norm constraints on the weights in an MLP in tensorflow? The kind that Hinton and Dean describe in their work on dark knowledge. That is, does tf.nn.dropout implement the weight constraints by default, or do we need to do it explicitly, as in
https://arxiv.org/pdf/1207.0580.pdf
"If these networks share the same weights for the hidden units that are present.
We use the standard, stochastic gradient descent procedure for training the dropout neural
networks on mini-batches of training cases, but we modify the penalty term that is normally
used to prevent the weights from growing too large. Instead of penalizing the squared length
(L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming
weight vector for each individual hidden unit. If a weight-update violates this constraint, we
renormalize the weights of the hidden unit by division."
Keras appears to have it
http://keras.io/constraints/
tf.nn.dropout does not impose any norm constraint. I believe what you're looking for is to "process the gradients before applying them" using tf.clip_by_norm.
For example, instead of simply:
# Create an optimizer + implicitly call compute_gradients() and apply_gradients()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
You could:
# Create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...])
# grads_and_vars is a list of tuples (gradient, variable).
# Do whatever you need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1])
for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients
optimizer = optimizer.apply_gradients(capped_grads_and_vars)
I hope this helps. Final notes about tf.clip_by_norm's axes parameter:
If you're calculating tf.nn.xw_plus_b(x, weights, biases), or equivalently matmul(x, weights) + biases, when the dimensions of x and weights are (batch, in_units) and (in_units, out_units) respectively, then you probably want to set axes == [0] (because in this usage each column details all incoming weights to a specific unit).
Pay attention to the shape/dimensions of your variables above and whether/how exactly you want to clip_by_norm each of them! E.g. if some of [weights1, weights2, ...] are matrices and some aren't, and you call clip_by_norm() on the grads_and_vars with the same axes value like in the List Comprehension above, this doesn't mean the same thing for all the variables! In fact, if you're lucky, this will result in a weird error like ValueError: Invalid reduction dimension 1 for input with 1 dimensions, but otherwise it's a very sneaky bug.
You can use tf.clip_by_value:
https://www.tensorflow.org/versions/r0.10/api_docs/python/train/gradient_clipping
Gradient clipping is also used to prevent weight explosion in recurrent neural networks.

How to define weight decay for individual layers in TensorFlow?

In CUDA ConvNet, we can write something like this (source) for each layer:
[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0
where wc=0 refers to the L2 weight decay.
How can the same be achieved in TensorFlow?
You can add all the variables you want to add weight decay to, to a collection name 'variables' and then you calculate the L2 norm weight decay for the whole collection.
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None)
This is the usage of tensorflow function get_variable. You can easily specify the regularizer to do weight decay.
Following is an example:
weight_decay = tf.constant(0.0005, dtype=tf.float32) # your weight decay rate, must be a scalar tensor.
W = tf.get_variable(name='weight', shape=[4, 4, 256, 512], regularizer=tf.contrib.layers.l2_regularizer(weight_decay))
Both current answers are wrong in that they do not give you "weight decay as in cuda-convnet" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.