Getting all four Batch Normalization parameters for inference in CNTK - cntk

The two Batch Normalization (BN) parameters that you get from the parameters() function are insufficient for inference. BN computes y = gamma * (x – E[x]) / Var[x]) + beta. How do get access to the other two?

The other 2 (mean and variance statistics) are “constants” since they are not trained using gradients using SGD. You can get to them by calling ‘constants’.

If you use the cntk.layers.BatchNormalization layer and name the layer, then the parameters can be accessed by name:
>>> y=BatchNormalization(name='bn')
>>> y.bn.scale
Parameter('scale', [], [?])
>>> y.bn.bias
Parameter('bias', [], [?])
>>> y.bn.aggregate_mean
Constant('aggregate_mean', [], [?])
>>> y.bn.aggregate_variance
Constant('aggregate_variance', [], [?])
>>> y.bn.aggregate_count
Constant('aggregate_count', [], [])
Note that these are aggregates that are formed during training for purpose of being used during inference. There is no way to access a single minibatch's mean and variance during training.

Related

Does Keras masking impact weight updates and loss calcuations?

I'm working with time series, and understand that keras.layers.Masking and keras.layers.Embedding are useful to create a mask value in the network which indicates timesteps to 'skip'. The mask value is propagated throughout the network to be used by any layers that support it.
The Keras documentation doesn't specify any further impacts of the mask value. My expectation is that the mask would be applied through all functions in model training and evaluation, but I don't see any evidence in support of this.
Does the mask value impact back-propagation?
Does the mask value impact the loss function or the metrics?
Would it be wise or foolish to use the sample_weight parameter in model.compile() to tell Keras to 'ignore' the masked timesteps in the loss function?
I've performed some experiments to answer these questions.
Here's my sample code:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
# Fix the random seed for repeatable results
np.random.seed(5)
tf.random.set_seed(5)
x = np.array([[[3, 0], [1, 4], [3, 2], [4, 0], [4, 5]],
[[1, 2], [3, 1], [1, 3], [5, 1], [3, 5]]], dtype='float64')
# Choose some values to be masked out
mask = np.array([[False, False, True, True, True],
[ True, True, False, False, True]]) # True:keep. False:ignore
samples, timesteps, features_in = x.shape
features_out = 1
y_true = np.random.rand(samples, timesteps, features_out)
# y_true[~mask] = 1e6 # TEST MODIFICATION
# Apply the mask to x
mask_value = 0 # Set to any value
x[~mask] = [mask_value] * features_in
input_tensor = keras.Input(shape=(timesteps, features_in))
this_layer = input_tensor
this_layer = keras.layers.Masking(mask_value=mask_value)(this_layer)
this_layer = keras.layers.Dense(10)(this_layer)
this_layer = keras.layers.Dense(features_out)(this_layer)
model = keras.Model(input_tensor, this_layer)
model.compile(loss='mae', optimizer='adam')
model.fit(x=x, y=y_true, epochs=100, verbose=0)
y_pred = model.predict(x)
print("y_pred = ")
print(y_pred)
print("model weights = ")
print(model.get_weights()[1])
print(f"{'model.evaluate':>14s} = {model.evaluate(x, y_true, verbose=0):.5f}")
# See if the loss computed by model.evaluate() is equal to the masked loss
error = y_true - y_pred
masked_loss = np.abs(error[mask]).mean()
unmasked_loss = np.abs(error).mean()
print(f"{'masked loss':>14s} = {masked_loss:.5f}")
print(f"{'unmasked loss':>14s} = {unmasked_loss:.5f}")
Which outputs
y_pred =
[[[-0.28896046]
[-0.28896046]
[ 0.1546848 ]
[-1.1596009 ]
[ 1.5819632 ]]
[[ 0.59000516]
[-0.39362794]
[-0.28896046]
[-0.28896046]
[ 1.7996234 ]]]
model weights =
[-0.06686568 0.06484845 -0.06918766 0.06470951 0.06396528 0.06470013
0.06247645 -0.06492618 -0.06262784 -0.06445726]
model.evaluate = 0.60170
masked loss = 1.00283
unmasked loss = 0.90808
mask and loss calculation
Surprisingly, the 'mae' (mean absolute error) loss calculation does NOT exclude the masked timesteps from the calculation. Instead, it assumes that these timesteps have zero loss - a perfect prediction. Therefore, every masked timestep actually reduces the calculated loss!
To explain in more detail: the above sample code input x has 10 timesteps. 4 of them are removed by the mask, so 6 valid timesteps remain. The 'mean absolute error' loss calculation sums the losses for the 6 valid timesteps, then divides by 10 instead of dividing by 6. This looks like a bug to me.
output values are masked
Output values of masked timesteps do not impact the model training or evaluation (as it should be).
This can be easily tested by setting:
y_true[~mask] = 1e6
The model weights, predictions and losses remain exactly the same.
input values are masked
Input values of masked timesteps do not impact the model training or evaluation (as it should be).
Similarly, I can change mask_value from 0 to any other number, and the resulting model weights, predictions, and losses remain exactly the same.
In summary:
Q1: Effectively yes - the mask impacts the loss function, which is used through backpropagation to update the weights.
Q2: Yes, but the mask impacts the loss in an unexpected way.
Q3: Initially foolish - the mask should already be applied to the loss calculation. However, perhaps sample_weights could be valuable to correct the unexpected method of the loss calculation...
Note that I'm using Tensorflow 2.7.0.
I have been struggling through this on a related issue, namely implementing a mask to a multi-output model where some samples are missing labels for different outputs. Here, construct features, labels, sample_weights from a dataset and labels and sample_weights are dictionaries with equivalent keys. The weights are 0,1 for each sample indicating if it should contribute to the calculation for the relevant loss.
I had hoped that sample_weights would contribute to the loss as they do when I pass the metric equivalents for the losses via weight_metrics in model.compile
I've found that sample_weight does not seem to address this problem. I can tell from the training metrics that the task_loss values are different from task_metric values when sample weights are used.
I've given up on this and decided to go ahead and use masking. The masked loss values are low in your case (and in mine) because tensorflow sees the modeled output as perfection - I hope this means it does not see a gradient for this points and so parameters aren't tuned in response.

How to make use of class_weights to calculated custom loss fuction while using custom training loop (i.e. not using .fit )

I have written my custom training loop using tf.GradientTape(). My data has 2 classes. The classes are not balanced; class1 data contributes almost 80% and class2 contributes remaining 20%. Therefore in order to remove this imbalance I was trying to write custom loss function which will take into account this imbalance and apply the corresponding class weights and calculate the loss. i.e. I want to use the class_weights = [0.2, 0.8]. I am not able to find similar examples.
However all the examples I am seeing are using model.fit approach where its easier to pass the class_weights. I am not able to find out the example which uses class_weights with custom training loop using tf.GradientTape.
I did go through the suggestions of using sample_weight, however I don't have the data where in I can specify the weights for samples, therefore my preference is to use class weight.
I am using BinaryCrossentropy loss as loss function but I want to change the loss based on the class_weights. That's where I am stuck, how to tell BinaryCrossentropy to consider the class_weights.
Is my approach of using custom loss function correct or there is better way to make use of class_weights while training with custom training loop (not using model.fit)?
you can write your own loss function. in that loss function call BinaryCrossentropy and then multiply the result in the weight you want and return that
Here's an implementation that should work for n classes instead of just 2.
For your example of 80:20 split, calculate weights as below (assuming 100 samples in total).
Weight calculation (ref: Handling Class Imbalance: TensorFlow):
weight_class_0 = (1/count_for_class_0) * (total_samples / num_classes) # (80%) 0.625
weight_class_1 = (1/count_for_class_1) * (total_samples / num_classes) # (20%) 2.5
class_wts = tf.constant([weight_class_0, weight_class_1])
Loss function: Requires labels to be sparse and logits unscaled (no activations applied).
# Example logits=[[-3.2, 2.0], [1.2, 0.5], ...], (sparse)labels=[0, 1, ...]
def weighted_sparse_categorical_crossentropy(labels, logits, weights):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, logits)
class_weights = tf.gather(weights, labels)
return tf.reduce_mean(class_weights * loss)
You can supply this loss function to custom training loops.

ValueError from tensorflow estimator RNNClassifier with gcloud ml-engine job

I am working on the task.py file for submitting a gcloud MLEngine job. Previously I was using tensorflow.estimator.DNNClassifier successfully to submit jobs with my data (which consists solely of 8 columns of sequential numerical data for cryptocurrency prices & volume; no categorical).
I have now switched to the tensorflow contrib estimator RNNClassifier. This is my current code for the relevant portion:
def get_feature_columns():
return [
tf.feature_column.numeric_column(feature, shape=(1,))
for feature in column_names[:len(column_names)-1]
]
def build_estimator(config, learning_rate, num_units):
return tf.contrib.estimator.RNNClassifier(
sequence_feature_columns=get_feature_columns(),
num_units=num_units,
cell_type='lstm',
rnn_cell_fn=None,
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate),
config=config)
estimator = build_estimator(
config=run_config,
learning_rate=args.learning_rate,
num_units=[32, 16])
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
However, I'm getting the following ValueError:
ValueError: All feature_columns must be of type _SequenceDenseColumn. You can wrap a sequence_categorical_column with an embedding_column or indicator_column. Given (type <class 'tensorflow.python.feature_column.feature_column_v2.NumericColumn'>): NumericColumn(key='LTCUSD_close', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
I don't understand this, as the data is not categorical.
As #Ben7 pointed out sequence_feature_columns accepts columns like sequence_numeric_column. However, according to the documentation, RNNClassifier sequence_feature_columns expects SparseTensors and sequence_numeric_column is a dense tensor. This seems to be contradictory.
Here is a workaround I used to solve this issue (I took the to_sparse_tensor function from this answer):
def to_sparse_tensor(dense):
# sequence_numeric_column default is float32
zero = tf.constant(0.0, dtype=tf.dtypes.float32)
where = tf.not_equal(dense, zero)
indices = tf.where(where)
values = tf.gather_nd(dense, indices)
return tf.SparseTensor(indices, values, tf.shape(dense, out_type=tf.dtypes.int64))
def get_feature_columns():
return [
tf.feature_column.sequence_numeric_column(feature, shape=(1,), normalizer_fn=to_sparse_tensor)
for feature in column_names[:len(column_names)-1]
]
you got this error because you use a numeric feature column whereas this kind of estimator can only accept sequence feature columns as you can see it on the init function.
So, instead of using numeric column you have to use sequence_numeric_column.

How to freeze/lock weights of one TensorFlow variable (e.g., one CNN kernel of one layer)

I have a TensorFlow CNN model that is performing well and we would like to implement this model in hardware; i.e., an FPGA. It's a relatively small network but it would be ideal if it were smaller. With that goal, I've examined the kernels and find that there are some where the weights are quite strong and there are others that aren't doing much at all (the kernel values are all close to zero). This occurs specifically in layer 2, corresponding to the tf.Variable() named, "W_conv2". W_conv2 has shape [3, 3, 32, 32]. I would like to freeze/lock the values of W_conv2[:, :, 29, 13] and set them to zero so that the rest of the network can be trained to compensate. Setting the values of this kernel to zero effectively removes/prunes the kernel from the hardware implementation thus achieving the goal stated above.
I have found similar questions with suggestions that generally revolve around one of two approaches;
Suggestion #1:
tf.Variable(some_initial_value, trainable = False)
Implementing this suggestion freezes the entire variable. I want to freeze just a slice, specifically W_conv2[:, :, 29, 13].
Suggestion #2:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list)
Again, implementing this suggestion does not allow the use of slices. For instance, if I try the inverse of my stated goal (optimize only a single kernel of a single variable) as follows:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list = W_conv2[:,:,0,0]))
I get the following error:
NotImplementedError: ('Trying to optimize unsupported type ', <tf.Tensor 'strided_slice_2228:0' shape=(3, 3) dtype=float32>)
Slicing tf.Variables() isn't possible in the way that I've tried it here. The only thing that I've tried which comes close to doing what I want is using .assign() but this is extremely inefficient, cumbersome, and caveman-like as I've implemented it as follows (after the model is trained):
for _ in range(10000):
# get a new batch of data
# reset the values of W_conv2[:,:,29,13]=0 each time through
for m in range(3):
for n in range(3):
assign_op = W_conv2[m,n,29,13].assign(0)
sess.run(assign_op)
# re-train the rest of the network
_, loss_val = sess.run([optimizer, loss], feed_dict = {
dict_stuff_here
})
print(loss_val)
The model was started in Keras then moved to TensorFlow since Keras didn't seem to have a mechanism to achieve the desired results. I'm starting to think that TensorFlow doesn't allow for pruning but find this hard to believe; it just needs the correct implementation.
A possible approach is to initialize these specific weights with zeros, and modify the minimization process such that gradients won't be applied to them. It can be done by replacing the call to minimize() with something like:
W_conv2_weights = np.ones((3, 3, 32, 32))
W_conv2_weights[:, :, 29, 13] = 0
W_conv2_weights_const = tf.constant(W_conv2_weights)
optimizer = tf.train.RMSPropOptimizer(0.001)
W_conv2_orig_grads = tf.gradients(loss, W_conv2)
W_conv2_grads = tf.multiply(W_conv2_weights_const, W_conv2_orig_grads)
W_conv2_train_op = optimizer.apply_gradients(zip(W_conv2_grads, W_conv2))
rest_grads = tf.gradients(loss, rest_of_vars)
rest_train_op = optimizer.apply_gradients(zip(rest_grads, rest_of_vars))
tf.group([rest_train_op, W_conv2_train_op])
I.e,
Preparing a constant Tensor for canceling the appropriate gradients
Compute gradients only for W_conv2, then multiply element-wise with the constant W_conv2_weights to zero the appropriate gradients and only then apply gradients.
Compute and apply gradients "normally" to the rest of the variables.
Group the 2 train ops to a single training op.

Using external optimizers with tensorflow and stochastic network elements

I have been using Tensorflow with the l-bfgs optimizer from openopt. It was pretty easy to setup callbacks to allow Tensorflow to compute gradients and loss evaluations for the l-bfgs, however, I am having some trouble figuring out how to introduce stochastic elements like dropout into the training procedure.
During the line search, l-bfgs performs multiple evaluations of the loss function, which need to operate on the same network as the prior gradient evaluation. However, it seems that for each evaluation of the tf.nn.dropout function, a new set of dropouts is created. I am looking for a way to fix the dropout over multiple evaluations of the loss function, and then allow it to change between the gradient steps of the l-bfgs. I'm assuming this has something to do with the control flow ops in tensorflow, but there isn't really a good tutorial on how to use these and they are a little enigmatic to me.
Thanks for your help!
Drop-out relies on uses random_uniform which is a stateful op, and I don't see a way to reset it. However, you can hack around it by substituting your own random numbers and feeding them to the same input point as random_uniform, replacing the generated values
Taking the following code:
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed)
Visualize the graph to see where random_uniform is connected
You can see dropout takes input of random_uniform through the Add op which has a name mydropout/random_uniform/(random_uniform). Actually the /(random_uniform) suffix is there for UI reasons, and the true name is mydropout/random_uniform as you can see by printing tf.get_default_graph().as_graph_def(). That gives you shortened tensor name. Now you append :0 to get actual tensor name. (side-note: operation could produce multiple tensors which correspond to suffixes :0, :1 etc. Since having one output is the most common case, :0 is implicit in GraphDef and node input is equivalent to node:0. However :0 is not implicit when using feed_dict so you have to explicitly write node:0)
So now you can fix the seed by generating your own random numbers (of the same shape as incoming tensor), and reusing them between invocations.
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed, name="mydropout")
random_numbers = np.random.random(a.get_shape()).astype(dtype=np.float32)
sess = tf.Session()
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
You should see the same set of numbers with 2 run calls.