Using tensorflow batch normalization stalls loss decrease

Using tensorflow batch normalization stalls loss decrease - tensorflow

I'm using tensorflow version r0.11
I'm trying to use the batch normalization (tf.contrib.layers.batch_norm()) in a conv net. As a headstart, I followed the discussions made in the following github issue. It seems that 'is_training', 'reuse' and 'updates_collections' flags are still confusing (in usage), partly because of the lack of good use cases. However, my problem is that the loss is not decreasing if I add batch norm layer.
I framed the code following the structure as in CIFAR. And I am running it in a multi-gpu fashion (for training). I have one script for training (similar to cifar10_multigpu.py) and one for testing (similar to cifar10_eval.py).
for i in xrange(all_flags.num_gpus): # Number of GPUs is 2
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (all_flags.TOWER_NAME, i)) as scope:
# Calculate the loss for one tower of the model. This function
# constructs the entire model but shares the variables across all
# towers.
loss = _tower_loss(inputs[i], labels[i], scope)
# Reuse variables for the next tower. This line makes it possible
tf.get_variable_scope().reuse_variables()
# More stuff happening like compute_gradients (inside gpus loop),
# averaging gradients (outside gpus loop), applying them (outside
# gpus loop)
The inference/model building happens within (a nested function in) the function _tower_loss. (below is an example of the function, in reality I use more layers and neurons).
def inference(inputs): #(This gets called from _tower_loss())
# conv1
with tf.variable_scope('conv1') as scope:
kernel = # define kernel
conv = tf.nn.conv2d(inputs, kernel, strides=[1, 1, 1, 1], padding='SAME')
biases = _variable_on_gpu('biases', [64], tf.constant_initializer(0.0))
preactivation = tf.nn.bias_add(conv, biases)
# ReLU.
conv1 = tf.nn.relu(preactivation, name=scope.name)
# pool1
pool1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME', name='pool1')
# Similarly more conv+pool and then fcs and finally logits
return logits
I want to perform batch nomalization. So I passed in an additional placeholder input argument inside my '_tower_loss' as well as 'inference' functions.
def inference(inputs, is_training):
# BN1
with tf.variable_scope('norm0') as scope:
# Note that I'm using the dafault for 'updates_collections'
# which is None
norm0 = tf.contrib.layers.batch_norm(inputs, is_training=is_training,
scope=scope, reuse=None)
# conv1
with tf.variable_scope('conv1') as scope:
kernel = # define kernel
conv = tf.nn.conv2d(norm0, kernel, strides=[1, 1, 1, 1], padding='SAME')
# Rest is same
I also added normalization layers in the couple of fc layers
In the train code the instructions go like this
...
variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
variables_averages_op = variable_averages.apply(tf.trainable_variables())
train_op = tf.group(apply_gradient_op, variables_averages_op)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # Line A
...
sess.run([train_op, loss, update_ops],feed_dict={is_training: True}) # Line B
...
When Batch normalization is not there, Line A is not there and in Line B, 'update_ops' is run in the session.
What I'm seeing is that when batch normalization is not used loss starts at aorund 6.5 and consistently decreases to near 0, but when I'm using batch normalization loss does not decrease after 2 or 3 hundred (minibatch) iterations and gets stuck at around 5.5 or so. Speedwise, I'd say the performance is same. I'm not sure what is the issue. I tried with different learning rate (I'm using Adam optimizer) with no effect. I'm not sure whether 'variables_averages_op' along with 'update_ops' is messing things up. Any help will be appreciated.

Related

Meta-Gradients / Multi-Batch Backpropagation in tansorflow

I am trying to implement a meta-gradient based pruning-at-initialization method by Alizadeh et al. (2022) in tensorflow. The method works roughly like this:
Take some batches from the dataset.
Mask all weights of the network with ones (e. g. tf.ones).
Perform one update of the weights, including the mask.
UNMASK all weights and perform the rest of the updates through the other batches.
Compute the meta-gradient of the loss w. r. t. the mask, i. e. backpropagate through all batches and weight-updates until the mask from the first iteration is "reached".
The authors implement this in pytorch, which I typically do not use at work. I want to implement it in tensorflow, yet I run into the following problem: tensorflow is not designed to process gradients "through" assign-operations. E. g. that means:
w = tf.Variable([4.])
c = tf.Variable([2.])
with tf.GradientTape() as tape:
tape.watch(c)
w.assign(w * c)
output = 2. * w
print(output)
# >> tf.Tensor([16.], shape=(1,), dtype=float32)
print(tape.gradient(output, c))
# >> None
That being said, my "pruning loop" is looking somewhat like this:
test_factor = tf.Variable(1., dtype=tf.float32)
with tf.GradientTape(persistent=True) as outer_tape:
outer_tape.watch(masked_model.masks)
outer_tape.watch(test_factor)
## First btach
X_batch, y_batch = wrp.non_random_batch(X_train, y_train, 0, 256)
with tf.GradientTape() as tape1:
y_pred = masked_model(X_batch)
loss = test_factor*loss_fn(y_batch, y_pred)
gradients = tape1.gradient(loss, masked_model.proper_weights)
## Updating weights
for w, g in zip(masked_model.proper_weights, gradients):
w.assign(w - 0.05*g)
## Unmasking
masked_model.unmask_forward_passes()
## Second batch (and more)
X_batch, y_batch = wrp.non_random_batch(X_train, y_train, 1, 256)
with tf.GradientTape() as tape2:
y_pred = masked_model(X_batch)
loss = loss_fn(y_batch, y_pred)
gradients = tape2.gradient(loss, masked_model.proper_weights)
print(outer_tape.gradient(loss, masked_model.masks))
# >> ListWrapper([None, None, ..., None])
print(outer_tape.gradient(loss, test_factor))
# >> None
Where after the second batch more batches would be to come.
I inserted the test_factor to show, that this problem is not some problem with my masks, but with the general structure. Simply changing the line w.assign(w - 0.05*g) to w = w - 0.05*g enables the usage of the gradient, but then the weights are not actually updated...
For the authors of the paper mentioned, this does not seem to be a problem. Is pytorch simply more powerful in such cases, or do I miss some kind of trick to get this to work in tensorflow?

tensorflow, compute gradients with respect to weights that come from two models (encoder, decoder)

I have a encoder model and a decoder model (RNN).
I want to compute the gradients and update the weights.
I'm somewhat confused by what I've seen so far on the web.
Which block is the best practice? Is there any difference between the two options? Gradients seems to converge faster in Block 1, I do not know why?
# BLOCK 1, in two operations
encoder_gradients,decoder_gradients = tape.gradient(loss,[encoder_model.trainable_variables,decoder_model.trainable_variables])
myoptimizer.apply_gradients(zip(encoder_gradients,encoder_model.trainable_variables))
myoptimizer.apply_gradients(zip(decoder_gradients,decoder_model.trainable_variables))
# BLOCK 2, in one operation
gradients = tape.gradient(loss,encoder_model.trainable_variables + decoder_model.trainable_variables)
myoptimizer.apply_gradients(zip(gradients,encoder_model.trainable_variables +
decoder_model.trainable_variables))

You can manually verify this.
First, let's simplify the model. Let the encoder and decoder both be a single dense layer. This is mostly for simplicity and you can print out the weights being applying the gradients, gradients and weights after applying the gradients.
import tensorflow as tf
import numpy as np
from copy import deepcopy
# create a simple model with one encoder and one decoder layer.
class custom_net(tf.keras.Model):
def __init__(self):
super().__init__()
self.encoder = tf.keras.layers.Dense(3, activation='relu')
self.decoder = tf.keras.layers.Dense(3, activation='relu')
def call(self, inp):
return self.decoder(self.encoder(inp))
net = model()
# create dummy input/output
inp = np.random.randn(1,1)
gt = np.random.randn(3,1)
# set persistent to true since we will be accessing the gradient 2 times
with tf.GradientTape(persistent=True) as tape:
out = custom_model(inp)
loss = tf.keras.losses.mean_squared_error(gt, out)
# get the gradients as mentioned in the question
enc_grad, dec_grad = tape.gradient(loss,
[net.encoder.trainable_variables,
net.decoder.trainable_variables])
gradients = tape.gradient(loss,
net.encoder.trainable_variables + net.decoder.trainable_variables)
First, let's use a stateless optimizer like SGD which updates the weights based on the following formula and compare it to the 2 approaches mentioned in the question.
new_weights = weights - learning_rate * gradients.
# Block 1
myoptimizer = tf.keras.optimizers.SGD(learning_rate=1)
# store weights before updating the weights based on the gradients
old_enc_weights = deepcopy(net.encoder.get_weights())
old_dec_weights = deepcopy(net.decoder.get_weights())
myoptimizer.apply_gradients(zip(enc_grad, net.encoder.trainable_variables))
myoptimizer.apply_gradients(zip(dec_grad, net.decoder.trainable_variables))
# manually calculate the weights after gradient update
# since the learning rate is 1, new_weights = weights - grad
cal_enc_weights = []
for weights, grad in zip(old_enc_weights, enc_grad):
cal_enc_weights.append(weights-grad)
cal_dec_weights = []
for weights, grad in zip(old_dec_weights, dec_grad):
cal_dec_weights.append(weights-grad)
for weights, man_calc_weight in zip(net.encoder.get_weights(), cal_enc_weights):
print(np.linalg.norm(weights-man_calc_weight))
for weights, man_calc_weight in zip(net.decoder.get_weights(), cal_dec_weights):
print(np.linalg.norm(weights-man_calc_weight))
# block 2
old_weights = deepcopy(net.encoder.trainable_variables + net.decoder.trainable_variables)
myoptimizer.apply_gradients(zip(gradients, net.encoder.trainable_variables + \
net.decoder.trainable_variables))
cal_weights = []
for weight, grad in zip(old_weights, gradients):
cal_weights.append(weight-grad)
for weight, man_calc_weight in zip(net.encoder.trainable_variables + net.decoder.trainable_variables, cal_weights):
print(np.linalg.norm(weight-man_calc_weight))
You will see that both the methods update the weights in the exact same way.
I think you used an optimizer like Adam/RMSProp which is stateful. For such optimizers invoking apply_gradients will update the optimizer parameters based on the gradient value and sign. In the first case, the optimizer parameters are updated twice and in the second case only once.
I would stick to the second option if I were you, since you are performing just one step of optimization here.

How to specify sample dependent kernels/filters in Conv2d?

I am trying to implement a convolutional autoencoder where some of the convolutional filters are input content dependent. For example, in a simple toy example, knowing the digit label for MNIST could further help with reconstruction in an autoencoder setup.
The more general idea is that there could be some relevant, auxiliary information (whether the information is the class label or some other information) that that is useful to incorporate. While there are various ways to use this label/auxiliary information, I will do so through creating a separate convolutional filter. Let's say the model has 15 typical convolutional filters, I would like to add an additional convolutional filter that corresponds to the MNIST digit and can be thought of as an embedding of the digit in the form of a 3x3 kernel. We would use the digit as an additional input to the network and then learn a distinct kernel/filter embedding for each digit.
However, I am having difficulty implementing a convolutional filter/kernel that is input dependent. I am not using tf.keras.layers.Conv2D layer because that takes in the # of filters to be used, but not the actual filter parameters to make this input dependent.
# load and preprocess data
num_classes = 10
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train, x_test = np.float32(x_train)/255, np.float32(x_test)/255
x_train, x_test = np.expand_dims(x_train, axis=-1), np.expand_dims(x_test, axis=-1)
y_train = keras.utils.to_categorical(y_train, num_classes=num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes=num_classes)
num_filters = 15
input_img = layers.Input(shape=(28,28,1))
conv_0 = keras.layers.Conv2D(num_filters, (3,3), strides=2, padding='same', activation='relu')(input_img)
# embed the target as a 3x3 kernel/filter -> this should map to a distinct embedding for
# each target
target = layers.Input(shape=(10,))
target_encoded = layers.Dense(9, activation='relu')(target)
target_encoded = layers.Reshape((3,3,1,1))(target_encoded)
# Using tf.nn.conv2d so that I can specify kernel
# Kernel needs to be a 4D tensor of dimensions (filter_height, filter_width, input_channels, output_channels)
# which in this case is (3,3,1,1)
# However it is currently (None,3,3,1,1) because the first dimension is batch size so this doesn't work
target_conv = tf.nn.conv2d(input_img, target_encoded, strides=[1, 1, 1, 1], padding='SAME')
I am currently using tf.nn.conv2d which takes a kernel as input in the format (filter_height, filter_width, input_channels, output_channels). However, this doesn't work as is because data is fed in batches. Therefore, each sample in the batch has a label and therefore a corresponding kernel so the kernels are of shape (None, 3, 3, 1, 1) which is not compatible with the expected format. This is illustrated in the code chunk above (which doesn't work). What are potential work arounds? Is there a simpler way to implement this concept of an input dependent conv2d filter?

Making A Conv2D with SWAPPABLE kernel!
You'll need to make your own Conv2D that takes as input the image to process AND the kernel to use.
# Define our new Convolution
class DynamicConv2D(tf.keras.layers.Layer):
def __init__(self, padding='SAME'):
super(DynamicConv2D, self).__init__()
self.padding = padding
def call(self, input, kernel):
return tf.nn.conv2d(input=input, filters=kernel,
strides=(1,1), padding=self.padding)
And let's test it out
dc2d = DynamicConv2D(padding='VALID')
input_tensor = np.ones([1,4,4,3],dtype=np.float32)
kernel_tensor = np.ones([2,2,3,1],dtype=np.float32)
dc2d(input_tensor, kernel_tensor)
returns
array([[[[12.], [12.], [12.]],
[[12.], [12.], [12.]],
[[12.], [12.], [12.]]]])
It looks like it works great... but there is a HUGE problem
HUGE ISSUE WITH KERAS - BATCH BY DEFAULT
Yeah, so here is the deal: tensorflow keras is really really really set on everything being set up so the first dimension is the batch. But if you look up above we have to specify the ONE KERNEL for the whole batch. We can't pass in a batch of kernel_tensorS, but just one.
THERE IS A WORK AROUND!
Let's borrow something from RNN training schemes, specifically we are going to solve this by being careful about what we send per batch. More specifically, for a batch we are going to make sure all input images use the same kernel_tensor. You'll have to figure out how you do that efficiently with your data pipeline, but here is an example to get you going.
Working Code
(We will rewrite out dynamic conv2d so that it takes a category and stores its
own kernel per category)
# Define our new Convolution
class DynamicConv2D(tf.keras.layers.Layer):
def __init__(self, padding='SAME', input_dim=10, kernel_shape=[3,3,1,8]):
super(DynamicConv2D, self).__init__()
self.padding = padding
self.input_dim = input_dim
self.kernel_shape = kernel_shape
self.kernel_size = kernel_shape[0]*kernel_shape[1]*kernel_shape[2]*kernel_shape[3] # = 3*3*1*8
self.category_to_kernel = tf.keras.layers.Embedding(self.input_dim,self.kernel_size)
def call(self, input, categories):
just_first_category = tf.slice(categories,(0,0),(1,1))
flat_kernel = self.category_to_kernel(just_first_category)
kernel = tf.reshape(flat_kernel,self.kernel_shape)
return tf.nn.conv2d(input=input, filters=kernel, strides=(1,1), padding=self.padding)
This class by default does a 3x3 convolution, reading in 1 filter from the previous layer and outputting 8
# Example output
dc2d = DynamicConv2D(padding='VALID')
image_data = np.ones([4,10,10,1],dtype=np.float32)
# prove that you can send in a different category and get different results
print( dc2d(image_data, [[3]]*4).numpy()[0,0,0,:3] )
print( dc2d(image_data, [[4]]*4).numpy()[0,0,0,:3] )
--------
[ 0.014 -0.002 0.108]
[ 0.021 0.014 -0.034]
Use it to make a tf.Keras model
# model input
image_input = tf.keras.Input(shape=(28,28,1), dtype=tf.float32)
category_input = tf.keras.Input(shape=(1,), dtype=tf.int32)
# do covolution
dynamic_conv2d = DynamicConv2D(padding='VALID')(image_input, category_input)
# make the model
model = tf.keras.Model(inputs=[image_input, category_input], outputs=dynamic_conv2d)
And we can use the model like so
# use the model
input_as_tensor = tf.constant(image_data,dtype=tf.float32)
category_as_tensor = tf.constant([[4]]*4,dtype=tf.int32)
result = model.predict(x=(input_as_tensor, category_as_tensor))
print('The output shape is',result.shape)
print('The first 3 values of the first output image are', result[0,0,0,:3])
---------
The output shape is (4, 8, 8, 8)
The first 3 values of the first output image are [-0.028 -0.009 0.015]

TensorFlow post-LSTM fully connected layer outputs return the same values as each other

I was trying to train a sequence-to-sequence LSTM model with a dataset with three labels: [1, 0] for detection of class 1, [0, 1] for detection of class 2, and [0, 0] for detection of nothing. After getting the outputs from the LSTM network, I applied a fully connected layer to each cell's output the following way:
outputs, state = tf.nn.dynamic_rnn(cell, input)
# Shape of outputs is [batch_size, n_time_steps, n_hidden]
# As matmul works only on matrices, reshape to get the
# time dimension into the batch dimension
outputs = tf.reshape(outputs, [-1, n_hidden])
# Shape is [batch_size * n_time_steps, n_hidden]
w = tf.Variable(tf.truncated_normal(shape=[n_hidden, 2], stddev=0.1))
b = tf.Variable(tf.constant(0.1, shape=[2]))
logit = tf.add(tf.matmul(outputs, w), b, name='logit')
# Reshape back to [batch_size, n_time_steps, 2]
logit = tf.reshape(logit, [batch_size, -1, 2])
On the output, I apply tf.nn.sigmoid_cross_entropy_with_logits and reduce the mean. The model seems to work just fine achieving high accuracy and recall, except for the fact that in almost all the cases it outputs either [0, 0], or [1, 1]. The two logit outputs from the fully connected layer always have very similar values (but not the same). This effectively puts a hard-cap on precision of 50%, which the model converges to (but not a fraction of a percent above).
Now, my intuition would tell me that something must be wrong with the training step and both fully connected outputs are trained on the same data, but curiously enough when I replace my own implementation with the prepackaged one from tf.contrib:
outputs, state = tf.nn.dynamic_rnn(cell, input)
logit = tf.contrib.layers.fully_connected(outputs, 2, activation_fn=None)
without changing a single other thing, the model starts training properly. Now, the obvious solution would be to just use that implementation, but why doesn't the first one work?

Force symmetry for a TensorFlow conv2d kernel

I'd like to enforce symmetry in the weights within a Variable. I really want an approximate circular symmetry. However, I could imagine either row or column enforced symmetry.
The goal is to reduce training time by reducing the number of free variables. I know my problem would like a symmetric array but I might want to include both symmetric and "free" variables. I am using conv2d now, so I believe I need to keep using it.

Here is a function that creates a kernel symmetric with respect to reflection over its center row:
def SymmetricKernels(height,width,in_channels,out_channels,name=None):
half_kernels = tf.Variable(initial_value=tf.random_normal([(height+1)//2,width,in_channels,out_channels]))
half_kernels_reversed = tf.reverse(half_kernels[:(height//2),:,:,:],[0])
kernels = tf.concat([half_kernels,half_kernels_reversed],axis=0,name=name)
return kernels
Usage example:
w = SymmetricKernels(5,5,1,1)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
w_ = sess.run(w)
w_[:,:,0,0]
# output:
# [[-1.299 -1.835 -1.188 0.093 -1.736]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-0.217 -0.802 -0.892 -0.229 1.383]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-1.299 -1.835 -1.188 0.093 -1.736]]
The idea is to use tf.Variable() to create only the upper half variables of the kernels (half_kernels), and then form the symmetric kernels as a concatenation of the upper half and its reflected version.
This idea can be extended to create also kernels with both left-right and up-down symmetries.

Another thing you can try is to tie the net's hands by convolving twice, reusing the kernel but flipping it for the second convolution (untested code):
def symmetric_convolution(input_tensor, n_filters, size, name, dilations=[1,1,1,1]):
with tf.variable_scope("", reuse=tf.AUTO_REUSE):
kernel = tf.get_variable(shape=[*size, input_tensor.shape[-1], n_filters], name='conv_kernel_' + name, ...)
lr_flipped_kernel = tf.reverse(kernel, axis=[1], name='conv_kernel_flipped_lr_' + name)
conv_l = tf.nn.conv2d(input=input_tensor, filter=kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
conv_r = tf.nn.conv2d(input=input_tensor, filter=lr_flipped_kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
return tf.reduce_max(tf.concat([conv_l, conv_r], axis=-1), keepdims=True, axis=[-1])
You can add in biases, activations, etc. as needed. I've used something similar in the past – reduce_max will allow your kernel to take whatever shape, and effectively give you two convolutions for one; if you use reduce_sum instead, any asymmetries will average out quite quickly and your kernel will be symmetric. What works best will depend on your use case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using tensorflow batch normalization stalls loss decrease - tensorflow

Related

Meta-Gradients / Multi-Batch Backpropagation in tansorflow

tensorflow, compute gradients with respect to weights that come from two models (encoder, decoder)

How to specify sample dependent kernels/filters in Conv2d?

TensorFlow post-LSTM fully connected layer outputs return the same values as each other

Force symmetry for a TensorFlow conv2d kernel

Categories

Resources