I've recently picked up Tensorflow and and have been trying my best to adjust to the environment. It has been nothing but wonderful! However, batch normalization using tf.contrib.layers.batch_norm has been a little tricky.
Right now, here is the function I'm using:
def batch_norm(x, phase):
return tf.contrib.layers.batch_norm(x,center = True, scale = True,
is_training = phase, updates_collections = None)
Using this, I followed most documentation (also Q & A) that I've found online and it led me to the following conclusions:
1) is_training should be set to True for training and false for testing. This makes sense! When training, I had convergence (error < 1%, Cifar 10 Dataset).
However during testing, my results are terrible (error > 90%) UNLESS I add (update collections = None) as an argument to the batch norm function above. Only with that as an argument will testing give me the error I expected.
I am also sure to use the following for training:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops): # Ensures, Updating ops will perform before training
with tf.name_scope('Cross_Entropy'):
cross_entropy = tf.reduce_mean( # Implement Cross_Entropy to compute the softmax activation
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv)) # Cross Entropy: True Output Labels (y_), Softmax output (y_conv)
tf.summary.scalar('cross_entropy', cross_entropy) # Graphical output Cross Entropy
with tf.name_scope('train'):
train_step = tf.train.AdamOptimizer(1e-2).minimize(cross_entropy) # Train Network, Tensorflow minimizes cross_entropy via ADAM Optimization
with tf.name_scope('Train_Results'):
with tf.name_scope('Correct_Prediction'):
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1)) # Check if prediction is wrong with tf.equal(CNN_result,True_result)
with tf.name_scope('Accuracy'):
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # Find the percent accuracy, take mean of correct_prediction outputs
tf.summary.scalar('accuracy', accuracy) # Graphical output Classification Accuracy
This should make sure that the batch normalization parameters are updating during training.
So this leads me to believe that update collections = None is just a nice default to my batch normalization function that during testing function will be sure not to adjust any batch normalization parameters.... Am I correct?
Lastly: Is it normal to have good results (Expected Error) when, during the testing phase, having batch normalization turned on AND off? Using the batch norm function above, I was able to train well (is_training = True) and test well (is_training = False). However, during testing (is_training = True) I was still able to get great results. This is just gives me a bad feeling. Could someone explain why this is happening? Or if it should be happening at all?
Thank you for your time!
Unstable decay rate (default 0.999) for moving averages might be the reason for reasonably good training performance but poor validation/test performance. Try a slightly lower decay rate (0.99 or 0.9). Also, try zero_debias_moving_mean=True for improved stability.
You can also try different batch sizes and see if validation performance increases. Large batch size can break validation performance when batch normalization is used. See this.
Is your phase variable, a tensorflow boolean or a Python boolean?
Related
I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.
In tensorflow 2.0 you don't have to worry about training phase(batch size, number of epochs etc), because everything can be defined in compile method: model.fit(X_train,Y_train,batch_size = 64,epochs = 100).
But I have seen the following code style:
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
#tf.function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
regularization_loss = tf.math.add_n(model.losses)
pred_loss = loss_fn(labels, predictions)
total_loss = pred_loss + regularization_loss
gradients = tape.gradient(total_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
for epoch in range(NUM_EPOCHS):
for inputs, labels in train_data:
train_step(inputs, labels)
print("Finished epoch", epoch)
So here you can observe "more detailed" code, where you manually define by for loops you training procedure.
I have following question: what is the best practice in Tensorflow 2.0? I haven't found a any complete tutorial.
Use what is best for your needs.
Both methods are documented in Tensorflow tutorials.
If you don't need anything special, no extra losses, strange metrics or intricate gradient computation, just use a model.fit() or a model.fit_generator(). This is totally ok and makes your life easier.
A custom training loop might come in handy when you have complicated models with non-trivial loss/gradients calculation.
Up to now, two applications I tried were easier with this:
Training a GAN's generator and discriminator simultaneously without having to do the generation step twice. (It's complicated because you have a loss function that applies to different y_true values, and each case should update only a part of the model) - The other option would require to have a few separate models, each model with its own trainable=True/False configuration, and train then in separate phases.
Training inputs (good for style transfer models) -- Alternatively, create a custom layer that takes dummy inputs and that outputs its own trainable weights. But it gets complicated to compile several loss functions for each of the outputs of the base and style networks.
I implement a network using tensorflow, and the loss is not converged. Then, I get some value in the network, and I find that the BN layer do not work. Please look at the following picture:
We can see that s2 is the result of batch normalization of s1, but the value in s2 is still very large. I don't know what's the problem. Why the value in s2 is so large?
I have updated my code to github. Someone who is interested can test it.
As per the official tensorflow documentation here,
when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so
they need to be executed alongside the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not
work properly.
For example:
training = tf.placeholder(tf.bool, name="is_training")
# ...
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = optimizer.minimize(loss)
train_op = tf.group([train_op, update_ops])
# or, you can also do something like this:
# with tf.control_dependencies(update_ops):
# train_op = optimizer.minimize(loss)
So, it is really important to get the update ops as stated in the tensorflow documentation because in training time the moving variance and the moving mean of the layer have to be updated. If you don’t do this, batch normalization will not work and the network will not train as expected. It is also useful to declare a placeholder to tell the network if it is in training time or inference time, as during test(or inference) time, the mean and the variance are fixed. They are estimated using the previously calculated means and variances of each training batch.
I wrote a neural network using Keras. It contains BatchNormalization layers.
When I trained it with model.fit, everything was fine. When training it with tensorflow as explained here, the training is fine, but the validation step always give very poor performance, and it quickly saturates (the accuracy goes 5%, 10%, 40%, 40%, 40%..; the loss is stagnant too).
I need to use tensorflow because it allows more flexibility regarding the monitoring part of training.
I strongly suspect it has something to do with BN layers or/and the way I compute the test performances (see below)
feed_dict = {x: X_valid,
batch_size_placeholder: X_valid.shape[0],
K.learning_phase(): 0,
beta: self.warm_up_schedule(global_step)
}
if self.weights is not None:
feed_dict[weights] = self.weights
acc = accuracy.eval(feed_dict=feed_dict)
Is there anything special to do when computing the validation accuracy of a model containing Keras BatchNormalizatin layers ?
Thank you in advance !
Actually I found out about the training argument of the __call__ method of the BatchNormalization layer
So what you can do when instantiating the layer is just:
x = Input((dim1, dim2))
h = Dense(dim3)(x)
h = BatchNormalization()(h, training=K.learning_phase())
And when evaluating the performance on validation set:
feed_dict = {x: X_valid,
batch_size_placeholder: X_valid.shape[0],
K.learning_phase(): 0,
beta: self.warm_up_schedule(global_step)
}
acc = accuracy.eval(feed_dict=feed_dict)
summary_ = merged.eval(feed_dict=feed_dict)
test_writer.add_summary(summary_, global_step)
I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.
Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.
Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.
Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device(). But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:
cpu_gather = []
label_batches = []
for i in range(num_gpu):
with tf.device('/gpu:%d' % i):
with tf.variable_scope('block1', reuse=i > 0):
image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
label_batches.append(label_batch)
x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
block1_gather.append(x)
with tf.device('/cpu:0'):
print block1_gather[0].get_shape()
x1 = tf.concat(block1_gather, 0)
# print x1.get_shape()
mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')
for i in range(num_gpu):
with tf.device('/gpu:%d' % i):
with tf.variable_scope('block2', reuse=i > 0):
shape = cpu_gather[i].get_shape().as_list()
assert len(shape) in [2, 4]
n_out = shape[-1]
beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)
x = tf.nn.batch_normalization(
cpu_gather[i], mean, variance, beta, gamma, 0.00001)
x = _relu(x)
That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.
I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.
I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.
I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).
Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope to avoid grabbing any extraneous ops in UPDATE_OPS. Code skeleton below:
update_ops = []
for i, device in enumerate(devices):
with tf.variable_scope('foo', reuse=bool(i > 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
with tf.device(device):
# Put as many batch_norm layers as you want here
update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
with tf.control_dependencies(update_ops):
# average and apply gradients.
If you wanna try this on some existing code, try just deleting the if i == 0 line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115
You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.
A specialized keras layer SyncBatchNormalization is available Since TF2.2
https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization
I've figured out a way to implement sync batch norm in pure tensorflow and pure python.
The code makes it possible to train PSPNet on Cityscapes and get comparable performance.