If the batch_size equals 1 in tf.layers.batch_normalization(), will it works correctly? - tensorflow

everyone. I am using tensorflow 1.4 to train a model like U-net for my purpose. Due to the constraints of my hardware, when training, the batch_size could only set to be 1 otherwise there will be OOM error.
Here comes my question. In this case, when the batch_size equals to 1, will the tf.layers.batch_normalization() works correctly(saying moving average, moving variance, gamma, beta)? will small batch_size makes it working unstable?
In my work, I set training=True when training, and training=False when testing. When training, I use
logits = mymodel.inference()
loss = tf.mean_square_error(labels, logits)
updata_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
...
saver = tf.train.Saver(tf.global_variables())
with tf.Session() as sess:
sess.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
sess.run(train_op)
...
saver.save(sess, save_path, global_step)
when testing, I use:
logits = model.inference()
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, checkpoint)
sess.run(tf.local_variables_initializer())
results = sess.run(logits)
Could anyone tell me that am I using this wrong? And how much influence with batch_size equals to 1 in tf.layers.batch_normalization()?
Any help will be appreciated! Thanks in advance.

Yes, tf.layers.batch_normalization() works with batches of single elements. Doing batch normalization on such batches is actually named instance normalization (i.e. normalization of a single instance).
#Maxim made a great post about instance normalization if you want to know more. You can also find more theory on the web and in the literature, e.g. Instance Normalization: The Missing Ingredient for Fast Stylization.

Related

can't reproduce model.fit with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

A Tensorflow training agnostic to Eager and Graph modes

I spend some of my time coding novel (I wish) RNN cells in Tensorflow.
To prototype, I use eager mode (easier to debug).
In order to train, I migrate the code to a graph (runs faster).
I am looking for a wrapper code/example that can run forward pass and training in a way that will be agnostic to the mode I run it - eager or graph, as much as possible. I have in mind a set of functions/classes, to which the particular neural network/optimizer/data can be inserted, and that these set of functions/classes could run in both modes with minimal changes between the two. In addition, it is of course good that it would be compatible with many types of NN/optimizers/data instances.
I am quite sure that many had this idea.
I wonder if something like this is feasible given the current eager/graph integration in TF.
Yes. I have been wondering the same. In the Tensorflow documentation you can see:
The same code written for eager execution will also build a graph during graph execution. Do this by simply running the same code in a new Python session where eager execution is not enabled.
But this is hard to achieve, mostly because working with graphs means dealing with placeholders, which can not be used in Eager mode. I tried to get rid off placeholders using object-oriented layers and the Dataset API. This is the closest I could get to totally compatible code:
m = 128 # num_examples
n = 5 # num features
epochs = 2
batch_size = 32
steps_per_epoch = m // 32
dataset = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([m, n], dtype=tf.float32),
tf.random_uniform([m, 1], dtype=tf.float32)))
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_dim=n),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])
def train_eagerly(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_one_shot_iterator()
print('Training graph...')
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = iterator.get_next()
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.numpy())])
def train_graph(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_initializable_iterator()
print('Training graph...')
with tf.Session() as sess:
sess.run(iterator.initializer)
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = sess.run(iterator.get_next())
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.eval())])
As you can see, the main difference is that I use a one_shot_iterator during Eager training (of course, during graph training, I have to run operations within a session).
I tried to do the same using optimizer.minimize instead of applying the gradients myself, but I could not come up with a code that worked both for eager and graph modes.
Also, I'm sure this becomes much harder to do with not so simple models, like the one you are working with.

Using two tensorflow models where one is for inference and another is for training

I'm new to tensorflow and I'm trying to combine two models in one graph because I need one model's inference result to modify the loss function of the other model. I wrote the code and it runs without errors but I'm not sure whether I wrote it correctly so I'm writing this thread.
In the code, I loaded two graphs like this
with tf.variable_scope("modelA"):
new_saver = tf.train.import_meta_graph('modelA-1000.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
with tf.variable_scope("modelB"):
new_saver = tf.train.import_meta_graph('modelB-1000.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
and I used modelA's result to modify modelB's loss funcion as follows
output_A = tf.get_default_graph().get_tensor_by_name("modelA_output:0")
output_B = tf.get_default_graph().get_tensor_by_name("modelB_output:0")
loss = tf.reduce_mean(-tf.reduce_sum(output_A * tf.log(output_B ), reduction_indices=[1]))
Then for training, I included only modelB variables to train since I want to make Model A for inference only.
model_vars = tf.trainable_variables()
var_B = [var for var in model_vars if 'modelB' in var.name]
gradient = tf.gradients(loss,var_B)
trainer = tf.train.GradientDescentOptimizer(0.1)
train_step = trainer.apply_gradients(zip(gradient ,var_B))
... declare session and prepare batch for training ...
for i in range(10000):
loss_ = train_step.run(loss, feed_dict={x: batch[0]})
I ran it and the code runs but the loss does not decrease. What did I do wrong? Thanks for reading!
I am not sure how this code runs. train_step is an Operation. Operation.run() method takes the feed_dict and an optional session. I don't know how train_step.run(loss, feed_dict={x: batch[0]}) can run. Generally, you would do something like this:
with tf.Session() as sess:
_, _loss = sess.run([train_step, loss], feed_dict=...)
As a side note, if you have the code that produced the modelA and modelB in the first place. It is better (i.e. less brittle) to rerun that code to recreate the graph. Once the graph is created, you can restore the variable values from your checkpoint using Saver. This avoids doing brittle extractions like
output_A = tf.get_default_graph().get_tensor_by_name("modelA_output:0")

What does batch normalization do if the batch size is one?

I'am currently reading the paper from Ioffe and Szegedy about Batch Normalization and im wondering what happens if the Batch size is set to one. The computation of the mini-Batch mean(which is basically the value of theactivation itself) and variance(should be Zero plus constant epsilon) would lead to a normalized Dimension of Zero.
Yet this small example in tensorflow Shows that something different is Happening:
test_img = np.array([[[[50],[100]],
[[150],[200]]]], np.float32)
gt_img = np.array([[[[60],[130]],
[[180],[225]]]], np.float32)
test_img_op = tf.convert_to_tensor(test_img, tf.float32)
norm_op = tf.layers.batch_normalization(test_img_op)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = gt_img,
logits = norm_op))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer_obj = tf.train.AdamOptimizer(0.01).minimize(loss_op)
with tf.Session() as sess:
sess.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
print(test_img)
while True:
new_img, op, lossy, trainable = sess.run([norm_op, optimizer_obj, loss_op, tf.trainable_variables()])
print(trainable)
print(new_img)
So what is TensorFlow doing differently(moving average?!)?
Thank you!
Because of beta, the learnable parameter for translation which is enabled by default, the normalized output will not necessarily be zero.
Moving averages for input mean and variance will be computed during training and can be used at testing (if you set is_training accordingly).

Summary for the a specific branch

I have a tensorflow graph that has a complicated loss function for the training, but a simpler one for evaluation (they share ancestors). Essentially this
train_op = ... (needs more things in feed_dict etc.)
acc = .... (just needs one value for placeholer)
to better understand what's going on, I added summaries. But calling
merged = tf.summary.merge_all()
and then
(summ, acc) = session.run([merged, acc_eval], feed_dict={..})
tensorflow complains that values for placeholders are missing.
As far as I understand your question, to summary a specific tensorflow operation, you should run it specifically.
For example:
# define accuracy ops
correct_prediction = tf.equal(tf.argmax(Y, axis=1), tf.argmax(Y_labels, axis=1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, dtype=tf.float32))
# summary_accuracy is the Summary protocol buffer you need to run,
# instead of merge_all(), if you want to summary specific ops
summary_accuracy = tf.summary.scalar('testing_accuracy', accuracy)
# define writer file
sess.run(tf.global_variables_initializer())
test_writer = tf.summary.FileWriter('log/test', sess.graph)
(summ, acc) = sess.run([summary_accuracy, accuracy], feed_dict={..})
test_writer.add_summary(summ)
Also, you can use tf.summary.merge(), which is documented here.
Hope this help !