I know how to use Tensorboard to get the graph, however just curious if I could get testing loss value during training process by leaving out "train_op". Also, I'm wondering if it's ok to get any value of other dataset without training them simply by deleting "train_op".
loss = tf.reduce_mean(tf.square(y_ - y), name='square_mean')
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
# this is for training, and I put train_op inside.
for _ in xrange(step):
_, loss_value = sess.run([train_op, loss], feed_dict={x: batch_x, y_:np.transpose([batch_y])})
# just feed the data and get one loss value in some epochs
loss_test = sess.run(loss, feed_dict={x: testing_batch, y_: np.transpose([label_t_batch])})
Leaving out the train_op is perfectly fine as it's a node (more or less) like all others. Leaving it out will just not run it - meaning that the gradient descent will not happen.
As for your second question, as long as the data format of the other datasets you're talking about fit the input format of your graph, there should not be any issue.
Related
I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.
I'm new to tensorflow and I'm trying to combine two models in one graph because I need one model's inference result to modify the loss function of the other model. I wrote the code and it runs without errors but I'm not sure whether I wrote it correctly so I'm writing this thread.
In the code, I loaded two graphs like this
with tf.variable_scope("modelA"):
new_saver = tf.train.import_meta_graph('modelA-1000.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
with tf.variable_scope("modelB"):
new_saver = tf.train.import_meta_graph('modelB-1000.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
and I used modelA's result to modify modelB's loss funcion as follows
output_A = tf.get_default_graph().get_tensor_by_name("modelA_output:0")
output_B = tf.get_default_graph().get_tensor_by_name("modelB_output:0")
loss = tf.reduce_mean(-tf.reduce_sum(output_A * tf.log(output_B ), reduction_indices=[1]))
Then for training, I included only modelB variables to train since I want to make Model A for inference only.
model_vars = tf.trainable_variables()
var_B = [var for var in model_vars if 'modelB' in var.name]
gradient = tf.gradients(loss,var_B)
trainer = tf.train.GradientDescentOptimizer(0.1)
train_step = trainer.apply_gradients(zip(gradient ,var_B))
... declare session and prepare batch for training ...
for i in range(10000):
loss_ = train_step.run(loss, feed_dict={x: batch[0]})
I ran it and the code runs but the loss does not decrease. What did I do wrong? Thanks for reading!
I am not sure how this code runs. train_step is an Operation. Operation.run() method takes the feed_dict and an optional session. I don't know how train_step.run(loss, feed_dict={x: batch[0]}) can run. Generally, you would do something like this:
with tf.Session() as sess:
_, _loss = sess.run([train_step, loss], feed_dict=...)
As a side note, if you have the code that produced the modelA and modelB in the first place. It is better (i.e. less brittle) to rerun that code to recreate the graph. Once the graph is created, you can restore the variable values from your checkpoint using Saver. This avoids doing brittle extractions like
output_A = tf.get_default_graph().get_tensor_by_name("modelA_output:0")
everyone. I am using tensorflow 1.4 to train a model like U-net for my purpose. Due to the constraints of my hardware, when training, the batch_size could only set to be 1 otherwise there will be OOM error.
Here comes my question. In this case, when the batch_size equals to 1, will the tf.layers.batch_normalization() works correctly(saying moving average, moving variance, gamma, beta)? will small batch_size makes it working unstable?
In my work, I set training=True when training, and training=False when testing. When training, I use
logits = mymodel.inference()
loss = tf.mean_square_error(labels, logits)
updata_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
...
saver = tf.train.Saver(tf.global_variables())
with tf.Session() as sess:
sess.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
sess.run(train_op)
...
saver.save(sess, save_path, global_step)
when testing, I use:
logits = model.inference()
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, checkpoint)
sess.run(tf.local_variables_initializer())
results = sess.run(logits)
Could anyone tell me that am I using this wrong? And how much influence with batch_size equals to 1 in tf.layers.batch_normalization()?
Any help will be appreciated! Thanks in advance.
Yes, tf.layers.batch_normalization() works with batches of single elements. Doing batch normalization on such batches is actually named instance normalization (i.e. normalization of a single instance).
#Maxim made a great post about instance normalization if you want to know more. You can also find more theory on the web and in the literature, e.g. Instance Normalization: The Missing Ingredient for Fast Stylization.
I have a tensorflow graph that has a complicated loss function for the training, but a simpler one for evaluation (they share ancestors). Essentially this
train_op = ... (needs more things in feed_dict etc.)
acc = .... (just needs one value for placeholer)
to better understand what's going on, I added summaries. But calling
merged = tf.summary.merge_all()
and then
(summ, acc) = session.run([merged, acc_eval], feed_dict={..})
tensorflow complains that values for placeholders are missing.
As far as I understand your question, to summary a specific tensorflow operation, you should run it specifically.
For example:
# define accuracy ops
correct_prediction = tf.equal(tf.argmax(Y, axis=1), tf.argmax(Y_labels, axis=1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, dtype=tf.float32))
# summary_accuracy is the Summary protocol buffer you need to run,
# instead of merge_all(), if you want to summary specific ops
summary_accuracy = tf.summary.scalar('testing_accuracy', accuracy)
# define writer file
sess.run(tf.global_variables_initializer())
test_writer = tf.summary.FileWriter('log/test', sess.graph)
(summ, acc) = sess.run([summary_accuracy, accuracy], feed_dict={..})
test_writer.add_summary(summ)
Also, you can use tf.summary.merge(), which is documented here.
Hope this help !
In TensorFlow, it seems that we have to propagate the input to the top layer once to compute the current training error, and propagate another time to do the parameter update.
At the bottom of the Tensorflow MNIST example, the line:
train_accuracy = accuracy.eval(feed_dict={
x:batch[0], y_: batch[1], keep_prob: 1.0})
was used to compute accuracy, which required a feedforward process.
Next, the line
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
was used to update the weight parameters. I think this process also requires a feedforward propagation first to get the network output so that the mean squared error can be calculated for back-propagation algorithm.
Do we really need to go through the whole network twice to get current training error and update parameters???
No need to go through twice. Check out the other examples, such as mnist/convolutional.py:
_, l, lr, predictions = s.run(
[optimizer, loss, learning_rate, train_prediction],
feed_dict=feed_dict)
You pull both nodes at the same time in run, and get both the training done and the train prediction at the same time. This is the standard way of training.
In general, I'd suggest checking out the examples in models/ first. The "red pill" and "blue pill" examples are meant to be a very gentle introduction to tensorflow, but the examples in models are a bit more real. They're not production, but they're closer to what you'd want to do.