Not able to generate Log files - tensorflow

I am trying to run a tensorflow mode, the code works and folder gets created but no log file is getting generated , any help would be deeply appreciated
Here is the relvent code
LOGGING_PATH='board/'
folder_name=f'model1 at {strftime("%H-%M")}'
directory=os.path.join(LOGGING_PATH, folder_name)
try:
os.makedirs(directory)
except OSError as exception:
print(exception.strerror)
else:
print('Sucessfully Created Directory')
summary_log=tf.summary.create_file_writer(directory)
for epoch in range(nr_epochs):
for i in range(iterations):
batch_x, batch_y=next_batch(batch_size=size_of_batch, data=true_train, labels=true_label)
feed_dictionary={X:batch_x, Y:batch_y}
sess.run(train_step, feed_dict=feed_dictionary)
batch_accuracy=sess.run(fetches=[accuracy], feed_dict=feed_dictionary)
with summary_log.as_default():
tf.summary.scalar('loss', loss_, step=epoch)
tf.summary.scalar('accuracy', accuracy, step=epoch)
template= 'Epoch{}, Loss: {}, Accuracy: {}'
print(f'Epoch {epoch}\t| Training Accuracy={batch_accuracy}')
print('Done Training')
Below is the loss function and accuracy
loss_=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y,logits=layer3_out))
correct_prediction=tf.equal(tf.argmax(layer3_out, axis=1), tf.argmax(Y, axis=1))
accuracy=tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Is there something wrong with the syntax or I am missing somthing? as there is the folder call model#"time" getting created everytime I run the code but no log file is inside

Related

How can I use Tensorflow.Checkpoint to recover a previously trained net

I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore.
I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just attempts to instantiate a new net, then to fill it with weights from a previous net that was saved and checkpointed.
I'm assigning a unique(ish) id number to a particular instantiation of a net by summing all the weights of the net. I compare these id numbers both at the creation of the net, and after I've attempted to recover the checkpointed net
def main(opt):
# Initialize pix2pix GAN using arguments input from command line
p2p = Pix2Pix(vars(opt))
print(opt)
# print sum of initial weights for net
print("Init Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
# Create or read from model checkpoints
checkpoint = tf.train.Checkpoint(generator_optimizer=p2p.generator_optimizer,
discriminator_optimizer=p2p.discriminator_optimizer,
generator=p2p.generator,
discriminator=p2p.discriminator)
# print sum of weights from checkpoint, to ensure it has access
# to relevant regions of p2p
print("Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
# Recover Checkpointed net
checkpoint.restore(tf.train.latest_checkpoint(opt.weights)).expect_partial()
# print sum of weights for p2p & checkpoint after attempting to restore saved net
print("Restore Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
print("Restored Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
print("Done.")
if __name__ == '__main__':
opt = parse_opt()
main(opt)
The output I got when I ran this code was as follows:
Namespace(channels='1', data='data', img_size=256, output='output', weights='weights/ckpt-40.data-00000-of-00001')
## These are the input arguments, the images have only 1 channel (they're gray scale)
## The directory with data is ./data, the images are 265x256
## The output directory is ./output
## The checkpointed net is stored in ./weights/ckpt-40.data-00000-of-00001
## Sums of nets' weights
Init Model Weights: 11047.206374436617
Checkpoint Weights: 11047.206374436617
Restore Model Weights: 11047.206374436617
Restored Checkpoint Weights: 11047.206374436617
Done.
There is no change in the sum of the net's weights before and after recovering the checkpointed version, although p2p and checkpoint do seem to have access to the same locations in memory.
Why am I not recovering the saved net?
The problem arose because tf.Checkpoint.restore needs the directory in which the checkpointed net is stored, not the specific file (or, what I took to be the specific file - ./weights/ckpt-40.data-00000-of-00001)
When it is not given a valid directory, it silently proceeds to the next line of code, without updating the net or throwing an error. The fix was to give it the directory with the relevant checkpoint files, rather than just the file I believed to be relevant.
My alternative way is using callback and restore, you may name the layer for checkpoints they determine.
Example:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: DataSet
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DATA = adding_array_DATA(DATA, action, reward, gamescores, step)
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(DATA, dtype=tf.float32),tf.constant(np.reshape(0, (1, 1, 1, 1)))))
batched_features = dataset
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Initialize
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(1200, 1)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True, return_state=False)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
])
model.add(layers.Flatten())
model.add(layers.Dense(64))
model.add(layers.Dense(2))
model.summary()
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Callback
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_dir, monitor='val_loss',
verbose=0, save_best_only=True, mode='min' )
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Optimizer
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
optimizer = tf.keras.optimizers.Nadam(
learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
name='Nadam'
) # 0.00001
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Loss Fn
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# 1
lossfn = tf.keras.losses.MeanSquaredLogarithmicError(reduction=tf.keras.losses.Reduction.AUTO, name='mean_squared_logarithmic_error')
# 2
# lossfn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Summary
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model.compile(optimizer=optimizer, loss=lossfn, metrics=['accuracy'])
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Training
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features), callbacks=[cp_callback]) # epochs=500 # , callbacks=[cp_callback, tb_callback]
checkpoint = tf.train.Checkpoint(model)
checkpoint.restore(checkpoint_dir)
input('...')
Output:
2022-03-08 10:33:06.965274: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8100
1/1 [==============================] - ETA: 0s - **loss: 0.0154** - accuracy: 0.0000e+002022-03-08 10:33:16.175845: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
1/1 [==============================] - 31s 31s/step - **loss: 0.0154** - accuracy: 0.0000e+00 - val_loss: 0.0074 - val_accuracy: 0.0000e+00
...

Why Trained and Loaded Model is giving different evaluate result? 82% vs 5%

I implemented my seq-Keras model and it was successfully trained.
...
model.fit(...)
...
>>Result: Successfully completed: TrainAcc=99%, ValAcc=88%
Next:
NOW I run this code:
model.save('Model88.h5')
model.evaluate(X_test, y_test)
Result >> accuracy: 0.8216
Next:
but when I load the saved model(Model88) and evaluate that:
model = keras.models.load_model('Model88.h5')
model.evaluate(X_test, y_test)
Result >> accuracy: 0.0214 !!!
The test data is the same, the saved model and the loaded model are the same!
Why does this happen? accuracy: 82% -> 5% !!!!
I had same problem before and i found out that test accuracy was calculated wrongly during training. I had changed test batch size to 1 to fix it.

how to log validation loss and accuracy using tfslim

Is there any way that I can log the validaton loss and accuracy to tensorboard when using tf-slim? When I was using keras, the following code can do this for me:
model.fit_generator(generator=train_gen(), validation_data=valid_gen(),...)
Then the model will evaluate the validation loss and accuracy after each epoch, which is very convenient. But how to achieve this using tf-slim? The following steps are using primitive tensorflow, which is not what I want:
with tf.Session() as sess:
for step in range(100000):
sess.run(train_op, feed_dict={X: X_train, y: y_train})
if n % batch_size * batches_per_epoch == 0:
print(sess.run(train_op, feed_dict={X: X_train, y: y_train}))
Right now, the steps to train a model using tf-slim is:
tf.contrib.slim.learning.train(
train_op=train_op,
logdir="logs",
number_of_steps=10000,
log_every_n_steps = 10,
save_summaries_secs=1
)
So how to evaluate validation loss and accuracy after each epoch with the above slim training procedure?
Thanks in advance!
The matter is still being discussed on TF Slim repo (issue #5987).
The framework allows you to easily create an evaluation script to run after / in parallel of your training (solution 1 below), but some people are pushing to be able to implement the "classic cycle of batch training + validation" (solution 2).
1. Use slim.evaluation in another script
TF Slim has evaluation methods e.g. slim.evaluation.evaluation_loop() you can use in another script (which can be run in parallel of your training) to periodically load the latest checkpoint of your model and perform evaluation. TF Slim page contains a good example how such a script may look: example.
2. Provide a custom train_step_fn to slim.learning.train()
A patchy solution the initiator of the discussion came up with makes use of a custom training step function you can provide to slim.learning.train():
"""
Snippet from code by Kevin Malakoff #kmalakoff
https://github.com/tensorflow/tensorflow/issues/5987#issue-192626454
"""
# ...
accuracy_validation = slim.metrics.accuracy(
tf.argmax(predictions_validation, 1),
tf.argmax(labels_validation, 1)) # ... or whatever metrics needed
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_check == 0:
accuracy = session.run(train_step_fn.accuracy_validation)
print('Step %s - Loss: %.2f Accuracy: %.2f%%' % (str(train_step_fn.step).rjust(6, '0'), total_loss, accuracy * 100))
# ...
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
slim.learning.train(
train_op,
FLAGS.logs_dir,
train_step_fn=train_step_fn,
graph=graph,
number_of_steps=FLAGS.max_steps
)

How to run a session without updating parameters in tensorflow?

I know how to use Tensorboard to get the graph, however just curious if I could get testing loss value during training process by leaving out "train_op". Also, I'm wondering if it's ok to get any value of other dataset without training them simply by deleting "train_op".
loss = tf.reduce_mean(tf.square(y_ - y), name='square_mean')
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
# this is for training, and I put train_op inside.
for _ in xrange(step):
_, loss_value = sess.run([train_op, loss], feed_dict={x: batch_x, y_:np.transpose([batch_y])})
# just feed the data and get one loss value in some epochs
loss_test = sess.run(loss, feed_dict={x: testing_batch, y_: np.transpose([label_t_batch])})
Leaving out the train_op is perfectly fine as it's a node (more or less) like all others. Leaving it out will just not run it - meaning that the gradient descent will not happen.
As for your second question, as long as the data format of the other datasets you're talking about fit the input format of your graph, there should not be any issue.

How to get epoch num info from tf.train.string_input_producer

If reading files using string_input_producer, like
filename_queue = tf.train.string_input_producer(
files,
num_epochs=num_epochs,
shuffle=shuffle)
how can I get epoch num info during training(I want to show this info during training)
I tried below
run
tf.get_default_graph().get_tensor_by_name('input_train/input_producer/limit_epochs/epochs:0')
will always the same as the limit epoch num.
run
tf.get_default_graph().get_tensor_by_name('input_train/input_producer/limit_epochs/CountUpTo:0')
will each time add 1..
Both can not get correct epoch num during training.
Another thing is ,if retrain from existing model, can I got the epoch num info already trained?
I think the right approach here is to define a global_step variable that you pass to your optimizer (or you can increment it manually).
The TensorFlow Mechanics 101 tutorial provides an example:
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Now global_step will be incremented each time the train_op runs. Since you know the size of your dataset and your batch size, you will know what epoch you're currently at.
When you save your model with a tf.train.Saver(), the global_step variable will also be saved. When you restore your model, you can just call global_step.eval() to get back the step value where you left off.
I hope this helps!