If reading files using string_input_producer, like
filename_queue = tf.train.string_input_producer(
files,
num_epochs=num_epochs,
shuffle=shuffle)
how can I get epoch num info during training(I want to show this info during training)
I tried below
run
tf.get_default_graph().get_tensor_by_name('input_train/input_producer/limit_epochs/epochs:0')
will always the same as the limit epoch num.
run
tf.get_default_graph().get_tensor_by_name('input_train/input_producer/limit_epochs/CountUpTo:0')
will each time add 1..
Both can not get correct epoch num during training.
Another thing is ,if retrain from existing model, can I got the epoch num info already trained?
I think the right approach here is to define a global_step variable that you pass to your optimizer (or you can increment it manually).
The TensorFlow Mechanics 101 tutorial provides an example:
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Now global_step will be incremented each time the train_op runs. Since you know the size of your dataset and your batch size, you will know what epoch you're currently at.
When you save your model with a tf.train.Saver(), the global_step variable will also be saved. When you restore your model, you can just call global_step.eval() to get back the step value where you left off.
I hope this helps!
Related
I'm facing a problem with restoring training from the last checkpoint that I saved. I'm following exactly this code except that I'm changing the dataset and increasing the number of epochs to 100: Machine Translation French-English notebook
What do I add in order to keep the training because it wouldn't finish in one days and every time it re-starts from epoch 1.
I've found a similar question but the answer didn't solve the problem: Resume training from a certain checkpoint.
I know this is late but I wanted to share the code of a possible solution to this.
Saving a checkpoint and restoring the model from it is pretty easy according to the Tensorflow documentation. The saving can be done using the Tensorflow callbacks every epoch (or with a save_freq additional argument every x epochs):
model.compile(..., metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/path/to/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True # if this is not the best epoch so far it is not saved.
)
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
Then, before starting a new train, or doing prediction, the weights of the saved checkpoint can be loaded like this:
model.load_weights(checkpoint_filepath)
That's it.
I am trying to run (training) my CNN at Google Colab Pro, when I run my code, all is allright, but It does not know the number of steps, so an infinite loop is created.
Mounted at /content/drive
2.2.0-rc3
Found 10018 images belonging to 2 classes.
Found 1336 images belonging to 2 classes.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/300
8/Unknown - 364s 45s/step - loss: 54.9278 - accuracy: 0.5410
I am using ImageDataGenerator() for loadings images. How can I fix it?
An iterator does not store anything, it generates the data dynamically. When you are using a dataset or dataset iterator, you must provide steps_per_epoch. The length of an iterator is unknown until you iterate through it. You could explicitly pass len(datafiles) into the .fit function. So, You need to provide steps_per_epoch as shown below.
model.fit_generator(
train_data_gen,
steps_per_epoch=total_train // batch_size,
epochs=epochs,
validation_data=val_data_gen,
validation_steps=total_val // batch_size
)
More details are mentioned here
steps_per_epoch: Integer or None. Total number of steps (batches of
samples) before declaring one epoch finished and starting the next
epoch. When training with input tensors such as TensorFlow data
tensors, the default None is equal to the number of samples in your
dataset divided by the batch size, or 1 if that cannot be determined.
If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch
will run until the input dataset is exhausted. This argument is not
supported with array inputs.
I notice you are using binary classification. One more thing to remember when you use ImageDataGenerator is to provide class_mode as shown below. Otherwise, there will be a bug (in keras) or 50% accuracy (in tf.keras).
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),class_mode='binary') #
I am training a Tensorflow model, in which I include a checkpoint to save the best model (based on val_loss).
checkpoint = ModelCheckpoint(filepath, monitor='val_rmse', verbose=2, \
save_best_only=True, save_weights_only=False, \
mode='min', save_frequency=1)
After the training, to visualize the model's training process epoch after epoch using the stats stored in the r objects. I do:
plotter.plot({'Basic': history}, metric = 'loss')
Question: How do I do if I want to visual the model's straining process not epoch after epoch but only until the best model is saved. E.gg, if I initially set epoch=5,000 but the best model is at epoch = 2,000, I want to chart only until epoch = 2,000.
Thanks
From documentation: The filepath can contain named formatting options, which will be filled with the values of epoch and keys in logs (passed in on_epoch_end).
For example: if filepath is weights.{epoch:02d}-{val_loss:.2f}.hdf5, then the model checkpoints will be saved with the epoch number and the validation loss in the filename. e.g weights.0150-0.88.hdf5
You can then inspect the file name and plot until the desired epoch number.
I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.
Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.
Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit
In this is tutorial code from TensorFlow website,
could anyone help explain what does global_step mean?
I found on the Tensorflow website written that global step is used count training steps, but I don't quite get what exactly it means.
Also, what does the number 0 mean when setting up global_step?
def training(loss,learning_rate):
tf.summary.scalar('loss',loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Why 0 as the first parameter of the global_step tf.Variable?
global_step = tf.Variable(0, name='global_step',trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
return train_op
According to Tensorflow doc global_step: increment by one after the variables have been updated. Does that mean after one update global_step becomes 1?
global_step refers to the number of batches seen by the graph. Every time a batch is provided, the weights are updated in the direction that minimizes the loss. global_step just keeps track of the number of batches seen so far. When it is passed in the minimize() argument list, the variable is increased by one. Have a look at optimizer.minimize().
You can get the global_step value using tf.train.global_step().
Also handy are the utility methods tf.train.get_global_step or tf.train.get_or_create_global_step.
0 is the initial value of the global step in this context.
The global_step Variable holds the total number of steps during training across the tasks (each step index will occur only on a single task).
A timeline created by global_step helps us understand know where we are in
the grand scheme, from each of the tasks separately. For instance, the loss and accuracy could be plotted against global_step on Tensorboard.
show you a vivid sample below:
code:
train_op = tf.train.GradientDescentOptimizer(learning_rate=LEARNING_RATE).minimize(loss_tensor,global_step=tf.train.create_global_step())
with tf.Session() as sess:
...
tf.logging.log_every_n(tf.logging.INFO,"np.mean(loss_evl)= %f at step %d",100,np.mean(loss_evl),sess.run(tf.train.get_global_step()))
corresponding print
INFO:tensorflow:np.mean(loss_evl)= 1.396970 at step 1
INFO:tensorflow:np.mean(loss_evl)= 1.221397 at step 101
INFO:tensorflow:np.mean(loss_evl)= 1.061688 at step 201
There are networks, e.g. GANs, that may need two (or more) different steps. Training a GANs with the WGAN specification requires that the steps on the discriminator (or critic) D are more than the ones done on the generator G. In that case, it is usefull to declare different global_steps variables.
Example: (G_lossand D_loss are the loss of the generator and the discriminator)
G_global_step = tf.Variable(0, name='G_global_step', trainable=False)
D_global_step = tf.Variable(0, name='D_global_step', trainable=False)
minimizer = tf.train.RMSPropOptimizer(learning_rate=0.00005)
G_solver = minimizer.minimize(G_loss, var_list=params, global_step=G_global_step)
D_solver = minimizer.minimize(D_loss, var_list=params, global_step=D_global_step)