I am trying to train an alexnet model using tf.estimator in TensorFlow. The training process works smoothly, and I can see the logs well displayed.
INFO:tensorflow:loss = 2.61362, step = 294
INFO:tensorflow:Saving checkpoints for 325 into /home/olu/Dev/data_base
sign_base/output/Checkpoints_N_Model/trained_alexnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 2.94104.
Below is how the training function is called:
traffic_sign_classifier.train(input_fn=train_input_fn,hooks=[logging_hook])
Please, how can I get my loss value as a normal python floating point from the tf estimator object
You can download the loss values from tensorboard when finishing your training.
The .evaluate() method on the estimator returns a dictionary of metrics you can specify in your model function, in the estimator spec, https://www.tensorflow.org/api_docs/python/tf/estimator/EstimatorSpec eval_metric_ops. I found the answer on this thread on GitHub Link
Related
I am getting this result after training on TensorFlow object detection API 2.x on a custom dataset.
TF2: SSD MobileNet v2 320x320
Can you please suggest a scirpt modification in model_main_tf2.py to get evaluation results exactly like model_main.py?
I'm training my own music_vae models and I've noticed that the evaluation loss and accuracy is always 0 when viewed in tensorboard. This is strange because I follow a similar process to train the RNNs, but the evaluation loss and accuracy look good with the RNNs.
Here is what I'm seeing in tensorboard:
tensorboard legend
evaluation loss
accuracy
Finally, here is what I'm seeing inside the eval folder. As you can see there is some data there:
Eval folder contents
Any help on this issue would be appreciated! Thanks
all,
I start the training process using deeplab v3+ following this guide. However, after step 1480, I got the error:
Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2
The detailed train log is here
Could someone suggest how to solve this issue? THX!
Based on the log, it seems that you are training with batch_size = 1, fine_tune_batch_norm = True (default value). Since you are fine-tuning batch norm during training, it is better to set batch size as large as possible (see comments in train.py and Q5 in FAQ). If only limited GPU memory is available, you could fine-tune from the provided pre-trained checkpoint, set smaller learning rate and fine_tune_batch_norm = False (see model_zoo.md for details). Note make sure the flag tf_initial_checkpoint has correct path to the desired pre-trained checkpoint.
I am new to TensorFlow. I am doing a binary classification with my own dataset. However I do not know how to compute the accuracy. Can anyone please help me with to do this?
My classifier has 5 convolutional layers followed by 2 fully connected layers. The final FC layer has an output dimension of 2 for which I have used:
prob = tf.nn.softmax(classification_features, name="output")
Just calculate the percentage of correct predictions:
prediction = tf.math.argmax(prob, axis=1)
equality = tf.math.equal(prediction, correct_answer)
accuracy = tf.math.reduce_mean(tf.cast(equality, tf.float32))
UPDATE 2020-11-23 Keras in Tensorflow
Now you can just specify you want it in the metrics parameter in model.compile.
This post is from 3.6 years ago when tensorflow was still in version 1. Now that Tensorflow.org suggests using the Keras calls you can specify you want accuracy like so:
model.compile(loss='mse',optimizer='sgd',metrics=['accuracy'])
model.fit(x,y)
BOOM! You've got accuracy in your report when you run "model.fit".
If you are using an older version of tensorflow or just writing it from scratch, #Androbin explains it well.
I'm new to TensorFlow and try to run it in distributed mode. Now I have found its official document in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md . But it lacks something in loss function.
Can anyone help to complete that so that I can run with your code?
It not only lacks of loss function, it lacks of the model to train and thus the loss to minimize.
This file is just a template file that you have to complete in order to train your model in distributed mode.
So, when in the template file you find the comment
# Build model...
It means that you have to define a model to train (eg: a convolutional neural network, a simple perceptron...).
Something like the MNIST model that you can find in the tutorial: https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html
Your model ends with a loss function to minimize.
Following the MNIST example, the loss is:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
loss = cross_entropy
Once you defined the model to train and the loss to minimize, you have filled the template with the missing values and you can now start to train you model in distributed mode.