Tensorflow hyperparameter tuning - metrics for each trial not outputted - tensorflow

I want to tune a hyperparameter in slightly modified DNNClassifier. I was able to run the tuning job and it succeeded too but the output does not show the final metrics for each trial. This is what the final output looks like:
{
"completedTrialCount": "2",
"trials": [
{
"trialId": "1",
"hyperparameters": {
"myparam": "0.003"
}
},
{
"trialId": "2",
"hyperparameters": {
"myparam": "0.07"
}
}
],
"consumedMLUnits": 1.48,
"isHyperparameterTuningJob": true
}
how do I get the final metric for each trial so as to decide which value is the best?
My code looks like this.
My DNNClassifier:
classifier = DNNClassifier(
feature_columns=feature_columns,
hidden_units=hu,
optimizer=tf.train.AdamOptimizer(learning_rate=lr),
activation_fn=tf.nn.leaky_relu,
dropout=dr,
n_classes=2,
config=self.get_run_config(),
model_dir=self.model_dir,
weight_column=weight_column
)
tf.contrib.estimator.add_metrics(classifier, compute_metrics)
def compute_metrics(labels, predictions):
return {'my-roc-auc': tf.metrics.auc(labels, predictions)}
The hyperparameters spec is as follows.
trainingInput:
hyperparameters:
hyperparameterMetricTag: my-roc-auc
maxTrials: 2
enableTrialEarlyStopping: True
params:
- parameterName: myparam
type: DISCRETE
discreteValues:
- 0.0001
- 0.0005
- 0.001
- 0.003
- 0.005
- 0.007
- 0.01
- 0.03
- 0.05
- 0.07
- 0.1
I mostly followed the instructions here.

Fixed it. The issue was
tf.contrib.estimator.add_metrics(classifier, compute_metrics)
It should have been
classifier = tf.contrib.estimator.add_metrics(classifier, compute_metrics)

Related

Images get rotated during training

I am trying to train a ssd_mobilenet_v2_keras for object detection on a dataset of more or less 6000 images. The problem is that images are rotated randomly during training (or at least, this is what it looks like from the tensorboard). This is the configuration I am using in the pipeline.config file:
train_config {
batch_size: 32
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_rgb_to_gray {
probability: 0.25
}
}
data_augmentation_options {
random_jpeg_quality {
random_coef: 0.8
min_jpeg_quality: 50
max_jpeg_quality: 100
}
}
sync_replicas: true
optimizer {
adam_optimizer: {
epsilon: 1e-7
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: 1e-3
total_steps: 50000
warmup_learning_rate: 2.5e-4
warmup_steps: 5000
}
}
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-models/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0"
num_steps: 50000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
fine_tune_checkpoint_version: V2
}
I have also tried to remove the random horizontal flip (I knew that was probably not solve anything, I just gave it a try...) but nothing changes, I still see some training images rotated in the tensorboard, and also if I run the evaluation sometimes the images are rotated. Of course the xml with the bounding box coordinates is not "rotated" so the ground truth image in tensorboard appear completely wrong, the object is in a position and the ground truth box is in a completely different position (the right position if the image wasn't rotated...)

Problems with the TF2 learning rate scheduler

I am using the TF2 research object detection API with an EfficientDet D3 model for my training. The optimizer is defined in my pipeline.config file like this:
optimizer {
adam_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.08
total_steps: 300000
warmup_learning_rate: 0.001
warmup_steps: 250
}
}
}
use_moving_average: false
}
So I would assume the learning rate would go up to 0.08 until step 250, afterwards it would slowly go down again until end of training at step 30,000 - is that assumption correct?
However, the learning rate chart in Tensorboard looks like this:
So the learning rate is sticking to 0.08 after step 250 has been reached.
I tried to let that run for hours - the learning rate won't go down at all.
What I am missing here?
PS: The whole pipeline.config file can be found here.

Training loss value is increasing after some training time, but the model detects objects pretty good

I encounter a strange problem while training CNN to detect objects from my own dataset. I am using transfer learning and at the beginning of training, the loss value is decreasing (as expected). But after some time, it gets higher and higher, and I have no idea why it happens.
At the same time, when I look at Images tab on Tensorboard to check how well the CNN predicts objects, I can see that it does it very well, it doesn't look as it is getting worse over time. Also, the Precision and Recall charts look good, only the Loss charts (especially classification_loss) show an increasing trend over time.
Here are some specific details:
I have 10 different classes of logos (such as DHL, BMW, FedEx, etc.)
Around 600 images per class
I use tensorflow-gpu on Ubuntu 18.04
I tried multiple pre-trained models, the latest being faster_rcnn_resnet101_coco with this config pipeline:
model {
faster_rcnn {
num_classes: 10
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/models2/faster_rcnn_resnet101_coco/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/train.record"
}
label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
}
eval_config: {
num_examples: 8000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/test.record"
}
label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
shuffle: false
num_readers: 1
}
Here you can see results that I get after training for nearly 23 hours and reaching over 120k steps:
Loss and Total Loss
Precision
So, my question is, why is the loss value increasing over time? It should be getting smaller or stay more or less constant, but you can clearly see the increasing trend in the above charts.
I think everything is properly configured and my dataset is pretty decent (also .tfrecord files were correctly "built").
To check if it is my fault I tried to use somebody's else dataset and configuration files. So I used the racoon dataset author's files (he provided all of the necessary files on his repo). I just downloaded them and started training with no modifications to check if I would get similar results as him.
Surprisingly, after 82k steps, I got entirely different charts than the ones shown in the linked article (that were captured after 22k steps). Here you can see the comparison of our results:
My losses vs his TotalLoss
My precision vs his mAP
Clearly, something worked differently on my PC. I suspect it may be the same reason why I get increasing loss on my own dataset, that's why I mentioned it.
The totalLoss is the weighted sum of those four other losses. (RPN cla and reg losses, BoxCla cla and reg losses) and they are all Evaluation loss. On tensorboard you can check or uncheck to see the evaluation results for training only or for evaluation only. (For example, the following pic has train summary and evaluation summary)
If the evaluation loss is increasing, this might suggest an overfitting model, besides, the precision metrics dropped a little bit.
To try a better fine-tuning result, you may try adjusting the weights of the four losses, for example, you may increase the weight for BoxClassifierLoss/classification_loss to let the model focused on this metric better. In your config file, the loss weight for second_stage_classification_loss_weight and first_stage_objectness_loss_weight are both 1 while the other two are both 2, so the model currently focused on the other two a little more.
An extra question about why loss_1 and loss_2 are the same. This can be explained by looking at the tensorflow graph.
Here loss_2 is the summary for total_loss, (note this total_loss is not the same as in totalLoss) and the red-circled node is a tf.identity node. This node will output the same tensor as the input, so loss_1 is the same as loss_2

Object detection boxes are lost at second evaluation step

I'm a beginner with Tensorflow 1.4.0 and I'm trying to perform my first training + evaluation process on an object detection model. What I'm seeing is something weird when looking at the output of the evaluation steps.
Here is the steps I made. First, it's worth to say that my goal is to detect two different kind of shapes in very particular scientific images. They are under a kind of "copyright" so I just can show a simplified version of them (made by hand). Just keep in mind that the original ones are way more detailed.
A raw example of input image, see it as a repeated pattern (there is always a grid in the background) with some particular shapes in random positions.
As you can see I want to train the model to detect 2 classes: "round" shapes (class A) and "irregular" shapes (class B).
I used labelImg to generate labels for both the classes in XML format. In general, I've labeled 168 images (960x720 RGB, PNG) ending up with a total of 800 boxes (a single image might have multiple A/B shapes in it).
I've also prepared a smaller dataset for evaluation composed of 10 new images and 150 labels. This time the images are bigger than the others in the train dataset (but they are not "resized", simply the viewport is larger so there could be more events in each input). We are talking about 1920x1440 RGB, PNG images.
Then I converted the XMLs for both the datasets into two .tfrecord files (there are some scripts around GitHub for this).
Then I prepared all the other input files for Tensorflow:
Label map file:
item {
id: 1
name: 'shape_a'
display_name: 'Shape A'
}
item {
id: 2
name: 'shape_b'
display_name: 'Shape B'
}
Config file (adapted from https://github.com/tensorflow/models/tree/master/research/object_detection/samples/configs). As you can see I've chosen the faster_rcnn_inception_v2 and I tried to train it from scratch (because of the nature of those images, that are way different from the ones used in the pretrained models). Most of the parameters are kept as they are in the repository.
model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 720
max_dimension: 960
}
}
feature_extractor {
type: 'faster_rcnn_inception_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.5
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.5
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 0
learning_rate: .0002
}
schedule {
step: 900000
learning_rate: .00002
}
schedule {
step: 1200000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
from_detection_checkpoint: false
# fine_tune_checkpoint: "./run/train/modelXXXXXX.ckpt"
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {}
}
data_augmentation_options {
random_vertical_flip {}
}
data_augmentation_options {
random_adjust_brightness { max_delta: 0.15 }
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "./train.tfrecord"
}
label_map_path: "./label_map.pbtxt"
}
eval_config: {
num_examples: 10
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
eval_interval_secs: 300
}
eval_input_reader: {
tf_record_input_reader {
input_path: "./eval.tfrecord"
}
label_map_path: "./label_map.pbtxt"
shuffle: false
num_readers: 1
}
Finally, I run Tensorflow by calling the https://github.com/tensorflow/models/blob/master/research/object_detection/train.py script. By running on a notebook Nvidia Quadro GPU, performances are around 0.600 sec/step. There are no errors in the console but the first thing I see is that the Loss seems to converge to 0.4 and stay there in relatively few (?) steps:
When around 500 steps, I've also started the evaluation script (https://github.com/tensorflow/models/blob/master/research/object_detection/eval.py) on the CPU. It runs every 5 minutes (eval_interval_secs: 300) and I can see the output on Tensorboard.
Here is the problem. The first evaluation is relative to the checkpoint at step #0, so the output images are a bunch of randomly displaced boxes, and this should be normal. One fact is that only boxes for the first A class are present.
Then, from the second evaluation (around step #1000) and so on all the output images have no detection anymore! No A/B class boxes are drawn and nothing show up until I decide to stop everything (step #10000).
I was expecting to continue seeing detection, even if with errors.
I have many questions and I've probably made clear mistakes in my flow (my knowledge is still very limited):
Is it really a strange behavior what I'm seeing on loss and evaluation outputs?
What techniques can I use to check if I did some conceptual mistakes in data preparation?
Can I debug what's happening under the hood during training?
How about the Tensorflow config file? Is there something wrong there?
A note: I've also tried that same thing using other models like ssd_*, but behavior is the same.

Zero accuracy training a neural network using caffe

I am training a network which has a constant 0 accuracy, I know the network is not learning. I tried different batch sizes and learning rates, It didn't help. What could possibly go wrong given the network prototxt and solver shown below? Thanks!
layer {
name: "data"
type: "HDF5Data"
top: "X"
top: "y"
hdf5_data_param{
source:"/A/B/trainlist.txt"
batch_size: 1
}
include{phase: TRAIN}
}
layer {
name: "data"
type: "HDF5Data"
top: "X"
top: "y"
hdf5_data_param{
source:"/A/B/testlist.txt"
batch_size: 1
}
include{phase: TEST}
}
Here is the solver.prototxt
net: "/A/B/train.prototxt"
test_iter: 10
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 1000
display: 10
max_iter: 4000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "/A/B/model_"
solver_mode: GPU