How can I use Convolutional LSTM in Keras to realize position estimation? - tensorflow

I used LSTM in Keras with Tensorflow.
I would like to realize position estimation.
I want to input the movie (1 scene is 15 frame) and estimate the position of moving the square in the movie.
Input is 15 frame. Output is 2 variable (x, y).
In the following code the estimation accuracy is too bad. What should I do?
And, I don't understand AveragePooling3D/Reshape (Without this it will not perform.).
# We create a layer which take as input movies of shape
# (n_frames, width, height, channels) and returns a movie
# of identical shape.
seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
input_shape=(None, 80, 80, 1),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
#seq.add(Flatten())
seq.add(AveragePooling3D((1, 80, 80)))
seq.add(Reshape((-1, 40)))
seq.add(Dense(2))
#seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
# activation='sigmoid',
# padding='same', data_format='channels_last'))
seq.compile(loss='mean_squared_error', optimizer='adam')
def generate_movies(n_samples=1200, n_frames=15):
row = 80
col = 80
noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
dtype=np.float)
square_x_y = np.zeros((n_samples, n_frames, 2), dtype=np.float)
for i in range(n_samples):
for j in range(1):
# Initial position
xstart = np.random.randint(20, 60)
ystart = np.random.randint(20, 60)
# Direction of motion
directionx = np.random.randint(0, 3) - 1
directiony = np.random.randint(0, 3) - 1
# Size of the square
w = np.random.randint(2, 4)
for t in range(n_frames):
x_shift = xstart + directionx * t
y_shift = ystart + directiony * t
noisy_movies[i, t, x_shift - w: x_shift + w,
y_shift - w: y_shift + w, 0] += 1
# Make it more robust by adding noise.
# The idea is that if during inference,
# the value of the pixel is not exactly one,
# we need to train the network to be robust and still
# consider it as a pixel belonging to a square.
if np.random.randint(0, 2):
noise_f = (-1)**np.random.randint(0, 2)
noisy_movies[i, t,
x_shift - w - 1: x_shift + w + 1,
y_shift - w - 1: y_shift + w + 1,
0] += noise_f * 0.1
# Shift the ground truth by 1
x_shift = xstart + directionx * (t + 1)
y_shift = ystart + directiony * (t + 1)
shifted_movies[i, t, x_shift - w: x_shift + w,
y_shift - w: y_shift + w, 0] += 1
square_x_y[i, t, 0] = x_shift/row
square_x_y[i, t, 1] = y_shift/col
# Cut to a 40x40 window
#noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
#shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
#noisy_movies[noisy_movies >= 1] = 1
#shifted_movies[shifted_movies >= 1] = 1
return noisy_movies, shifted_movies, square_x_y
# Train the network
noisy_movies, shifted_movies, sq_x_y = generate_movies(n_samples = 1200)
seq.fit(noisy_movies[:1000], sq_x_y[:1000], batch_size=10,
epochs=1, validation_split=0.05)

See your seq.summary() to understand the shapes.
Your last ConvLSTM2D layer is outputting something with shape
(movies, length, side1, side2, channels)
(None, None, 80, 80, 40)
For the Dense layer to work, you will need to keep the "movies" and the "length" dimension, and collapse the others into one.
Notice that the dimension 40 (the channels) is important for it represents different "concepts", while the 80s (the sides) are purely 2D positional.
Since you don't want to touch the "length" dimension, you need an AveragePooling2D (not 3D). But since you're going to detect a very distinct positional feature and its location, I suggest you don't collapse the spatial dimensions at all. It should be better to just reshape and add a Dense that will take into account these positions.
So instead of the AveragePooling I suggest you use:
seq.add(Reshape((-1, 80*80*40)) #new shape is = (movies, length, 80*80*40)
Then you add a Dense layer that still has a notion of the positions:
seq.add(Dense(2)) #output shape is (movies, length, 2)
This is not guaranteed to make your model perform well, but I believe "better".
Other things you can do are adding more dense layers and smoothly go from the 80*80*40 features to 2.
This is more complex, but you might study the functional API model and learn about "U-nets" and try to make something like a u-net there.

Related

How to effectively infer artifact removal cnn models?

So earlier, I trained the SNET model on images present here for the purpose of artifact removal. The hyperparameters with which I have trained the model are mentioned below:
So SNET has eight convolution-based heads that would output a certain size of patch in the image. The input would be the same patch that has jpeg artifacts artificially added into them. The predicted and the ground truth patch would be compared with one another, and then MSE loss between them is backpropagated.
The hyperparameters, that I used:
learning_rate = 0.0001, min_learning_rate = 0.000001 for exponential scheduler
optimizer = Adam
loss metric = MSE(mean squared error)
patch_size = 48 x 48
Training batch-size = 16
Evaluation metrics:
PSNR: This is used for quality measurement between the original and reconstructed image.
SSIM: A common metric is to quantify the difference in the values of each of the corresponding pixels between the sample and the reference images.
So while training the images, I was evaluating the model's performance at every step. So there are eight 48 x 48 patches of the same image, since there 8 heads in the model. The model was trained for 100 iterations. Outputs of random patches from 998 to 999 iterations are given below:
The first image is the artifact-added image, the second is the predicted one, third is the ground truth image.
After training the model, I had to test them on bigger images with a larger context(that has an object). So instead of resizing the images to 48 x 48, I divided them into patches of 48 x 48, therefore only tested on images that have width and height values that are multiples of 48. But the problem is that, though the image has high psnr and ssim values, there is a fine gap between patch to patch as shown below:
Is there a way to efficiently tackle this issue? Please suggest, open to any kind of feedback.
Below is the code for model that I used:
import tensorflow as tf
from PIL import Image
import cv2
import tensorflow as tf
import numpy as np
import os
def MSE(input,target):
#return tf.reduce_sum(tf.reduce_mean(tf.abs(input - target),axis=0))
return tf.reduce_mean(tf.abs(input - target))
initializer = tf.initializers.VarianceScaling()
def EncoderBlock(x, activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
return x
def DecoderBlock(x, activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(3, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
return x
def ConvolutionalUnit(x, structure_type = 'classic', activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
residual = x
if structure_type == "classic":
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Add()([x, residual])
elif structure_type == "advanced":
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = tf.keras.layers.Lambda(lambda x: x * 0.1)(x)
x = tf.keras.layers.Add()([x, residual])
return x
def S_Net(channels = 3, num_metrics=8 , structure_type='advanced', nf = 256):
inputs = tf.keras.layers.Input(shape=[None, None, channels])
encoder = EncoderBlock(inputs, nf = nf)
convolution_units = []
decoders = []
for i in range(num_metrics):
convolution_units.append(ConvolutionalUnit( convolution_units[-1] if len(convolution_units)>0 else ConvolutionalUnit(encoder, nf=nf), structure_type = structure_type, nf=nf))
decoders.append(DecoderBlock(convolution_units[-1],nf=nf))
return tf.keras.Model(inputs=[inputs], outputs=decoders)

How to add an additional Conv2DTranspose layer to get 56x56 mask

def build_fpn_mask_graph(rois, feature_maps, image_meta,
pool_size, num_classes, train_bn=True):
"""Builds the computation graph of the mask head of Feature Pyramid Network.
rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized
coordinates.
feature_maps: List of feature maps from different layers of the pyramid,
[P2, P3, P4, P5]. Each has a different resolution.
image_meta: [batch, (meta data)] Image details. See compose_image_meta()
pool_size: The width of the square feature map generated from ROI Pooling.
num_classes: number of classes, which determines the depth of the results
train_bn: Boolean. Train or freeze Batch Norm layers
Returns: Masks [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, NUM_CLASSES]
"""
# ROI Pooling
# Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]
x = PyramidROIAlign([pool_size, pool_size],
name="roi_align_mask")([rois, image_meta] + feature_maps)
# Conv layers
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv1")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn1')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv2")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn2')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv3")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn3')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv4")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn4')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
name="mrcnn_mask_deconv")(x)
x = KL.TimeDistributed(KL.Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),
name="mrcnn_mask")(x)
return x
Above code, I have got from matterport mask rcnn. I need to add an additional Conv2DTranspose layer to get a 56x56 mask. How can I do that? I don't know anything about TensorFlow. I think above code is designed for 28x28 mask. But I need to get 56x56 resolution mask.
You should add another
x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
name="mrcnn_mask_deconv")(x)
before the final layer with the sigmoid activation.
It does not matter if you know TensorFlow or PyTorch. Think of the TransposedConvolution() as the opposite of Convolution().
When applying on an image of 4x4 a convolution with kernel (2,2) with strides = 2, we obtain a final output feature map of dimension 2x2. Exactly the opposite when it comes to TransposedConvolution; a transposed convolution with a kernel of (2,2) and a stride of 2 will double the size of the output image. hence transforming an input image of dimension 2x2 to an output one of 4x4.

Low Triplet loss accuracy CIFAR10

I started learning about triplet networks and decided to implement using convolutional neural networks, but I decided to use the CIFAR-10 dataset for image classification, but I get very low accuracy.
After training the accuracy is around 0.32. .
def pairwise_distance(feature, squared=False):
"""Computes the pairwise distance matrix with numerical stability.
output[i, j] = || feature[i, :] - feature[j, :] ||_2
Args:
feature: 2-D Tensor of size [number of data, feature dimension].
squared: Boolean, whether or not to square the pairwise distances.
Returns:
pairwise_distances: 2-D Tensor of size [number of data, number of data].
"""
pairwise_distances_squared = math_ops.add(
math_ops.reduce_sum(math_ops.square(feature), axis=[1], keepdims=True),
math_ops.reduce_sum(
math_ops.square(array_ops.transpose(feature)),
axis=[0],
keepdims=True)) - 2.0 * math_ops.matmul(feature,
array_ops.transpose(feature))
# Deal with numerical inaccuracies. Set small negatives to zero.
pairwise_distances_squared = math_ops.maximum(pairwise_distances_squared, 0.0)
# Get the mask where the zero distances are at.
error_mask = math_ops.less_equal(pairwise_distances_squared, 0.0)
# Optionally take the sqrt.
if squared:
pairwise_distances = pairwise_distances_squared
else:
pairwise_distances = math_ops.sqrt(
pairwise_distances_squared + math_ops.to_float(error_mask) * 1e-16)
# Undo conditionally adding 1e-16.
pairwise_distances = math_ops.multiply(
pairwise_distances, math_ops.to_float(math_ops.logical_not(error_mask)))
num_data = array_ops.shape(feature)[0]
# Explicitly set diagonals to zero.
mask_offdiagonals = array_ops.ones_like(pairwise_distances) - array_ops.diag(
array_ops.ones([num_data]))
pairwise_distances = math_ops.multiply(pairwise_distances, mask_offdiagonals)
return pairwise_distances
def masked_maximum(data, mask, dim=1):
"""Computes the axis wise maximum over chosen elements.
Args:
data: 2-D float `Tensor` of size [n, m].
mask: 2-D Boolean `Tensor` of size [n, m].
dim: The dimension over which to compute the maximum.
Returns:
masked_maximums: N-D `Tensor`.
The maximized dimension is of size 1 after the operation.
"""
axis_minimums = math_ops.reduce_min(data, dim, keepdims=True)
masked_maximums = math_ops.reduce_max(
math_ops.multiply(data - axis_minimums, mask), dim,
keepdims=True) + axis_minimums
return masked_maximums
def masked_minimum(data, mask, dim=1):
"""Computes the axis wise minimum over chosen elements.
Args:
data: 2-D float `Tensor` of size [n, m].
mask: 2-D Boolean `Tensor` of size [n, m].
dim: The dimension over which to compute the minimum.
Returns:
masked_minimums: N-D `Tensor`.
The minimized dimension is of size 1 after the operation.
"""
axis_maximums = math_ops.reduce_max(data, dim, keepdims=True)
masked_minimums = math_ops.reduce_min(
math_ops.multiply(data - axis_maximums, mask), dim,
keepdims=True) + axis_maximums
return masked_minimums
def triplet_loss_adapted_from_tf(y_true, y_pred):
del y_true
margin = 1.
labels = y_pred[:, :1]
labels = tf.cast(labels, dtype='int32')
embeddings = y_pred[:, 1:]
### Code from Tensorflow function [tf.contrib.losses.metric_learning.triplet_semihard_loss] starts here:
# Reshape [batch_size] label tensor to a [batch_size, 1] label tensor.
# lshape=array_ops.shape(labels)
# assert lshape.shape == 1
# labels = array_ops.reshape(labels, [lshape[0], 1])
# Build pairwise squared distance matrix.
pdist_matrix = pairwise_distance(embeddings, squared=True)
# Build pairwise binary adjacency matrix.
adjacency = math_ops.equal(labels, array_ops.transpose(labels))
# Invert so we can select negatives only.
adjacency_not = math_ops.logical_not(adjacency)
# global batch_size
batch_size = array_ops.size(labels) # was 'array_ops.size(labels)'
# Compute the mask.
pdist_matrix_tile = array_ops.tile(pdist_matrix, [batch_size, 1])
mask = math_ops.logical_and(
array_ops.tile(adjacency_not, [batch_size, 1]),
math_ops.greater(
pdist_matrix_tile, array_ops.reshape(
array_ops.transpose(pdist_matrix), [-1, 1])))
mask_final = array_ops.reshape(
math_ops.greater(
math_ops.reduce_sum(
math_ops.cast(mask, dtype=dtypes.float32), 1, keepdims=True),
0.0), [batch_size, batch_size])
mask_final = array_ops.transpose(mask_final)
adjacency_not = math_ops.cast(adjacency_not, dtype=dtypes.float32)
mask = math_ops.cast(mask, dtype=dtypes.float32)
# negatives_outside: smallest D_an where D_an > D_ap.
negatives_outside = array_ops.reshape(
masked_minimum(pdist_matrix_tile, mask), [batch_size, batch_size])
negatives_outside = array_ops.transpose(negatives_outside)
# negatives_inside: largest D_an.
negatives_inside = array_ops.tile(
masked_maximum(pdist_matrix, adjacency_not), [1, batch_size])
semi_hard_negatives = array_ops.where(
mask_final, negatives_outside, negatives_inside)
loss_mat = math_ops.add(margin, pdist_matrix - semi_hard_negatives)
mask_positives = math_ops.cast(
adjacency, dtype=dtypes.float32) - array_ops.diag(
array_ops.ones([batch_size]))
# In lifted-struct, the authors multiply 0.5 for upper triangular
# in semihard, they take all positive pairs except the diagonal.
num_positives = math_ops.reduce_sum(mask_positives)
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
### Code from Tensorflow function semi-hard triplet loss ENDS here.
return semi_hard_triplet_loss_distance
def create_base_network(image_input_shape, embedding_size):
weight_decay = 1e-4
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay),
input_shape=image_input_shape))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(Conv2D(64, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))
model.add(Conv2D(128, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), padding='same', kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('elu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.4))
model.add(Flatten())
model.add(Dense(embedding_size, activation='softmax'))
return model
if __name__ == "__main__":
# in case this scriot is called from another file, let's make sure it doesn't start training the network...
batch_size = 128
epochs = 100
train_flag = True # either True or False
embedding_size = 64
no_of_components = 2 # for visualization -> PCA.fit_transform()
step = 10
# The data, split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.
x_test /= 255.
input_image_shape = (32, 32, 3)
x_val = x_test#[:2000, :, :]
y_val = y_test#[:2000]
# Network training...
if train_flag == True:
base_network = create_base_network(input_image_shape, embedding_size)
input_images = Input(shape=input_image_shape, name='input_image') # input layer for images
input_labels = Input(shape=(1,), name='input_label') # input layer for labels
embeddings = base_network([input_images]) # output of network -> embeddings
labels_plus_embeddings = concatenate([input_labels, embeddings]) # concatenating the labels + embeddings
# Defining a model with inputs (images, labels) and outputs (labels_plus_embeddings)
model = Model(inputs=[input_images, input_labels],
outputs=labels_plus_embeddings)
#model.summary()
#plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
# train session
opt = Adam(lr=0.001) # choose optimiser. RMS is good too!
model.compile(loss=triplet_loss_adapted_from_tf,
optimizer=opt)
filepath = "semiH_trip_MNIST_v13_ep{epoch:02d}_BS%d.hdf5" % batch_size
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=False, period=25)
callbacks_list = [checkpoint]
# Uses 'dummy' embeddings + dummy gt labels. Will be removed as soon as loaded, to free memory
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
x_train = np.reshape(x_train, (len(x_train), x_train.shape[1], x_train.shape[1], 3))
x_val = np.reshape(x_val, (len(x_val), x_train.shape[1], x_train.shape[1], 3))
H = model.fit(
x=[x_train, y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),
callbacks=callbacks_list)
else:
#####
model = load_model('semiH_trip_MNIST_v13_ep25_BS256.hdf5',
custom_objects={'triplet_loss_adapted_from_tf': triplet_loss_adapted_from_tf})
# Test the network
# creating an empty network
testing_embeddings = create_base_network(input_image_shape,
embedding_size=embedding_size)
x_embeddings_before_train = testing_embeddings.predict(np.reshape(x_test, (len(x_test), 32, 32, 3)))
# Grabbing the weights from the trained network
for layer_target, layer_source in zip(testing_embeddings.layers, model.layers[2].layers):
weights = layer_source.get_weights()
layer_target.set_weights(weights)
del weights
# Visualizing the effect of embeddings -> using PCA!
x_embeddings = testing_embeddings.predict(x_train)
y_embeddings = testing_embeddings.predict(x_val)
svc = SVC()
svc.fit(x_embeddings, y_train)
valid_prediction = svc.predict(y_embeddings)
print(valid_prediction.shape)
print("validation accuracy : ", accuracy_score(y_val, valid_prediction))
i would really appriciate if you guys could check if i am not doing things right. Hope to hear from anyone soon
Try a simpler network like this (from here):
def create_base_network(image_input_shape, embedding_size):
input_image = Input(shape=image_input_shape)# input_5:InputLayer
x = Flatten()(input_image)
x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(embedding_size)(x) # dense_15: Dense
base_network = Model(inputs=input_image, outputs=x)
plot_model(base_network, to_file='base_netwoN.png',
show_shapes=True, show_layer_names=True)
return base_network

Tensorflow YOLO Object Detection Loss Exploding

I am trying to implement and train YOLO on my own, based on this implementation https://github.com/allanzelener/YAD2K/. The problems I am having is the width/height values in my prediction tensor are exploding and I never see an IOU above 0 between my prediction objects and ground truth objects. This all goes wrong within the first few minibatches of the first epoch. The loss and most of my prediction width/heights are nan.
My image size is 416x416, I'm using 5 anchors, and have 5 classes. I'm dividing the image into a 13x13 grid for a prediction tensor of [batch_size, 13, 13, 5, 10]. The ground truths for each batch are [batch_size, 13, 13, 5, 5], without one hot for the class probabilities.
Below is my loss function (based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L152), which passes the image to my model and then calls predict_transform which reshapes the tensor and transforms the coordinates.
def loss_custom(true_box_grid, x):
# training=training is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
y_ = model(x, training=training)
# (batch, rows, cols, anchors, vals)
center_coords, wh_coords, obj_scores, class_probs = DetectNet.predict_transform(y_)
detector_mask = create_mask(true_box_grid)
total_loss = 0
pred_wh_half = wh_coords / 2.
# bottom left corner
pred_mins = center_coords - pred_wh_half
# top right corner
pred_maxes = center_coords + pred_wh_half
true_xy = true_box_grid[..., 0:2]
true_wh = true_box_grid[..., 2:4]
true_wh_half = true_wh / 2.
true_mins = true_xy - true_wh_half
true_maxes = true_xy + true_wh_half
# max bottom left corner
intersect_mins = tf.math.maximum(pred_mins, true_mins)
# min top right corner
intersect_maxes = tf.math.minimum(pred_maxes, true_maxes)
intersect_wh = tf.math.maximum(intersect_maxes - intersect_mins, 0.)
# product of difference between x max and x min, y max and y min
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
pred_areas = wh_coords[..., 0] * wh_coords[..., 1]
true_areas = true_wh[..., 0] * true_wh[..., 1]
union_areas = pred_areas + true_areas - intersect_areas
iou_scores = intersect_areas / union_areas
# Best IOUs for each location.
iou_scores = tf.expand_dims(iou_scores, 4)
best_ious = tf.keras.backend.max(iou_scores, axis=4) # Best IOU scores.
best_ious = tf.expand_dims(best_ious, 4)
# A detector has found an object if IOU > thresh for some true box.
object_detections = tf.keras.backend.cast(best_ious > 0.6, dtype=tf.float32)
no_obj_weights = params.noobj_loss_weight * (1 - object_detections) * (1 - detector_mask[...,:1])
no_obj_loss = no_obj_weights * tf.math.square(obj_scores)
# could use weight here on obj loss
obj_conf_loss = params.obj_loss_weight * detector_mask[...,:1] * tf.math.square(1 - obj_scores)
conf_loss = no_obj_loss + obj_conf_loss
matching_classes = tf.cast(true_box_grid[...,4], tf.int32)
matching_classes = tf.one_hot(matching_classes, params.num_classes)
class_loss = detector_mask[..., :1] * tf.math.square(matching_classes - class_probs)
# keras_yolo does a sigmoid on center_coords here but they should already be between 0 and 1 from predict_transform
pred_boxes = tf.concat([center_coords, wh_coords], axis=-1)
matching_boxes = true_box_grid[..., :4]
coord_loss = params.coord_loss_weight * detector_mask[..., :1] * tf.math.square(matching_boxes - pred_boxes)
confidence_loss_sum = tf.keras.backend.sum(conf_loss)
classification_loss_sum = tf.keras.backend.sum(class_loss)
coordinates_loss_sum = tf.keras.backend.sum(coord_loss)
# not sure why .5 is here, maybe to make sure numbers don't get too large
total_loss = 0.5 * (confidence_loss_sum + classification_loss_sum + coordinates_loss_sum)
return total_loss
Below is predict_transform (based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L66) which reshapes the prediction tensor into a grid in order to compare with the ground truth objects. For the center coordinates, object scores, and class probabilities it does a sigmoid or softmax.
For the width height coordinates it performs the exponential operation on them (to make them positive) and multiplies them by the anchors. This seems to be where they start exploding.
def predict_transform(predictions):
predictions = tf.reshape(predictions, [-1, params.grid_height, params.grid_width, params.num_anchors, params.pred_vec_len])
conv_dims = predictions.shape[1:3]
conv_height_index = tf.keras.backend.arange(0, stop=conv_dims[0])
conv_width_index = tf.keras.backend.arange(0, stop=conv_dims[1])
conv_height_index = tf.tile(conv_height_index, [conv_dims[1]]) # (169,) tensor with 0-12 repeating
conv_width_index = tf.tile(tf.expand_dims(conv_width_index, 0), [conv_dims[0], 1]) # (13, 13) tensor with x offset in each row
conv_width_index = tf.keras.backend.flatten(tf.transpose(conv_width_index)) # (169,) tensor with 13 0's followed by 13 1's, etc (y offsets)
conv_index = tf.transpose(tf.stack([conv_height_index, conv_width_index])) # (169, 2)
conv_index = tf.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2]) # y offset, x offset
conv_dims = tf.cast(tf.reshape(conv_dims, [1, 1, 1, 1, 2]), tf.float32) # grid_height x grid_width, max dims of anchors
# makes the center coordinate between 0 and 1, each grid cell is normalized to 1 x 1
center_coords = tf.math.sigmoid(predictions[...,:2])
conv_index = tf.cast(conv_index, tf.float32)
center_coords = (center_coords + conv_index) / conv_dims
# makes the objectness score a probability between 0 and 1
obj_scores = tf.math.sigmoid(predictions[...,4:5])
anchors = DetectNet.get_anchors()
anchors = tf.reshape(anchors, [1, 1, 1, params.num_anchors, 2])
# exp to make width and height positive then multiply by anchor dims to resize box to anchor
# should fit close to anchor, normalizing by conv_dims should make it between 0 and approx 1
wh_coords = (tf.math.exp(predictions[...,2:4])*anchors) / conv_dims
# apply sigmoid to class scores to make them probabilities
class_probs = tf.keras.activations.softmax(predictions[..., 5 : 5 + params.num_classes])
# (batch, rows, cols, anchors, vals)
return center_coords, wh_coords, obj_scores, class_probs
I have another doubt in creating the ground truth data, based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L352. In the below box[0] and box[1] are the center coordinates, i and j are the grid cell coordinates (between 0 and 13), and box[2] and box[3] are the width and height.
They have all been normalized to be within the grid coordinates (0 to 13). Its placing the object in the ground truth grid with its corresponding best anchor. box[0] - j and box[1] - i ensure the center coordinates are between 0 and 1.
However I don't understand np.log(box[2] / anchors[best_anchor][0]), as the anchors are also on the grid coordinate scale, and the quotient may be less than 1, which will produce a negative number after the log. I often see negative widths and heights in my ground truth data as I am training, and don't know what to make of that.
if best_iou > 0:
adjusted_box = np.array(
[
box[0] - j, # center should be between 0 and 1, like prediction will be
box[1] - i,
np.log(box[2] / anchors[best_anchor][0]), # quotient might be less than one, not sure why log is used
np.log(box[3] / anchors[best_anchor][1]),
box[4] # class label
],
dtype=np.float32
)
true_box_grid[i, j, best_anchor] = adjusted_box
Also here is my model, which is very watered down because of my lack of computational resources.
def create_model():
model = models.Sequential()
model.add(Conv2D(6, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Conv2D(8, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Conv2D(12, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(Conv2D(8, 1, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(Conv2D(12, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(params.grid_height * params.grid_width * params.pred_vec_len * params.num_anchors, activation='relu'))
return model
I'm wondering what I can do to prevent the predicted widths and heights, and thus the loss from exploding. The exponential is there to ensure they are positive, which makes sense. I could also do a sigmoid on them, but I don't want to restrict them to be between 0 and 1. In the YOLO paper, they mention that they pretrain their network so that the layer weights are already initialized when the YOLO training begins. Is this a problem of initializing the network properly?

How to get CNN kernel values in Tensorflow

I am using the code below to create CNN layers.
conv1 = tf.layers.conv2d(inputs = input, filters = 20, kernel_size = [3,3],
padding = "same", activation = tf.nn.relu)
and I want to get the values of all kernels after training. It does not work it I simply do
kernels = conv1.kernel
So how should I retrieve the value of these kernels? I am also not sure what variables and method does conv2d has since tensorflow don't really tell it in conv2d class.
You can find all the variables in list returned by tf.global_variables() and easily lookup for variable you need.
If you wish to get these variables by name, declare a layer as:
conv_layer_1 = tf.layers.conv2d(activation=tf.nn.relu,
filters=10,
inputs=input_placeholder,
kernel_size=(3, 3),
name="conv1", # NOTE THE NAME
padding="same",
strides=(1, 1))
Recover the graph as:
gr = tf.get_default_graph()
Recover the kernel values as:
conv1_kernel_val = gr.get_tensor_by_name('conv1/kernel:0').eval()
Recover the bias values as:
conv1_bias_val = gr.get_tensor_by_name('conv1/bias:0').eval()
You mean you want to get the value of the weights for the conv1 layer.
You haven't actually defined the weights with conv2d, you need to do that. When I create a convolutional layer I use a function that performs all the necessary steps, here's a copy/paste of the function I use to create a each of my convolutional layers:
def _conv_layer(self, name, in_channels, filters, kernel, input_tensor, strides, dtype=tf.float32):
with tf.variable_scope(name):
w = tf.get_variable("w", shape=[kernel, kernel, in_channels, filters],
initializer=tf.contrib.layers.xavier_initializer_conv2d(), dtype=dtype)
b = tf.get_variable("b", shape=[filters], initializer=tf.constant_initializer(0.0), dtype=dtype)
c = tf.nn.conv2d(input_tensor, w, strides, padding='SAME', name=name + "c")
a = tf.nn.relu(c + b, name=name + "_a")
print name + "_a", a.get_shape().as_list(), name + "_w", w.get_shape().as_list(), \
"params", np.prod(w.get_shape().as_list()[1:]) + filters
return a, w.get_shape().as_list()
This is what I use to define 5 convolutional layers, this example is straight out of my code, so note that it's 5 convolutional layers stacked without using max pooling or anything, strides of 2 and 5x5 kernels.
conv1_a, _ = self._conv_layer("conv1", 3, 24, 5, self.imgs4d, [1, 2, 2, 1]) # 24.8 MiB/feature -> 540 x 960
conv2_a, _ = self._conv_layer("conv2", 24, 80, 5, conv1_a, [1, 2, 2, 1]) # 6.2 MiB -> 270 x 480
conv3_a, _ = self._conv_layer("conv3", 80, 256, 5, conv2_a, [1, 2, 2, 1]) # 1.5 MiB -> 135 x 240
conv4_a, _ = self._conv_layer("conv4", 256, 750, 5, conv3_a, [1, 2, 2, 1]) # 0.4 MiB -> 68 x 120
conv5_a, _ = self._conv_layer("conv5", 750, 2048, 5, conv4_a, [1, 2, 2, 1]) # 0.1 MiB -> 34 x 60
There's also a good tutorial on the tensorflow website on how to set up a convolutional network:
https://www.tensorflow.org/tutorials/deep_cnn
The direct answer to your question is that the weights for the convolutional layer are defined there as w, that's the tensor you're asking about if I understand you correctly.