Tensorflow YOLO Object Detection Loss Exploding - tensorflow

I am trying to implement and train YOLO on my own, based on this implementation https://github.com/allanzelener/YAD2K/. The problems I am having is the width/height values in my prediction tensor are exploding and I never see an IOU above 0 between my prediction objects and ground truth objects. This all goes wrong within the first few minibatches of the first epoch. The loss and most of my prediction width/heights are nan.
My image size is 416x416, I'm using 5 anchors, and have 5 classes. I'm dividing the image into a 13x13 grid for a prediction tensor of [batch_size, 13, 13, 5, 10]. The ground truths for each batch are [batch_size, 13, 13, 5, 5], without one hot for the class probabilities.
Below is my loss function (based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L152), which passes the image to my model and then calls predict_transform which reshapes the tensor and transforms the coordinates.
def loss_custom(true_box_grid, x):
# training=training is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
y_ = model(x, training=training)
# (batch, rows, cols, anchors, vals)
center_coords, wh_coords, obj_scores, class_probs = DetectNet.predict_transform(y_)
detector_mask = create_mask(true_box_grid)
total_loss = 0
pred_wh_half = wh_coords / 2.
# bottom left corner
pred_mins = center_coords - pred_wh_half
# top right corner
pred_maxes = center_coords + pred_wh_half
true_xy = true_box_grid[..., 0:2]
true_wh = true_box_grid[..., 2:4]
true_wh_half = true_wh / 2.
true_mins = true_xy - true_wh_half
true_maxes = true_xy + true_wh_half
# max bottom left corner
intersect_mins = tf.math.maximum(pred_mins, true_mins)
# min top right corner
intersect_maxes = tf.math.minimum(pred_maxes, true_maxes)
intersect_wh = tf.math.maximum(intersect_maxes - intersect_mins, 0.)
# product of difference between x max and x min, y max and y min
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
pred_areas = wh_coords[..., 0] * wh_coords[..., 1]
true_areas = true_wh[..., 0] * true_wh[..., 1]
union_areas = pred_areas + true_areas - intersect_areas
iou_scores = intersect_areas / union_areas
# Best IOUs for each location.
iou_scores = tf.expand_dims(iou_scores, 4)
best_ious = tf.keras.backend.max(iou_scores, axis=4) # Best IOU scores.
best_ious = tf.expand_dims(best_ious, 4)
# A detector has found an object if IOU > thresh for some true box.
object_detections = tf.keras.backend.cast(best_ious > 0.6, dtype=tf.float32)
no_obj_weights = params.noobj_loss_weight * (1 - object_detections) * (1 - detector_mask[...,:1])
no_obj_loss = no_obj_weights * tf.math.square(obj_scores)
# could use weight here on obj loss
obj_conf_loss = params.obj_loss_weight * detector_mask[...,:1] * tf.math.square(1 - obj_scores)
conf_loss = no_obj_loss + obj_conf_loss
matching_classes = tf.cast(true_box_grid[...,4], tf.int32)
matching_classes = tf.one_hot(matching_classes, params.num_classes)
class_loss = detector_mask[..., :1] * tf.math.square(matching_classes - class_probs)
# keras_yolo does a sigmoid on center_coords here but they should already be between 0 and 1 from predict_transform
pred_boxes = tf.concat([center_coords, wh_coords], axis=-1)
matching_boxes = true_box_grid[..., :4]
coord_loss = params.coord_loss_weight * detector_mask[..., :1] * tf.math.square(matching_boxes - pred_boxes)
confidence_loss_sum = tf.keras.backend.sum(conf_loss)
classification_loss_sum = tf.keras.backend.sum(class_loss)
coordinates_loss_sum = tf.keras.backend.sum(coord_loss)
# not sure why .5 is here, maybe to make sure numbers don't get too large
total_loss = 0.5 * (confidence_loss_sum + classification_loss_sum + coordinates_loss_sum)
return total_loss
Below is predict_transform (based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L66) which reshapes the prediction tensor into a grid in order to compare with the ground truth objects. For the center coordinates, object scores, and class probabilities it does a sigmoid or softmax.
For the width height coordinates it performs the exponential operation on them (to make them positive) and multiplies them by the anchors. This seems to be where they start exploding.
def predict_transform(predictions):
predictions = tf.reshape(predictions, [-1, params.grid_height, params.grid_width, params.num_anchors, params.pred_vec_len])
conv_dims = predictions.shape[1:3]
conv_height_index = tf.keras.backend.arange(0, stop=conv_dims[0])
conv_width_index = tf.keras.backend.arange(0, stop=conv_dims[1])
conv_height_index = tf.tile(conv_height_index, [conv_dims[1]]) # (169,) tensor with 0-12 repeating
conv_width_index = tf.tile(tf.expand_dims(conv_width_index, 0), [conv_dims[0], 1]) # (13, 13) tensor with x offset in each row
conv_width_index = tf.keras.backend.flatten(tf.transpose(conv_width_index)) # (169,) tensor with 13 0's followed by 13 1's, etc (y offsets)
conv_index = tf.transpose(tf.stack([conv_height_index, conv_width_index])) # (169, 2)
conv_index = tf.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2]) # y offset, x offset
conv_dims = tf.cast(tf.reshape(conv_dims, [1, 1, 1, 1, 2]), tf.float32) # grid_height x grid_width, max dims of anchors
# makes the center coordinate between 0 and 1, each grid cell is normalized to 1 x 1
center_coords = tf.math.sigmoid(predictions[...,:2])
conv_index = tf.cast(conv_index, tf.float32)
center_coords = (center_coords + conv_index) / conv_dims
# makes the objectness score a probability between 0 and 1
obj_scores = tf.math.sigmoid(predictions[...,4:5])
anchors = DetectNet.get_anchors()
anchors = tf.reshape(anchors, [1, 1, 1, params.num_anchors, 2])
# exp to make width and height positive then multiply by anchor dims to resize box to anchor
# should fit close to anchor, normalizing by conv_dims should make it between 0 and approx 1
wh_coords = (tf.math.exp(predictions[...,2:4])*anchors) / conv_dims
# apply sigmoid to class scores to make them probabilities
class_probs = tf.keras.activations.softmax(predictions[..., 5 : 5 + params.num_classes])
# (batch, rows, cols, anchors, vals)
return center_coords, wh_coords, obj_scores, class_probs
I have another doubt in creating the ground truth data, based on https://github.com/allanzelener/YAD2K/blob/master/yad2k/models/keras_yolo.py#L352. In the below box[0] and box[1] are the center coordinates, i and j are the grid cell coordinates (between 0 and 13), and box[2] and box[3] are the width and height.
They have all been normalized to be within the grid coordinates (0 to 13). Its placing the object in the ground truth grid with its corresponding best anchor. box[0] - j and box[1] - i ensure the center coordinates are between 0 and 1.
However I don't understand np.log(box[2] / anchors[best_anchor][0]), as the anchors are also on the grid coordinate scale, and the quotient may be less than 1, which will produce a negative number after the log. I often see negative widths and heights in my ground truth data as I am training, and don't know what to make of that.
if best_iou > 0:
adjusted_box = np.array(
[
box[0] - j, # center should be between 0 and 1, like prediction will be
box[1] - i,
np.log(box[2] / anchors[best_anchor][0]), # quotient might be less than one, not sure why log is used
np.log(box[3] / anchors[best_anchor][1]),
box[4] # class label
],
dtype=np.float32
)
true_box_grid[i, j, best_anchor] = adjusted_box
Also here is my model, which is very watered down because of my lack of computational resources.
def create_model():
model = models.Sequential()
model.add(Conv2D(6, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Conv2D(8, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Conv2D(12, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(Conv2D(8, 1, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(Conv2D(12, 3, padding='same', data_format='channels_last', kernel_regularizer=l2(5e-4)))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(params.grid_height * params.grid_width * params.pred_vec_len * params.num_anchors, activation='relu'))
return model
I'm wondering what I can do to prevent the predicted widths and heights, and thus the loss from exploding. The exponential is there to ensure they are positive, which makes sense. I could also do a sigmoid on them, but I don't want to restrict them to be between 0 and 1. In the YOLO paper, they mention that they pretrain their network so that the layer weights are already initialized when the YOLO training begins. Is this a problem of initializing the network properly?

Related

How to effectively infer artifact removal cnn models?

So earlier, I trained the SNET model on images present here for the purpose of artifact removal. The hyperparameters with which I have trained the model are mentioned below:
So SNET has eight convolution-based heads that would output a certain size of patch in the image. The input would be the same patch that has jpeg artifacts artificially added into them. The predicted and the ground truth patch would be compared with one another, and then MSE loss between them is backpropagated.
The hyperparameters, that I used:
learning_rate = 0.0001, min_learning_rate = 0.000001 for exponential scheduler
optimizer = Adam
loss metric = MSE(mean squared error)
patch_size = 48 x 48
Training batch-size = 16
Evaluation metrics:
PSNR: This is used for quality measurement between the original and reconstructed image.
SSIM: A common metric is to quantify the difference in the values of each of the corresponding pixels between the sample and the reference images.
So while training the images, I was evaluating the model's performance at every step. So there are eight 48 x 48 patches of the same image, since there 8 heads in the model. The model was trained for 100 iterations. Outputs of random patches from 998 to 999 iterations are given below:
The first image is the artifact-added image, the second is the predicted one, third is the ground truth image.
After training the model, I had to test them on bigger images with a larger context(that has an object). So instead of resizing the images to 48 x 48, I divided them into patches of 48 x 48, therefore only tested on images that have width and height values that are multiples of 48. But the problem is that, though the image has high psnr and ssim values, there is a fine gap between patch to patch as shown below:
Is there a way to efficiently tackle this issue? Please suggest, open to any kind of feedback.
Below is the code for model that I used:
import tensorflow as tf
from PIL import Image
import cv2
import tensorflow as tf
import numpy as np
import os
def MSE(input,target):
#return tf.reduce_sum(tf.reduce_mean(tf.abs(input - target),axis=0))
return tf.reduce_mean(tf.abs(input - target))
initializer = tf.initializers.VarianceScaling()
def EncoderBlock(x, activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
return x
def DecoderBlock(x, activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(3, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
return x
def ConvolutionalUnit(x, structure_type = 'classic', activation = tf.keras.layers.LeakyReLU(alpha=0.2), nf = 256):
residual = x
if structure_type == "classic":
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Add()([x, residual])
elif structure_type == "advanced":
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = activation(x)
x = tf.keras.layers.Conv2D(nf, 5, strides=1, padding='same', kernel_initializer=initializer, use_bias=True)(x)
x = tf.keras.layers.Lambda(lambda x: x * 0.1)(x)
x = tf.keras.layers.Add()([x, residual])
return x
def S_Net(channels = 3, num_metrics=8 , structure_type='advanced', nf = 256):
inputs = tf.keras.layers.Input(shape=[None, None, channels])
encoder = EncoderBlock(inputs, nf = nf)
convolution_units = []
decoders = []
for i in range(num_metrics):
convolution_units.append(ConvolutionalUnit( convolution_units[-1] if len(convolution_units)>0 else ConvolutionalUnit(encoder, nf=nf), structure_type = structure_type, nf=nf))
decoders.append(DecoderBlock(convolution_units[-1],nf=nf))
return tf.keras.Model(inputs=[inputs], outputs=decoders)

Weighted Pixel Wise Categorical Cross Entropy for Semantic Segmentation

I have recently started learning about Semantic Segmentation. I am trying to train a UNet for the same. My input is RGB 128x128x3 images. My masks are made up of 4 classes 0, 1, 2, 3 and are One-Hot Encoded with dimension 128x128x4.
def weighted_cce(y_true, y_pred):
weights = []
t_inf = tf.convert_to_tensor(1e9, dtype = 'float32')
t_zero = tf.convert_to_tensor(0, dtype = 'int64')
for i in range(0, 4):
l = tf.argmax(y_true, axis = -1) == i
n = tf.cast(tf.math.count_nonzero(l), 'float32') + K.epsilon()
weights.append(n)
weights = [batch_size/j for j in weights]
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
# clip to prevent NaN's and Inf's
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
# calc
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
This is the loss function that I am using but it classifies every pixel as 2. What am I doing wrong?
You should have weights based on you entire data (unless your batch size is reasonably big so you have sort of stable weights).
If some class is underrepresented, with a small batch size, it will have near infinity weights.
If your target data is numpy array:
shp = y_train.shape
totalPixels = shp[0] * shp[1] * shp[2]
weights = np.sum(y_train, axis=(0, 1, 2)) #final shape (4,)
weights = totalPixels/weights
If your data is in a Sequence generator:
totalPixels = 0
counts = np.zeros((4,))
for i in range(len(generator)):
x, y = generator[i]
shp = y.shape
totalPixels += shp[0] * shp[1] * shp[2]
counts = counts + np.sum(y, axis=(0,1,2))
weights = totalPixels / counts
If your data is in a yield generator (you must know how many batches you have in an epoch):
for i in range(batches_per_epoch):
x, y = next(generator)
#the rest is equal to the Sequence example above
Attempt 1
I don't know if newer versions of Keras are able to handle this, but you can try the simplest approach first: simply call fit or fit_generator with the class_weight argument:
model.fit(...., class_weight = {0: weights[0], 1: weights[1], 2: weights[2], 3: weights[3]})
Attempt 2
Make a healthier loss function:
weights = weights.reshape((1,1,1,4))
kWeights = K.constant(weights)
def weighted_cce(y_true, y_pred):
yWeights = kWeights * y_pred #shape (batch, 128, 128, 4)
yWeights = K.sum(yWeights, axis=-1) #shape (batch, 128, 128)
loss = K.categorical_crossentropy(y_true, y_pred) #shape (batch, 128, 128)
wLoss = yWeights * loss
return K.sum(wLoss, axis=(1,2))

How do I store an intermediate convolutional layer's result in tensorflow for later processing?

The image below describes the output before the application of a max-pooling layer of a single intermediate filter layer of a CNN.
I want to store the co-ordinates of the pixel with intensity 4(on the bottom right of the matrix on the LHS of the arrow) as it is in the matrix on the LHS of the arrow. That is the pixel at co-ordinate (4,4)(1 based indexing)in the right matrix is the one which is getting stored in the bottom right cell of the matrix on the RHS of the arrow, right. Now what I want to do is to store this co-ordinate value (4,4) along with the co-ordinates for the other pixels {(2,2) for pixel with intensity 6, (2, 4) for pixel with intensity 8 and (3, 1) for pixel with intensity 3} as a list for later processing. How do I do it in Tensorflow.
Max pooling done with a filter of size 2 x 2 and stride of 2
You can use tf.nn.max_pool_with_argmax (link).
Noteļ¼š
The indices in argmax are flattened, so that a maximum value at
position [b, y, x, c] becomes flattened index ((b * height + y) *
width + x) * channels + c.
We need to do some processing to make it fit your coordinates.
An example:
import tensorflow as tf
import numpy as np
def max_pool_with_argmax(net,filter_h,filter_w,stride):
output, mask = tf.nn.max_pool_with_argmax( net,ksize=[1, filter_h, filter_w, 1],
strides=[1, stride, stride, 1],padding='SAME')
# If your ksize looks like [1, stride, stride, 1]
loc_x = mask // net.shape[2]
loc_y = mask % net.shape[2]
loc = tf.concat([loc_x+1,loc_y+1],axis=-1) #count from 0 so add 1
# If your ksize is all changing, use the following
# c = tf.mod(mask,net.shape[3])
# remain = tf.cast(tf.divide(tf.subtract(mask,c),net.shape[3]),tf.int64)
# x = tf.mod(remain,net.shape[2])
# remain = tf.cast(tf.divide(tf.subtract(remain,x),net.shape[2]),tf.int64)
# y = tf.mod(remain,net.shape[1])
# remain = tf.cast(tf.divide(tf.subtract(remain, y), net.shape[1]),tf.int64)
# b = tf.mod(remain, net.shape[0])
# loc = tf.concat([y+1,x+1], axis=-1)
return output,loc
input = tf.Variable(np.random.rand(1, 6, 4, 1), dtype=np.float32)
output, mask = max_pool_with_argmax(input,2,2,2)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
input_value,output_value,mask_value = sess.run([input,output,mask])
print(input_value[0,:,:,0])
print(output_value[0,:,:,0])
print(mask_value[0,:,:,:])
#print
[[0.20101677 0.09207255 0.32177696 0.34424785]
[0.4116488 0.5965447 0.20575707 0.63288754]
[0.3145412 0.16090539 0.59698933 0.709239 ]
[0.00252096 0.18027237 0.11163216 0.40613824]
[0.4027637 0.1995668 0.7462126 0.68812144]
[0.8993007 0.55828506 0.5263306 0.09376772]]
[[0.5965447 0.63288754]
[0.3145412 0.709239 ]
[0.8993007 0.7462126 ]]
[[[2 2]
[2 4]]
[[3 1]
[3 4]]
[[6 1]
[5 3]]]
You can see (2,2) for pixel with intensity 0.5965447, (2, 4) for pixel with intensity 0.63288754 and so on.
Let's say you have the following max-pooling layer:
pool_layer= tf.nn.max_pool(conv_output,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='VALID')
you can use:
max_pos = tf.gradients([pool_layer], [conv_output])[0]

How can I use Convolutional LSTM in Keras to realize position estimation?

I used LSTM in Keras with Tensorflow.
I would like to realize position estimation.
I want to input the movie (1 scene is 15 frame) and estimate the position of moving the square in the movie.
Input is 15 frame. Output is 2 variable (x, y).
In the following code the estimation accuracy is too bad. What should I do?
And, I don't understand AveragePooling3D/Reshape (Without this it will not perform.).
# We create a layer which take as input movies of shape
# (n_frames, width, height, channels) and returns a movie
# of identical shape.
seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
input_shape=(None, 80, 80, 1),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
#seq.add(Flatten())
seq.add(AveragePooling3D((1, 80, 80)))
seq.add(Reshape((-1, 40)))
seq.add(Dense(2))
#seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
# activation='sigmoid',
# padding='same', data_format='channels_last'))
seq.compile(loss='mean_squared_error', optimizer='adam')
def generate_movies(n_samples=1200, n_frames=15):
row = 80
col = 80
noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
dtype=np.float)
square_x_y = np.zeros((n_samples, n_frames, 2), dtype=np.float)
for i in range(n_samples):
for j in range(1):
# Initial position
xstart = np.random.randint(20, 60)
ystart = np.random.randint(20, 60)
# Direction of motion
directionx = np.random.randint(0, 3) - 1
directiony = np.random.randint(0, 3) - 1
# Size of the square
w = np.random.randint(2, 4)
for t in range(n_frames):
x_shift = xstart + directionx * t
y_shift = ystart + directiony * t
noisy_movies[i, t, x_shift - w: x_shift + w,
y_shift - w: y_shift + w, 0] += 1
# Make it more robust by adding noise.
# The idea is that if during inference,
# the value of the pixel is not exactly one,
# we need to train the network to be robust and still
# consider it as a pixel belonging to a square.
if np.random.randint(0, 2):
noise_f = (-1)**np.random.randint(0, 2)
noisy_movies[i, t,
x_shift - w - 1: x_shift + w + 1,
y_shift - w - 1: y_shift + w + 1,
0] += noise_f * 0.1
# Shift the ground truth by 1
x_shift = xstart + directionx * (t + 1)
y_shift = ystart + directiony * (t + 1)
shifted_movies[i, t, x_shift - w: x_shift + w,
y_shift - w: y_shift + w, 0] += 1
square_x_y[i, t, 0] = x_shift/row
square_x_y[i, t, 1] = y_shift/col
# Cut to a 40x40 window
#noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
#shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
#noisy_movies[noisy_movies >= 1] = 1
#shifted_movies[shifted_movies >= 1] = 1
return noisy_movies, shifted_movies, square_x_y
# Train the network
noisy_movies, shifted_movies, sq_x_y = generate_movies(n_samples = 1200)
seq.fit(noisy_movies[:1000], sq_x_y[:1000], batch_size=10,
epochs=1, validation_split=0.05)
See your seq.summary() to understand the shapes.
Your last ConvLSTM2D layer is outputting something with shape
(movies, length, side1, side2, channels)
(None, None, 80, 80, 40)
For the Dense layer to work, you will need to keep the "movies" and the "length" dimension, and collapse the others into one.
Notice that the dimension 40 (the channels) is important for it represents different "concepts", while the 80s (the sides) are purely 2D positional.
Since you don't want to touch the "length" dimension, you need an AveragePooling2D (not 3D). But since you're going to detect a very distinct positional feature and its location, I suggest you don't collapse the spatial dimensions at all. It should be better to just reshape and add a Dense that will take into account these positions.
So instead of the AveragePooling I suggest you use:
seq.add(Reshape((-1, 80*80*40)) #new shape is = (movies, length, 80*80*40)
Then you add a Dense layer that still has a notion of the positions:
seq.add(Dense(2)) #output shape is (movies, length, 2)
This is not guaranteed to make your model perform well, but I believe "better".
Other things you can do are adding more dense layers and smoothly go from the 80*80*40 features to 2.
This is more complex, but you might study the functional API model and learn about "U-nets" and try to make something like a u-net there.

Problems with reshape in GAN's discriminator (Tensorflow)

I was trying to implement various GANs in Tensorflow (after doing it successfully in PyTorch), and I am having some problems while coding the discriminator part.
The code of the discriminator (very similar to the MNIST CNN tutorial) is:
def discriminator(x):
"""Compute discriminator score for a batch of input images.
Inputs:
- x: TensorFlow Tensor of flattened input images, shape [batch_size, 784]
Returns:
TensorFlow Tensor with shape [batch_size, 1], containing the score
for an image being real for each input image.
"""
with tf.variable_scope("discriminator"):
x = tf.reshape(x, [tf.shape(x)[0], 28, 28, 1])
h_1 = leaky_relu(tf.layers.conv2d(x, 32, 5))
m_1 = tf.layers.max_pooling2d(h_1, 2, 2)
h_2 = leaky_relu(tf.layers.conv2d(m_1, 64, 5))
m_2 = tf.layers.max_pooling2d(h_2, 2, 2)
m_2 = tf.contrib.layers.flatten(m_2)
h_3 = leaky_relu(tf.layers.dense(m_2, 4*4*64))
logits = tf.layers.dense(h_3, 1)
return logits
while the code for the generator (architecture of InfoGAN paper) is:
def generator(z):
"""Generate images from a random noise vector.
Inputs:
- z: TensorFlow Tensor of random noise with shape [batch_size, noise_dim]
Returns:
TensorFlow Tensor of generated images, with shape [batch_size, 784].
"""
with tf.variable_scope("generator"):
batch_size = tf.shape(z)[0]
fc = tf.nn.relu(tf.layers.dense(z, 1024))
bn_1 = tf.layers.batch_normalization(fc)
fc_2 = tf.nn.relu(tf.layers.dense(bn_1, 7*7*128))
bn_2 = tf.layers.batch_normalization(fc_2)
bn_2 = tf.reshape(bn_2, [batch_size, 7, 7, 128])
c_1 = tf.nn.relu(tf.contrib.layers.convolution2d_transpose(bn_2, 64, 4, 2, padding='valid'))
bn_3 = tf.layers.batch_normalization(c_1)
c_2 = tf.tanh(tf.contrib.layers.convolution2d_transpose(bn_3, 1, 4, 2, padding='valid'))
So far, so good. The number of parameters is correct (checked it). However, I am having some problems in the next block of code:
tf.reset_default_graph()
# number of images for each batch
batch_size = 128
# our noise dimension
noise_dim = 96
# placeholder for images from the training dataset
x = tf.placeholder(tf.float32, [None, 784])
# random noise fed into our generator
z = sample_noise(batch_size, noise_dim)
# generated images
G_sample = generator(z)
with tf.variable_scope("") as scope:
#scale images to be -1 to 1
logits_real = discriminator(preprocess_img(x))
# Re-use discriminator weights on new inputs
scope.reuse_variables()
logits_fake = discriminator(G_sample)
# Get the list of variables for the discriminator and generator
D_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'discriminator')
G_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'generator')
# get our solver
D_solver, G_solver = get_solvers()
# get our loss
D_loss, G_loss = gan_loss(logits_real, logits_fake)
# setup training steps
D_train_step = D_solver.minimize(D_loss, var_list=D_vars)
G_train_step = G_solver.minimize(G_loss, var_list=G_vars)
D_extra_step = tf.get_collection(tf.GraphKeys.UPDATE_OPS, 'discriminator')
G_extra_step = tf.get_collection(tf.GraphKeys.UPDATE_OPS, 'generator')
The problem I am getting is where I am doing the reshape in the discriminator, and the error says:
ValueError: None values not supported.
Sure, the value for the batch_size is None (btw, the same error I am getting even where I am changing it to some number), but shape function (as far as I understand) should get the dynamic shape, not the static one. I think that I am a bit lost here.
For what is worth, I am giving here the link to the entire notebook I am working: https://github.com/TheRevanchist/GANs/blob/master/GANs-TensorFlow.ipynb if someone wants to look at it.
NB: The code here is part of the Stanford CS231n assignment. I have no affiliation with Stanford though, so it isn't homework cheating (proof: the course is finished months ago).
The generator seems to be the problem. The output size should match the discriminator. And the other issues are batch norm should be applied before the activation unit. I have modified the code:
with tf.variable_scope("generator"):
fc = tf.layers.dense(z, 4*4*128)
bn_1 = leaky_relu(tf.layers.batch_normalization(fc))
bn_1 = tf.reshape(bn_1, [-1, 4, 4, 128])
c_1 = tf.layers.conv2d_transpose(bn_1, 64, 5, strides=2, padding='same')
bn_2 = leaky_relu(tf.layers.batch_normalization(c_1))
c_2 = tf.layers.conv2d_transpose(bn_2, 32, 5, strides=2, padding='same')
bn_3 = leaky_relu(tf.layers.batch_normalization(c_2))
c_3 = tf.layers.conv2d_transpose(bn_3, 1, 5, strides=2, padding='same')
c_3 = tf.layers.batch_normalization(c_3)
c_3 = tf.image.resize_images(c_3, (28, 28))
c_3 = tf.contrib.layers.flatten(c_3)
c_3 = tf.tanh(c_3)
return c_3
Your code gives the below output when run with the above changes
Instead of passing None to reshape you must pass -1.
So this:
x = tf.reshape(x, [tf.shape(x)[0], 28, 28, 1])
becomes
x = tf.reshape(x, [-1, 28, 28, 1])
and this:
bn_2 = tf.reshape(bn_2, [batch_size, 7, 7, 128])
becomes:
bn_2 = tf.reshape(bn_2, [-1, 7, 7, 128])
It will infer the batch size from the rest of the shape you provided.