Kernel crashing, trying to run a single convolution layer over 1 image - tensorflow

I am using the MNIST data and loading a single image for a cnn, I wanted to see how the image looks like after a single layer. I have gone through the documentation to see if there were any errors with my inputs or how I am consolidating the data. Is the code wrong by any means or is it just my computer ?
filtersw = tf.Variable(tf.random_normal(shape=[5, 5, 1, 1], mean=0.5, stddev=0.01))
filtersb = tf.Variable(tf.zeros(1))
conv = tf.nn.conv2d(x, filtersw, strides=[1, 1, 1, 1], padding='VALID') + filtersb
sess = tf.Session()
sess.run(tf.global_variables_initializer())
afterimage = sess.run(conv, feed_dict={x : X_train[0:1]})

You need to call tf.layers.conv2d instead of tf.nn.conv2d. The two functions have the same name, but they perform different operations. tf.layers.conv2d creates a convolution layer in a CNN and tf.nn.conv2d performs convolution with a known filter or set of filters.

Related

How to specify sample dependent kernels/filters in Conv2d?

I am trying to implement a convolutional autoencoder where some of the convolutional filters are input content dependent. For example, in a simple toy example, knowing the digit label for MNIST could further help with reconstruction in an autoencoder setup.
The more general idea is that there could be some relevant, auxiliary information (whether the information is the class label or some other information) that that is useful to incorporate. While there are various ways to use this label/auxiliary information, I will do so through creating a separate convolutional filter. Let's say the model has 15 typical convolutional filters, I would like to add an additional convolutional filter that corresponds to the MNIST digit and can be thought of as an embedding of the digit in the form of a 3x3 kernel. We would use the digit as an additional input to the network and then learn a distinct kernel/filter embedding for each digit.
However, I am having difficulty implementing a convolutional filter/kernel that is input dependent. I am not using tf.keras.layers.Conv2D layer because that takes in the # of filters to be used, but not the actual filter parameters to make this input dependent.
# load and preprocess data
num_classes = 10
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train, x_test = np.float32(x_train)/255, np.float32(x_test)/255
x_train, x_test = np.expand_dims(x_train, axis=-1), np.expand_dims(x_test, axis=-1)
y_train = keras.utils.to_categorical(y_train, num_classes=num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes=num_classes)
num_filters = 15
input_img = layers.Input(shape=(28,28,1))
conv_0 = keras.layers.Conv2D(num_filters, (3,3), strides=2, padding='same', activation='relu')(input_img)
# embed the target as a 3x3 kernel/filter -> this should map to a distinct embedding for
# each target
target = layers.Input(shape=(10,))
target_encoded = layers.Dense(9, activation='relu')(target)
target_encoded = layers.Reshape((3,3,1,1))(target_encoded)
# Using tf.nn.conv2d so that I can specify kernel
# Kernel needs to be a 4D tensor of dimensions (filter_height, filter_width, input_channels, output_channels)
# which in this case is (3,3,1,1)
# However it is currently (None,3,3,1,1) because the first dimension is batch size so this doesn't work
target_conv = tf.nn.conv2d(input_img, target_encoded, strides=[1, 1, 1, 1], padding='SAME')
I am currently using tf.nn.conv2d which takes a kernel as input in the format (filter_height, filter_width, input_channels, output_channels). However, this doesn't work as is because data is fed in batches. Therefore, each sample in the batch has a label and therefore a corresponding kernel so the kernels are of shape (None, 3, 3, 1, 1) which is not compatible with the expected format. This is illustrated in the code chunk above (which doesn't work). What are potential work arounds? Is there a simpler way to implement this concept of an input dependent conv2d filter?
Making A Conv2D with SWAPPABLE kernel!
You'll need to make your own Conv2D that takes as input the image to process AND the kernel to use.
# Define our new Convolution
class DynamicConv2D(tf.keras.layers.Layer):
def __init__(self, padding='SAME'):
super(DynamicConv2D, self).__init__()
self.padding = padding
def call(self, input, kernel):
return tf.nn.conv2d(input=input, filters=kernel,
strides=(1,1), padding=self.padding)
And let's test it out
dc2d = DynamicConv2D(padding='VALID')
input_tensor = np.ones([1,4,4,3],dtype=np.float32)
kernel_tensor = np.ones([2,2,3,1],dtype=np.float32)
dc2d(input_tensor, kernel_tensor)
returns
array([[[[12.], [12.], [12.]],
[[12.], [12.], [12.]],
[[12.], [12.], [12.]]]])
It looks like it works great... but there is a HUGE problem
HUGE ISSUE WITH KERAS - BATCH BY DEFAULT
Yeah, so here is the deal: tensorflow keras is really really really set on everything being set up so the first dimension is the batch. But if you look up above we have to specify the ONE KERNEL for the whole batch. We can't pass in a batch of kernel_tensorS, but just one.
THERE IS A WORK AROUND!
Let's borrow something from RNN training schemes, specifically we are going to solve this by being careful about what we send per batch. More specifically, for a batch we are going to make sure all input images use the same kernel_tensor. You'll have to figure out how you do that efficiently with your data pipeline, but here is an example to get you going.
Working Code
(We will rewrite out dynamic conv2d so that it takes a category and stores its
own kernel per category)
# Define our new Convolution
class DynamicConv2D(tf.keras.layers.Layer):
def __init__(self, padding='SAME', input_dim=10, kernel_shape=[3,3,1,8]):
super(DynamicConv2D, self).__init__()
self.padding = padding
self.input_dim = input_dim
self.kernel_shape = kernel_shape
self.kernel_size = kernel_shape[0]*kernel_shape[1]*kernel_shape[2]*kernel_shape[3] # = 3*3*1*8
self.category_to_kernel = tf.keras.layers.Embedding(self.input_dim,self.kernel_size)
def call(self, input, categories):
just_first_category = tf.slice(categories,(0,0),(1,1))
flat_kernel = self.category_to_kernel(just_first_category)
kernel = tf.reshape(flat_kernel,self.kernel_shape)
return tf.nn.conv2d(input=input, filters=kernel, strides=(1,1), padding=self.padding)
This class by default does a 3x3 convolution, reading in 1 filter from the previous layer and outputting 8
# Example output
dc2d = DynamicConv2D(padding='VALID')
image_data = np.ones([4,10,10,1],dtype=np.float32)
# prove that you can send in a different category and get different results
print( dc2d(image_data, [[3]]*4).numpy()[0,0,0,:3] )
print( dc2d(image_data, [[4]]*4).numpy()[0,0,0,:3] )
--------
[ 0.014 -0.002 0.108]
[ 0.021 0.014 -0.034]
Use it to make a tf.Keras model
# model input
image_input = tf.keras.Input(shape=(28,28,1), dtype=tf.float32)
category_input = tf.keras.Input(shape=(1,), dtype=tf.int32)
# do covolution
dynamic_conv2d = DynamicConv2D(padding='VALID')(image_input, category_input)
# make the model
model = tf.keras.Model(inputs=[image_input, category_input], outputs=dynamic_conv2d)
And we can use the model like so
# use the model
input_as_tensor = tf.constant(image_data,dtype=tf.float32)
category_as_tensor = tf.constant([[4]]*4,dtype=tf.int32)
result = model.predict(x=(input_as_tensor, category_as_tensor))
print('The output shape is',result.shape)
print('The first 3 values of the first output image are', result[0,0,0,:3])
---------
The output shape is (4, 8, 8, 8)
The first 3 values of the first output image are [-0.028 -0.009 0.015]

TensorFlow post-LSTM fully connected layer outputs return the same values as each other

I was trying to train a sequence-to-sequence LSTM model with a dataset with three labels: [1, 0] for detection of class 1, [0, 1] for detection of class 2, and [0, 0] for detection of nothing. After getting the outputs from the LSTM network, I applied a fully connected layer to each cell's output the following way:
outputs, state = tf.nn.dynamic_rnn(cell, input)
# Shape of outputs is [batch_size, n_time_steps, n_hidden]
# As matmul works only on matrices, reshape to get the
# time dimension into the batch dimension
outputs = tf.reshape(outputs, [-1, n_hidden])
# Shape is [batch_size * n_time_steps, n_hidden]
w = tf.Variable(tf.truncated_normal(shape=[n_hidden, 2], stddev=0.1))
b = tf.Variable(tf.constant(0.1, shape=[2]))
logit = tf.add(tf.matmul(outputs, w), b, name='logit')
# Reshape back to [batch_size, n_time_steps, 2]
logit = tf.reshape(logit, [batch_size, -1, 2])
On the output, I apply tf.nn.sigmoid_cross_entropy_with_logits and reduce the mean. The model seems to work just fine achieving high accuracy and recall, except for the fact that in almost all the cases it outputs either [0, 0], or [1, 1]. The two logit outputs from the fully connected layer always have very similar values (but not the same). This effectively puts a hard-cap on precision of 50%, which the model converges to (but not a fraction of a percent above).
Now, my intuition would tell me that something must be wrong with the training step and both fully connected outputs are trained on the same data, but curiously enough when I replace my own implementation with the prepackaged one from tf.contrib:
outputs, state = tf.nn.dynamic_rnn(cell, input)
logit = tf.contrib.layers.fully_connected(outputs, 2, activation_fn=None)
without changing a single other thing, the model starts training properly. Now, the obvious solution would be to just use that implementation, but why doesn't the first one work?

Effect of max_pool in Convolutional Neural Network [tensorflow]

I'm following Udacity Deep Learning video by Vincent Vanhoucke and trying to understand the (practical or intuitive or obvious) effect of max pooling.
Let's say my current model (without pooling) uses convolutions with stride 2 to reduce the dimensionality.
def model(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
Now I introduced pooling: Replace the strides by a max pooling operation (nn.max_pool()) of stride 2 and kernel size 2.
def model(data):
conv1 = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
bias1 = tf.nn.relu(conv1 + layer1_biases)
pool1 = tf.nn.max_pool(bias1, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
conv2 = tf.nn.conv2d(pool1, layer2_weights, [1, 1, 1, 1], padding='SAME')
bias2 = tf.nn.relu(conv2 + layer2_biases)
pool2 = tf.nn.max_pool(bias2, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
shape = pool2.get_shape().as_list()
reshape = tf.reshape(pool2, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
What would be the compelling reason that we use the later model instead of no-pool model, besides the improved accuracy? Would love to have some insights from people who have already used cnn many times!
Both of the approaches (strides and pooling) reduces the dimensionality of the input (for strides/pooling size > 1). This by itself is a good thing because it reduces the computation time, number of parameters and allows to prevent overfitting.
They achieve it in a different way:
you can think about strides as downsampling the result of the 1-strided convolution by just taking every s-th result.
max-pooling downsamples the result by taking the maximum number from a hypercube. If some important feature has been found, max-pool preserves it regardless of its position
You also mentioned "besides the improved accuracy". But almost everything people do in machine learning is to improve the accuracy (some other loss function). So if tomorrow someone will show that sum-square-root pooling achieves the best result on many bechmarks, a lot of people will start to use it.
In a classification task improving the accuracy is the goal.
However, pooling allows you to:
Reduce the input dimensionality
Force the network to learn particular features, depending on the type of pooling you apply.
Reducing the input dimensionality is something you want because it forces the network to project its learned representations in a different and with lower dimensionality space. This is good computationally speaking because you have to allocate less memory and thus you can have bigger batches. But it's also something desirable because usually high-dimensional spaces have a lot of redundancy and are spaces in which all abjects appears to be sparse and dissimilar ( see The curse of dimensionality )
The function you decide to use for the pooling operation, moreover, can force the network to give more importance to some features.
Max-pooling, for instance, is widely used because allow the network to be robust to small variations of the input image.
What happens, in practice, it that only the features with the highest activations pass through the max-pooling gate.
If the input image is shifted by a small amount, then the max-pooling op produces the same output although the input is shifted (the maximum shift is thus equal to the kernel size).
CNN without pooling are also capable of learning this kind of features, but with a bigger cost in term of parameters and computing time (see Striving for Simplicity: The All Convolutional Net)

Oversampling images during inference

It is is a common practice in convolutional neural networks to oversample a given image during inference,
I.e to create a batch from different transformation of the same image (most common - different crops and mirroring), transfer the entire batch through the network and average (or another kind of reducing function) over the results to get a single prediction (caffe example),
How can this approach be implemented in tensorflow?
You can take a look at the TF cnn tutorial. In particular, the function distorted_inputs does the image preprocessing step.
In short, there are a couple of TF functions in the tf.image package that help with distorting the images. You can use either them or regular numpy functions to create an extra dimension for the output, for which you can average the results:
Before:
input_place = tf.placeholder(tf.float32, [None, 256, 256, 3])
prediction = some_model(input_place) # size: [None]
sess.run(prediction, feed_dict={input_place: batch_of_images})
After:
input_place = tf.placeholder(tf.float32, [None, NUM_OF_DISTORTIONS, 256, 256, 3])
prediction = some_model(input_place) # make sure it is of size [None, NUM_DISTORTIONS]
new_prediction = tf.reduce_mean(prediction, axis=1)
new_batch = np.zeros(batch_size, NUM_OF_DISTORTIONS, 256, 256, 3)
for i in xrange(len(batch_of_images)):
for f in xrange(len(distortion_functions)):
new_batch[i, f, :, :, :] = distortion_functions[f](batch_of_images[i])
sess.run(new_prediction, feed_dict={input_place: new_batch})
Take a look at TF's image-related functions. You could apply those transformations at test time to some input image, and stack all of them together to make a batch.
I imagine you could also do this using OpenCV or some other image processing tool. I don't see a need to do it in the computation graph. You could create the batches beforehand, and pass it through in feed_dict.

Using tensorflow batch normalization stalls loss decrease

I'm using tensorflow version r0.11
I'm trying to use the batch normalization (tf.contrib.layers.batch_norm()) in a conv net. As a headstart, I followed the discussions made in the following github issue. It seems that 'is_training', 'reuse' and 'updates_collections' flags are still confusing (in usage), partly because of the lack of good use cases. However, my problem is that the loss is not decreasing if I add batch norm layer.
I framed the code following the structure as in CIFAR. And I am running it in a multi-gpu fashion (for training). I have one script for training (similar to cifar10_multigpu.py) and one for testing (similar to cifar10_eval.py).
for i in xrange(all_flags.num_gpus): # Number of GPUs is 2
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (all_flags.TOWER_NAME, i)) as scope:
# Calculate the loss for one tower of the model. This function
# constructs the entire model but shares the variables across all
# towers.
loss = _tower_loss(inputs[i], labels[i], scope)
# Reuse variables for the next tower. This line makes it possible
tf.get_variable_scope().reuse_variables()
# More stuff happening like compute_gradients (inside gpus loop),
# averaging gradients (outside gpus loop), applying them (outside
# gpus loop)
The inference/model building happens within (a nested function in) the function _tower_loss. (below is an example of the function, in reality I use more layers and neurons).
def inference(inputs): #(This gets called from _tower_loss())
# conv1
with tf.variable_scope('conv1') as scope:
kernel = # define kernel
conv = tf.nn.conv2d(inputs, kernel, strides=[1, 1, 1, 1], padding='SAME')
biases = _variable_on_gpu('biases', [64], tf.constant_initializer(0.0))
preactivation = tf.nn.bias_add(conv, biases)
# ReLU.
conv1 = tf.nn.relu(preactivation, name=scope.name)
# pool1
pool1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME', name='pool1')
# Similarly more conv+pool and then fcs and finally logits
return logits
I want to perform batch nomalization. So I passed in an additional placeholder input argument inside my '_tower_loss' as well as 'inference' functions.
def inference(inputs, is_training):
# BN1
with tf.variable_scope('norm0') as scope:
# Note that I'm using the dafault for 'updates_collections'
# which is None
norm0 = tf.contrib.layers.batch_norm(inputs, is_training=is_training,
scope=scope, reuse=None)
# conv1
with tf.variable_scope('conv1') as scope:
kernel = # define kernel
conv = tf.nn.conv2d(norm0, kernel, strides=[1, 1, 1, 1], padding='SAME')
# Rest is same
I also added normalization layers in the couple of fc layers
In the train code the instructions go like this
...
variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
variables_averages_op = variable_averages.apply(tf.trainable_variables())
train_op = tf.group(apply_gradient_op, variables_averages_op)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # Line A
...
sess.run([train_op, loss, update_ops],feed_dict={is_training: True}) # Line B
...
When Batch normalization is not there, Line A is not there and in Line B, 'update_ops' is run in the session.
What I'm seeing is that when batch normalization is not used loss starts at aorund 6.5 and consistently decreases to near 0, but when I'm using batch normalization loss does not decrease after 2 or 3 hundred (minibatch) iterations and gets stuck at around 5.5 or so. Speedwise, I'd say the performance is same. I'm not sure what is the issue. I tried with different learning rate (I'm using Adam optimizer) with no effect. I'm not sure whether 'variables_averages_op' along with 'update_ops' is messing things up. Any help will be appreciated.