Tensorflow VGG16 (converted from caffe) got low evaluation accuracy - tensorflow

I didn't convert the weights by myself, instead I used vgg16_weights.npz from www(dot)cs(dot)toronto(dot)edu/~frossard/post/vgg16/. There, it is mentioned
We convert the Caffe weights publicly available in the author’s GitHub profile (gist(dot)github(dot)com/ksimonyan/211839e770f7b538e2d8#file-readme-md) using a specialized tool (github(dot)com/ethereon/caffe-tensorflow).
But, in that page, there is no validation code, so I made it referring to tensorflow MNIST and inception code.
How I create TFRecords of Imagenet
I use build_imagenet_data.py from inception. I changed the
label_index = 0 #originally label_index = 1
because inception use label_index 0 as background class (so in total there are 1001 classes). Caffe format doesn't use that as the number of output is 1000. I prefer to use TFRecord format as I will change process the weight and retrain.
How I load the weights
inference function taken from MNIST's mnist.py was modified so the Variable is taken from the vgg16_weights.npz
How I load the weights:
weights = np.load('/the_path/vgg16_weights.npz')
How I put the variable in conv1_1:
with tf.name_scope('conv1_1') as scope:
kernel = tf.Variable(tf.constant(weights['conv1_1_W']), name='weights')
conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
biases = tf.Variable(tf.constant(weights['conv1_1_b']), name='biases')
out = tf.nn.bias_add(conv, biases)
conv1_1 = tf.nn.relu(out, name=scope)
sess.run(conv1_1)
How I read the TFRecords
I took inception's image_processing.py, dataset.py, and ImagenetData.py with no change. Then, I run inception's inception_eval.py evaluate function with changing in inference code and deleting the restoring moving variable from checkpoint (as I already restore manually in variable initialization). However, the accuracy is not same with the VGG-16 in caffe. Top-5 accuracy is around 9%.
Closing
What is the problem of this method? There are several part of code that I still don't understand though:
How TFReader move to the next batch of images after processing 1 batch of images? The output of inception's image_processing.py size is only the number of batch size. To be complete, this is the output based on documentation:
images: Images. 4D tensor of size [batch_size, FLAGS.image_size,
image_size, 3].
labels: 1-D integer Tensor of [FLAGS.batch_size].
Do I need softmax the logits before tf.in_top_k ? (Well, I don't think it is matter as the value sequence is same)
Thank you for the help. Sorry if the link is messy as I can only post 2 links in 1 post because of my reputation.
UPDATE
I tried myself by changing the caffe weight. Reverse the channel input dimension of conv1_1 (because caffe receive BGR, so the weight is for BGR instead of RGB in tensorflow) and get the same accuracy with the weight from website: around 9% in top-5.
I found out that there is no mean image subtraction in tensorflow inception's image_processing.py. I add mean subtraction (in eval_image function) with tf.reduce_mean and got 11% accuracy.
Then I tried to change the eval_image function with
# source: https://github.com/ethereon/caffe-tensorflow/blob/master/examples/imagenet/dataset.py
img_shape = tf.to_float(tf.shape(image)[:2])
min_length = tf.minimum(img_shape[0], img_shape[1])
new_shape = tf.to_int32((256 / min_length) * img_shape) #isotropic case
# new_shape = tf.pack([256,256]) #non isotropic case
image = tf.image.resize_images(image, [new_shape[0], new_shape[1]])
offset = tf.to_int32((new_shape - 224) / 2)
image = tf.slice(image, begin=tf.pack([offset[0], offset[1], 0]), size=tf.pack([224, 224, -1]))
# mean_subs_image = tf.reduce_mean(image,axis=[0,1],keep_dims=True)
return image - mean_subs_image
and I got 13%. Increased but still lack a lot. Seems it is one of the problem. I am not sure what is the other problems.

In general porting whole model weights across libraries will be hard. You pointed out some differences from caffe, but there could be others. It might be easier to retrain the model in TensorFlow.

Related

Bounding Box regression using Keras transfer learning gives 0% accuracy. The output layer with Sigmoid activation only outputs 0 or 1

I am trying to create an object localization model to detect license plate in an image of a car. I used VGG16 model and excluded the top layer to add my own dense layers, with the final layer having 4 nodes and sigmoid activation to get (xmin, ymin, xmax, ymax).
I used the functions provided by keras to read image, and resize it to (224, 244, 3), and also used preprocess_input() function to process the input. I also tried to manually process the image by resizing with padding to maintain proportion, and normalize the input by dividing by 255.
Nothing seems to work when I train. I get 0% train and test accuracy. Below is my code for this model.
def get_custom(output_size, optimizer, loss):
vgg = VGG16(weights="imagenet", include_top=False, input_tensor=Input(shape=IMG_DIMS))
vgg.trainable = False
flatten = vgg.output
flatten = Flatten()(flatten)
bboxHead = Dense(128, activation="relu")(flatten)
bboxHead = Dense(32, activation="relu")(bboxHead)
bboxHead = Dense(output_size, activation="sigmoid")(bboxHead)
model = Model(inputs=vgg.input, outputs=bboxHead)
model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
return model
X and y were of shapes (616, 224, 224, 3) and (616, 4) respectively. I divided the coordinates by the length of the respective sides so each value in y is in range (0,1).
I'll link my python notebook below from github so you can see the full code. I am using google colab to train the model.
https://github.com/gauthamramesh3110/image_processing_scripts/blob/main/License_Plate_Detection.ipynb
Thanks in advance. I am really in need of help here.
If you're doing object localization task then you shouldn't using 'accuracy' as your metrics, because docs of compile() said:
When you pass the strings 'accuracy' or 'acc', we convert this to one
of tf.keras.metrics.BinaryAccuracy,
tf.keras.metrics.CategoricalAccuracy,
tf.keras.metrics.SparseCategoricalAccuracy based on the loss function
used and the model output shape
You should using tf.keras.metrics.MeanAbsoluteError, IoU(Intersection Over Union) or mAP(Mean Average Precision) instead

Delayed echo of sin - cannot reproduce Tensorflow result in Keras

I am experimenting with LSTMs in Keras with little to no luck. At some moment I decided to scale back to the most basic problems in order finally achieve some positive result.
However, even with simplest problems I find that Keras is unable to converge while the implementation of the same problem in Tensorflow gives stable result.
I am unwilling to just switch to Tensorflow without understanding why Keras keeps diverging on any problem I attempt.
My problem is a many-to-many sequence prediction of delayed sin echo, example below:
Blue line is a network input sequence, red dotted line is an expected output.
The experiment was inspired by this repo and workable Tensorflow solution was also created from it too.
The relevant excerpts from the my code are below, and full version of my minimal reproducible example is available here.
Keras model:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=True))
model.add(TimeDistributed(Dense(n_input, activation='linear')))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
Tensorflow model:
x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_steps])
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_steps], seed = SEED))
}
biases = {
'out': tf.Variable(tf.random_normal([n_steps], seed = SEED))
}
lstm = rnn.LSTMCell(n_hidden, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(lstm, inputs=x,
dtype=tf.float32,
time_major=False)
h = tf.transpose(outputs, [1, 0, 2])
pred = tf.nn.bias_add(tf.matmul(h[-1], weights['out']), biases['out'])
individual_losses = tf.reduce_sum(tf.squared_difference(pred, y),
reduction_indices=1)
loss = tf.reduce_mean(individual_losses)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) \
.minimize(loss)
I claim that other parts of code (data_generation, training) are completely identical. But learning progress with Keras stalls early and yields unsatisfactory predictions. Graphs of logloss for both libraries and example predictions are attached below:
Logloss for Tensorflow-trained model:
Logloss for Keras-trained model:
It's not easy to read from graph, but Tensorflow reaches target_loss=0.15 and stops early after about 10k batches. But Keras uses up all 13k batches reaching loss about only 1.5. In a separate experiment where Keras was running for 100k batches it went no further stalling around 1.0.
Figures below contain: black line - model input signal, green dotted line - ground truth output, red line - acquired model output.
Predictions of Tensorflow-trained model:
Predictions of Keras-trained model:
Thank you for suggestions and insights, dear colleagues!
Ok, I have managed to solve this. Keras implementation now converges steadily to a sensible solution too:
The models were in fact not identical. You may inspect with extra caution the Tensorflow model version from the question and verify for yourself that actual Keras equivalent is listed below, and isn't what stated in the question:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=False))
model.add(Dense(n_steps, input_shape=(n_hidden,), activation='linear'))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
I will elaborate. Workable solution here uses that last column of size n_hidden spat out by LSTM as an intermediate activation then fed to the Dense layer.
So, in a way, the actual prediction here is made by the regular perceptron.
One extra take away note - source of mistake in the original Keras solution is already evident from the inference examples attached to question. We see there that earlier timestamps fail utterly, while later timestamps are near perfect. These earlier timestamps correspond to the states of LSTM when it were just initialized on new window and clueless of context.

The CNN model does not learn when adding one/two more convolutional layers

I am trying to implement the model in the picture in tensorflow. Just instead of 6 output neurons I have 1000. I am doing this to have pretrained weights.
I implemented the full model but with only one (14,14,128) layer; just for testing and so on. Now that the porgram is matured I implemented the two more layers (or one). This makes the model not learning anything; loss is constant (around a small noise) and the accuracy tested on the training images is constant at random guess. Befor adding the layers I could get very fast (5-10 min) to accuracy of 70-80 percent in 1000 image-subset from the train dataset. As said this is not the case with the additional layers.
Here is the code for the additional layer where s1 and s2 is the stride of the conv:
w2 = weight_variable([3,3,64,128])
b2 = bias_variable([128])
h2 = tf.nn.relu(conv2d_s2(h1_pool,w2)+b2)
h2_pool = max_pool_2x2(h2)
#Starts additional layer
w3 = weight_variable([3,3,128,128])
b3 = bias_variable([128])
h3 = tf.nn.relu(conv2d_s1(h2_pool,w3)+b3)
#Ends additional layer
w5 = weight_variable([3,3,128,256])
b5 = bias_variable([256])
h5 = tf.nn.relu(conv2d_s1(h3,w5)+b5)
h5_pool = max_pool_2x2(h5)
This extra layer makes the model worthless. I have tried different hyperparameters (learning rate, batch size, epochs) without success. Where lies the problem?
Another question could be: does anyone know a small (and/or better) network of this size so I can imprement and test. My goal is to detect grasp positions in different objects (images of objects)?
If it helps, I use one GTX 980 with a very very good xeon.
The repository can be found in https://github.com/tnikolla/grasp-detection.
UPDATE
The problem was divergence in loss. Solved by lowering the learning rate
In orange is the accuracy and loss (tensorboard and terminal) when the program showed no learning at all. I was fooled by the loss showed in termminal. As pointed by #hars to check the logs of accuracy and loss I discovered that the loss in tensorboard diverges in the first steps. By changing the learning rate from 0,01 to 0,001 the divergence dissapeared and as you can see in cyan the model was learning (overfited a small subset of imagente in 1 minute).
You have a ReLU layer at the end of the model which might be clipping all gradients, followed by a softmax with logits in the training part. Therefore, the model might get stuck at poor local-minima.
Try removing tf.nn.relu in the last line of your inference and see if it trains well.
Here is your part of the code:
last few lines of model :
# fc1 layer
W_fc3 = weight_variable([512, 1000])
b_fc3 = bias_variable([1000])
output = tf.nn.relu(tf.matmul(h_fc2, W_fc3) + b_fc3)
#print("output: {}".format(output.get_shape()))
return output
training part of the code:
logits = inference_redmon.inference(images)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=labels_one))
tf.summary.scalar('loss', loss)
correct_pred = tf.equal( tf.argmax(logits,1), tf.argmax(labels_one,1))
accuracy = tf.reduce_mean( tf.cast( correct_pred, tf.float32))

Oversampling images during inference

It is is a common practice in convolutional neural networks to oversample a given image during inference,
I.e to create a batch from different transformation of the same image (most common - different crops and mirroring), transfer the entire batch through the network and average (or another kind of reducing function) over the results to get a single prediction (caffe example),
How can this approach be implemented in tensorflow?
You can take a look at the TF cnn tutorial. In particular, the function distorted_inputs does the image preprocessing step.
In short, there are a couple of TF functions in the tf.image package that help with distorting the images. You can use either them or regular numpy functions to create an extra dimension for the output, for which you can average the results:
Before:
input_place = tf.placeholder(tf.float32, [None, 256, 256, 3])
prediction = some_model(input_place) # size: [None]
sess.run(prediction, feed_dict={input_place: batch_of_images})
After:
input_place = tf.placeholder(tf.float32, [None, NUM_OF_DISTORTIONS, 256, 256, 3])
prediction = some_model(input_place) # make sure it is of size [None, NUM_DISTORTIONS]
new_prediction = tf.reduce_mean(prediction, axis=1)
new_batch = np.zeros(batch_size, NUM_OF_DISTORTIONS, 256, 256, 3)
for i in xrange(len(batch_of_images)):
for f in xrange(len(distortion_functions)):
new_batch[i, f, :, :, :] = distortion_functions[f](batch_of_images[i])
sess.run(new_prediction, feed_dict={input_place: new_batch})
Take a look at TF's image-related functions. You could apply those transformations at test time to some input image, and stack all of them together to make a batch.
I imagine you could also do this using OpenCV or some other image processing tool. I don't see a need to do it in the computation graph. You could create the batches beforehand, and pass it through in feed_dict.

Style Transfer in TensorFlow

I am having trouble understanding the way that the content and style filters are being trained for (e.g. in this paper ) in style transfer algorithms using TensorFlow.
I have examined a few implementations of the algorithm in the linked paper, but I can't quite grok their treatment of this step. To that end, I thought it would be to helpful implement a naive version, without using the pre-trained model. My understanding of the steps involved are:
Train a CNN on a single image (in the paper they use the pre-trained VGG network)
Using the trained network, feed in a white noise image. Define a new loss function that is minimized by updating the input image (this is how the image is 'painted') e.g. 'content' is derived by minimizing the distance between the conv layers in the trained model, and those resulting from the input (white noise) image
Thus, the implementation should be something like:
import TensorFlow as tf
x_in = tf.placeholder(tf.float32, shape=[None, num_pixels], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, num_pixels], name='y')
...
diff = y_-y_out
loss = tf.reduce_sum(tf.abs(diff)) # minimizing 'pixel difference'
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
# training model
for i in range(NUM_TRAINING_STEPS):
_, loss_val = sess.run([train_step, loss],
feed_dict={x_in: input_image, y_: input_image})
After training the model, I can generate a white noise image, but how can I used the trained model to update my input image? My suspicion is that I need to create a second network, where x_in is of type tf.Variable and load the weights and biases from the trained model, but the details of this elude me.
Yes, you could store your input image in a tf.Variable, load the weights from the trained model, and run an optimization loop with the style transfer loss function wrt to the input variable.
you can just use a style transfer as a service site to train styles like http://somatic.io