CNN Keras Object Localization - Bad predictions - tensorflow

I'm a beginner in machine learning and I currently am trying to predict the position of an object within an image that is part of a dataset I created.
This dataset contains about 300 images in total and contains 2 classes (Ace and Two).
I created a CNN that predicts whether it's an Ace or a two with about 88% accuracy.
Since this dataset was doing a great job, I decided to try and predict the position of the card (instead of the class). I read up some articles and from what I understood, all I had to do was to take the same CNN that I used to predict the class and to change the last layer for a Dense layer of 4 nodes.
That's what I did, but apparently this isn't working.
Here is my model:
model = Sequential()
model.add(Conv2D(64,(3,3),input_shape = (150,150,1)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(32,(3,3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Flatten())
model.add(Dense(4))
model.compile(loss="mean_squared_error",optimizer='adam',metrics=[])
model.fit(X,y,batch_size=1,validation_split=0,
epochs=30,verbose=1,callbacks=[TENSOR_BOARD])
What I feed to my model:
X: a grayscale Image of 150x150 pixels. Each pixels are rescaled between [0-1]
y: Smallest X coordinate, Highest Y coordinate, Width and Height of the object (each of those values are between [0-1].
And here's an example of predictions it gives me:
[array([ 28.66145 , 41.278576, -9.568813, -13.520659], dtype=float32)]
but what I really wanted was:
[0.32, 0.38666666666666666, 0.4, 0.43333333333333335]
I knew something was wrong here so I decided to train and test my CNN on a single image (so it should overfit and predict the right bounding box for this single image if it worked). Even after overfitting on this single image, the predicted values were ridiculously high.
So my question is:
What am I doing wrong ?
EDIT 1
After trying #Matias's solution which was to add a sigmoid activation function to the last layer, all of the output's values are now between [0,1].
But, even with this, the model still produces bad outputs.
For example, after training it 10 epochs on the same image, it predicted this:
[array([0.0000000e+00, 0.0000000e+00, 8.4378130e-18, 4.2288357e-07],dtype=float32)]
but what I expected was:
[0.2866666666666667, 0.31333333333333335, 0.44666666666666666, 0.5]
EDIT 2
Okay, so, after experimenting for quite a while, I've come to a conclusion that the problem was either my model (the way it is built)
or the lack of training data.
But even if it was caused by a lack of training data, I should have been able to overfit it on 1 image in order to get the right predictions for this one, right?
I created another post which asks about my last question since the original one has been answered and I don't want to completely re-edit the post since it would make the first answers kind of pointless.

Since your targets (the Y values) are normalized to the [0, 1] range, the output of the model should match this range. For this you should use a sigmoid activation at the output layer, so the output is constrained to the [0, 1] range:
model.add(Dense(4, activation='sigmoid'))

Related

Hand Landmark Coordinate Neural Network Not Converging

I'm currently trying to train a custom model with tensorflow to detect 17 landmarks/keypoints on each of 2 hands shown in an image (fingertips, first knuckles, bottom knuckles, wrist, and palm), for 34 points (and therefore 68 total values to predict for x & y). However, I cannot get the model to converge, with the output instead being an array of points that are pretty much the same for every prediction.
I started off with a dataset that has images like this:
each annotated to have the red dots correlate to each keypoint. To expand the dataset to try to get a more robust model, I took photos of the hands with various backgrounds, angles, positions, poses, lighting conditions, reflectivity, etc, as exemplified by these further images:
I have about 3000 images created now, with the landmarks stored inside a csv as such:
I have a train-test split of .67 train .33 test, with the images randomly selected to each. I load the images with all 3 color channels, and scale the both the color values & keypoint coordinates between 0 & 1.
I've tried a couple different approaches, each involving a CNN. The first keeps the images as they are, and uses a neural network model built as such:
model = Sequential()
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (225,400,3)))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
filters_convs = [(128, 2), (256, 3), (512, 3), (512,3)]
for n_filters, n_convs in filters_convs:
for _ in np.arange(n_convs):
model.add(Conv2D(filters = n_filters, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(96, activation="relu"))
model.add(Dense(72, activation="relu"))
model.add(Dense(68, activation="sigmoid"))
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
I've modified the various hyperparameters, yet nothing seems to make any noticeable difference.
The other thing I've tried is resizing the images to fit within a 224x224x3 array to use with a VGG-16 network, as such:
vgg = VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
vgg.trainable = False
flatten = vgg.output
flatten = Flatten()(flatten)
points = Dense(256, activation="relu")(flatten)
points = Dense(128, activation="relu")(points)
points = Dense(96, activation="relu")(points)
points = Dense(68, activation="sigmoid")(points)
model = Model(inputs=vgg.input, outputs=points)
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
This model has similar results to the first. No matter what I seem to do, I seem to get the same results, in that my mse loss minimizes around .009, with an mae around .07, no matter how many epochs I run:
Furthermore, when I run predictions based off the model it seems that the predicted output is basically the same for every image, with only slight variation between each. It seems the model predicts an array of coordinates that looks somewhat like what a splayed hand might, in the general areas hands might be most likely to be found. A catch-all solution to minimize deviation as opposed to a custom solution for each image. These images illustrate this, with the green being predicted points, and the red being the actual points for the left hand:
So, I was wondering what might be causing this, be it the model, the data, or both, because nothing I've tried with either modifying the model or augmenting the data seems to have done any good. I've even tried reducing the complexity to predict for one hand only, to predict a bounding box for each hand, and to predict a single keypoint, but no matter what I try, the results are pretty inaccurate.
Thus, any suggestions for what I could do to help the model converge to create more accurate & custom predictions for each image of hands it sees would be very greatly appreciated.
Thanks,
Sam
Usually, neural networks will have a very hard time to predict exact coordinates of landmarks. A better approach is probably a fully convolutional network. This would work as follows:
You omit the dense layers at the end and thus end up with an output of (m, n, n_filters) with m and n being the dimensions of your downsampled feature maps (since you use maxpooling at some earlier stage in the network they will be lower resolution than your input image).
You set n_filters for the last (output-)layer to the number of different landmarks you want to detect plus one more to indicate no landmark.
You remove some of the max pooling such that your final output has a fairly high resolution (so the earlier referenced m and n are bigger). Now your output has shape mxnx(n_landmarks+1) and each of the nxm (n_landmark+1)-dimensional vectors indicate which landmark is present as the position in the image that corresponds to the position in the mxn grid. So the activation for your last output convolutional layer needs to be a softmax to represent probabilities.
Now you can train your network to predict the landmarks locally without having to use dense layers.
This is a very simple architecture and for optimal results a more sophisticated architecture might be needed, but I think this should give you a first idea of a better approach than using the dense layers for the prediction.
And for the explanation why your network does predict the same values every time: This is probably, because your network is just not able to learn what you want it to learn because it is not suited to do so. If this is the case, the network will just learn to predict a value, that is fairly good for most of the images (so basically the "average" position of each landmark for all of your images).

What is the behavior of data augmentation layer?

I am trying to understand the tensor flow data augmentation tutorial
In the following defined model
model = tf.keras.Sequential([
resize_and_rescale,
data_augmentation,
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
# Rest of your model
])
My understanding is no matter how many image rotate/zoom/transform defined in data_augmentation. This data_augmentation layer output just 1 image from 1 input image, am I correct?
I saw another post Does ImageDataGenerator add more images to my dataset?. Someone answers each epoch ImageDataGenerator will create different images, is that the same behavior here?
Otherwise, it is just the same transformed image trained epoch after epoch, which makes no sense.
Yes! Data augmentation layer would just transform images and return same shape as input (batch_size, *image_dims). But, due to randomisation in data augmentation layer, you are likely to get a different output each time that layer is called. For instance, in linked tutorial, random rotate or zoom is applied with a 20% chance, in addition the zoom factor and rotation angle are randomly selected(within specified limits) each time that layer is called.

MLP output of first layer is zero after one epoch

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.
the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working
I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.

What dimension is the LSTM model considers the data sequence?

I know that an LSTM layer expects a 3 dimension input (samples, timesteps, features). But which of it dimension the data is considered as a sequence.
Reading some sites I understood that is the timestep, so I tried to create a simple problem to test.
In this problem, the LSTM model needs to sum the values in timesteps dimension. Then, assuming that the model will consider the previous values of the timestep, it should return as an output the sum of the values.
I tried to fit with 4 samples and the result was not good. Does my reasoning make sense?
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM
X = np.array([
[5.,0.,-4.,3.,2.],
[2.,-12.,1.,0.,0.],
[0.,0.,13.,0.,-13.],
[87.,-40.,2.,1.,0.]
])
X = X.reshape(4, 5, 1)
y = np.array([[6.],[-9.],[0.],[50.]])
model = Sequential()
model.add(LSTM(5, input_shape=(5, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=1000, batch_size=4, verbose=0)
print(model.predict(np.array([[[0.],[0.],[0.],[0.],[0.]]])))
print(model.predict(np.array([[[10.],[-10.],[10.],[-10.],[0.]]])))
print(model.predict(np.array([[[10.],[20.],[30.],[40.],[50.]]])))
output:
[[-2.2417212]]
[[7.384143]]
[[0.17088854]]
First of all, yes you're right that timestep is the dimension take as data sequence.
Next, I think there is some confusion about what you mean by this line
"assuming that the model will consider the previous values of the
timestep"
In any case, LSTM doesn't take previous values of time step, but rather, it takes the output activation function of the last time step.
Also, the reason that your output is wrong is because you're using a very small dataset to train the model. Recall that, no matter what algorithm you use in machine learning, it'll need many data points. In your case, 4 data points are not enough to train the model. I used slightly more number of parameters and here's the sample results.
However, remember that there is a small problem here. I initialised the training data between 0 and 50. So if you make predictions on any number outside of this range, this won't be accurate anymore. Farther the number from this range, lesser the accuracy. This is because, it has become more of a function mapping problem than addition. By function mapping, I mean that your model will learn to map all values that are in training set(provided it's trained on enough number of epochs) to outputs. You can learn more about it here.

CNN Image Recognition with Regression Output on Tensorflow

I want to predict the estimated wait time based on images using a CNN. So I would imagine that this would use a CNN to output a regression type output using a loss function of RMSE which is what I am using right now, but it is not working properly.
Can someone point out examples that use CNN image recognition to output a scalar/regression output (instead of a class output) similar to wait time so that I can use their techniques to get this to work because I haven't been able to find a suitable example.
All of the CNN examples that I found are for the MSINT data and distinguishing between cats and dogs which output a class output, not a number/scalar output of wait time.
Can someone give me an example using tensorflow of a CNN giving a scalar or regression output based on image recognition.
Thanks so much! I am honestly super stuck and am getting no progress and it has been over two weeks working on this same problem.
Check out the Udacity self-driving-car models which take an input image from a dash cam and predict a steering angle (i.e. continuous scalar) to stay on the road...usually using a regression output after one or more fully connected layers on top of the CNN layers.
https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models
Here is a typical model:
https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/autumn
...it uses tf.atan() or you can use tf.tanh() or just linear to get your final output y.
Use MSE for your loss function.
Here is another example in keras...
model = models.Sequential()
model.add(convolutional.Convolution2D(16, 3, 3, input_shape=(32, 128, 3), activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(32, 3, 3, activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(64, 3, 3, activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(core.Flatten())
model.add(core.Dense(500, activation='relu'))
model.add(core.Dropout(.5))
model.add(core.Dense(100, activation='relu'))
model.add(core.Dropout(.25))
model.add(core.Dense(20, activation='relu'))
model.add(core.Dense(1))
model.compile(optimizer=optimizers.Adam(lr=1e-04), loss='mean_squared_error')
They key difference from the MNIST examples is that instead of funneling down to a N-dim vector of logits into softmax w/ cross entropy loss, for your regression output you take it down to a 1-dim vector w/ MSE loss. (you can also have a mix of multiple classification and regression outputs in the final layer...like in YOLO object detection)
The key is to have NO activation function in your last Fully Connected (output) layer. Note that you must have at least 1 FC layer beforehand.