Is binary crossentropy the only loss function that can be paired with sigmoid when building a model for multi-label image classification? - tensorflow

I am using Keras to build a CNN to classify images from the fashion MNist dataset.
I have taken the 28x28 images and created new images by placing each image into one corner of a 56X56 image, i.e. a new image might be a shoe in the top left corner with the rest of the image being white, etc.
Instead of just classifying the object in the image, I want it to also classify the position of the object in the image, with 14 classes in total - 10 for the type of object in the image and 4 for the position.
The labels are one hot encoded, so for instance, an image that has a bag in the bottom left corner would have the label [0 0 0 0 0 0 0 0 1 0 0 0 1 0] the first 1 marking the object as a bag and the second 1 indicating it is in the lower left corner.
Everything I have read states that for multi-label classification, the final activation function should be sigmoid and the loss binary crossentropy. I understand the reasoning for this, but is that the only viable combination?
I have tried many CNN architectures and hyperparameter searches, and the best validation accuracy I can achieve using binary crossentropy as the loss is around 0.50.
However, if I change the loss to categorical crossentropy, I am able to achieve around 0.85 validation accuracy and good predictions on unseen data. However, it's not exactly what I want, as the 10 object classes should be independent from the 4 position classes, and ideally I want a probability for belonging to each class independently (not summed to 1).
Considering the type of task, would I be better building a model that has multiple outputs and multiple losses?

Related

how to detect not in trained-for-category images when using resnet50

I have trained resnet50 on four categories of images. It works fantastic when I feed it an image in any one of the four categories -- I have essentially 100% accuracy on images in these categories.
However, when I feed my trained Resnet50 model an image of a similar object, but not in one of the original four categories, the prediction comes back as one of the four existing classes. By this I mean, in the array that is returned with the likelihood of each category, in many cases the likelihood of one of the categories is basically 1. For example, when I query the model about image that is not in one of the four categories, the prediction array will look like
[1.3492944e-07 9.9999988e-01 8.3132584e-14 1.4716975e-24]
Here is the prediction array for an image that the model was trained on:
[1.8217645e-27 1.0000000e+00 3.6731971e-32 0.0000000e+00]
These scores are different, but not much different. Many of the images that are not in one of the trained-for categories have a 1.00000000 for one of the labels.
I had been planning on dealing with the oddball images by looking at the prediction array to see if the max(category labels prediction) was below some threshold. But most of my max(category labels predictions) are all above .99999 and so I can't differentiate between images in the training set and images not part of the training set.
I plan to train my model for N buckets. When I am running the system I will occasionally have images that are not in one of the N buckets and I need to know that. I don't care what they are, I just want to know when an image is not in one of the N buckets.
Resnet50 does a great job of forcing everything into one of the categories, even when it is not.
My images are super well defined! I wonder if I am somehow overtraining or overlooking some other obvious error.
Here is an example of an image that was correctly categorized:
in training set and correctly categorized
Here is an image that is not part of the training set that was then categorized into one of the categories:
not in training set and incorrectly categorized
In summary: I am trying to sort images and I need to know when one of the images is not part of the training categories so I can reject that image. Restated, I want to sort images into buckets: known, trained for buckets, and one unknown bucket.
Is there any way to do this?
Should I use a different classifier than Resnet50?
My images are grayscale, bicubic interpolated during resize (large to smaller), 150x150. I have about 1,600 training images and 200 validation images per category. My accuracy and val_accuracy are .9997 after 3 epochs.
Training and validation accuracy
Training and validation loss
Your model only knows about 4 classes. It or any other model say MobileNet will always look at an image and assign probabilities to each of the 4 classes. You could put in a picture of a water buffalo and it will still try to classify it. Usually but not always if the out of class image you put in is very different from your training images the class with the highest probability will have a probability value well below 1.0. However in your case the out of class image is NOT all that different from the images in your dataset hence a fairly high false probability prediction.
All I can think off is if your out of class images will be generically similar to each other you could create a 5th class and train your model with the data you have plus gather some "typical" out of class images. Then train the model on these 5 classes. I made a model that classified 50 different dog breeds. It was extremely accurate. I put in a picture of Donald Trump and he was predicted as being a chihuahua!

How to modify the tensorflow loss function to suit multi labels on the same image

Tensorflow is fairly new to me and the way i would have the loss calculated on the mnist dataset was using the softmax_cross_entropy_with_logits function.
This function worked on that dataset due to the label input being a single label on each image
What im trying to do is to train a CNN on the mscoco dataset which has multiple labels on the same image with 80 classes total.
Is there a function that makes that possible?
My label input is currently somewhat a modified onehot representation, meaning that for each image i have a list of 80 elements having 0 for categories not in the image and 1 for categories present in an image
I.e. an image with a human and a dog would have a list of [0,1,0,0,1] assuming i have 5 classes with dogs and humans being in index 1 and 4
For multi-label classification problem, you can use the sigmoid function available in tensorflow (tf.nn.sigmoid_cross_entropy_with_logits). It would take the onehot encoded label input along with the final logits layer as its input.

Transpose tensorboard embedding projections

My model is trying to predict scores for 163 items using variety of inputs. It uses keras on tensorflow backend.
Following the approach in Keras - Save image embedding of the mnist data set to capture layer weights, I am capturing embedding data for final layer which is Dense(163). Since final dense layer is getting 128 inputs, weights matrix is 128x163. In Tensorboard Projector, I can see it visualizes 128 points very well.
However when I try to map it to my real world items using meta data, I have 163 items names but Tensorboard Projecter is visualizing 128x163 weight matrix by dimension 0 i.e. 128 points. Is there any way to make it visualize points by dimension 1 (163 points) in Tensorboard Projector?

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny

Tensorflow Loss for Non-Independent Classes

I am using a Tensorflow network for classification between classes that are similar to their neighboring classes, i.e. not independent. For example, let's say we want to predict among 10 classes but the predictions are not merely "correct" or "incorrect." Instead, if the correct class is 7 and network predicts 6, the loss should be less than if the network predicted 5, because 6 is closer to the correct answer than 5. My understanding is that cross entropy and 1-hot vectors provides "all or nothing" loss rather than a "continuous" loss that reflects the magnitude of the error. If that is correct, how does one implement such a continuous loss in Tensorflow?
-- Update June 13 2016 ----
An example application might be color recognition. If the network predicts "green" but the true color is yellow-green, then the loss should be less than if the network predicted blue because green is a better prediction than blue.
You can choose to implement a continuous function (e.g. hue from HSV) as a single output, and construct your own loss calculation that reflects what you want to optimize. In that case you'd just have a single output value that ranged between 0.0 and 1.0, and the loss would be evaluated based on the distance from the labeled value.