How to modify the tensorflow loss function to suit multi labels on the same image - tensorflow

Tensorflow is fairly new to me and the way i would have the loss calculated on the mnist dataset was using the softmax_cross_entropy_with_logits function.
This function worked on that dataset due to the label input being a single label on each image
What im trying to do is to train a CNN on the mscoco dataset which has multiple labels on the same image with 80 classes total.
Is there a function that makes that possible?
My label input is currently somewhat a modified onehot representation, meaning that for each image i have a list of 80 elements having 0 for categories not in the image and 1 for categories present in an image
I.e. an image with a human and a dog would have a list of [0,1,0,0,1] assuming i have 5 classes with dogs and humans being in index 1 and 4

For multi-label classification problem, you can use the sigmoid function available in tensorflow (tf.nn.sigmoid_cross_entropy_with_logits). It would take the onehot encoded label input along with the final logits layer as its input.

Related

Is binary crossentropy the only loss function that can be paired with sigmoid when building a model for multi-label image classification?

I am using Keras to build a CNN to classify images from the fashion MNist dataset.
I have taken the 28x28 images and created new images by placing each image into one corner of a 56X56 image, i.e. a new image might be a shoe in the top left corner with the rest of the image being white, etc.
Instead of just classifying the object in the image, I want it to also classify the position of the object in the image, with 14 classes in total - 10 for the type of object in the image and 4 for the position.
The labels are one hot encoded, so for instance, an image that has a bag in the bottom left corner would have the label [0 0 0 0 0 0 0 0 1 0 0 0 1 0] the first 1 marking the object as a bag and the second 1 indicating it is in the lower left corner.
Everything I have read states that for multi-label classification, the final activation function should be sigmoid and the loss binary crossentropy. I understand the reasoning for this, but is that the only viable combination?
I have tried many CNN architectures and hyperparameter searches, and the best validation accuracy I can achieve using binary crossentropy as the loss is around 0.50.
However, if I change the loss to categorical crossentropy, I am able to achieve around 0.85 validation accuracy and good predictions on unseen data. However, it's not exactly what I want, as the 10 object classes should be independent from the 4 position classes, and ideally I want a probability for belonging to each class independently (not summed to 1).
Considering the type of task, would I be better building a model that has multiple outputs and multiple losses?

Transpose tensorboard embedding projections

My model is trying to predict scores for 163 items using variety of inputs. It uses keras on tensorflow backend.
Following the approach in Keras - Save image embedding of the mnist data set to capture layer weights, I am capturing embedding data for final layer which is Dense(163). Since final dense layer is getting 128 inputs, weights matrix is 128x163. In Tensorboard Projector, I can see it visualizes 128 points very well.
However when I try to map it to my real world items using meta data, I have 163 items names but Tensorboard Projecter is visualizing 128x163 weight matrix by dimension 0 i.e. 128 points. Is there any way to make it visualize points by dimension 1 (163 points) in Tensorboard Projector?

What is the structure of the data and labels in tensorflow.examples.tutorials.mnist input_data

I'm trying to learn to introduce data to conv nets properly in Tensorflow, and a majority of example code uses from import tensorflow.examples.tutorials.mnist import input_data.
It's simple when you can use this to access mnist data, but not helpful when trying to establish the equivalent way to structure and introduce non-mnist data to similar models.
What is the structure of the data being imported through the mnist examples, so that I can use example cnn walkthrough code and manipulate my data to mirror the structure of the mnist data?
The format of the MNIST data obtained from that example code depends on exactly how you initialize the DataSet class. Calling DataSet.next_batch(batch_size) returns two NumPy arrays, representing batch_size images and labels respectively. They have the following formats.
If the DataSet was initialized with reshape=True (the default), the images array is a batch_size by 784 matrix, in which each row contains the pixels of one MNIST image. The default type is tf.float32, and the values are pixel intensities between 0.0 and 1.0.
If the DataSet was initialized with reshape=False, the images array is batch_size by 28 by 28 by 1 4-dimensional tensor. The 28 corresponds to the height and width of each image in pixels; the 1 corresponds to the number of channels in the images, which are grayscale and so have only a single channel.
If the DataSet was initialized with one_hot=False (the default), the labels array is a vector of length batch_size, in which each value is the label (an integer from 0 to 9) representing the digit in the respective image.
If the DataSet was initialized with one_hot=True, the labels array is a batch_size by 10 matrix, in which each row is all zeros, except for a 1 in the column that corresponds to the label of the respective image.
Note that if you are interested in convolutional networks, initializing the DataSet with reshape=False is probably what you want, since that will retain spatial information about the images that will be used by the convolutional operators.

Changing Inception-v4 architecture to do Multi-label classification in Tensorflow

I am working on image tagging and annotation problem, simply an image may contain multiple objects. I want to train inception-v4 for multi-label classification. My training data will be an image and a vector of length equals the number of classes and has 1 in each index if the object exists in the image. For example, If I have four classes (Person, car, tree, buildings). If an image contains a person and car. Then my vector will be (1, 1, 0, 0).
What changes do I need to make to train inception-v4 for the tagging and annotation problem?
Do I only need to change the input format and change the loss function from softmax to sigmoid_cross_entropy_with_logits in the inception-v4 architecture?
https://github.com/tensorflow/models/blob/master/slim/nets/inception_v4.py
Thank you in advance.
If you'd like to retrain a model to output different labels, check out the image_retraining example: https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/examples/image_retraining/retrain.py
In that example, we retrain the standard inception v3 model to recognize flowers instead of the standard ImageNet categories.

How to learn multi-class multi-output CNN with TensorFlow

I want to train a convolutional neural network with TensorFlow to do multi-output multi-class classification.
For example: If we take the MNIST sample set and always combine two random images two a single one and then want to classify the resulting image. The result of the classification should be the two digits shown in the image.
So the output of the network could have the shape [-1, 2, 10] where the first dimension is the batch, the second represents the output (is it the first or the second digit) and the third is the "usual" classification of the shown digit.
I tried googling for this for a while now, but wasn't able find something useful. Also, I don't know if multi-output multi-class classification is the correct naming for this task. If not, what is the correct naming? Do you have any links/tutorials/documentations/papers explaining what I'd need to do to build the loss function/training operations?
What I tried was to split up the output of the network into the single outputs with tf.split and then use softmax_cross_entropy_with_logits on every single output. The result I averaged over all outputs but it doesn't seem to work. Is this even a reasonable way?
For nomenclature of classification problems, you can have a look at this link:
http://scikit-learn.org/stable/modules/multiclass.html
So your problem is called "Multilabel Classification". In normal TensorFlow multiclass classification (classic MNIST) you will have 10 output units and you will use softmax at the end for computing losses i.e. "tf.nn.softmax_cross_entropy_with_logits".
Ex: If your image has "2", then groundtruth will be [0,0,1,0,0,0,0,0,0,0]
But here, your network output will have 20 units and you will use sigmoid i.e. "tf.nn.sigmoid_cross_entropy_with_logits"
Ex: If your image has "2" & "4", then groundtruth will be [0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0], i.e. first ten bits to represent first digit class and second to represent second digit class.
First you have to provide two labels to an image comprised of two different images. Then change your objective loss function so it maximizes the outputs of the two given labels and train your model. I don't think you need to split the outputs.