Architecture of CNN to accept 2 sets of inputs - tensorflow

There are multiple examples how to build Tensorflow model to recognise cats and dogs from images. Now suppose I have audio associated with each picture and train separate network to recognise cats and dogs by sound.
I want to feed predictions of both networks into another layer to combine results and increase final prediction success rate.
How should my model look like?

Create two neural networks, that given a pair image-audio, you input each value to its corresponding net.
After the convolution steps or whatever you want to use, proceed as you would do with a normal CNN, in the last step before passing data to a FNN, when you flatten the data, do the same with the output of the audio NN.
So, as an example, if the output of the images one (flattened) has shape 2048 and the audio 4096 just append these two and make first layer of the FNN to have the sum of these shapes = 6144.

Related

how to detect not in trained-for-category images when using resnet50

I have trained resnet50 on four categories of images. It works fantastic when I feed it an image in any one of the four categories -- I have essentially 100% accuracy on images in these categories.
However, when I feed my trained Resnet50 model an image of a similar object, but not in one of the original four categories, the prediction comes back as one of the four existing classes. By this I mean, in the array that is returned with the likelihood of each category, in many cases the likelihood of one of the categories is basically 1. For example, when I query the model about image that is not in one of the four categories, the prediction array will look like
[1.3492944e-07 9.9999988e-01 8.3132584e-14 1.4716975e-24]
Here is the prediction array for an image that the model was trained on:
[1.8217645e-27 1.0000000e+00 3.6731971e-32 0.0000000e+00]
These scores are different, but not much different. Many of the images that are not in one of the trained-for categories have a 1.00000000 for one of the labels.
I had been planning on dealing with the oddball images by looking at the prediction array to see if the max(category labels prediction) was below some threshold. But most of my max(category labels predictions) are all above .99999 and so I can't differentiate between images in the training set and images not part of the training set.
I plan to train my model for N buckets. When I am running the system I will occasionally have images that are not in one of the N buckets and I need to know that. I don't care what they are, I just want to know when an image is not in one of the N buckets.
Resnet50 does a great job of forcing everything into one of the categories, even when it is not.
My images are super well defined! I wonder if I am somehow overtraining or overlooking some other obvious error.
Here is an example of an image that was correctly categorized:
in training set and correctly categorized
Here is an image that is not part of the training set that was then categorized into one of the categories:
not in training set and incorrectly categorized
In summary: I am trying to sort images and I need to know when one of the images is not part of the training categories so I can reject that image. Restated, I want to sort images into buckets: known, trained for buckets, and one unknown bucket.
Is there any way to do this?
Should I use a different classifier than Resnet50?
My images are grayscale, bicubic interpolated during resize (large to smaller), 150x150. I have about 1,600 training images and 200 validation images per category. My accuracy and val_accuracy are .9997 after 3 epochs.
Training and validation accuracy
Training and validation loss
Your model only knows about 4 classes. It or any other model say MobileNet will always look at an image and assign probabilities to each of the 4 classes. You could put in a picture of a water buffalo and it will still try to classify it. Usually but not always if the out of class image you put in is very different from your training images the class with the highest probability will have a probability value well below 1.0. However in your case the out of class image is NOT all that different from the images in your dataset hence a fairly high false probability prediction.
All I can think off is if your out of class images will be generically similar to each other you could create a 5th class and train your model with the data you have plus gather some "typical" out of class images. Then train the model on these 5 classes. I made a model that classified 50 different dog breeds. It was extremely accurate. I put in a picture of Donald Trump and he was predicted as being a chihuahua!

tensorflow output coordinates

i have the following task: i'm supposed to find the coordinates of an targetpoint. The features that are given, are the distances from anchors to that targetpoint. See img 1 distances from anchors to target
I planned to create a simple neural network first just with input and output layer. The cost-function i try to minimize is: correct_coordinate - mean of square(summed_up_distances*weights).
But now i'm kind of stuck in how to model the neural network, so that i'm outputting coordinates [x,y], as the current model would just output a single value. See img 2 current model
Right now I would than just train 2 neural networks. One that outputs the x-value, and one that outputs the y-value.
I'm just not sure if that is the best practice with tensorflow.
So I would like to know, how would you model the NN with tensorflow?
You can build the network with 2 nodes in the output layer. there is no need to train 2 neural networks for the same task.

Convolutional Neural Network Training

I have a question regarding convolutional neural network (CNN) training.
I have managed to train a network using tensorflow that takes an input image (1600 pixels) and output one of three classes that matches it.
Testing the network with variations of the trained classes is giving good results. However; when I give it a different -fourth- image (does not contain any of the trained 3 image), it always returns a random match to one of the classes.
My question is, how can I train a network to classify that the image does not belong to either of the three trained images? A similar example, if i trained a network against the mnist database and then a gave it the character "A" or "B". Is there a way to discriminate that the input does not belong to either of the classes?
Thank you
Your model will always make predictions like your labels, so for example if you train your model with MNIST data, when you will make predictions, prediction will always be 0-9 just like MNIST labels.
What you can do is train a different model first with 2 classes in which you will predict if an image belongs to data set A or B. E.x. for MNIST data you label all data as 1 and add data from other sources that are different (not 0-9) and label them as 0. Then train a model to find if image belongs to MNIST or not.
Convolutional Neural Network (CNN) predicts the result from the defined classes after training. CNN always return from one of the classes regardless of accuracy. I have faced similar problem, what you can do is to check for accuracy value. If the accuracy is below some threshold value then it's belong to none category. Hope this helps.
You probably have three output nodes, and choose the maximum value (one-hot encoding). That's a bit unfortunate as it's a low number of outputs. Non-recognized inputs tend to cause pretty random outputs.
Now, with 3 outputs, roughly speaking you can get 7 outcomes. You might get a single high value (3 possibilities) but non-recognized input can also cause 2 high outputs (also 3 possibilities) or approximately equal output (also 3 possibilities). So there's a decent chance (~ 3/7) of random inputs producing a pattern on the output nodes which you'd only expect for a recognized input.
Now, if you had 15 classes and thus 15 output nodes, you'd be looking at roughly 32767 possible outcomes for unrecognized inputs, only 15 of which correspond to expected one-hot outcomes.
Underlying this is a lack of training data. If your training set has examples outside the 3 classes, you can just dump this in a 4th "other" category and train with that. This by itself isn't a reliable indication, as usually the theoretical "other" set is huge, but you now have 2 complementary ways of detecting other inputs: either by the "other" output node or by one of the 11 ambiguous outputs.
Another solution would be to check what outcome your CNN usually gives when given something else. I believe the last layer must be softmax and your CNN should return probabilities of the three given classes. If none of these probabilities is close to 1 this might be a sign that this is something else assuming your CNN is well trained (it must be fined for overconfidence when predicting wrong labels).

Neural Network with my own dataset

I have downloaded many face images from web. In order to learn Tensorflow I want to feed those images to a simple fully-connected neural network with a single hidden layer. I have found an example code in here.
Since I am a beginner, I don't know how to train, evaluate, and test the network with the downloaded images. The code owner used a '.mat' file and a .pkl file. I don't understand how he organized training and test set.
In order to run the code with my images;
Do I need to divide my images into training, test, and validation folders and turn each folder into a mat file? How am I going to provide labels for the training?
Besides, I don't understand why he used a '.pkl' file?
All in all, I would like to change this code so that I can find test, training , and validation set classification performance with my image dataset.
It might be an easy question, but it is important for me as it is a starting step. Thanks for your understanding.
First, you don't have to use .mat files nor pickles. Tensorflow expects numpy array.
For instance, let's say you have 70000 images of size 28x28 (=784 dimensions) belonging to 10 classes. Let's also assume that you'd like to train a simple feedforward neural network to classify the images.
The first step would be to split the images between train and test (and validation, but let's put this aside for the sake of simplicity). For the sake of the example, let's imagine that you chose randomly 60000 images for your training set and 10000 for your test set.
The second step would be to ensure that your data has the right format. Here, you'd like your training set to consist in one numpy array of shape (60000, 784) for the images and another one of shape (60000, 10) for the labels (if you use one-hot encoding to represent your classes). As for your test set, you should have an array of shape (10000, 784) for the images and one of shape (10000, 10) for the labels.
Once you have these big numpy arrays, you should define placeholders that will allow you to feed data to you network during training and evaluation.
images = tf.placeholder(tf.float32, shape=[None, 784])
labels = tf.placeholder(tf.int64, shape=[None, 10])
The None here means that you can feed a batch of any size, i.e. as many images as you want, as long as you numpy array is of shape (anything, 784).
The third step consists in defining your model as well as the loss function and the optimizer.
The fourth step consists in training your network by feeding it with random batches of data using the placeholders created above. As your network is training, you can periodically print its performance like the training loss/accuracy as well as the test loss/accuracy.
You can find a complete and very simple example here.

How to learn multi-class multi-output CNN with TensorFlow

I want to train a convolutional neural network with TensorFlow to do multi-output multi-class classification.
For example: If we take the MNIST sample set and always combine two random images two a single one and then want to classify the resulting image. The result of the classification should be the two digits shown in the image.
So the output of the network could have the shape [-1, 2, 10] where the first dimension is the batch, the second represents the output (is it the first or the second digit) and the third is the "usual" classification of the shown digit.
I tried googling for this for a while now, but wasn't able find something useful. Also, I don't know if multi-output multi-class classification is the correct naming for this task. If not, what is the correct naming? Do you have any links/tutorials/documentations/papers explaining what I'd need to do to build the loss function/training operations?
What I tried was to split up the output of the network into the single outputs with tf.split and then use softmax_cross_entropy_with_logits on every single output. The result I averaged over all outputs but it doesn't seem to work. Is this even a reasonable way?
For nomenclature of classification problems, you can have a look at this link:
http://scikit-learn.org/stable/modules/multiclass.html
So your problem is called "Multilabel Classification". In normal TensorFlow multiclass classification (classic MNIST) you will have 10 output units and you will use softmax at the end for computing losses i.e. "tf.nn.softmax_cross_entropy_with_logits".
Ex: If your image has "2", then groundtruth will be [0,0,1,0,0,0,0,0,0,0]
But here, your network output will have 20 units and you will use sigmoid i.e. "tf.nn.sigmoid_cross_entropy_with_logits"
Ex: If your image has "2" & "4", then groundtruth will be [0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0], i.e. first ten bits to represent first digit class and second to represent second digit class.
First you have to provide two labels to an image comprised of two different images. Then change your objective loss function so it maximizes the outputs of the two given labels and train your model. I don't think you need to split the outputs.