Online tripet generation - am I doing it right? - tensorflow

I'm trying to train a convolutional neural network with triplet loss (more about triplet loss here) in order to generate face embeddings (128 values that accurately describe a face).
In order to select only semi-hard triplets (distance(anchor, positive) < distance(anchor, negative)), I first feed all values in a mini-batch and calculate the distances:
distance1, distance2 = sess.run([d_pos, d_neg], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
Then I select the indices of the inputs with distances that respect the formula above:
valids_batch = compute_valids(distance1, distance2, batch_size)
The function compute_valids:
def compute_valids(distance1, distance2, batch_size):
valids = list();
for q in range(0, len(distance1)):
if(distance1[q] < distance2[q]):
valids.append(q)
return valids;
Then I learn only from the training examples with indices returned by this filter function:
input1_valid = [input1[q] for q in valids_batch]
input2_valid = [input2[q] for q in valids_batch]
input3_valid = [input3[q] for q in valids_batch]
_, loss_value, summary = sess.run([optimizer, cost, summary_op], feed_dict={x_anchor:input1_valid, x_positive:input2_valid, x_negative:input3_valid})
Where optimizer is defined as:
model1 = siamese_convnet(x_anchor)
model2 = siamese_convnet(x_positive)
model3 = siamese_convnet(x_negative)
d_pos = tf.reduce_sum(tf.square(model1 - model2), 1)
d_neg = tf.reduce_sum(tf.square(model1 - model3), 1)
cost = triplet_loss(d_pos, d_neg)
optimizer = tf.train.AdamOptimizer(learning_rate = 1e-4).minimize( cost )
But something is wrong because accuracy is very low (50%).
What am I doing wrong?

There are a lot of reasons why your network is performing poorly. From what I understand, your triplet generation method is fine. Here are some tips that may help improve your performance.
The model
In deep metric learning, people usually use some pre-trained models on ImageNet classification task as these models are pretty expressive and can generate good representation for image. You can fine-tuning your model on the basis of these pre-trained models, e.g., VGG16, GoogleNet, ResNet.
How to fine-tuing
Even if you have a good pre-trained model, it is often difficult to directly optimize the triplet loss using these model on your own dataset. Since these pre-trained models are trained on ImageNet, if your dataset is vastly different from ImageNet, you can first fine-tuning the model using classification task on your dataset. Once your model performs reasonably well on the classification task on your custom dataset, you can use the classification model as base network (maybe a little tweak) for triplet network. It will often lead to much better performance.
Hyper parameters
Hyper parameters such as learning rate, momentum, weight_decay etc. are also extremely important for good performance (learning rate maybe the most important factor). Since your are fine-tuning and not training the network from scratch. You should use a small learning rate, for example, lr=0.001 or lr=0.0001. For momentum, 0.9 is a good choice. For weight_decay, people usually use 0.0005 or 0.00005.
If you add some fully connected layers, then for these layers, the learning rate may be higher than other layers (0.01 for example).
Which layer to fine-tuing
As your network has several layers, you need to decide which layer to fine-tune. Researcher have found that the lower layers in network just produce some generic features such as line or edges. Typically, people will freeze the updating of lower layers and only update the weight of upper layers which tend to produce task-oriented features. You should try to optimize starting from different lower layers and see which setting performs best.
Reference
Fast rcnn(Section 4.5, which layers to fine-tune)
Deep image retrieval(section 5.2, Influence of fine-tuning the representation)

distance(anchor, positive) < distance(anchor, negative)
This will select triplets in which similarity between anchor and positive is more than anchor and negative, it is opposite of hard triplet. You need to use examples where d(a,p)>d(a,n) for hard triplets. For semi-hard triplets, you need examples that satisfy d(a,p)<d(a,n)<d(a,p)+margin.
Here is the explanation : https://stackoverflow.com/a/49314187/7693521
I hope I am correct about this, if not please correct me.

Related

Tensorflow loss converging but model fails to predict even on train data

Using ANN with Tensorflow to train a simple known equation Y=Sin(X) or Y=Cos(X). My loss function is converging properly.
Loss function convergence graph. If loss function converges it means model has fitted well to my training dataset.
However, when I predict passing in argument training set itself, model fails to predict even train data which is strange.
Here it can be seen that after 200th value there model shows no training at all
If the loss has converged then model should fit the train dataset perfectly but that is not happening here. What is wrong in my code?
X = np.linspace(0,10*np.pi,1000)
Y = np.sin(X)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(500,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(1))
opt = tf.keras.optimizers.Adam(0.01)
model.compile(optimizer=opt,loss='mse')
r= model.fit(X.reshape(-1,1),Y,epochs=100)
plt.plot(r.history['loss'])
Yhat = model.predict(X.reshape(-1,1)).flatten()
plt.plot(Y)
plt.plot(Yhat)
It is the nature of your data.
It made me remember the old paper which showed that the ANN can't compute even the XOR
Anyway the reason here is that your model is shallow and shallow networks are much less efficient than deep networks. To put in perspective a model like below
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(20,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(20,activation='relu'))
model.add(tf.keras.layers.Dense(1))
Will likely perform better even though it has only 1/3 of the parameters of the original model and that is cause the deeper you go the more complex representations can the model create. The core thing to remember is
THE DEEP LEARNING MODEL DON'T BUILD NON-LINEAR DECISION BOUNDARIES as EACH AND EVERY
UNIT IS FUNDAMENTALLY DESIGNED TO CREATE SOME LINEAR DECISION BOUNDARY. so what does
it do? IT FROM STACKING THOSE LINEAR DECISION BOUNDARIES MAKE A REPRESENTATION OF
DATA WHICH IS LINEARLY SEPARABLE.
Also, the most important things is to know your data. In this case using the Probabilistic Models will give almost perfect results. You can easily implement those using TensorFlow probability.

Is my training data set too complex for my neural network?

I am new to machine learning and stack overflow, I am trying to interpret two graphs from my regression model.
Training error and Validation error from my machine learning model
my case is similar to this guy Very large loss values when training multiple regression model in Keras but my MSE and RMSE are very high.
Is my modeling underfitting? if yes what can I do to solve this problem?
Here is my neural network I used for solving a regression problem
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
and my data set
I have 500 samples, 10 features and 1 target
Quite the opposite: it looks like your model is over-fitting. When you have low error rates for your training set, it means that your model has learned from the data well and can infer the results accurately. If your validation data is high afterwards however, that means that the information learned from your training data is not successfully being applied to new data. This is because your model has 'fit' onto your training data too much, and only learned how to predict well when its based off of that data.
To solve this, we can introduce common solutions to reduce over-fitting. A very common technique is to use Dropout layers. This will randomly remove some of the nodes so that the model cannot correlate with them too heavily - therefor reducing dependency on those nodes and 'learning' more using the other nodes too. I've included an example that you can test below; try playing with the value and other techniques to see what works best. And as a side note: are you sure that you need that many nodes within your dense layer? Seems like quite a bit for your data set, and that may be contributing to the over-fitting as a result too.
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
Dropout(0.2),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
Well i think your model is overfitting
There are several ways that can help you :
1-Reduce the network’s capacity Which you can do by removing layers or reducing the number of elements in the hidden layers
2- Dropout layers, which will randomly remove certain features by setting them to zero
3-Regularization
If i want to give a brief explanation on these:
-Reduce the network’s capacity:
Some models have a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.by lowering the capacity of the network, it's going to learn the patterns that matter or that minimize the loss. But remember،reducing the network’s capacity too much will lead to underfitting.
-regularization:
This page can help you a lot
https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
-Drop out layer
You can use some layer like this
model.add(layers.Dropout(0.5))
This is a dropout layer with a 50% chance of setting inputs to zero.
For more details you can see this page:
https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
As mentioned in the existing answer by #omoshiroiii your model in fact seems to be overfitting, that's why RMSE and MSE are too high.
Your model learned the detail and noise in the training data to the extent that it is now negatively impacting the performance of the model on new data.
The solution is therefore randomly removing some of the nodes so that the model cannot correlate with them too heavily.

Approximating multidimensional functions with neural networks

Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example

Best DNNClassifier Configuration for MNIST study in TensorFlow

I am working on MNIST dataset on TensorFlow with deep neural networks classifier. I am using the following structure for the network.
MNIST_DATASET = input_data.read_data_sets(mnist_data_path)
train_data = np.array(MNIST_DATASET.train.images, 'int64')
train_target = np.array(MNIST_DATASET.train.labels, 'int64')
test_data = np.array(MNIST_DATASET.test.images, 'int64')
test_target = np.array(MNIST_DATASET.test.labels, 'int64')
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=[tf.contrib.layers.real_valued_column("", dimension=784)],
n_classes=10, #0 to 9 - 10 classes
hidden_units=[2500, 1000, 1500, 2000, 500],
model_dir="model"
)
classifier.fit(train_data, train_target, steps=1000)
However, I faced with the 40% accuracy when I run the following line.
accuracy_score = 100*classifier.evaluate(test_data, test_target)['accuracy']
How can I tune the network? I do something wrong? Similar studies retrieved 99% accuracy in academia.
Thank you.
I find an optimum configuration on GitHub.
Firstly, that's not the best configuration. Academic studies have already reached the 99.79% accuracy on test set.
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns
, n_classes=10
, hidden_units=[128, 32]
, optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=learning_rate)
, activation_fn = tf.nn.relu
)
Also, the following parameters is transfered to the classifier.
epoch = 15000
learning_rate = 0.1
batch_size = 40
In this way, model classifies 97.83% accuray on test set, and 99.77% accuracy on trainset.
Speaking from experience, it would be a good idea to have no more than 2 hidden layers in fully connected network for MNIST dataset. i.e. hidden_units=[500, 500]. That should get to over 90% accuracy.
What is the problem? Extreme number of model parameters. For example, just second hidden layer would require (2500*1000+1000) of parameters. The rule of thumb would be to keep number of trainable parameters somewhat comparable to number of training examples, or it is at least so in classical machine learning. If otherwise, regularize model rigorously.
What steps can be taken here?
Use simpler model. Decrease number of hidden units, number of layers
Use model with smaller number of parameters. Convolutional layers, for instance, would generally utilize much smaller number of parameters for the same number of units. For instance 1000 convolutinal neurons with 3x3 kernels would need only 1000*(3*3+1) parameters
Apply regularization: batch normalization, noise injection into your input, dropout, weight decay would be good examples to start from.

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.