NaN values for loss function (MSE) in TensorFlow - tensorflow

I would like to use a Feedforward Neural Network to output a continuous real value, using TensorFlow. My inputs values are, of course, continuous real values too.
I want my net to have two hidden layers and to use MSE as the cost function, so I've defined it like this:
def mse(logits, outputs):
mse = tf.reduce_mean(tf.pow(tf.sub(logits, outputs), 2.0))
return mse
def training(loss, learning_rate):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
return train_op
def inference_two_hidden_layers(images, hidden1_units, hidden2_units):
with tf.name_scope('hidden1'):
weights = tf.Variable(tf.truncated_normal([WINDOW_SIZE, hidden1_units],stddev=1.0 / math.sqrt(float(WINDOW_SIZE))),name='weights')
biases = tf.Variable(tf.zeros([hidden1_units]),name='biases')
hidden1 = tf.nn.relu(tf.matmul(images, weights) + biases)
with tf.name_scope('hidden2'):
weights = tf.Variable(tf.truncated_normal([hidden1_units, hidden2_units],stddev=1.0 / math.sqrt(float(hidden1_units))),name='weights')
biases = tf.Variable(tf.zeros([hidden2_units]),name='biases')
hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)
with tf.name_scope('identity'):
weights = tf.Variable(tf.truncated_normal([hidden2_units, 1],stddev=1.0 / math.sqrt(float(hidden2_units))),name='weights')
biases = tf.Variable(tf.zeros([1]),name='biases')
logits = tf.matmul(hidden2, weights) + biases
return logits
I'm doing a batch training and every step I evaluate the train_op and loss operators.
_, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)
The problem is that I'm getting some NaN values as the result of evaluating the loss function. That does NOT happen if I just use a neural network with just one hidden layer like the following:
def inference_one_hidden_layer(inputs, hidden1_units):
with tf.name_scope('hidden1'):
weights = tf.Variable(
tf.truncated_normal([WINDOW_SIZE, hidden1_units],stddev=1.0 / math.sqrt(float(WINDOW_SIZE))),name='weights')
biases = tf.Variable(tf.zeros([hidden1_units]),name='biases')
hidden1 = tf.nn.relu(tf.matmul(inputs, weights) + biases)
with tf.name_scope('identity'):
weights = tf.Variable(
tf.truncated_normal([hidden1_units, NUM_CLASSES],stddev=1.0 / math.sqrt(float(hidden1_units))),name='weights')
biases = tf.Variable(tf.zeros([NUM_CLASSES]),name='biases')
logits = tf.matmul(hidden1, weights) + biases
return logits
Why do I get NaN loss values when using a two hidden layers net?

Mind your learning rate. If you expand your network, you'll have more parameters to learn. That means you also need to decrease the learning rate.
For a high learning rate, your weights will explode. Also your output values will explode then.

Related

In tensorflow 1, when the loss function is defined with operations on Tensors, is the model really trained?

First, I m sorry but it's not possible to reproduce this problem on a few lines, as the model involved is a very complex network.
But here is an idea of the code:
def return_iterator(data, nb_epochs, batch_size):
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.repeat(nb_epochs).batch(batch_size)
iterator = dataset.make_one_shot_iterator()
yy = iterator.get_next()
return tf.cast(yy, tf.float32)
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
y_pred = complex_model.autoencode(train)
y_pred = tf.convert_to_tensor(y_pred, dtype=tf.float32)
nb_epochs = 10
batch_size = 64
y_real = return_iterator(train, nb_epochs, batch_size)
y_pred = return_iterator(y_pred, nb_epochs, batch_size)
res_equal = 1. - tf.reduce_mean(tf.abs(y_pred - y_real), [1,2,3])
loss = 1 - tf.reduce_sum(res_equal, axis=0)
opt = tf.train.AdamOptimizer().minimize(loss)
tf.global_variables_initializer().run()
for epoch in range(0, nb_epochs):
_, d_loss = sess.run([opt, loss])
To define the loss, I must use operations like tf.reduce_mean and tf.reduce_sum , and these operations only accept Tensors as input.
My question is: with this code, will the complex_model autoencoder be trained during the training ? (eventhough here, it's just used to output the predictions to compute the loss)
Thank you
p.s: I am using TF1.15 (and I cannot use another version)

How to convert a custom loss function with logits, built in tensorflow to keras?

I have a loss function built in tensorflow, it need logits and labels as input:
def median_weight_class_loss(labels, logits):
epsilon = tf.constant(value=1e-10)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
#this is just the number of samples in each class in my dataset divided by the sum of samples 10015.
weight_sample = np.array([1113,6705,514,327,1099,115,142])/10015
weight_sample = 0.05132302/weight_sample
xent = -tf.reduce_sum(tf.multiply(labels * tf.log(softmax + epsilon), weight_sample), axis=1)
return xent
the problem is in keras loss functions are in different format:
custom_loss(y_true, y_pred)
it used y_true, y_pred as inputs,
I found a way to get logits in keras, by using linear activation instead softmax in the last layer in my model.
model.add(Activation('linear'))
But I need my model to have softmax activation in the last layer, what you think the solution is?
thank you.
Strictly speaking, this loss does not need logits, you can input softmax probabilities directly by modifying the loss like:
def median_weight_class_loss(y_true, y_pred):
epsilon = tf.constant(value=1e-10)
weight_sample = np.array([1113,6705,514,327,1099,115,142])/10015
weight_sample = 0.05132302/weight_sample
xent = -tf.reduce_sum(tf.multiply(y_true * tf.log(y_pred + epsilon), weight_sample), axis=1)
return xent

Why Spark ML perceptron classifier has high F1-score while the same model on TensorFlow performs very badly?

Our team is working on a NLP problem. We have a dataset with some labeled sentences and we must classify them into two classes, 0 or 1.
We preprocess the data and use word embeddings so that we have 300 features for each sentence, then we use a simple neural network to train the model.
Since the data are very skewed we measure the model score with the F1-score, computing it both on the train set (80%) and the test set (20%).
Spark
We used the multilayer perceptron classifier featured in PySpark's MLlib:
layers = [300, 600, 2]
trainer = MultilayerPerceptronClassifier(featuresCol='features', labelCol='target',
predictionCol='prediction', maxIter=10, layers=layers,
blockSize=128)
model = trainer.fit(train_df)
result = model.transform(test_df)
predictionAndLabels = result.select("prediction", "target").withColumnRenamed("target", "label")
evaluator = MulticlassClassificationEvaluator(metricName="f1")
f1_score = evaluator.evaluate(predictionAndLabels)
This way we get F1-scores ranging between 0.91 and 0.93.
TensorFlow
We then chose to switch (mainly for learning purpose) to TensorFlow, so we implemented a neural network using the same architecture and formulas of the MLlib's one:
# Network Parameters
n_input = 300
n_hidden_1 = 600
n_classes = 2
# TensorFlow graph input
features = tf.placeholder(tf.float32, shape=(None, n_input), name='inputs')
labels = tf.placeholder(tf.float32, shape=(None, n_classes), name='labels')
# Initializes weights and biases
init_biases_and_weights()
# Layers definition
layer_1 = tf.add(tf.matmul(features, weights['h1']), biases['b1'])
layer_1 = tf.nn.sigmoid(layer_1)
out_layer = tf.matmul(layer_1, weights['out']) + biases['out']
out_layer = tf.nn.softmax(out_layer)
# Optimizer definition
learning_rate_ph = tf.placeholder(tf.float32, shape=(), name='learning_rate')
loss_function = tf.losses.log_loss(labels=labels, predictions=out_layer)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate_ph).minimize(loss_function)
# Start TensorFlow session
init = tf.global_variables_initializer()
tf_session = tf.InteractiveSession()
tf_session.run(init)
# Train Neural Network
learning_rate = 0.01
iterations = 100
batch_size = 256
total_batch = int(len(y_train) / batch_size)
for epoch in range(iterations):
avg_cost = 0.0
for block in range(total_batch):
batch_x = x_train[block * batch_size:min(block * batch_size + batch_size, len(x_train)), :]
batch_y = y_train[block * batch_size:min(block * batch_size + batch_size, len(y_train)), :]
_, c = tf_session.run([optimizer, loss_function], feed_dict={learning_rate_ph: learning_rate,
features: batch_x,
labels: batch_y})
avg_cost += c
avg_cost /= total_batch
print("Iteration " + str(epoch + 1) + " Logistic-loss=" + str(avg_cost))
# Make predictions
predictions_train = tf_session.run(out_layer, feed_dict={features: x_train, labels: y_train})
predictions_test = tf_session.run(out_layer, feed_dict={features: x_test, labels: y_test})
# Compute F1-score
f1_score = f1_score_tf(y_test, predictions_test)
Support functions:
def initialize_weights_and_biases():
global weights, biases
epsilon_1 = sqrt(6) / sqrt(n_input + n_hidden_1)
epsilon_2 = sqrt(6) / sqrt(n_classes + n_hidden_1)
weights = {
'h1': tf.Variable(tf.random_uniform([n_input, n_hidden_1],
minval=0 - epsilon_1, maxval=epsilon_1, dtype=tf.float32)),
'out': tf.Variable(tf.random_uniform([n_hidden_1, n_classes],
minval=0 - epsilon_2, maxval=epsilon_2, dtype=tf.float32))
}
biases = {
'b1': tf.Variable(tf.constant(1, shape=[n_hidden_1], dtype=tf.float32)),
'out': tf.Variable(tf.constant(1, shape=[n_classes], dtype=tf.float32))
}
def f1_score_tf(actual, predicted):
actual = np.argmax(actual, 1)
predicted = np.argmax(predicted, 1)
tp = tf.count_nonzero(predicted * actual)
fp = tf.count_nonzero(predicted * (actual - 1))
fn = tf.count_nonzero((predicted - 1) * actual)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * precision * recall / (precision + recall)
return tf.Tensor.eval(f1)
This way we get F1-scores ranging between 0.24 and 0.25.
Question
The only differences that I can see between the two neural networks are:
Optimizer: L-BFGS in Spark, Gradient Descent in TensorFlow
Weights and biases initialization: Spark makes its own initialization while we initialize them manually in TensorFlow
I don't think that these two parameters can cause a so big difference in performance between the models, but still Spark seems to get very high scores in very few iterations.
I can't understand if TensorFlow is performing very bad or maybe Spark's scores are not truthful. And in both cases I think we aren't seeing something important.
Initializing weights as uniform and bias as 1 is certainly not a good idea, and it may very well be the cause of this discrepancy.
Use normal or truncated_normal instead, with the default zero mean and a small variance for the weights:
weights = {
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1],
stddev=0.01, dtype=tf.float32)),
'out': tf.Variable(tf.truncated_normal([n_hidden_1, n_classes],
stddev=0.01, dtype=tf.float32))
}
and zero for the biases:
biases = {
'b1': tf.Variable(tf.constant(0, shape=[n_hidden_1], dtype=tf.float32)),
'out': tf.Variable(tf.constant(0, shape=[n_classes], dtype=tf.float32))
}
That said, I am not sure about the correctness of using the MulticlassClassificationEvaluator for a binary classification problem, and I would suggest doing some further manual checks to confirm that the function indeed returns what you think it returns...

RNN for Speech Emotion Recognition

I want to classify speech data into four different emotions (angry, sad, happy, neutral).
The problem is that when I run RNN code, all speech data classified into one class.
(For example, all speech data classified as "angry" all the time.)
I don't know what is the reason for this problem and what I have to change for training.
Here's my tensorflow RNN main function for training and calculating accuracy:
def RNN(x, weights, biases, lstm_size):
lstm_cell = []
for i in range(lstm_size):
lstm_cell.append(rnn.BasicLSTMCell(hidden_dim, forget_bias=1.0, state_is_tuple=True, activation=tf.nn.sigmoid))
stacked_lstm = tf.contrib.rnn.MultiRNNCell(lstm_cell, state_is_tuple=True)
outputs, states = tf.nn.dynamic_rnn(stacked_lstm, x, dtype=tf.float32)
foutput = tf.contrib.layers.fully_connected(outputs[:,-1], output_dim, activation_fn = None)
return foutput
logits = RNN(X, weights, biases, lstm_size)
prediction = tf.nn.sigmoid(logits)
cost =tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=Y))
learning_rate =tf.train.exponential_decay(learning_rate=initial_learning_rate, global_step=training_steps, decay_steps=training_steps/10, decay_rate=0.96, staircase=True)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(cost)
pred = tf.argmax(prediction, axis=1)
label = tf.argmax(Y, axis=1)
correct_pred = tf.equal(pred, label)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float))
Input for RNN is speech features(pitch and MFCC) and output for RNN is one-hot code.(For example, angry=[1,0,0,0]).
Also, I wonder whether it is right or not to calculate classification accuracy like this.

Can i generate the input given the output in a pretrained Tensorflow model?

Let's assume i have trained a model for the MNist task, given the following code:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
import tensorflow as tf
# Parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100
display_step = 1
# Network Parameters
n_hidden_1 = 256 # 1st layer number of features
n_hidden_2 = 256 # 2nd layer number of features
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Test model
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
avg_acc = 0.
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
batch_acc = accuracy.eval({x: batch_x, y: batch_y})
# Compute average loss
avg_cost += c / total_batch
avg_acc += batch_acc / total_batch
# Display logs per epoch step
if epoch % display_step == 0:
test_acc = accuracy.eval({x: mnist.test.images, y: mnist.test.labels})
print(
"Epoch:",
'%04d' % (epoch+1),
"cost=",
"{:.9f}".format(avg_cost),
"average_train_accuracy=",
"{:.6f}".format(avg_acc),
"test_accuracy=",
"{:.6f}".format(test_acc)
)
print("Optimization Finished!")
So this model predicts the number shown in an image given the image.
Once i have trained it, could i make the input a 'variable' instead of 'placeholder' and try to reverse engineer the input given an output ?
For example i would like to feed the output '8' and produce a representative image of number eight.
I thought of:
Freezing the model
Add a variable matrix 'M' of the same size as the input between the input and the weights
Feed an Identical matrix as input to the input placeholder
Run the optimizer to learn the 'M' matrix.
Is there a better way ?
If your goal is to reverse the model in the sense that the input should be a digit and the output an image displaying that digit (in say, handwritten form), it is not quite possible to do with machine learning models.
Because machine learning models attempt to create generalizations from the input (so that similar input will provide similar output, although the model was never trained on it) they tend to be quite lossy. Additionally, the reduction from hundreds, thousands and more input variables into a single output variable obviously has to lose some information in the process.
More specifically, although a Multilayer Perceptron (as you're using in your example) is a fully connected Neural Network, some weights are expected to be zero, thus completely dropping the information in certain input variables. Moverover, the same output of a neuron can be retrieved by multiple distinctive input values to it's function, due to the many degrees of freedom.
It is theoretically possible to replace those degrees of freedom and lost information with specifically crafted or random data, but that does not guarantee a successful output.
On a side note, I'm a bit puzzled by this question. If you are able to generate that model yourself, you could also create a similar model that does the opposite. You could train a model to accept an input digit (and perhaps some random seed) and output an image.