A recent paper (here) introduced a secondary loss function that they called center loss. It is based on the distance between the embeddings in a batch and the running average embedding for each of the respective classes. There has been some discussion in the TF Google groups (here) regarding how such embedding centers can be computed and updated. I've put together some code to generate class-average embeddings in my answer below.
Is this the best way to do this?
The previously posted method is too simple for cases like center loss where the expected value of the embeddings change over time as the model becomes more refined. This is because the previous center-finding routine averages all instances since start and therefore tracks changes in expected value very slowly. Instead, a moving window average is preferred. An exponential moving-window variant is as follows:
def get_embed_centers(embed_batch, label_batch):
''' Exponential moving window average. Increase decay for longer windows [0.0 1.0]
'''
decay = 0.95
with tf.variable_scope('embed', reuse=True):
embed_ctrs = tf.get_variable("ctrs")
label_batch = tf.reshape(label_batch, [-1])
old_embed_ctrs_batch = tf.gather(embed_ctrs, label_batch)
dif = (1 - decay) * (old_embed_ctrs_batch - embed_batch)
embed_ctrs = tf.scatter_sub(embed_ctrs, label_batch, dif)
embed_ctrs_batch = tf.gather(embed_ctrs, label_batch)
return embed_ctrs_batch
with tf.Session() as sess:
with tf.variable_scope('embed'):
embed_ctrs = tf.get_variable("ctrs", [nclass, ndims], dtype=tf.float32,
initializer=tf.constant_initializer(0), trainable=False)
label_batch_ph = tf.placeholder(tf.int32)
embed_batch_ph = tf.placeholder(tf.float32)
embed_ctrs_batch = get_embed_centers(embed_batch_ph, label_batch_ph)
sess.run(tf.initialize_all_variables())
tf.get_default_graph().finalize()
The get_new_centers() routine below takes in labelled embeddings and updates shared variables center/sums and center/cts. These variables are then used to calculate and return the embedding centers using the updated values.
The loop just exercises get_new_centers() and shows that it converges to the expected average embeddings for all classes over time.
Note that the alpha term used in the original paper isn't included here but should be straightforward to add if needed.
ndims = 2
nclass = 4
nbatch = 100
with tf.variable_scope('center'):
center_sums = tf.get_variable("sums", [nclass, ndims], dtype=tf.float32,
initializer=tf.constant_initializer(0), trainable=False)
center_cts = tf.get_variable("cts", [nclass], dtype=tf.float32,
initializer=tf.constant_initializer(0), trainable=False)
def get_new_centers(embeddings, indices):
'''
Update embedding for selected class indices and return the new average embeddings.
Only the newly-updated average embeddings are returned corresponding to
the indices (including duplicates).
'''
with tf.variable_scope('center', reuse=True):
center_sums = tf.get_variable("sums")
center_cts = tf.get_variable("cts")
# update embedding sums, cts
if embeddings is not None:
ones = tf.ones_like(indices, tf.float32)
center_sums = tf.scatter_add(center_sums, indices, embeddings, name='sa1')
center_cts = tf.scatter_add(center_cts, indices, ones, name='sa2')
# return updated centers
num = tf.gather(center_sums, indices)
denom = tf.reshape(tf.gather(center_cts, indices), [-1, 1])
return tf.div(num, denom)
with tf.Session() as sess:
labels_ph = tf.placeholder(tf.int32)
embeddings_ph = tf.placeholder(tf.float32)
unq_labels, ul_idxs = tf.unique(labels_ph)
indices = tf.gather(unq_labels, ul_idxs)
new_centers_with_update = get_new_centers(embeddings_ph, indices)
new_centers = get_new_centers(None, indices)
sess.run(tf.initialize_all_variables())
tf.get_default_graph().finalize()
for i in range(100001):
embeddings = 100*np.random.randn(nbatch, ndims)
labels = np.random.randint(0, nclass, nbatch)
feed_dict = {embeddings_ph:embeddings, labels_ph:labels}
rval = sess.run([new_centers_with_update], feed_dict)
if i % 1000 == 0:
feed_dict = {labels_ph:range(nclass)}
rval = sess.run(new_centers, feed_dict)
print('\nFor step ', i)
for iclass in range(nclass):
print('Class %d, center: %s' % (iclass, str(rval[iclass])))
A typical result at step 0 is:
For step 0
Class 0, center: [-1.7618252 -0.30574229]
Class 1, center: [ -4.50493908 10.12403965]
Class 2, center: [ 3.6156714 -9.94263649]
Class 3, center: [-4.20281982 -8.28845882]
and the output at step 10,000 demonstrates convergence:
For step 10000
Class 0, center: [ 0.00313433 -0.00757505]
Class 1, center: [-0.03476512 0.04682625]
Class 2, center: [-0.03865958 0.06585111]
Class 3, center: [-0.02502561 -0.03370816]
Related
I have a fully connected neural network with the following number of neurons in each layer [4, 20, 20, 20, ..., 1]. I am using TensorFlow and the 4 real-valued inputs correspond to a particular point in space and time, i.e. (x, y, z, t), and the 1 real-valued output corresponds to the temperature at that point. The loss function is just the mean square error between my predicted temperature and the actual temperature at that point in (x, y, z, t). I have a set of training data points with the following structure for their inputs:
(x,y,z,t):
(0.11,0.12,1.00,0.41)
(0.34,0.43,1.00,0.92)
(0.01,0.25,1.00,0.65)
...
(0.71,0.32,1.00,0.49)
(0.31,0.22,1.00,0.01)
(0.21,0.13,1.00,0.71)
Namely, what you will notice is that the training data all have the same redundant value in z, but x, y, and t are generally not redundant. Yet what I find is my neural network cannot train on this data due to the redundancy. In particular, every time I start training the neural network, it appears to fail and the loss function becomes nan. But, if I change the structure of the neural network such that the number of neurons in each layer is [3, 20, 20, 20, ..., 1], i.e. now data points only correspond to an input of (x, y, t), everything works perfectly and training is all right. But is there any way to overcome this problem? (Note: it occurs whether any of the variables are identical, e.g. either x, y, or t could be redundant and cause this error.) I have also attempted different activation functions (e.g. ReLU) and varying the number of layers and neurons in the network, but these changes do not resolve the problem.
My question: is there any way to still train the neural network while keeping the redundant z as an input? It just so happens the particular training data set I am considering at the moment has all z redundant, but in general, I will have data coming from different z in the future. Therefore, a way to ensure the neural network can robustly handle inputs at the present moment is sought.
A minimal working example is encoded below. When running this example, the loss output is nan, but if you simply uncomment the x_z in line 12 to ensure there is now variation in x_z, then there is no longer any problem. But this is not a solution since the goal is to use the original x_z with all constant values.
import numpy as np
import tensorflow as tf
end_it = 10000 #number of iterations
frac_train = 1.0 #randomly sampled fraction of data to create training set
frac_sample_train = 0.1 #randomly sampled fraction of data from training set to train in batches
layers = [4, 20, 20, 20, 20, 20, 20, 20, 20, 1]
len_data = 10000
x_x = np.array([np.linspace(0.,1.,len_data)])
x_y = np.array([np.linspace(0.,1.,len_data)])
x_z = np.array([np.ones(len_data)*1.0])
#x_z = np.array([np.linspace(0.,1.,len_data)])
x_t = np.array([np.linspace(0.,1.,len_data)])
y_true = np.array([np.linspace(-1.,1.,len_data)])
N_train = int(frac_train*len_data)
idx = np.random.choice(len_data, N_train, replace=False)
x_train = x_x.T[idx,:]
y_train = x_y.T[idx,:]
z_train = x_z.T[idx,:]
t_train = x_t.T[idx,:]
v1_train = y_true.T[idx,:]
sample_batch_size = int(frac_sample_train*N_train)
np.random.seed(1234)
tf.set_random_seed(1234)
import logging
logging.getLogger('tensorflow').setLevel(logging.ERROR)
tf.logging.set_verbosity(tf.logging.ERROR)
class NeuralNet:
def __init__(self, x, y, z, t, v1, layers):
X = np.concatenate([x, y, z, t], 1)
self.lb = X.min(0)
self.ub = X.max(0)
self.X = X
self.x = X[:,0:1]
self.y = X[:,1:2]
self.z = X[:,2:3]
self.t = X[:,3:4]
self.v1 = v1
self.layers = layers
self.weights, self.biases = self.initialize_NN(layers)
self.sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=False,
log_device_placement=False))
self.x_tf = tf.placeholder(tf.float32, shape=[None, self.x.shape[1]])
self.y_tf = tf.placeholder(tf.float32, shape=[None, self.y.shape[1]])
self.z_tf = tf.placeholder(tf.float32, shape=[None, self.z.shape[1]])
self.t_tf = tf.placeholder(tf.float32, shape=[None, self.t.shape[1]])
self.v1_tf = tf.placeholder(tf.float32, shape=[None, self.v1.shape[1]])
self.v1_pred = self.net(self.x_tf, self.y_tf, self.z_tf, self.t_tf)
self.loss = tf.reduce_mean(tf.square(self.v1_tf - self.v1_pred))
self.optimizer = tf.contrib.opt.ScipyOptimizerInterface(self.loss,
method = 'L-BFGS-B',
options = {'maxiter': 50,
'maxfun': 50000,
'maxcor': 50,
'maxls': 50,
'ftol' : 1.0 * np.finfo(float).eps})
init = tf.global_variables_initializer()
self.sess.run(init)
def initialize_NN(self, layers):
weights = []
biases = []
num_layers = len(layers)
for l in range(0,num_layers-1):
W = self.xavier_init(size=[layers[l], layers[l+1]])
b = tf.Variable(tf.zeros([1,layers[l+1]], dtype=tf.float32), dtype=tf.float32)
weights.append(W)
biases.append(b)
return weights, biases
def xavier_init(self, size):
in_dim = size[0]
out_dim = size[1]
xavier_stddev = np.sqrt(2/(in_dim + out_dim))
return tf.Variable(tf.truncated_normal([in_dim, out_dim], stddev=xavier_stddev), dtype=tf.float32)
def neural_net(self, X, weights, biases):
num_layers = len(weights) + 1
H = 2.0*(X - self.lb)/(self.ub - self.lb) - 1.0
for l in range(0,num_layers-2):
W = weights[l]
b = biases[l]
H = tf.tanh(tf.add(tf.matmul(H, W), b))
W = weights[-1]
b = biases[-1]
Y = tf.add(tf.matmul(H, W), b)
return Y
def net(self, x, y, z, t):
v1_out = self.neural_net(tf.concat([x,y,z,t], 1), self.weights, self.biases)
v1 = v1_out[:,0:1]
return v1
def callback(self, loss):
global Nfeval
print(str(Nfeval)+' - Loss in loop: %.3e' % (loss))
Nfeval += 1
def fetch_minibatch(self, x_in, y_in, z_in, t_in, den_in, N_train_sample):
idx_batch = np.random.choice(len(x_in), N_train_sample, replace=False)
x_batch = x_in[idx_batch,:]
y_batch = y_in[idx_batch,:]
z_batch = z_in[idx_batch,:]
t_batch = t_in[idx_batch,:]
v1_batch = den_in[idx_batch,:]
return x_batch, y_batch, z_batch, t_batch, v1_batch
def train(self, end_it):
it = 0
while it < end_it:
x_res_batch, y_res_batch, z_res_batch, t_res_batch, v1_res_batch = self.fetch_minibatch(self.x, self.y, self.z, self.t, self.v1, sample_batch_size) # Fetch residual mini-batch
tf_dict = {self.x_tf: x_res_batch, self.y_tf: y_res_batch, self.z_tf: z_res_batch, self.t_tf: t_res_batch,
self.v1_tf: v1_res_batch}
self.optimizer.minimize(self.sess,
feed_dict = tf_dict,
fetches = [self.loss],
loss_callback = self.callback)
def predict(self, x_star, y_star, z_star, t_star):
tf_dict = {self.x_tf: x_star, self.y_tf: y_star, self.z_tf: z_star, self.t_tf: t_star}
v1_star = self.sess.run(self.v1_pred, tf_dict)
return v1_star
model = NeuralNet(x_train, y_train, z_train, t_train, v1_train, layers)
Nfeval = 1
model.train(end_it)
I think your problem is in this line:
H = 2.0*(X - self.lb)/(self.ub - self.lb) - 1.0
In the third column fo X, corresponding to the z variable, both self.lb and self.ub are the same value, and equal to the value in the example, in this case 1, so it is acutally computing:
2.0*(1.0 - 1.0)/(1.0 - 1.0) - 1.0 = 2.0*0.0/0.0 - 1.0
Which is nan. You can work around the issue in a few different ways, a simple option is to simply do:
# Avoids dividing by zero
X_d = tf.math.maximum(self.ub - self.lb, 1e-6)
H = 2.0*(X - self.lb)/X_d - 1.0
This is an interesting situation. A quick check on an online tool for regression shows that even simple regression suffers from the problem of unable to fit data points when one of the inputs is constant through the dataset. Taking a look at the algebraic solution for a two-variable linear regression problem shows the solution involving division by standard deviation which, being zero in a constant set, is a problem.
As far as solving through backprop is concerned (as is the case in your neural network), I strongly suspect that the derivative of the loss with respect to the input (these expressions) is the culprit, and that the algorithm is not able to update the weights W using W := W - α.dZ, and ends up remaining constant.
I have an issue while using AUC from tensorflow library. I train my model (convolutional neural network) per batch ( i do not use a validation set) and after each epoch I use an independent test set to obtain my evaluations. The problem lies within AUC evaluation.
In each batch I calculate AUC/Accuracy/Loss/Precision/Recall/F1_score for the training set and then I aggregate the mean of these scores. When I try to do the same for the test set I again calculate the same scores. I notice that all scores except AUC have different values. I think it is not correct test's loss function to increase and AUC to increase as well. And the problem is that test's AUC is almost identical to training's AUC (even though their accuracy, loss error are completely different).
with tf.name_scope("output"):
W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
scores = tf.nn.xw_plus_b(h_drop, W, b, name="scores")
predictions = tf.argmax(scores, 1, name="predictions")
l2_loss += tf.nn.l2_loss(W, name="l2_loss")
l2_loss += tf.nn.l2_loss(b, name="l2_loss")
tf.summary.histogram("l2", l2_loss)
tf.summary.histogram("weigths", W)
tf.summary.histogram("biases", b)
with tf.name_scope("auc_score"):
# labelOut = tf.argmax(y_place_holder, 1)
probability = tf.nn.softmax(scores)
# auc_scoreTemp = streaming_auc(y_place_holder, probability, curve="PR")
auc_scoreTemp = tf.metrics.auc(y_place_holder, probability, curve="PR")
auc_score = tf.reduce_mean(tf.cast(auc_scoreTemp, tf.float32), name="auc_score")
tf.summary.scalar("auc_score", auc_score)
with tf.name_scope("accuracy"):
labelOut = tf.argmax(y_place_holder, 1)
correct_prediction = tf.equal(predictions, tf.argmax(y_place_holder, 1), name="correct_prediction")
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")
tf.summary.scalar("accuracy", accuracy)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
for batch in batches:
x_batch, y_batch = list(zip(*batch))
_, accuracy_train, auc_training, loss_train, prec_batch, recall_batch, f1_batch \
= sess.run([train_step, accuracy, auc_score, cross_entropy, precision_mini_batch,
recall_mini_batch, f1_score_min_batch], feed_dict={x_place_holder: x_batch,
y_place_holder: y_batch,
emb_place_holder: vocab_inv_emb_dset,
dropout_keep_prob: dropout_rate})
...
for test_batch in test_batches:
auc_test = None
x_test_batch, y_test_batch = list(zip(*test_batch))
accuracy_test, loss_test, auc_test = sess.run([accuracy, cross_entropy, auc_score],
feed_dict={x_place_holder: x_test_batch,
y_place_holder: y_test_batch,
emb_place_holder: vocab_inv_emb_dset_val,
dropout_keep_prob: 1.0})
I also tried to use streaming_auc which returns always 1.
EDIT
In the end of every epoch I reset the local variables by running:
sess.run(tf.local_variables_initializer())
But the first batch outputs really bad results. After the first batch I get normal results from test set which are not close to the training results. I don't know if this is the correct way to do it but results seem more realistic this way.
All of the tf.metrics return a value and an updating op (see here). So as described here you want to use the updating op to accumulate values and then evaluate auc_score to retrieve the accumulated value, something like this:
...
auc_score, auc_op = tf.metrics.auc(y_place_holder, probability, curve="PR")
...
for batch in batches:
sess.run([train_step, accuracy, auc_op, cross_entropy,...)
...
py_auc = sess.run(auc)
EDIT -- toy example showing tf.metrics.auc and tf.contrib.metrics.streaming_auc
import tensorflow as tf
from tensorflow.contrib import metrics
batch_sz = 100
noise_mag = 0.5
nloop = 10
tf.set_random_seed(0)
batch_x = tf.random_uniform([batch_sz, 1], 0, 2, dtype=tf.int32)
noise = noise_mag * tf.random_normal([batch_sz, 1])
batch_y = tf.sigmoid(tf.to_float(batch_x) + noise)
auc_val, auc_accum = tf.metrics.auc(batch_x, batch_y)
#note: contrib.metrics.streaming_auc reverses labels, predictions
auc_val2, auc_accum2 = metrics.streaming_auc(batch_y, batch_x)
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
for i in range(nloop):
_ = sess.run([auc_accum, auc_accum2])
auc, auc2 = sess.run([auc_val, auc_val2])
print('Accumulated AUC = ', sess.run(auc_val)) #0.9238014
print('Accumulated AUC2 = ', sess.run(auc_val)) #0.9238014
I am new to TensorFlow RNN prediction.
I am trying to use RNN with BasicLSTMCell to predict sequence, such as
1,2,3,4,5 ->6
3,4,5,6,7 ->8
35,36,37,38,39 ->40
My code doesn't report error, but outputs for every batch seem to be the same, and the cost seem to not reduce while training.
When I divided all training data by 100
0.01,0.02,0.03,0.04,0.05 ->0.06
0.03,0.04,0.05,0.06,0.07 ->0.08
0.35,0.36,0.37,0.38,0.39 ->0.40
The result is pretty good, the correlation between prediction and real values is very high (0.9998).
I suspect the problem is because integer and float? but I cannot explain the reason. Anyone can help? Many thanks!!
Here is the code
library(tensorflow)
start=sample(1:1000, 100000, T)
start1= start+1
start2=start1+1
start3= start2+1
start4=start3+1
start5= start4+1
start6=start5+1
label=start6+1
data=data.frame(start, start1, start2, start3, start4, start5, start6, label)
data=as.matrix(data)
n = nrow(data)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = data[trainIndex ,]
test = data[-trainIndex ,]
train_data= train[,1:7]
train_label= train[,8]
means=apply(train_data, 2, mean)
sds= apply(train_data, 2, sd)
train_data=(train_data-means)/sds
test_data=test[,1:7]
test_data=(test_data-means)/sds
test_label=test[,8]
batch_size = 50L
n_inputs = 1L # MNIST data input (img shape: 28*28)
n_steps = 7L # time steps
n_hidden_units = 10L # neurons in hidden layer
n_outputs = 1L # MNIST classes (0-9 digits)
x = tf$placeholder(tf$float32, shape(NULL, n_steps, n_inputs))
y = tf$placeholder(tf$float32, shape(NULL, 1L))
weights_in= tf$Variable(tf$random_normal(shape(n_inputs, n_hidden_units)))
weights_out= tf$Variable(tf$random_normal(shape(n_hidden_units, 1L)))
biases_in=tf$Variable(tf$constant(0.1, shape= shape(n_hidden_units )))
biases_out = tf$Variable(tf$constant(0.1, shape=shape(1L)))
RNN=function(X, weights_in, weights_out, biases_in, biases_out)
{
X = tf$reshape(X, shape=shape(-1, n_inputs))
X_in = tf$sigmoid (tf$matmul(X, weights_in) + biases_in)
X_in = tf$reshape(X_in, shape=shape(-1, n_steps, n_hidden_units)
lstm_cell = tf$contrib$rnn$BasicLSTMCell(n_hidden_units, forget_bias=1.0, state_is_tuple=T)
init_state = lstm_cell$zero_state(batch_size, dtype=tf$float32)
outputs_final_state = tf$nn$dynamic_rnn(lstm_cell, X_in, initial_state=init_state, time_major=F)
outputs= tf$unstack(tf$transpose(outputs_final_state[[1]], shape(1,0,2)))
results = tf$matmul(outputs[[length(outputs)]], weights_out) + biases_out
return(results)
}
pred = RNN(x, weights_in, weights_out, biases_in, biases_out)
cost = tf$losses$mean_squared_error(pred, y)
train_op = tf$contrib$layers$optimize_loss(loss=cost, global_step=tf$contrib$framework$get_global_step(), learning_rate=0.05, optimizer="SGD")
init <- tf$global_variables_initializer()
sess <- tf$Session()
sess.run(init)
step = 0
while (step < 1000)
{
train_data2= train_data[(step*batch_size+1) : (step*batch_size+batch_size) , ]
train_label2=train_label[(step*batch_size+1):(step*batch_size+batch_size)]
batch_xs <- sess$run(tf$reshape(train_data2, shape(batch_size, n_steps, n_inputs))) # Reshape
batch_ys= matrix(train_label2, ncol=1)
sess$run(train_op, feed_dict = dict(x = batch_xs, y= batch_ys))
mycost <- sess$run(cost, feed_dict = dict(x = batch_xs, y= batch_ys))
print (mycost)
test_data2= test_data[(0*batch_size+1) : (0*batch_size+batch_size) , ]
test_label2=test_label[(0*batch_size+1):(0*batch_size+batch_size)]
batch_xs <- sess$run(tf$reshape(test_data2, shape(batch_size, n_steps, n_inputs))) # Reshape
batch_ys= matrix(test_label2, ncol=1)
step=step+1
}
First, it's quite useful to always normalize your network inputs (there are different approaches, divide by a maximum value, subtract mean and divide by std and many more). This will help your optimizer a lot.
Second, and actually most important in your case, after the RNN output you are applying sigmoid function. If you check the plot of the sigmoid function, you will see that it actually scales all inputs to the range (0,1). So basically no matter how big your inputs are your output will always be at most 1. Thus you should not use any activation functions at the output layer in regression problems.
Hope it helps.
I have stopped training at some point and saved checkpoint, meta files etc.
Now when I want to resume training, I want to start with last running learning rate of the optimizer. Can you provide a example of doing so ?
For those coming here (like me) wondering whether the last learning rate is automatically restored: tf.train.exponential_decay doesn't add any Variables to the graph, it only adds the operations necessary to derive the correct current learning rate value given a certain global_step value. This way, you only need to checkpoint the global_step value (which is done by default normally) and, assuming you keep the same initial learning rate, decay steps and decay factor, you'll automatically pick up training where you left it, with the correct learning rate value.
Inspecting the checkpoint won't show any learning_rate variable (or similar), simply because there is no need for any.
This example code learns to add two numbers:
import tensorflow as tf
import numpy as np
import os
save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'
save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)
if not os.path.isdir(save_ckpt_dir):
os.mkdir(save_ckpt_dir)
if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]: # prefer to load pre-trained net
load_ckpt_path = save_ckpt_path
else:
load_ckpt_path = None # train from scratch
def add_layer(inputs, in_size, out_size, activation_fn=None):
Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
if activation_fn is None:
layer_output = Wx_plus_b
else:
layer_output = activation_fn(Wx_plus_b)
return layer_output
def produce_batch(batch_size=256):
"""Loads a single batch of data.
Args:
batch_size: The number of excersises in the batch.
Returns:
x : column vector of numbers
y : another column of numbers
xy_sum : the sum of the columns
"""
x = np.random.random(size=[batch_size, 1]) * 10
y = np.random.random(size=[batch_size, 1]) * 10
xy_sum = x + y
return x, y, xy_sum
with tf.name_scope("inputs"):
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("correct_labels"):
xysums = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
with tf.name_scope("graph_body"):
prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)
with tf.name_scope("loss_and_train"):
# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))
# Passing global_step to minimize() will increment it at each step.
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
with tf.name_scope("init_load_save"):
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init)
if load_ckpt_path:
saver.restore(sess, load_ckpt_path)
for i in range(1000):
x, y, xy_sum = produce_batch(256)
_, global_step_np, loss_np, lr_np = sess.run([train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
if global_step_np % 100 == 0:
print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np))
saver.save(sess, save_ckpt_path)
if you run it a few times, you will see the learning rate decrease. It also saves the global step. The trick is here:
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
...
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
By default, saver.save will save all savable objects (including learning rate and global step). However, if tf.train.Saver is provided with var_list, saver.save will only save the vars included in var_list:
saver = tf.train.Saver(var_list = ..list of vars to save..)
sources:
https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay
https://stats.stackexchange.com/questions/200063/tensorflow-adam-optimizer-with-exponential-decay
https://www.tensorflow.org/api_docs/python/tf/train/Saver (see "saveable objects")
Dataset Description
The dataset contains a set of question pairs and a label which tells if the questions are same. e.g.
"How do I read and find my YouTube comments?" , "How can I see all my
Youtube comments?" , "1"
The goal of the model is to identify if the given question pair is same or different.
Approach
I have created a Siamese network to identify if two questions are same. Following is the model:
graph = tf.Graph()
with graph.as_default():
embedding_placeholder = tf.placeholder(tf.float32, shape=embedding_matrix.shape, name='embedding_placeholder')
with tf.variable_scope('siamese_network') as scope:
labels = tf.placeholder(tf.int32, [batch_size, None], name='labels')
keep_prob = tf.placeholder(tf.float32, name='question1_keep_prob')
with tf.name_scope('question1') as question1_scope:
question1_inputs = tf.placeholder(tf.int32, [batch_size, seq_len], name='question1_inputs')
question1_embedding = tf.get_variable(name='embedding', initializer=embedding_placeholder, trainable=False)
question1_embed = tf.nn.embedding_lookup(question1_embedding, question1_inputs)
question1_lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
question1_drop = tf.contrib.rnn.DropoutWrapper(question1_lstm, output_keep_prob=keep_prob)
question1_multi_lstm = tf.contrib.rnn.MultiRNNCell([question1_drop] * lstm_layers)
q1_initial_state = question1_multi_lstm.zero_state(batch_size, tf.float32)
question1_outputs, question1_final_state = tf.nn.dynamic_rnn(question1_multi_lstm, question1_embed, initial_state=q1_initial_state)
scope.reuse_variables()
with tf.name_scope('question2') as question2_scope:
question2_inputs = tf.placeholder(tf.int32, [batch_size, seq_len], name='question2_inputs')
question2_embedding = question1_embedding
question2_embed = tf.nn.embedding_lookup(question2_embedding, question2_inputs)
question2_lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
question2_drop = tf.contrib.rnn.DropoutWrapper(question2_lstm, output_keep_prob=keep_prob)
question2_multi_lstm = tf.contrib.rnn.MultiRNNCell([question2_drop] * lstm_layers)
q2_initial_state = question2_multi_lstm.zero_state(batch_size, tf.float32)
question2_outputs, question2_final_state = tf.nn.dynamic_rnn(question2_multi_lstm, question2_embed, initial_state=q2_initial_state)
Calculate the cosine distance using the RNN outputs:
with graph.as_default():
diff = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(question1_outputs[:, -1, :], question2_outputs[:, -1, :])), reduction_indices=1))
margin = tf.constant(1.)
labels = tf.to_float(labels)
match_loss = tf.expand_dims(tf.square(diff, 'match_term'), 0)
mismatch_loss = tf.expand_dims(tf.maximum(0., tf.subtract(margin, tf.square(diff)), 'mismatch_term'), 0)
loss = tf.add(tf.matmul(labels, match_loss), tf.matmul((1 - labels), mismatch_loss), 'loss_add')
distance = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(distance)
Following is the code to train the model:
with graph.as_default():
saver = tf.train.Saver()
with tf.Session(graph=graph) as sess:
sess.run(tf.global_variables_initializer(), feed_dict={embedding_placeholder: embedding_matrix})
iteration = 1
for e in range(epochs):
summary_writer = tf.summary.FileWriter('/Users/mithun/projects/kaggle/quora_question_pairs/logs', sess.graph)
summary_writer.add_graph(sess.graph)
for ii, (x1, x2, y) in enumerate(get_batches(question1_train, question2_train, label_train, batch_size), 1):
feed = {question1_inputs: x1,
question2_inputs: x2,
labels: y[:, None],
keep_prob: 0.9
}
loss1 = sess.run([distance], feed_dict=feed)
if iteration%5==0:
print("Epoch: {}/{}".format(e, epochs),
"Iteration: {}".format(iteration),
"Train loss: {:.3f}".format(loss1))
if iteration%50==0:
val_acc = []
for x1, x2, y in get_batches(question1_val, question2_val, label_val, batch_size):
feed = {question1_inputs: x1,
question2_inputs: x2,
labels: y[:, None],
keep_prob: 1
}
batch_acc = sess.run([accuracy], feed_dict=feed)
val_acc.append(batch_acc)
print("Val acc: {:.3f}".format(np.mean(val_acc)))
iteration +=1
saver.save(sess, "checkpoints/quora_pairs.ckpt")
I have trained the above model with about 10,000 labeled data. But, the accuracy is stagnant at around 0.630 and strangely the validation accuracy is same across all the iterations.
lstm_size = 64
lstm_layers = 1
batch_size = 128
learning_rate = 0.001
Is there anything wrong with the way I have created the model?
This is a common problem with imbalanced datasets like the recently released Quora dataset which you are using. Since the Quora dataset is imbalanced (~63% negative and ~37% positive examples) you need proper initialization of weights. Without weight initialization your solution will be stuck in a local minima and it will train to predict only the negative class. Hence the 63% accuracy, because that is the percentage of 'not similar' questions in your validation data. If you check the results obtained on your validation set you will notice that it predicts all zeros. A truncated normal distribution proposed in He et al., http://arxiv.org/abs/1502.01852 is a good alternate for initializing the weights.