Decreased training speed on multi-GPU machine with dynamic RNN of Tensorflow - tensorflow

I have two machines available on which I can train models built with Tensorflow: a local desktop machine with one GPU (called "local" in the following) and a remote cluster with 4 GPUs (called "cluster" in the following). Even though the cluster has 4 GPUs I'm only using one GPU at a time (e.g. through CUDA_VISIBLE_DEVICES=2 python script.py). My problem is that training the exact same model on the cluster is considerably slower than on my local machine, even though the cluster has more powerful GPUs. I realize that this question might be very localized and it is difficult to find out why this is the case, however I am at loss as to what causes this behavior. In the following I try to give as many details about the configuration of both machines and the model I'm building.
Model
The model is a simple toy RNN taken from this github project. The model definition is as follows:
# Parameters
learning_rate = 0.01
training_steps = 600
batch_size = 128
display_step = 200
# Network Parameters
seq_max_len = 20 # Sequence max length
n_hidden = 64 # hidden layer num of features
n_classes = 2 # linear sequence or not
# tf Graph input
x = tf.placeholder("float", [None, seq_max_len, 1])
y = tf.placeholder("float", [None, n_classes])
# A placeholder for indicating each sequence length
seqlen = tf.placeholder(tf.int32, [None])
# Define weights
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_classes]))
}
biases = {
'out': tf.Variable(tf.random_normal([n_classes]))
}
def dynamicRNN(x, seqlen, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, n_steps, n_input)
# Required shape: 'n_steps' tensors list of shape (batch_size, n_input)
with tf.device('gpu:0'):
# Unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
x = tf.unstack(x, seq_max_len, 1)
# Define a lstm cell with tensorflow
lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
# Get lstm cell output, providing 'sequence_length' will perform dynamic
# calculation.
outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32,
sequence_length=seqlen)
# When performing dynamic calculation, we must retrieve the last
# dynamically computed output, i.e., if a sequence length is 10, we need
# to retrieve the 10th output.
# However TensorFlow doesn't support advanced indexing yet, so we build
# a custom op that for each sample in batch size, get its length and
# get the corresponding relevant output.
# 'outputs' is a list of output at every timestep, we pack them in a Tensor
# and change back dimension to [batch_size, n_step, n_input]
outputs = tf.stack(outputs)
outputs = tf.transpose(outputs, [1, 0, 2])
# Hack to build the indexing and retrieve the right output.
batch_size = tf.shape(outputs)[0]
# Start indices for each sample
index = tf.range(0, batch_size)*seq_max_len+(seqlen-1)
# Indexing
outputs = tf.gather(tf.reshape(outputs, [-1, n_hidden]), index)
# Linear activation, using outputs computed above
return tf.matmul(outputs, weights['out'])+biases['out']
pred = dynamicRNN(x, seqlen, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
The complete (runnable) python script can be found here: https://pastebin.com/LnKmCiSy
Configuration on local
Tensorflow: v1.3 (pre-compiled version installed)
CUDA: v8.0.61
cuDNN: v6.0.21
GPU: GeForce GTX TITAN X
NVIDIA Driver: 375.82
OS: Ubuntu 16.04, 64-bit
Configuration on cluster
Exactly the same as for local, except:
GPU: GeForce GTX TITAN X Pascal
NVIDIA Driver: 375.66
Performance Measures
Executing the above provided toy script I get the following outputs on local:
Step 128, Minibatch Loss= 0.725320, Training Accuracy= 0.43750, Time: 0.3180224895477295
Step 25600, Minibatch Loss= 0.683126, Training Accuracy= 0.50962, Time: 0.013816356658935547
Step 51200, Minibatch Loss= 0.680907, Training Accuracy= 0.50000, Time: 0.013682842254638672
Step 76800, Minibatch Loss= 0.677346, Training Accuracy= 0.57692, Time: 0.014072895050048828
And the following on the cluster:
Step 128, Minibatch Loss= 1.536499, Training Accuracy= 0.47656, Time: 0.8308820724487305
Step 25600, Minibatch Loss= 0.693901, Training Accuracy= 0.49038, Time: 0.06193065643310547
Step 51200, Minibatch Loss= 0.689709, Training Accuracy= 0.53846, Time: 0.05762457847595215
Step 76800, Minibatch Loss= 0.685955, Training Accuracy= 0.54808, Time: 0.06454324722290039
As you can see, execution times on the cluster are about 4x higher. I tried to profile what is happening on the GPU through the use of the timeline feature. I find it difficult to interpret the output of this feature, but what I found most striking is that there are huge idle gaps on the cluster. For this, see the following images that show a trace of the timeline feature for one call to sess.run (note that the scale of the time axis is not exactly the same in both images, but the difference should still be visible).
Timeline on cluster:
Timeline on local:
Did any of you observe the same behavior? What are possible reasons that could cause this behavior and/or how can if further narrow down the issue?

Related

Why is tf.GradientTape.jacobian giving None?

I'm using the IRIS dataset, and am following this official tutorial: Custom training: walkthrough
In the Training loop, I am trying to gather the model outputs and weights in each epoch%50==0 in the lists m_outputs_mod50, gather_weights respectively:
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
m_outputs_mod50 = []
gather_weights = []
num_epochs = 201
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# gather_kernel(model)
# Training loop - using batches of 32
for x, y in train_dataset:
# Optimize the model
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
# pred_hist.append(model.predict(x))
if epoch % 50 == 0:
m_outputs_mod50.append(model(x))
gather_weights.append(model.weights)
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
Running the above and trying to even get the jacobian at epoch 0 (using m_outputs_mod50[0] and gather_weights[0]) using
with tf.GradientTape() as tape:
print(tape.jacobian(target = m_outputs_mod50[0], sources = gather_weights[0]))`
I get a list of None as the output.
Why?
You need to understand how the GradientTape operates. For that, you can follow the guide: Introduction to gradients and automatic differentiation. Here is an excerpt:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To compute a gradient (or a jacobian), the tape needs to record the operations that are executed in its context. Then, outside its context, once the forward pass has been executed, its possible to use the tape to compute the gradient/jacobian.
You could use something like that:
if epoch % 50 == 0:
with tf.GradientTape() as tape:
out = model(x)
jacobian = tape.jacobian(out, model.weights)

Why doesn't custom training loop average loss over batch_size?

Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.

Codes worked fine one week ago, but keep getting error since yesterday: Fine-tuning Bert model training via PyTorch on Colab

I am new to Bert. Two weeks ago I successfully ran a fine-tuning Bert model on a nlp classification task though the outcome was not brilliant. Yesterday, however, when I tried to run the same code and data, an AttributeError was always there, which says: 'str' object has no attribute 'dim'. Please know everything is on Colab and via PyTorch Transformers.
What should I do to fix it?
Here is one thing I tried when I installed transformers but turned out it did not work:
instead of
!pip install transformers ,
I tried to use previous transformers version:
!pip install --target lib --upgrade transformers==3.5.0
Any feedback will be greatly appreciated!
Please see the code and the error message as below:
Code:
train definition
# function to train the model
def train():
model.train()
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds=[]
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 200 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
batch = [r.to(device) for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
model.zero_grad()
# get model predictions for the current batch
preds = model(sent_id, mask)
# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# update learning rate schedule
# scheduler.step()
# model predictions are stored on GPU. So, push it to CPU
preds=preds.detach().cpu().numpy()
# append the model predictions
total_preds.append(preds)
# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
#returns the loss and predictions
return avg_loss, total_preds
training process
# set initial loss to infinite
best_valid_loss = float('inf')
# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')
# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
Error message:
Epoch 1 / 10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-41-c5138ddf6b25> in <module>()
12
13 #train model
---> 14 train_loss, _ = train()
15
16 #evaluate model
5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1686 if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
1687 return handle_torch_function(linear, tens_ops, input, weight, bias=bias)
-> 1688 if input.dim() == 2 and bias is not None:
1689 # fused op is marginally faster
1690 ret = torch.addmm(bias, input, weight.t())
AttributeError: 'str' object has no attribute 'dim'
As far as I remember - there was an old transformer version in colab. Something like 2.11.0. Try:
!pip install transformers~=2.11.0
Change the version number until it works.

Tensorflow lstm for sentiment analysis not learning. UPDATED

UPDATED:
i'm building a Neural Network for my final project and i need some help with it.
I'm trying to build a rnn to do sentiment analysis over Spanish text. I have about 200,000 labeled tweets and i vectorized them using a word2vec with a Spanish embedding
Dataset & Vectorization:
I erased duplicates and split the dataset into training and testing sets.
Padding, unknown and end of sentence tokens are applied when vectorizing.
I mapped the #mentions to known names in the word2vec model. Example: #iamthebest => "John"
My model:
My data tensor has shape = (batch_size, 20, 300).
I have 3 classes: neutral, positive and negative, so my target tensor has shape = (batch_size, 3)
I use BasicLstm cells and dynamic rnn to build the net.
I use Adam Optimizer, and softmax_cross entropy for the loss calculation
I use a dropout wrapper to decrease the overfitting.
Last run:
I have tried with different configurations and non of them seem to work.
Last setup: 2 Layers, 512 batch size, 15 epochs and 0.001 of lr.
Weak points for me:
im worried about the final layer and the handing of the final state in the dynamic_rnn
Code:
# set variables
num_epochs = 15
tweet_size = 20
hidden_size = 200
vec_size = 300
batch_size = 512
number_of_layers= 1
number_of_classes= 3
learning_rate = 0.001
TRAIN_DIR="/checkpoints"
tf.reset_default_graph()
# Create a session
session = tf.Session()
# Inputs placeholders
tweets = tf.placeholder(tf.float32, [None, tweet_size, vec_size], "tweets")
labels = tf.placeholder(tf.float32, [None, number_of_classes], "labels")
# Placeholder for dropout
keep_prob = tf.placeholder(tf.float32)
# make the lstm cells, and wrap them in MultiRNNCell for multiple layers
def lstm_cell():
cell = tf.contrib.rnn.BasicLSTMCell(hidden_size)
return tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=keep_prob)
multi_lstm_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(number_of_layers)], state_is_tuple=True)
# Creates a recurrent neural network
outputs, final_state = tf.nn.dynamic_rnn(multi_lstm_cells, tweets, dtype=tf.float32)
with tf.name_scope("final_layer"):
# weight and bias to shape the final layer
W = tf.get_variable("weight_matrix", [hidden_size, number_of_classes], tf.float32, tf.random_normal_initializer(stddev=1.0 / math.sqrt(hidden_size)))
b = tf.get_variable("bias", [number_of_classes], initializer=tf.constant_initializer(1.0))
sentiments = tf.matmul(final_state[-1][-1], W) + b
prob = tf.nn.softmax(sentiments)
tf.summary.histogram('softmax', prob)
with tf.name_scope("loss"):
# define cross entropy loss function
losses = tf.nn.softmax_cross_entropy_with_logits(logits=sentiments, labels=labels)
loss = tf.reduce_mean(losses)
tf.summary.scalar("loss", loss)
with tf.name_scope("accuracy"):
# round our actual probabilities to compute error
accuracy = tf.to_float(tf.equal(tf.argmax(prob,1), tf.argmax(labels,1)))
accuracy = tf.reduce_mean(tf.cast(accuracy, dtype=tf.float32))
tf.summary.scalar("accuracy", accuracy)
# define our optimizer to minimize the loss
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
#tensorboard summaries
merged_summary = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, session.graph)
# initialize any variables
tf.global_variables_initializer().run(session=session)
# Create a saver for writing training checkpoints.
saver = tf.train.Saver()
# load our data and separate it into tweets and labels
train_tweets = np.load('data_es/train_vec_tweets.npy')
train_labels = np.load('data_es/train_vec_labels.npy')
test_tweets = np.load('data_es/test_vec_tweets.npy')
test_labels = np.load('data_es/test_vec_labels.npy')
**HERE I HAVE THE LOOP FOR TRAINING AND TESTING, I KNOW ITS FINE**
I have already solved my problem. After reading some papers and more trial and error, I figured out what my mistakes were.
1) Dataset: I had a large dataset, but I didn't format it properly.
I checked the distribution of tweet labels (Neutral, Positive and Negative), realized there was a disparity in the distribution of said tweets and normalized it.
I cleaned it up even more by erasing url hashtags and unnecessary punctuation.
I shuffled prior to vectorization.
2) Initialization:
I initialized the MultiRNNCell with zeros and I changed my custom final layer to tf.contrib.fully_connected. I also added the initialization of the bias and weight matrix. (By fixing this, I started to see better loss and accuracy plots in Tensorboard)
3) Dropout:
I read this paper, Recurrent Dropout without Memory Loss, and I changed my dropouts accordingly; I started seeing improvements in the loss and accuracy.
4) Decaying the learning rate:
I added an exponential decaying rate after 10,000 steps to control over-fitting.
Final results:
After applying all of these changes, I achieved a test accuracy of 84%, which is acceptable because my data set still sucks.
My final network config was:
num_epochs = 20
tweet_size = 20
hidden_size = 400
vec_size = 300
batch_size = 512
number_of_layers= 2
number_of_classes= 3
start_learning_rate = 0.001

Loss function works with reduce_mean but not reduce_sum

I'm new to tensor flow, and have been looking at the examples here. I wanted to rewrite the multilayer perceptron classification model to be a regression model. However I encountered some strange behaviour when modifying the loss function. It works fine with tf.reduce_mean, but if I try using tf.reduce_sum it gives nan's in the output. This seems very strange, as the functions are very similar - the only difference is that the mean divides the sum result by the number of elements? So I can't see how nan's could be introduced by this change?
import tensorflow as tf
# Parameters
learning_rate = 0.001
# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs
# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with tanh activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.tanh(layer_1)
# Hidden layer with tanh activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.tanh(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_output]))
}
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y)) # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y)) # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)
# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
training_epochs = 10
display_step = 1
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
# Loop over all batches
for i in range(100):
# Run optimization op (backprop) and cost op (to get loss value)
_, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "mse=", \
"{:.9f}".format(msev))
The problematic variable se is commented out. It should be used in place of mse.
With mse the output looks like this:
Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...
and with se it ends up like this:
Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...
The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nans.
Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.
if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.
In most literature, the loss is expressed as the mean of the losses over the batch. If the loss is calculated using reduce_mean(), the learning rate should be regarded as per batch which should be larger.
It seems like in tensorflow.keras.losses, people are still choosing between mean or sum. For example, in the tf.keras.losses.Huber, the default is mean. But you are allowed to set it to sum.