I am new to Bert. Two weeks ago I successfully ran a fine-tuning Bert model on a nlp classification task though the outcome was not brilliant. Yesterday, however, when I tried to run the same code and data, an AttributeError was always there, which says: 'str' object has no attribute 'dim'. Please know everything is on Colab and via PyTorch Transformers.
What should I do to fix it?
Here is one thing I tried when I installed transformers but turned out it did not work:
instead of
!pip install transformers ,
I tried to use previous transformers version:
!pip install --target lib --upgrade transformers==3.5.0
Any feedback will be greatly appreciated!
Please see the code and the error message as below:
train definition
# function to train the model
def train():
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 200 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
batch = [ for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
# get model predictions for the current batch
preds = model(sent_id, mask)
# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
# update learning rate schedule
# scheduler.step()
# model predictions are stored on GPU. So, push it to CPU
# append the model predictions
# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
#returns the loss and predictions
return avg_loss, total_preds
training process
# set initial loss to infinite
best_valid_loss = float('inf')
# empty lists to store training and validation loss of each epoch
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss, '')
# append training and validation loss
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
Error message:
Epoch 1 / 10
AttributeError Traceback (most recent call last)
<ipython-input-41-c5138ddf6b25> in <module>()
13 #train model
---> 14 train_loss, _ = train()
16 #evaluate model
5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/ in linear(input, weight, bias)
1686 if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
1687 return handle_torch_function(linear, tens_ops, input, weight, bias=bias)
-> 1688 if input.dim() == 2 and bias is not None:
1689 # fused op is marginally faster
1690 ret = torch.addmm(bias, input, weight.t())
AttributeError: 'str' object has no attribute 'dim'

As far as I remember - there was an old transformer version in colab. Something like 2.11.0. Try:
!pip install transformers~=2.11.0
Change the version number until it works.


Why is tf.GradientTape.jacobian giving None?

I'm using the IRIS dataset, and am following this official tutorial: Custom training: walkthrough
In the Training loop, I am trying to gather the model outputs and weights in each epoch%50==0 in the lists m_outputs_mod50, gather_weights respectively:
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
m_outputs_mod50 = []
gather_weights = []
num_epochs = 201
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# gather_kernel(model)
# Training loop - using batches of 32
for x, y in train_dataset:
# Optimize the model
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
# pred_hist.append(model.predict(x))
if epoch % 50 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
Running the above and trying to even get the jacobian at epoch 0 (using m_outputs_mod50[0] and gather_weights[0]) using
with tf.GradientTape() as tape:
print(tape.jacobian(target = m_outputs_mod50[0], sources = gather_weights[0]))`
I get a list of None as the output.
You need to understand how the GradientTape operates. For that, you can follow the guide: Introduction to gradients and automatic differentiation. Here is an excerpt:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To compute a gradient (or a jacobian), the tape needs to record the operations that are executed in its context. Then, outside its context, once the forward pass has been executed, its possible to use the tape to compute the gradient/jacobian.
You could use something like that:
if epoch % 50 == 0:
with tf.GradientTape() as tape:
out = model(x)
jacobian = tape.jacobian(out, model.weights)

How to view train_on_batch tensorboard log files generated by Google Colab?

I know how to view tensorboard plots on my local machine whilst my neural networks train using code in a local Jupyter Notebook, using the following code. What do I need to do differently when I use Google Colab to train the neural network instead? I can't see any tutorials/examples online when using train_on_batch.
After defining my model (convnet)...
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
num_epochs = 3
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
I can see that the log file has generated successfully within the Google Colab runtime. How do I view this in Tensorboard? I've seen solutions that describe downloading the log file to a local machine and viewing that in tensorboard locally but this doesn't display anything. Is there something I'm missing in my code to allow this to work on tensorboard locally? And/or an alternative solution to view the log data in Tensorboard within Google Colab?
In case its important for the details of the solution, I'm on a Mac. Also, the tutorials I've seen online show how to use Tensorboard with Google Colab when using the fit code but can't see how to modify my code which doesn't use fit but rather train_on_batch.
Thanks to Dr Ryan Cunningham from Manchester Metropolitan University for the solution to this problem , which was the following:
%load_ext tensorboard
%tensorboard --logdir './Logs'
...which allows me to view the Tensorboard plots in the Google Colab document itself, and see the plots update while the NN is training.
So, the full set of code, to view the Tensorboard plots while the network is training is (after defining the neural network, which I've called convnet):
# compile the neural net after defining the loss, optimisation and
# performance metric
convnet.compile(loss='categorical_crossentropy', # cross entropy is suited to
# multi-class classification
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
%load_ext tensorboard
%tensorboard --logdir './Logs'
# iterate through the training set for x epochs,
# each time iterating through the batches,
# for each batch, train, calculate loss & optimise weights.
# (mini-batch approach)
num_epochs = 1
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
Note: it can take a few seconds after the cell has finished running before the cell output refreshes and shows the Tensorboard plots.
get_ipython().system_raw('tensorboard --logdir /content/trainingdata/objectdetection/ckpt_output/trainingImatges/ --host --port 6006 &')
get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
This gives you a tensorboard from the log files created. This creates a tunnel for the tensorboard on colab and makes it accessible through a public URL provided by ngrok. When you run the final command, the public URL is printed. And it works with TF1.13 . I guess you can use the same approach for TF2 as well.

can't reproduce with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

how to log validation loss and accuracy using tfslim

Is there any way that I can log the validaton loss and accuracy to tensorboard when using tf-slim? When I was using keras, the following code can do this for me:
model.fit_generator(generator=train_gen(), validation_data=valid_gen(),...)
Then the model will evaluate the validation loss and accuracy after each epoch, which is very convenient. But how to achieve this using tf-slim? The following steps are using primitive tensorflow, which is not what I want:
with tf.Session() as sess:
for step in range(100000):, feed_dict={X: X_train, y: y_train})
if n % batch_size * batches_per_epoch == 0:
print(, feed_dict={X: X_train, y: y_train}))
Right now, the steps to train a model using tf-slim is:
log_every_n_steps = 10,
So how to evaluate validation loss and accuracy after each epoch with the above slim training procedure?
Thanks in advance!
The matter is still being discussed on TF Slim repo (issue #5987).
The framework allows you to easily create an evaluation script to run after / in parallel of your training (solution 1 below), but some people are pushing to be able to implement the "classic cycle of batch training + validation" (solution 2).
1. Use slim.evaluation in another script
TF Slim has evaluation methods e.g. slim.evaluation.evaluation_loop() you can use in another script (which can be run in parallel of your training) to periodically load the latest checkpoint of your model and perform evaluation. TF Slim page contains a good example how such a script may look: example.
2. Provide a custom train_step_fn to slim.learning.train()
A patchy solution the initiator of the discussion came up with makes use of a custom training step function you can provide to slim.learning.train():
Snippet from code by Kevin Malakoff #kmalakoff
# ...
accuracy_validation = slim.metrics.accuracy(
tf.argmax(predictions_validation, 1),
tf.argmax(labels_validation, 1)) # ... or whatever metrics needed
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_check == 0:
accuracy =
print('Step %s - Loss: %.2f Accuracy: %.2f%%' % (str(train_step_fn.step).rjust(6, '0'), total_loss, accuracy * 100))
# ...
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation

Decreased training speed on multi-GPU machine with dynamic RNN of Tensorflow

I have two machines available on which I can train models built with Tensorflow: a local desktop machine with one GPU (called "local" in the following) and a remote cluster with 4 GPUs (called "cluster" in the following). Even though the cluster has 4 GPUs I'm only using one GPU at a time (e.g. through CUDA_VISIBLE_DEVICES=2 python My problem is that training the exact same model on the cluster is considerably slower than on my local machine, even though the cluster has more powerful GPUs. I realize that this question might be very localized and it is difficult to find out why this is the case, however I am at loss as to what causes this behavior. In the following I try to give as many details about the configuration of both machines and the model I'm building.
The model is a simple toy RNN taken from this github project. The model definition is as follows:
# Parameters
learning_rate = 0.01
training_steps = 600
batch_size = 128
display_step = 200
# Network Parameters
seq_max_len = 20 # Sequence max length
n_hidden = 64 # hidden layer num of features
n_classes = 2 # linear sequence or not
# tf Graph input
x = tf.placeholder("float", [None, seq_max_len, 1])
y = tf.placeholder("float", [None, n_classes])
# A placeholder for indicating each sequence length
seqlen = tf.placeholder(tf.int32, [None])
# Define weights
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_classes]))
biases = {
'out': tf.Variable(tf.random_normal([n_classes]))
def dynamicRNN(x, seqlen, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, n_steps, n_input)
# Required shape: 'n_steps' tensors list of shape (batch_size, n_input)
with tf.device('gpu:0'):
# Unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
x = tf.unstack(x, seq_max_len, 1)
# Define a lstm cell with tensorflow
lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
# Get lstm cell output, providing 'sequence_length' will perform dynamic
# calculation.
outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32,
# When performing dynamic calculation, we must retrieve the last
# dynamically computed output, i.e., if a sequence length is 10, we need
# to retrieve the 10th output.
# However TensorFlow doesn't support advanced indexing yet, so we build
# a custom op that for each sample in batch size, get its length and
# get the corresponding relevant output.
# 'outputs' is a list of output at every timestep, we pack them in a Tensor
# and change back dimension to [batch_size, n_step, n_input]
outputs = tf.stack(outputs)
outputs = tf.transpose(outputs, [1, 0, 2])
# Hack to build the indexing and retrieve the right output.
batch_size = tf.shape(outputs)[0]
# Start indices for each sample
index = tf.range(0, batch_size)*seq_max_len+(seqlen-1)
# Indexing
outputs = tf.gather(tf.reshape(outputs, [-1, n_hidden]), index)
# Linear activation, using outputs computed above
return tf.matmul(outputs, weights['out'])+biases['out']
pred = dynamicRNN(x, seqlen, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
The complete (runnable) python script can be found here:
Configuration on local
Tensorflow: v1.3 (pre-compiled version installed)
CUDA: v8.0.61
cuDNN: v6.0.21
NVIDIA Driver: 375.82
OS: Ubuntu 16.04, 64-bit
Configuration on cluster
Exactly the same as for local, except:
GPU: GeForce GTX TITAN X Pascal
NVIDIA Driver: 375.66
Performance Measures
Executing the above provided toy script I get the following outputs on local:
Step 128, Minibatch Loss= 0.725320, Training Accuracy= 0.43750, Time: 0.3180224895477295
Step 25600, Minibatch Loss= 0.683126, Training Accuracy= 0.50962, Time: 0.013816356658935547
Step 51200, Minibatch Loss= 0.680907, Training Accuracy= 0.50000, Time: 0.013682842254638672
Step 76800, Minibatch Loss= 0.677346, Training Accuracy= 0.57692, Time: 0.014072895050048828
And the following on the cluster:
Step 128, Minibatch Loss= 1.536499, Training Accuracy= 0.47656, Time: 0.8308820724487305
Step 25600, Minibatch Loss= 0.693901, Training Accuracy= 0.49038, Time: 0.06193065643310547
Step 51200, Minibatch Loss= 0.689709, Training Accuracy= 0.53846, Time: 0.05762457847595215
Step 76800, Minibatch Loss= 0.685955, Training Accuracy= 0.54808, Time: 0.06454324722290039
As you can see, execution times on the cluster are about 4x higher. I tried to profile what is happening on the GPU through the use of the timeline feature. I find it difficult to interpret the output of this feature, but what I found most striking is that there are huge idle gaps on the cluster. For this, see the following images that show a trace of the timeline feature for one call to (note that the scale of the time axis is not exactly the same in both images, but the difference should still be visible).
Timeline on cluster:
Timeline on local:
Did any of you observe the same behavior? What are possible reasons that could cause this behavior and/or how can if further narrow down the issue?