Updating a BERT model through Huggingface transformers - tensorflow

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()

BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Related

How to obtain the ResNet component of the Tensorflow implementation of SimCLR v2?

I am currently trying to create embeddings of images by passing them through pre-trained Neural Networks and getting the values obtained at the last layer just before the fully-connected ones. I did not have much problem doing it with Pytorch implementations of other Neural Networks. However, I am stuck with the Tensorflow implementation of SimCLR v2 and do not know how to proceed.
The official repo of SimCLR v2 is this one: https://github.com/google-research/simclr
And the paper is here: https://arxiv.org/abs/2006.10029v2
If I understood correctly the paper and the code, this architecture is composed of a backbone ResNet as well as a projection head. In my case, I am not interested in the projection head and just want to obtain the results of the output of the ResNet model.
Looking at the code in the colabs, I have managed to import pre-trained SimCLR models:
model_path = 'gs://simclr-checkpoints-tf2/simclrv2/pretrained/r50_1x_sk0/saved_model'
saved_model = tf.saved_model.load(model_path)
However, I do not know what to do to get the outputs of the ResNet. In all the colabs, they only get the outputs of the projection head which I am uninterested in.
for x in ds.take(1):
image = x['image']
labels = x['label']
logits = saved_model(image, trainable=False)['logits_sup']
pred = tf.argmax(logits, -1)
Moreover, the way the model is imported makes it difficult to get the variables and layers. For instance if I try obtain a summary of the model, I have this error:
'_UserObject' object has no attribute 'summary'
I also do not want to convert the weights of Tensorflow into Pytorch and import them into a pytorch ResNet.
What then would be the best way to isolate the ResNet from the overall SimCLR v2 architecture in order to get the outputs of the final layer ?

TensorFlow Graph to Keras Model?

Is it possible to define a graph in native TensorFlow and then convert this graph to a Keras model?
My intention is simply combining (for me) the best of the two worlds.
I really like the Keras model API for prototyping and new experiments, i.e. using the awesome multi_gpu_model(model, gpus=4) for training with multiple GPUs, saving/loading weights or whole models with oneliners, all the convenience functions like .fit(), .predict(), and others.
However, I prefer to define my model in native TensorFlow. Context managers in TF are awesome and, in my opinion, it is much easier to implement stuff like GANs with them:
with tf.variable_scope("Generator"):
# define some layers
with tf.variable_scope("Discriminator"):
# define some layers
# model losses
G_train_op = ...AdamOptimizer(...)
.minimize(gloss,
var_list=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="Generator")
D_train_op = ...AdamOptimizer(...)
.minimize(dloss,
var_list=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="Discriminator")
Another bonus is structuring the graph this way. In TensorBoard debugging complicated native Keras models are hell since they are not structured at all. With heavy use of variable scopes in native TF you can "disentangle" the graph and look at a very structured version of a complicated model for debugging.
By utilizing this I can directly setup custom loss function and do not have to freeze anything in every training iteration since TF will only update the weights in the correct scope, which is (at least in my opinion) far easier than the Keras solution to loop over all the existing layers and set .trainable = False.
TL;DR:
Long story short: I like the direct access to everything in TF, but most of the time a simple Keras model is sufficient for training, inference, ... later on. The model API is much easier and more convenient in Keras.
Hence, I would prefer to set up a graph in native TF and convert it to Keras for training, evaluation, and so on. Is there any way to do this?
I don't think it is possible to create a generic automated converter for any TF graph, that will come up with a meaningful set of layers, with proper namings etc. Just because graphs are more flexible than a sequence of Keras layers.
However, you can wrap your model with the Lambda layer. Build your model inside a function, wrap it with Lambda and you have it in Keras:
def model_fn(x):
layer_1 = tf.layers.dense(x, 100)
layer_2 = tf.layers.dense(layer_1, 100)
out_layer = tf.layers.dense(layer_2, num_classes)
return out_layer
model.add(Lambda(model_fn))
That is what sometimes happens when you use multi_gpu_model: You come up with three layers: Input, model, and Output.
Keras Apologetics
However, integration between TensorFlow and Keras can be much more tighter and meaningful. See this tutorial for use cases.
For instance, variable scopes can be used pretty much like in TensorFlow:
x = tf.placeholder(tf.float32, shape=(None, 20, 64))
with tf.name_scope('block1'):
y = LSTM(32, name='mylstm')(x)
The same for manual device placement:
with tf.device('/gpu:0'):
x = tf.placeholder(tf.float32, shape=(None, 20, 64))
y = LSTM(32)(x) # all ops / variables in the LSTM layer will live on GPU:0
Custom losses are discussed here: Keras: clean implementation for multiple outputs and custom loss functions?
This is how my model defined in Keras looks in Tensorboard:
So, Keras is indeed only a simplified frontend to TensorFlow so you can mix them quite flexibly. I would recommend you to inspect source code of Keras model zoo for clever solutions and patterns that allows you to build complex models using clean API of Keras.
You can insert TensorFlow code directly into your Keras model or training pipeline! Since mid-2017, Keras has fully adopted and integrated into TensorFlow. This article goes into more detail.
This means that your TensorFlow model is already a Keras model and vice versa. You can develop in Keras and switch to TensorFlow whenever you need to. TensorFlow code will work with Keras APIs, including Keras APIs for training, inference and saving your model.

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.

Building a conversational model using TensorFlow

I'd like to build a conversational modal that can predict a sentence using the previous sentences using TensorFlow LSTMs . The example provided in TensorFlow tutorial can be used to predict the next word in a sentence .
https://www.tensorflow.org/versions/v0.6.0/tutorials/recurrent/index.html
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Can I use the same technique to predict the next sentence ? Is there any working example on how to do this ?
You want to use the Sequence-to-sequence model. Instead of having it learn to translate sentences from a source language to a target language you have it learn responses to previous utterances in the conversation.
You can adapt the example seq2seq model in tensorflow by using the analogy that the source language 'English' is your set of previous sentences and target language 'French' are your response sentences.
In theory you could use the basic LSTM you were looking at by concatenating your training examples with a special symbol like this:
hello there ! __RESPONSE hi , how can i help ?
Then during testing you run it forward with a sequence up to and including the __RESPONSE symbol and the LSTM can carry it the rest of the way.
However, the seq2seq model above should be much more accurate and powerful because it had a separate encoder / decoder and includes an attention mechanism.
A sentence is composed words, so you can indeed predict the next sentence by predicting words sequentially. There are models, such as the one described in this paper, that build embeddings for entire paragraphs, which can be useful for your purpose. Of course there is Neural Conversational Model work that probably directly fits your need. TensorFlow doesn't ship with working examples of these models, but the recurrent models that come with TensorFlow should give you a good starting point for implementing them.