BERT encodings from TensorFlow hub - tensorflow

I am using the following code to generate embeddings for my text classification.
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess =hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
def get_sentence_embeding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
e = get_sentence_embeding(["happy", "sad"])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([e[0]],[e[1]])
the above gives array([[0.99355495]], dtype=float32)
it is saying similarity score between happy and said is 99%
why it is given 99%? can I use these embeddings for my text classification?

BERT wasn't optimized to put words with contrary meanings far away from each other in the embedding space. Instead, the two words are close together since both are adjectives.
This tutorial actually demonstrates how to fine-tune BERT for sentiment analysis.

Related

How to convert the vectorized text into embedding matrix using Longformer transformer

Everyone knows about converting the text into vectors and that vectors into a matrix which helps to feed the machine learning models like LightGBM as features.
import transformers
from transformers import LongformerTokenizer,LongformerForSequenceClassification,Trainer, TrainingArguments, LongformerConfig,LongformerTokenizerFast
import tensorflow as tf
#tokenizer=LongformerTokenizer.from_pretrained("hf-internal-testing/tiny-random-longformer")
#model=TFLongformerForSequenceClassification.from_pretrained("hf-internal-testing/tiny-random-longformer")
from torch.utils.data import Dataset, DataLoader
config=LongformerConfig()
test=pd.read_csv('../input/feedback-prize-effectiveness/test.csv')
train=pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',gradient_checkpointing=False,attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)
#inputs = tokenizer("Hello, my dog is cute killer and bad")
#print(inputs.input_ids)
k=[]
for i in train['discourse_text']:
inputs=tokenizer(i)
m=inputs.input_ids
k.append(m)
train['long_tokens']=k
The above code uses the tokenization method from longformer to encode the sentences in the dataset. So, after doing that the dataset is going to look like below
So, the feature "long_tokens" should serve as a feature for the machine learning model[LightGBM].
My question is how can we transform those features to input the model?
The datatype of the "long_tokens" is tensor.
Please answer the question
Thanks & Regards
Satwik Sunnam
You can use Sentence transformers to encode sentences into vectors and then use them as features.
https://www.sbert.net/
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('allenai/longformer-base-4096')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

IndexError: list index out of range, NLP BERT Tensorflow

So I used Bert model trained it and saved it as hdf5 file, but when I try to predict , it shows this error :
IndexError: list index out of range
here is the code
import os.path
import numpy as np
import tensorflow as tf
import ktrain
from ktrain import text
"""## Part 1: Data Preprocessing
### Loading the IMDB dataset
"""
dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True)
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')
print(os.path.dirname(dataset))
print(IMDB_DATADIR)
"""### Creating the training and test sets"""
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(datadir=IMDB_DATADIR,
classes=['pos','neg'],
maxlen=500,
train_test_names=['train','test'],
preprocess_mode='bert')
"""## Part 2: Building the BERT model"""
model = text.text_classifier(name='bert',
train_data=(x_train, y_train),
preproc=preproc)
"""## Part 3: Training the BERT model"""
learner = ktrain.get_learner(model=model,
train_data=(x_train, y_train),
val_data=(x_test, y_test),
batch_size=6)
learner.fit_onecycle(lr=2e-5,
epochs=1)
tf.keras.models.save_model(model, 'NLP_model.hdf5')
from keras_bert import get_custom_objects
model = tf.keras.models.load_model('NLP_model.hdf5', custom_objects=get_custom_objects())
model.predict('This movie is not the scariest of all time, but it is a great example of a campy
eighties horror flick -- low budget, no stars, lots of inventive death scenes, and enough nudity to
keep the teenagers in their seats. The premise is interesting and fun and the three evil kids play
their parts well. A nice starting point for "Just Say" Julie Brown exposing her talents early in her
career. This film wont be seen by many, but for fans of 80s horror its a must.ense love would be more
believable.n.')
i'm trying to predict one of the sentences on the test set.
picture of the full code error
I would appreciate the help, ty
EDIT :
As shown in the ktrain tutorials and example notebooks like this one, you need to use the Predictor instance to make predictions on raw text inputs:
# create a Predictor instance
predictor = ktrain.get_predictor(learner.model, preproc)
# make prediction
output = predictor.predict('I loved this movie!')
print(output)
# save Predictor to disk
predictor.save('/tmp/mypredictor')
# reload Predictor from disk
reloaded_predictor = ktrain.load_predictor('/tmp/mypredictor')
# make another prediction
output = reloaded_predictor.predict('I loved this movie!')
print(output)

how to convert saved model from sklearn into tensorflow/lite

If I want to implement a classifier using the sklearn library. Is there a way to save the model or convert the file into a saved tensorflow file in order to convert it to tensorflow lite later?
If you replicate the architecture in TensorFlow, which will be pretty easy given that scikit-learn models are usually rather simple, you can explicitly assign the parameters from the learned scikit-learn models to TensorFlow layers.
Here is an example with logistic regression turned into a single dense layer:
import tensorflow as tf
import numpy as np
from sklearn.linear_model import LogisticRegression
# some random data to train and test on
x = np.random.normal(size=(60, 21))
y = np.random.uniform(size=(60,)) > 0.5
# fit the sklearn model on the data
sklearn_model = LogisticRegression().fit(x, y)
# create a TF model with the same architecture
tf_model = tf.keras.models.Sequential()
tf_model.add(tf.keras.Input(shape=(21,)))
tf_model.add(tf.keras.layers.Dense(1))
# assign the parameters from sklearn to the TF model
tf_model.layers[0].weights[0].assign(sklearn_model.coef_.transpose())
tf_model.layers[0].bias.assign(sklearn_model.intercept_)
# verify the models do the same prediction
assert np.all((tf_model(x) > 0)[:, 0].numpy() == sklearn_model.predict(x))
It is not always easy to replicate a scikit model in tensorflow. For instance scitik has a lot of on the fly imputation libraries which will be a bit tricky to implement in tensorflow

Updating a BERT model through Huggingface transformers

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Too Much Memory Issue with Semantic Image Segmentation NN (DeepLabV3+)

I first explain my task: I have nearly 3000 images from two different ropes. They contain rope 1, rope 2 and the background. My Labels/Masks are images, where for example the pixel value 0 represents the background, 1 represents the first rope and 2 represents the second rope. You can see both the input picture and the ground truth/labels here on picture 1 and 2 below. Notice that my ground truth/label has only 3 values: 0, 1 and 2.
My input picture is gray, but for DeepLab i converted it to a RGB Picture, because DeepLab was trained on RGB Pictures. But my converted picture still doesn't contain color.
The idea of this task is that the Neural Network should learn the structure from ropes, so it can label ropes correctly even if there are knotes. Therfore the color information is not important, because my ropes have different color, so it is easy to use KMeans for creating the ground truth/labels.
For this task i choose a Semantic Segmentation Network called DeepLab V3+ in Keras with TensorFlow as Backend. I want to train the NN with my nearly 3000 images. The size of alle the images is under 100MB and they are 300x200 pixels.
Maybe DeepLab is not the best choice for my task, because my pictures doesn't contain color information and the size of my pictures are very small (300x200), but i didn't find any better Semantic Segmentation NN for my task so far.
From the Keras Website i know how to load the Data with flow_from_directory and how to use the fit_generator method. I don't know if my code is logical correct...
Here are the links:
https://keras.io/preprocessing/image/
https://keras.io/models/model/
https://github.com/bonlime/keras-deeplab-v3-plus
My first question is:
With my implementation my graphic card used nearly all the memory (11GB). I don't know why. Is it possible, that the weights from DeepLab are that big? My Batchsize is default 32 and all my nearly 300 images are under 100MB big. I already used the config.gpu_options.allow_growth = True code, see my code below.
A general question:
Does somebody know a good semantic segmentation NN for my task? I don't need NN, which were trained with color images. But i also don't need NN, which were trained with binary ground truth pictures...
I tested my raw color image(picture 3) with DeepLab, but the result label i got was not good...
Here is my code so far:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
import numpy as np
from model import Deeplabv3
import tensorflow as tf
import time
import tensorboard
import keras
from keras.preprocessing.image import img_to_array
from keras.applications import imagenet_utils
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import TensorBoard
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
from keras import backend as K
K.set_session(session)
NAME = "DeepLab-{}".format(int(time.time()))
deeplab_model = Deeplabv3(input_shape=(300,200,3), classes=3)
tensorboard = TensorBoard(log_dir="logpath/{}".format(NAME))
deeplab_model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
# we create two instances with the same arguments
data_gen_args = dict(featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=90,
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=0.2)
image_datagen = ImageDataGenerator(**data_gen_args)
mask_datagen = ImageDataGenerator(**data_gen_args)
# Provide the same seed and keyword arguments to the fit and flow methods
seed = 1
#image_datagen.fit(images, augment=True, seed=seed)
#mask_datagen.fit(masks, augment=True, seed=seed)
image_generator = image_datagen.flow_from_directory(
'/path/Input/',
target_size=(300,200),
class_mode=None,
seed=seed)
mask_generator = mask_datagen.flow_from_directory(
'/path/Label/',
target_size=(300,200),
class_mode=None,
seed=seed)
# combine generators into one which yields image and masks
train_generator = zip(image_generator, mask_generator)
print("compiled")
#deeplab_model.fit(X, y, batch_size=32, epochs=10, validation_split=0.3, callbacks=[tensorboard])
deeplab_model.fit_generator(train_generator, steps_per_epoch= np.uint32(2935 / 32), epochs=10, callbacks=[tensorboard])
print("finish fit")
deeplab_model.save_weights('deeplab_1.h5')
deeplab_model.save('deeplab-1')
session.close()
Here is my code to test DeepLab (from Github):
from matplotlib import pyplot as plt
import cv2 # used for resize. if you dont have it, use anything else
import numpy as np
from model import Deeplabv3
import tensorflow as tf
from PIL import Image, ImageEnhance
deeplab_model = Deeplabv3(input_shape=(512,512,3), classes=3)
#deeplab_model = Deeplabv3()
img = Image.open("Path/Input/0/0001.png")
imResize = img.resize((512,512), Image.ANTIALIAS)
imResize = np.array(imResize)
img2 = cv2.cvtColor(imResize, cv2.COLOR_GRAY2RGB)
w, h, _ = img2.shape
ratio = 512. / np.max([w,h])
resized = cv2.resize(img2,(int(ratio*h),int(ratio*w)))
resized = resized / 127.5 - 1.
pad_x = int(512 - resized.shape[0])
resized2 = np.pad(resized,((0,pad_x),(0,0),(0,0)),mode='constant')
res = deeplab_model.predict(np.expand_dims(resized2,0))
labels = np.argmax(res.squeeze(),-1)
plt.imshow(labels[:-pad_x])
plt.show()
First question: The DeepLabV3+ is a very large model (I assume you are using the Xception backbone?!) and 11 GB of needed GPU capacity is totally normal regarding a bachsize of 32 with 200x300 pixels :) (Training DeeplabV3+, I needed approx. 11 GB using a batchsize of 5 with 500x500 pixels). One note to the second sentence of your question: the needed GPU resources are influenced by many factors (model, optimizer, batchsize, image crop, preprocessing etc) but the actual size of your dataset set shouldn't influence it. So it doesn't matter if your dataset is 300MB or 300GB large.
General Question: You are using a small dataset. Choosing DeeplabV3+ & Xception might not be a good fit, since the model might be too large. This might lead to overfitting. If you haven't obtained satisfying results yet you might try a smaller network. If you want to stick to the DeepLab-framework you could switch the backbone from the Xception network to MobileNetV2 (In the official tensorflow version it is already implemented). Alternatively, you could try using a standalone network like the Inception network with a FCN head...
In each case it would be essential to use a pre-trained encoder with a well-trained feature representation. If you don't find a good initialization of your desired model based on grayscale input images, just use a model pre-trained on RGB images and extend the pre-training with a grayscale dataset (basically you can convert any big rgb dataset to be grayscale) and finetune the weights on the grayscale input before using your data.
I hope this helps! Cheers, Frank
IBM's Large Model Support (LMS) library enables training of large deep neural networks that would normally exhaust GPU memory while training. LMS manages this over-subscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.
Description - https://developer.ibm.com/components/ibm-power/articles/deeplabv3-image-segmentation-with-pytorch-lms/
Pytorch - https://github.com/IBM/pytorch-large-model-support
TensorFlow - https://github.com/IBM/tensorflow-large-model-support