Develop Question Answer System using BERT - tensorflow

i'm currently developing a Question-Answer system (in Indonesian language) using BERT for my thesis.
The dataset and the questions given are in Indonesian.
The problem is, i'm still not clear on how the step-to-step process to develop the Question-Answer system in BERT.
From what I concluded after reading a number of research journals and papers, the process might be like this:
Prepare main dataset
Load Pre-Train Data
Train the main dataset with the pre-train data (so that it produce "fine-tuned" model)
Cluster the fine-tuned model
Testing (giving questions to the system)
Evaluation
What i want to ask are :
Are those steps correct? Or maybe there any missing step(s)?
Also, if the default pre-train data that BERT provide is in English while my main dataset is in Indonesian, how can i create my own indonesian pre-train data?
Does it really need to perform data/model clustering in BERT?
I appreciate any helpful answer(s).
Thank you very much in advance.

I would take a look at Huggingface's Question & Answer examples. That would at least be a good place to start.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
"Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")

Related

Transfer learning with TensorFlow Hub: using a single test image?

I have successfully followed this official tutorial on image classification with transfer learning: https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub
My experimental model is now saved and supposed to recognize when it sees a "good" painting. However, I want to test this with an image that the model has not seen before. So far I have only used notebooks where the dataset is already divided into train and test folders. However, this is not the case here.
I assume I need something like
img = tf.keras.preprocessing.image.load_img("/content/mytestimage.jpeg", target_size=(224,224))
among other things; however, for a beginner it would be useful to see an example of this kind of test prediction. So far I have searched without results - if anyone has any advice I'm super happy to hear!
Here's how to do it with mobilenet transfer learning with keras but most of the code should be the same. A full transfer learning tutorial can be found here. I found it very useful.
from PIL import Image
from tensorflow.keras.models import load_model
model = load_model('path/to/model.h5')
img = Image.open(file)
array = np.asarray(img, dtype=np.float32)
arrayexp = np.expand_dims(array, axis=0)
arrayexp = (arrayexp/127)-1 #This is a normalization factor specifically for Mobilenet but I think it's used for many other networks
result = model.predict(arrayexp)
print(np.argmax(result)) #Prints class with highest confidence
print(result[0][np.argmax(result)]) #Prints confidence for the highest

Updating a BERT model through Huggingface transformers

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

how can i use tqdm to visualize the progress of training steps using tf.data.Dataset api?

I want to use tqdm to visualize my cnn network training steps.
How can I implement tqdm with tf.data.Dataset() api?
Can u show me a sample code? thx!
NOTE: although this is an obviously flagged question with mal practise, I still think it is a valid question (I had myself) and post a solution.
A possibility is to obtain the cardinality of your dataset previously and use it as #mr.melon states in tqdm.
cardinality = np.sum([1 for i in dataset.batch(batch_size)])
where dataset is of class tf.data.Dataset and you haven't done the full preparation pipeline (I refere to interleaving, shuffling, batch and prefect and their kind).
Then you can
for input, label in tqdm(dataset, total=cardinality):
...
It's pretty easy:
Derive the number of samples in your dataset,
Then, convert the number into some iterable python structure.
for _ in tqdm(iterable=xxx, total=num_samples):
batch_data = sess.run(ele_derived_from_tf_dataset)

Tensorflow Hub Image Modules: Clarity on Preprocessing and Output values

Many thanks for support!
I currently use TF Slim - and TF Hub seems like a very useful addition for transfer learning. However the following things are not clear from the documentation:
1. Is preprocessing done implicitly? Is this based on "trainable=True/False" parameter in constructor of module?
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
When I use Tf-slim I use the preprocess method:
inception_preprocessing.preprocess_image(image, img_height, img_width, is_training)
2.How to get access to AuxLogits for an inception model? Seems to be missing:
import tensorflow_hub as hub
import tensorflow as tf
img = tf.random_uniform([10,299,299,3])
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
outputs = module(dict(images=img), signature="image_feature_vector", as_dict=True)
The output is
dict_keys(['InceptionV3/Mixed_6b', 'InceptionV3/MaxPool_5a_3x3', 'InceptionV3/Mixed_6c', 'InceptionV3/Mixed_6d', 'InceptionV3/Mixed_6e', 'InceptionV3/Mixed_7a', 'InceptionV3/Mixed_7b', 'InceptionV3/Conv2d_2a_3x3', 'InceptionV3/Mixed_7c', 'InceptionV3/Conv2d_4a_3x3', 'InceptionV3/Conv2d_1a_3x3', 'InceptionV3/global_pool', 'InceptionV3/MaxPool_3a_3x3', 'InceptionV3/Conv2d_2b_3x3', 'InceptionV3/Conv2d_3b_1x1', 'default', 'InceptionV3/Mixed_5b', 'InceptionV3/Mixed_5c', 'InceptionV3/Mixed_5d', 'InceptionV3/Mixed_6a'])
These are excellent questions; let me try to give good answers also for readers less familiar with TF-Slim.
1. Preprocessing is not done by the module, because it is a lot about your data, and not so much about the CNN architecture within the module. The module only handles transforming input values from the canonical [0,1] range into whatever the pre-trained CNN within the module expects.
Lengthy rationale: Preprocessing of images for CNN training usually consists of decoding the input JPEG (or whatever), selecting a (reasonably large) random crop from it, random photometric and geometric transformations (distort colors, flip left/right, etc.), and resizing to the common image size for a batch of training inputs. The TensorFlow Hub modules that implement https://tensorflow.org/hub/common_signatures/images leave all of that to your code around the module.
The primary reason is that the suitable random transformations depend a lot on your training task, but not on the architecture or trained state weights of the module. For example, color distortions will help if you classify cars vs dogs, but probably not for ripe vs unripe bananas, and so on.
Also, a batch of images that have been decoded but not yet cropped/resized are hard to represent as a single tensor (unless you make it a 1-D tensor of encoded strings, but that brings other problems, such as breaking backprop into module inputs for advanced uses).
Bottom line: The Python code using the module needs to do image preprocessing (except scaling values), for example, as in https://github.com/tensorflow/hub/blob/master/examples/image_retraining/retrain.py
The slim preprocessing methods conflate the dataset-specific random transformations (tuned for Imagenet!) with the re-scaling to the architecture's value range (which the Hub module does for you). That means they are not directly applicable here.
2. Indeed, auxiliary heads are missing from the initial set of modules published under tfhub.dev/google/..., but I expect them to work fine for re-training anyways.
More details: Not all architectures have auxiliary heads, and even the original Inception paper says their effect was "relatively minor" [Szegedy&al. 2015; §5]. Using an image feature vector module for a custom classification task would burden the module consumer code with checking for aux features and, if found, putting aux logits and a loss term on top.
This complication did not seem to pull its weight, but more experiments might refute that assessment. (Please share in a GitHub issue if you know of any.)
For now, the only way to put an aux head onto https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1 is to copy&paste some lines from https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_v3.py (search "Auxiliary head logits") and apply that to the "Inception_V3/Mixed_6e" output that you saw.
3. You didn't ask, but: For training, the module's documentation recommends to pass hub.Module(..., tags={"train"}), or else batch norm operates in inference mode (and dropout, if the module had any).
Hope this explains how and why things are.
Arno (from the TensorFlow Hub developers)

Building a conversational model using TensorFlow

I'd like to build a conversational modal that can predict a sentence using the previous sentences using TensorFlow LSTMs . The example provided in TensorFlow tutorial can be used to predict the next word in a sentence .
https://www.tensorflow.org/versions/v0.6.0/tutorials/recurrent/index.html
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Can I use the same technique to predict the next sentence ? Is there any working example on how to do this ?
You want to use the Sequence-to-sequence model. Instead of having it learn to translate sentences from a source language to a target language you have it learn responses to previous utterances in the conversation.
You can adapt the example seq2seq model in tensorflow by using the analogy that the source language 'English' is your set of previous sentences and target language 'French' are your response sentences.
In theory you could use the basic LSTM you were looking at by concatenating your training examples with a special symbol like this:
hello there ! __RESPONSE hi , how can i help ?
Then during testing you run it forward with a sequence up to and including the __RESPONSE symbol and the LSTM can carry it the rest of the way.
However, the seq2seq model above should be much more accurate and powerful because it had a separate encoder / decoder and includes an attention mechanism.
A sentence is composed words, so you can indeed predict the next sentence by predicting words sequentially. There are models, such as the one described in this paper, that build embeddings for entire paragraphs, which can be useful for your purpose. Of course there is Neural Conversational Model work that probably directly fits your need. TensorFlow doesn't ship with working examples of these models, but the recurrent models that come with TensorFlow should give you a good starting point for implementing them.