BERT for Text Summarization - tensorflow

I'm trying to build a text summarization model using seq2seq architecture in Keras. I've followed this tutorial https://keras.io/examples/lstm_seq2seq/ and implemented it with Embeddings layer, which works fine. But now I want to use BERT. Can pretrained BERT embeddings be used in such a task, usually I see text classifiation, but not the encoder-decoder architecture used with BERT.
I access BERT model from TF Hub, and have a Layer class implemented from this tutorial https://github.com/strongio/keras-bert/blob/master/keras-bert.ipynb, I also tokenize accordingly with BERT tokenizer, below is my model
enc_in_id = Input(shape=(None, ), name="Encoder-Input-Ids")
enc_in_mask = Input(shape=(None, ), name="Encoder-Input-Masks")
enc_in_segment = Input(shape=(None, ), name="Encoder-Input-Segment-Ids")
bert_encoder_inputs = [enc_in_id, enc_in_mask, enc_in_segment]
encoder_embeddings = BertLayer(name='Encoder-Bert-Layer')(bert_encoder_inputs)
encoder_embeddings = BatchNormalization(name='Encoder-Batch-Normalization')(encoder_embeddings)
encoder_lstm = LSTM(latent_size, return_state=True, name='Encoder-LSTM')
encoder_out, e_state_h, e_state_c = encoder_lstm(encoder_embeddings)
encoder_states = [e_state_h, e_state_c]
dec_in_id = Input(shape=(None,), name="Decoder-Input-Ids")
dec_in_mask = Input(shape=(None,), name="Decoder-Input-Masks")
dec_in_segment = Input(shape=(None,), name="Decoder-Input-Segment-Ids")
bert_decoder_inputs = [dec_in_id, dec_in_mask, dec_in_segment]
decoder_embeddings_layer = BertLayer(name='Decoder-Bert-Layer')
decoder_embeddings = decoder_embeddings_layer(bert_decoder_inputs)
decoder_batchnorm_layer = BatchNormalization(name='Decoder-Batch-Normalization-1')
decoder_batchnorm = decoder_batchnorm_layer(decoder_embeddings)
decoder_lstm = LSTM(latent_size, return_state=True, return_sequences=True, name='Decoder-LSTM')
decoder_out, _, _ = decoder_lstm(decoder_batchnorm, initial_state=encoder_states)
dense_batchnorm_layer = BatchNormalization(name='Decoder-Batch-Normalization-2')
decoder_out_batchnorm = dense_batchnorm_layer(decoder_out)
decoder_dense_id = Dense(vocabulary_size, activation='softmax', name='Dense-Id')
dec_outputs_id = decoder_dense_id(decoder_out_batchnorm)
The model builds and after a couple of epochs accuracy rises to 1, and loss drops below 0.5, but the predictions are awful. Since I'm working on a dev set comprised of 5 samples, with max 30 WordPiece tokens and predicting on the same data, I only get the first or maybe two tokens right, then it just repeats the last seen token, or [PAD] token.

There different methods for summarizing a text i.e. Extractive & Abstractive.
Extractive summarization means identifying important sections of the text and generating them verbatim producing a subset of the
sentences from the original text; while abstractive summarization
reproduces important material in a new way after interpretation and
examination of the text using advanced natural language techniques to
generate a new shorter text that conveys the most critical information
from the original one.
For a transformer based approach you just need an additional attention layer which you can add to an encoder-decoder model or you can use pre-trained transformers (fine tune them maybe) like BERT, GPT, T5, etc.
You can have a look at : https://huggingface.co/transformers/
For Abstractive Summarization T5 works pretty well. Here's a nice and simple example : https://github.com/faiztariq/FzLabs/blob/master/abstractive-text-summarization-t5.ipynb
For Extractive Summarization you may take a look at : https://pypi.org/project/bert-extractive-summarizer/
There's a paper (Attention Is All You Need) that explains transformers pretty well, you may also take a look at it : https://arxiv.org/abs/1706.03762

I think this work might prove helpful, there are many other text summarization models that you can try out here they also contain their own blogs to discuss into detail how they were made
Hope this is helpful

Related

CNN + LSTM model for images performs poorly on validation data set

My training and loss curves look like below and yes, similar graphs have received comments like "Classic overfitting" and I get it.
My model looks like below,
input_shape_0 = keras.Input(shape=(3,100, 100, 1), name="img3")
model = tf.keras.layers.TimeDistributed(Conv2D(8, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(16, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(Flatten())(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.4))(model)
model = LSTM(16, kernel_regularizer=tf.keras.regularizers.l2(0.007))(model)
# model = Dense(100, activation="relu")(model)
# model = Dense(200, activation="relu",kernel_regularizer=tf.keras.regularizers.l2(0.001))(model)
model = Dense(60, activation="relu")(model)
# model = Flatten()(model)
model = Dropout(0.15)(model)
out = Dense(30, activation='softmax')(model)
model = keras.Model(inputs=input_shape_0, outputs = out, name="mergedModel")
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr
return lr
opt = tf.keras.optimizers.RMSprop()
lr_metric = get_lr_metric(opt)
# merged.compile(loss='sparse_categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt, metrics=['accuracy',lr_metric])
model.summary()
In the above model building code, please consider the commented lines as some of the approaches I have tried so far.
I have followed the suggestions given as answers and comments to this kind of question and none seems to be working for me. Maybe I am missing something really important?
Things that I have tried:
Dropouts at different places and different amounts.
Played with inclusion and expulsion of dense layers and their number of units.
Number of units on the LSTM layer was tried with different values (started from as low as 1 and now at 16, I have the best performance.)
Came across weight regularization techniques and tried to implement them as shown in the code above and so tried to put it at different layers ( I need to know what is the technique in which I need to use it instead of simple trial and error - this is what I did and it seems wrong)
Implemented learning rate scheduler using which I reduce the learning rate as the epochs progress after a certain number of epochs.
Tried two LSTM layers with the first one having return_sequences = true.
After all these, I still cannot overcome the overfitting problem.
My data set is properly shuffled and divided in a train/val ratio of 80/20.
Data augmentation is one more thing that I found commonly suggested which I am yet to try, but I want to see if I am making some mistake so far which I can correct it and avoid diving into data augmentation steps for now. My data set has the below sizes:
Training images: 6780
Validation images: 1484
The numbers shown are samples and each sample will have 3 images. So basically, I input 3 mages at once as one sample to my time-distributed CNN which is then followed by other layers as shown in the model description. Following that, my training images are 6780 * 3 and my Validation images are 1484 * 3. Each image is 100 * 100 and is on channel 1.
I am using RMS prop as the optimizer which performed better than adam as per my testing
UPDATE
I tried some different architectures and some reularizations and dropouts at different places and I am now able to achieve a val_acc of 59% below is the new model.
# kernel_regularizer=tf.keras.regularizers.l2(0.004)
# kernel_constraint=max_norm(3)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(64, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(128, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(GlobalAveragePooling2D())(model)
model = LSTM(128, return_sequences=True,kernel_regularizer=tf.keras.regularizers.l2(0.040))(model)
model = Dropout(0.60)(model)
model = LSTM(128, return_sequences=False)(model)
model = Dropout(0.50)(model)
out = Dense(30, activation='softmax')(model)
Try to perform Data Augmentation as a preprocessing step. Lack of data samples can lead to such curves. You can also try using k-fold Cross Validation.
There are many ways to prevent overfitting, according to the papers below:
Dropout layers (Disabling randomly neurons). https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
Input Noise (e.g. Random Gaussian Noise on the imges). https://arxiv.org/pdf/2010.07532.pdf
Random Data Augmentations (e.g. Rotating, Shifting, Scaling, etc.).
https://arxiv.org/pdf/1906.11052.pdf
Adjusting Number of Layers & Units.
https://clgiles.ist.psu.edu/papers/UMD-CS-TR-3617.what.size.neural.net.to.use.pdf
Regularization Functions (e.g. L1, L2, etc)
https://www.researchgate.net/publication/329150256_A_Comparison_of_Regularization_Techniques_in_Deep_Neural_Networks
Early Stopping: If you notice that for N successive epochs that your model's training loss is decreasing, but the model performs poorly on validaiton data set, then It is a good sign to stop the training.
Shuffling the training data or K-Fold cross validation is also common way way of dealing with Overfitting.
I found this great repository, which contains examples of how to implement data augmentations:
https://github.com/kochlisGit/random-data-augmentations
Also, this repository here seems to have examples of CNNs that implement most of the above methods:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
The goal should be to get the model predict correctly irrespective of
the order in which the 3 images in the sample are arranged.
If the order of the images of each sample is not important for the training, I think your model does the inverse, the Timedistributed layers succeded by LSTM take into account the order of the three images. As a solution, primarily, you can add images by reordering the images of each sample (= Augmented data). Secondly, try to consider the three images as one image with three-channel and remove the Timedistributed layers (I'm not sure that the three-channels are more efficient but you can give it a try)

How to use bert layer for Multiple instance learning using TimeDistributed Layer?

I want to perform Multiple Instance Learning Using Bert. A bag of instances contain 40 sentences. Each Sentence should output a label, and the final label should be average of all the labels.
I have tried using bert layer from tensorflow_hub. But I have no idea how to use it with TimeDistributed.
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
Any suggestions/workarounds will be appreciated
Disclaimer: I'm not an expert, thus there are probably some issues to figure out, but it should give you a hint.
Except for the BERT encoder, you should use some preprocessing for your text:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=True)
# please keep in mind that the encoder is updated
sentence_input = keras.Input(shape=(), dtype=tf.string, name='sentences')
preprocessed = bert_preprocess(sentence_input)
encoded = bert_encoder(preprocessed)
outputs = encoded['pooled_output'] # this returns only single token for each sentence
bert_model = Model(inputs=sentence_input, outputs=outputs)
x = TimeDistributed(bert_model)(your_inputs)
and after the time distributed layer you should add an average pooling layer (please remember that the dimension of x depends on the BERT model size, it'd be (40,786) in your case I guess).

Binary CNN Spam Classifier using ResNet50 producing val-train metrices > 0.99 but test metric falls under 0.6. Faulty code or approach?

I am working on Binary Spam Classifier project where I have to classify whether an image is spam or not. To get things in detail, let me first show you what are not spams:
Any image in the world which is not a picture of a question on a paper is a spam be it a bike, black screen or anything.
I used my Custom made dummy model as well as ResNet50 with and without fine tuning. I have lots of images of questions but I used 47000 images of questions and COCO 2017 test data set as spams. Model did well producing great results but when I used it with imagenet data set images it performed very poorly. I want to ask if I am doing something wrong here or if not, what should I do to make my model generalised.
train_data_gen = ImageDataGenerator(preprocessing_function=preprocess_input,validation_split=0.20)
train_set = train_data_gen.flow_from_directory(TRAIN,batch_size=BATCH_SIZE,target_size=(224,224),
subset='training',class_mode='categorical')
val_set = train_data_gen.flow_from_directory(TRAIN,batch_size=128,target_size=(224,224),
subset='validation',class_mode='categorical')
res_net = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
fc1 = Dense(1024,activation='relu')(res_net.output)
d1 = Dropout(0.79)(fc1)
out_ = Dense(1,activation='sigmoid',name='output_layer')(d1)
model = Model(inputs=res_net.input, outputs= out_)
for layer in res_net.layers:
layer.trainable = False
model.compile(loss='binary_crossentropy', optimizer=adam,
metrics=['accuracy'])
history = model.fit(train_set,epochs=3,validation_data=val_set,
steps_per_epoch=len(train_set)//BATCH_SIZE,callbacks=callbacks)
I have used softmax with categorical_crossentropy (2 neurons) and sigmoid with binary_crossentropy (1 neuron). Model is performing well in all train and validation cases but Why is it not being generalise? Is there one thing wrong with the way I train my model, is the code faulty? Do I need more data, do I train the resnet from scratch.
I think I need more data but how much data should I be aiming at so that it tends to LEARN WHAT A QUESTION IMAGE LOOKS LIKE because spam could be anything.

Word Embedding for Convolution Neural Network

I am trying to apply word2vec for convolution neural network. I am new with Tensorflow. Here is my code for pre-train layer.
W = tf.Variable(tf.constant(0.0, shape=[vocabulary_size, embedding_size]),
trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocabulary_size, embedding_size])
embedding_init = W.assign(embedding_placeholder)
sess = tf.Session()
sess.run(embedding_init, feed_dict={embedding_placeholder: final_embeddings})
I think I should use embedding_lookup but not sure how to use it. I really appreace it someone could give some advice.
Thanks
Tensorflow has an example using word2vec-cnn for text classification: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/text_classification_cnn.py
You are on the right track. As embedding_lookup works under the assumption that words are represented as integer ids you need to transform your inputs vectors to comply with that. Furthermore, you need to make sure that your transformed words are correctly indexed into the embedding matrix. What I did was I used the information about the index-to-word-mapping generated from the embedding model (I used gensim for training my embeddings) to create a word-to-index lookup table that I subsequently used to transform my input vectors.
I am doing something similar. I stumbled upon this blog that implements the paper "Convolutional neural networks for Sentence Classification". This blog is good. http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

Building a conversational model using TensorFlow

I'd like to build a conversational modal that can predict a sentence using the previous sentences using TensorFlow LSTMs . The example provided in TensorFlow tutorial can be used to predict the next word in a sentence .
https://www.tensorflow.org/versions/v0.6.0/tutorials/recurrent/index.html
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Can I use the same technique to predict the next sentence ? Is there any working example on how to do this ?
You want to use the Sequence-to-sequence model. Instead of having it learn to translate sentences from a source language to a target language you have it learn responses to previous utterances in the conversation.
You can adapt the example seq2seq model in tensorflow by using the analogy that the source language 'English' is your set of previous sentences and target language 'French' are your response sentences.
In theory you could use the basic LSTM you were looking at by concatenating your training examples with a special symbol like this:
hello there ! __RESPONSE hi , how can i help ?
Then during testing you run it forward with a sequence up to and including the __RESPONSE symbol and the LSTM can carry it the rest of the way.
However, the seq2seq model above should be much more accurate and powerful because it had a separate encoder / decoder and includes an attention mechanism.
A sentence is composed words, so you can indeed predict the next sentence by predicting words sequentially. There are models, such as the one described in this paper, that build embeddings for entire paragraphs, which can be useful for your purpose. Of course there is Neural Conversational Model work that probably directly fits your need. TensorFlow doesn't ship with working examples of these models, but the recurrent models that come with TensorFlow should give you a good starting point for implementing them.