How to set max length in preprocessing before using BERT function? - tensorflow

BERT_MODEL = "https://tfhub.dev/google/experts/bert/wiki_books/2"
PREPROCESS_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
preprocess = hub.load(PREPROCESS_MODEL)
bert = hub.load(BERT_MODEL)
inputs = preprocess(sentences)
outputs = bert(inputs)
I'm trying to get BERT embeddings for text-to-image generation. But I could not find how to change max length in these functions. Can you please explain how to do it?

BERT can only take input sequences up to 512 tokens in length. This is quite a large limitation, since many common document types are much longer than 512 words. It inherits it’s architecture from the transformers, which themselves use self-attention, feed-forward layers, residual connections, and layer normalization as their foundational components.
The problems with BERT and large input documents are caused by a few aspects of BERT's architecture:
Transformers are autoregressive in and of themselves, and the designers of BERT saw a considerable drop in performance when using documents longer than 512 tokens. As a result, this limit was set to protect against low-quality output.
The self-attention model has a space complexity of O(n2). Because of the quadratic complexity, the modes require a lot of resources to fine-tune. The more input you provide, the more resources you will need to fine-tune the model. For most users, the quadratic complexity makes this too expensive.
On how to deal with long sequences please refer to this. Thank you!

Related

Is it possible to train a NN in Keras with features that won't be available for prediction?

I'm fairly new to this topic as a whole and struggle to wrap my head even the basics of neural networks in general. Not looking for a project plan, appreciate that you probably have better things to do.
Nonetheless, any idea or push in the right direction is appreciated.
Imaging a grey-box model of some kind, thermal network, electrical network, so on, it's desirable to predict returns based on a very few features with an underlying smart model that is trained with a much bigger dataset.
My question would be if it is possible to train a model with features and define mandatory and some sort of good-to-have features for the predictions?
Any tips are appreciated.
Cheers
Yes, you can train your model like that. But you must feed all the features during prediction. For example, you have 30 mandatory features and 10 optional features. The total is 40. You must feed all the 40 features to get a prediction from your model. Input data shape must be the same always. But we asked for optional features, but why I'm being forced now? well, I will discuss two options.
Option 1: set input shape to None. If you set the input shape to None, your model will accept any shape of input. But if you will have to handle some stuff. You can't use the MaxPooling layer. If you really need to use MaxPool, you will need to calculate input, output shape for all the layers only using the mandatory feature shapes(minimum input shape). If you calculate using (mandatory_feature+optional feature), you will end up with an error. Because the shape of the input of the maxpool layer can become too small and can't be reduced. Take care of that stuff and you're good to go.
Option 2: I will give you an example. I was using the OpenPose output dataset to classify some movements. OpenPose output = 18 bone key points, 36 features including x and y coordinates. These key points were being extracted from live camera frames. But we can't say all the body parts of a human will be always inside the frame. When someone's legs are outside of the frame, we can't get their legs keypoints. But still, we need to classify. There were a lot of options. We could replace the missing keypoints with 0 or find median/mean of such poses and use that value as key points. We found the best case by analyzing all the data. If you're going with option 2, I will suggest you analyze the data first, then decide how you're going to handle the missing field.

CNN-LSTM structure: post vs pre padding?

In a structure like: CNN -> LSTM -> Dense
The input is variable length (ex Speech Recognition CTC), and need to be padded.
Will the choice between pre and post padding affect the performance?
I read Effects of padding on LSTMs and CNNs
Is it true that pre vs post will not affect the performance as long as the input layer is CNN?
As the paper shows
So if you apply post padding to the LSTM, you will clearly have a worse performance.
Since LSTMs learn based on a sequence of data and try to find relations to the past, adding white noise,i.e. LSTM pre-padding, in the input will prevent them from building this relation.
If the pre-padding is added to the CNN, I believe that the performance of the LSTM will remain unaffected, as the output of the CNN doesn't change.
From my understanding CNN post padding and LSTM pre padding is the same thing, and may result in worse perfromance.
Generally speaking, you can get solid information by reading papers. In forums you will read about other people's opinions.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Tensorflow: Increasing number of duplicate predictions while training

I have a multilayer perceptron with 5 hidden layers and 256 neurons each. When I start training, I get different prediction probabilities for each train sample until epoch 50, but then the number of duplicate predictions increases, on epoch 300 I already have 30% of duplicate predictions which does not make sense since the input data is different for all training samples. Any idea what causes this behavior?
Clarifications:
with "duplicate predictions", I mean items with the exactly same predicted probability to belong to class A (it's a binary classification problem)
I have 4000 training samples with 200 features each and all samples are different, it does not make sense that the number of duplicate predictions increases to 30% while training. So I wonder what can cause this behavior.
One point, you say you are doing a binary prediction, and when you say "duplicate predictions", even with your clarification it's hard to understand your meaning. I am guessing that you have two outputs for your binary classifier, one for class A and one for class B and you are getting roughly the same value for a given sample. If that's the case, then the first thing to do is to use 1 output. A binary classification problem is better modeled with 1 output that ranges between 0 and 1 (sigmoid the output neuron). This way there will be no ambiguity, the network will have to choose one or the other, or when it's confused you'll get ~0.5 and it will be clear.
Second, it is very common for a network to start learning well and then to perform more poorly after overtraining. Especially with small datasets such as what you have. In fact, even with the little knowledge I have of your dataset I would put a small bet on you getting better performance out of an algorithm like XGA Boost than a neural network (I assume you're using a neural net and not literally a perceptron).
But regarding the performance degrading over time. When this happens you want to look into something called "early stopping". At some point the network will start memorizing the input, and may be part of what's happening. Essentially you train until the performance on your held out test data starts to worsen.
To address this you can apply various forms of regularization (L2 regularization, dropout, batch normalization all come to mind). You can also reduce the size of your network. 5 layers of 256 neurons sounds too big for the problem. Try trimming this down and I bet your results will improve. There is a sweet spot for architecture size in neural networks. When your network is too large it can, and often will, over fit. When it's too small it won't be expressive enough for the data. Angrew Ng's coursera class has some helpful practical advice on dealing with this.