How is the "Supervisor" constructed in the TimeGAN? - tensorflow

I am exploring the TimeGAN architecture (Yoon, et al) and I have studied the code from the original authors, https://github.com/jsyoon0823/TimeGAN/blob/master/timegan.py, and a revised code written in TensorFlow 2, https://github.com/ydataai/ydata-synthetic/blob/e4cace4fb5c7ac47604364fbf300c7b072faab8a/src/ydata_synthetic/synthesizers/timeseries/timegan/model.py#L339. The actual paper seems to lack many details of the TimeGAN architecture, specifically details relating to the supervisor model. Line 132 of the first reference show that the intended input for the supervisor is the 'H', the latent representation. In Line 170 and 171, the supervisor is given two separate inputs, 'E-Hat' which produces'H_hat', and 'H' which produces 'H_hat_supervise'. The supervised loss is defined as the MSE of these two ('H' and 'H_hat_supervised'). This part doesn't make sense. If 'H' is the input to the supervisor and it's output is being directly compared to 'H'. Wouldn't this just enforce the supervior to be identity? Also, the trainable variables for this loss are defined as the the generator weights and the supervisor weights. Why are the generator weights included in this when the generator is not used in producing 'H' or 'H_hat_supervised'?

Related

Is there linter for model(inputs) of PyTorch like model.predict(inputs) of TensorFlow?

My goal is to do object detection. However, YOLOv7 and (hack to create bounding box with feature map) tutorial is using PyTorch.
The problem is: model(inputs) do not have typings.
The code L148-L150
out = model(inputs)
probs, class_preds = torch.max(out[0], dim=-1)
feature_maps = out[1].to("cpu")
The forced me to debug the helper.py file to understand what [0] and out[1] are. Currently, I assume that out[0] as the softmax probability and out[1] as the feature maps.
I think the answer is no, in general it is non-trivial to automatically infer the semantic meaning of the outputs of a neural network; this is a product of the semantic meaning of the inputs and the model structure itself. You could reference the Yolo model architecture provided in model.py (though as an aside you should not link to external code but rather provide relevant code in your question itself) and investigate the structure of the outputs, then reference the structure of the labeled inputs (as the model by definition is learning to replicate the structure of the labels.)
That being said in your case the output is quite obviously per-class probabilities and class indexes as shown in line 149:
probs, class_preds = torch.max(out[0], dim=-1)
as the outputs from torch.max per pytorch documentation are (maximum value, maximum index).

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

ML/DL Prediction on whole input rather than row by row

I have tabular data from a sensor measuring various features. When the sensor is "off" it will report zero as values. I am training some machine learning models kNN, XGBoost, and NN for the purpose of classification. Here's the issue I am facing: I can train and predict on a row by row basis; however it would be better to classify a range as whole rather than a row by row basis. Another issue to this is that the range can vary in size. For a very basic example, please see this diagram illustrating the range.
I have a basic Keras model:
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
And the training data is shaped with 20 features and 4 classes. How would I:
1.) Format my training data
2.) Shape input data to classify as a "whole" rather than row by row
3.) While this has been talking about using keras. Can the same input shaping/training be applied to XGBoost or a kNN?
I assume that the blue line in that graph represents your targets. Here is a fundamental issue I see with something like predicting the range as a whole instead of sample by sample.
Assuming that there is some reasonable logic that could collapse the range of samples into one (taking mean per each feature, or concatenation, or whatever...), obviously you would first need to identify the range itself. This range identification step is however dependent on the knowledge of target (at least it seems like that based on the presented graph).
If the preprocessing step is dependent on the knowledge of the target, you would need to know the target for the test set as well before you could preprocess the data and make the predictions. In other words, you would need to know the outcome before you could make the prediction which would then be rather pointless.
You have stated that you are trying to perform classification but your target seems to be continuous. I don't know what your classes are or what patterns they are associated with but you would need to bin the target before you could start solving this as a classification problem. You would most likely lose a lot of information by doing this.
Therefore, I would start by solving it as a regression problem. Trying to predict that continuous target for each sample. Once you have that, you can apply some patter matching logic to identify the class for a given sample/range (for example, you could slice the sequence of targets/predictions from the previous step, associate each slice with the desired class and use this data as a new dataset for some classification algorithm).
As for the variable length inputs. Some deep learning architectures allow you to work with input of variable length, such as RNNs or adaptive pooling. You may try to do this one you know how to predict the continuous target as mentioned before. Non-deep-learning algorithms usually expect all samples to have the same shape so there is no general/automatic way of reusing the same input between them and deep learning algorithms that work with input of variable length.

Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.
Bert's last layer looks like this :
Where we take the [CLS] token of each sentence :
Image source
I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation :
BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.
My question is, Why the author ignored the other information ( each token's vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?
How does this [CLS] token help compare to the average of all token vectors?
The use of the [CLS] token to represent the entire sentence comes from the original BERT paper, section 3:
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
Your intuition is correct that averaging the vectors of all the tokens may produce superior results. In fact, that is exactly what is mentioned in the Huggingface documentation for BertModel:
Returns
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
Update: Huggingface removed that statement ("This output is usually not a good summary of the semantic content ...") in v3.1.0. You'll have to ask them why.
BERT is designed primarily for transfer learning, i.e., finetuning on task-specific datasets. If you average the states, every state is averaged with the same weight: including stop words or other stuff that are not relevant for the task. The [CLS] vector gets computed using self-attention (like everything in BERT), so it can only collect the relevant information from the rest of the hidden states. So, in some sense the [CLS] vector is also an average over token vectors, only more cleverly computed, specifically for the tasks that you fine-tune on.
Also, my experience is that when I keep the weights fixed and do not fine-tune BERT, using the token average yields better results.

Seq2Seq for prediction of complex states

My problem:
I have a sequence of complex states and I want to predict the future states.
Input:
I have a sequence of states. Each sequence can be of variable length. Each state is a moment in time and is described by several attributes: [att1, att2, ...]. Where each attribute is a number between an interval [[0..5], [1..3651], ...]
The example (and paper) of Seq2Seq is based on that each state (word) is taken from their dictionary. So each state has around 80.000 possibilities. But how would you represent each state when it is taken from a set of vectors and the set is just each possible combination of the attributes.
Is there any method to work with more complex states with TensorFlow? Also, what is a good method do decide the boundaries of your buckets when the relation between input length and output length is unclear?
May I suggest a rephrasing and splitting of your question into two parts? The first is really a general machine learning/LSTM question that's independent of tensorflow: How to use an LSTM to predict when the sequence elements are general vectors, and the second is how to represent this in tensorflow. For the former - there's nothing really magical to do there.
But a very quick answer: You've really just skipped the embedding lookup part of seq2seq. You can feed dense tensors in to a suitably modified version of it -- your state is just a dense vector representation of the state. That's the same thing that comes out of an embedding lookup.
The vector representation tutorial discusses the preprocessing that turns, e.g., words into embeddings for use in later parts of the learning pipeline.
If you look at line 139 of seq2seq.py you'll see that the embedding_rnn_decoder takes in a 1D batch of things to decide (the dimension is elements in the batch), but then uses the embedding lookup to turn it into a batch_size * cell.input_size tensor. You want to directly input a batch_size * cell.input_size tensor into the RNN, skipping the embedding step.