Batch structure for training a ranking model with contrastive loss? - tensorflow

How do I choose my batch if I train a deep ranking model with a eg. contrastive loss where I have per query 1 positive document and 2 negative samples?
So, it is about ranking (loss) which applies to eg. the quora question pair data or any other question/answer pairs which I want to rank using a deep learning ranking model or just a Siamese network.
The data would look like this: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv
Now, I assume that it is crucial how to build the batch, right? Since for every question all according pos and neg answers need to be contained inside a batch, right?

Different strategies can be used to build the batches and the triplets or pairs. Usually, the batches are built randomnly, and then the hardest negative, or one of the hardest negatives in the batch is picked.
So yes, positive and negatives examples need to be contaned inside a batch. And it is crucial to pick negatives. But usually efforts are made to pick the proper negatives inside the batch, instead of in building the batches in a specific way.
This blogpost explaining how ranking losses work may be usefull https://gombru.github.io/2019/04/03/ranking_loss/

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

How to structure multi-output bayesian optimization

I am trying to use bayesian optimization for a multi-output problem, but am not 100% sure the best way to set it up.
I have a small number of inputs (5) and outputs (3-4) in my problem. For each output, I have a target value I would like to achieve. Ultimately I would like to minimize the MSE between the target vector (of 3-4 outputs) and the true outputs.
The simplest way to do this in my mind is to create a single model which models the MSE as a function the problem inputs. Here, all historical data is first compressed into the single MSE, then this is used to train the GP.
However, I would instead like to create individual models (or a combined, multi-output model) that directly models the outputs of interest, instead of the ultimate cost function (MSE). Primarily, this is because I have noticed more accurate predictive results (of the combined MSE), when first modeling the individual outputs, then creating a MSE, instead of directly modeling the MSE.
My problem arises when creating the acquisition function when I have multiple outputs. Ideally, I'd like to use expected improvement (EI) as my acquisition function. However, I'm not sure how to either 1) combine the multiple output distributions into a single distribution (representing the probability of the combined MSE), which can then be used to determine overall EI or 2) how to combine multiple EI values into a single metric (i.e. combine the E.I. for each output into a unified E.I.).
When reading about multi-output BO, the most common approach seems to be to identify a frontier of solutions, however this is not 100% applicable, as ultimately I can convert the output vector into a single MSE (and the frontier becomes a point).
Is the best approach simply to model the combined MSE directly? Or is there a way that I can model the individual outputs, then combine these modeled outputs into a reasonable acquisition function?

Understanding the Input Parameters in RNN

I'm having a hard time to understand the different "jargons" used in RNN. They are the following:
batch_size, time_steps, inputs and instances.
Let me go through my understanding of each input parameters & please correct me where I'm wrong.
Suppose I've got a sequence of numbers and I want to predict the next number. The numbers are the following:
[1,2,3,4,5,....,100]
time_steps: This parameter means how far RNN will look into past before it predicts the future. For simplicity, I want to predict 1 number ahead. And want to do after I see 10 numbers in the past. So, in this case, time_steps will be 10.
inputs: These are the values at each time_steps. In first time_step (t) the inputs are
t0: [1]
t1: [2]
.
.
.
t10: [10]`
batch_size: This helps in efficient computation of RNN model. Suppose my batch_size is 2. In that case, at time_step 2, the RNN input will be
t0: [1]
t0: [11]
Then what's the usage of instances? E.g. in this post, instances have been used. And there are multiple cases where instances are used. Is it means each loop over batch? E.g. there are 5 batches, each of size 2. Then there will be 5 instances.
Please help me correct my understanding.
Thanks!
batch_size
Batch size, in general, represents the size of the mini-batches constructed from the experimental dataset. Since in deep learning, we are required to do a lot of computations, it is better if we consider mini-batch operations because GPU usage will be worth then.
time_steps
Since RNN takes sequential inputs, index of each element in the input sequence can be referred as a time step of that sequence. For example, if [1,2,3,4,5,....,100] is a sequence, index of each element in the sequence is a time step.
inputs
The term inputs has a broader meaning, so I am not sure if my definition is correct. According to my understanding, inputs to an RNN refers to individual inputs provided to RNN at each time step. For example, in [1,2,3,4,5,....,100], each element is an input to the RNN at a particular time step.
But in an abstract way, if someone asks, what is the input of your deep neural model? You can say, it is English sentences or images or audio clips or videos etc. In short, the meaning of the term inputs depends on the context.
instances
Instances, in general, refers to a training/dev/test example in the dataset. For example, the sequence: [1,2,3,4,5,....,100] can be a training instance in your dataset.
Hope this helps!
Alright pal, you did good learning those concepts. I had a hard time learning those correctly. Everything you know seems to be in order and as for "instances". They're basically a set of data. There's no fixed term of usage of "instances" in a deep learning community. Some people use it for referring for a different set of data or batches of data. I rarely hear it in papers.

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.

In the training of a Deep Learning model, does it matter the sequential order of elements in my dataset I use to input?

To be more specific, I’m dealing with a NLP problem and training a LSTM to word prediction given a initial sequence of words. My dataset is 200k reddit comments.
Does it matter if I randomly feed the examples one at a time (allowing repeated inputs) or if I feed them in a sequence (not allowing repetitions)?
Since your data is actually a set of comments, there is no need to process them in a sequence. In fact, typically it is better to process data in random order to make sure that network does not learn something which is order dependent. Repeats do not matter at all, as long as you sample uniformly.