How to implement feature importance on nominal categorical features in tree based classifiers? - xgboost

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)

First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

Related

How to structure multi-output bayesian optimization

I am trying to use bayesian optimization for a multi-output problem, but am not 100% sure the best way to set it up.
I have a small number of inputs (5) and outputs (3-4) in my problem. For each output, I have a target value I would like to achieve. Ultimately I would like to minimize the MSE between the target vector (of 3-4 outputs) and the true outputs.
The simplest way to do this in my mind is to create a single model which models the MSE as a function the problem inputs. Here, all historical data is first compressed into the single MSE, then this is used to train the GP.
However, I would instead like to create individual models (or a combined, multi-output model) that directly models the outputs of interest, instead of the ultimate cost function (MSE). Primarily, this is because I have noticed more accurate predictive results (of the combined MSE), when first modeling the individual outputs, then creating a MSE, instead of directly modeling the MSE.
My problem arises when creating the acquisition function when I have multiple outputs. Ideally, I'd like to use expected improvement (EI) as my acquisition function. However, I'm not sure how to either 1) combine the multiple output distributions into a single distribution (representing the probability of the combined MSE), which can then be used to determine overall EI or 2) how to combine multiple EI values into a single metric (i.e. combine the E.I. for each output into a unified E.I.).
When reading about multi-output BO, the most common approach seems to be to identify a frontier of solutions, however this is not 100% applicable, as ultimately I can convert the output vector into a single MSE (and the frontier becomes a point).
Is the best approach simply to model the combined MSE directly? Or is there a way that I can model the individual outputs, then combine these modeled outputs into a reasonable acquisition function?

How to inject future sequential input for multi-step time series forecasts in LSTM networks

I am trying to do multi-step (i.e., sequence-to-sequence) forecasts for product sales using both (multivariate) sequential and non-sequential inputs.
Specifically, I am using sales numbers as well as some other sequential inputs (e.g., price, is day before holiday, etc...) of the past n days to predict the sales for future m days. Additionally, I have some non-sequential features characterizing the product itself.
Definitions:
n_seq_features <- number of sequential features (in the multivariate time-series) including sales
n_non_seq_features <- number of non-sequential features characterizing a product
I got as far as building a hybrid-model, where first the sequential input is passed through some LSTM layers. The output of the final LSTM layer is then concatenated with the non-sequential features and fed into some dense layers.
What I can't quite get my head around, though, is how to input future sequntial input (everything except sales numbers for the following m days) in a way that efficiently utilizes the sequential information (i.e., causality, etc...). For m=1, I can simply input the sequential data for this one day together with the non-sequential input after the LSTM layers, however as soon as m becomes greater than 1 this appears to be a waste of causal information.
The only ways I could think of were:
to incorporate the sequential information for future m days as features in the LSTM input blowing up the input shape from (..., n, n_seq_features) to (..., n, n_seq_features + m*(n_seq_features-1))
add a separate LSTM branch handling the future data, the output of which is then 'somehow' fed into the dense layers at the last stage of the model
I only started using LSTM networks a while ago so I unfortunately have only limited intuition on how they are best utilized (especially in hybrid approaches). For this reason, I would like to ask:
Is the general approach of injecting sequential and non-sequential input at different stages of the same model (i.e., trained concurrently) useful or would one rather split it into separate models which can be trained independently for more fine-grained control?
How is future sequential input injected into an LSTM network to preserve causal information? Can this be achieved with a high-level frontend like KERAS or does it require a very deep dive into the tensorflow backend?
Are LSTM networks not the way to go for this specific problem in the first place?
Cheers and thanks in advance for any advice, resources or thoughts on the matter. :)
In case someone is having a similar issue with future sequential (or temporal) data, University of Oxford and Google Cloud AI have come up with a new architecture to handle all three types of input (past temporal, future temporal as well as static). It is called Temporal Fusion Transformer and, at least from reading the paper, looks like a neat fit. However, I have yet to implement and test it. There is also a PyTorch Tutorial available.

Batch structure for training a ranking model with contrastive loss?

How do I choose my batch if I train a deep ranking model with a eg. contrastive loss where I have per query 1 positive document and 2 negative samples?
So, it is about ranking (loss) which applies to eg. the quora question pair data or any other question/answer pairs which I want to rank using a deep learning ranking model or just a Siamese network.
The data would look like this: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv
Now, I assume that it is crucial how to build the batch, right? Since for every question all according pos and neg answers need to be contained inside a batch, right?
Different strategies can be used to build the batches and the triplets or pairs. Usually, the batches are built randomnly, and then the hardest negative, or one of the hardest negatives in the batch is picked.
So yes, positive and negatives examples need to be contaned inside a batch. And it is crucial to pick negatives. But usually efforts are made to pick the proper negatives inside the batch, instead of in building the batches in a specific way.
This blogpost explaining how ranking losses work may be usefull https://gombru.github.io/2019/04/03/ranking_loss/

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Reducing false positive in CNN (Conv1D) text classification model

I created a char-based CNN model for text classification on keras + tensorflow - mainly using Conv1D, mainly based on:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The model is performing very good with 80%+ accuracy on test data set. However I'm having problem with false positive. One of the reason could be that the final layer is a Dense layer with softmax activation function.
To give an idea of how the model is performing, I train the model with data set with 31 classes with 1021 samples, the performance is ~85% on 25% test data set
However if you include false negative the performance is pretty bad (I didn't run another test data with false negative since it's pretty obvious just testing by hand) - every input has a corresponding prediction. For example a sentence acasklncasdjsandjas can result in a class ask_promotion.
Are there any best practice on how to deal with false positive in this case?
My idea is to:
Implement a noise class where samples are just a set of totally random text. However this doesn't seem to help since the noise doesn't contain any pattern thus it would be difficult to train the model
Replace softmax with something that doesn't require all output probability to 1 so small values can stay small regardless of other values. I did some research on this but there's not much information on changing the activation function for this specific case
That sounds like the issue of imbalanced data, where two classes have completely different supports (the number of instances in each class). This issue is particularly crucial in the task of hierarchical classification in which some classes with a deep hierarchy tend to have much more instances than the others.
Anyway, let's simply the issue as binary classification, and name the class with much more support Class-A and the other one with less support Class-B. Generally speaking, there are two popular ways to circumvent this issue.
Under-sampling: You fix Class-B as is. Then you sample instances from Class-A for the same amount as Class-B. Combine these instances and train your classifier with them.
Over-sampling: You fix Class-A as is. Then you sample instances from Class-B for the same amount as Class-A. The same goes with Choice 1.
For more information, please refer to this KDNuggets page.
https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
Hope this helps. :P