eli5 and SHAP for interpreting XGBoost results - xgboost

I have used XGBoost algorithm tried both eli5 and SHAP to interpret the results of the regression. I got some contradictory results, screenshots below.
and
I do not entirely understand the difference between eli5 and SHAP and I would like to find out which interpretation to rely more on. I would appreciate suggestions and insights into that.

It's probably impossible to say that one is more reliable than the other; it depends on what you're looking for. You could ask over at stats.SE or datascience.SE for some more detail about how eli5 and shap produce their valuations. It appears that eli5.show_weights is just delegating to xgboost's internal feature importances based on gain (by default), weight, or cover.
All that said, these aren't contradictory. Both the shap plot and the eli5 weights suggest that chassis_1 is the more important variable: it has larger (in absolute values) shap values as well as a higher importance score.

Related

APIs of make inferences in GPflow

I have built some gaussian process models in GPflow and learned them successfully, but I cannot find APIs that can help me to make inferences straightforwardly in GPflow, such as seperating the contributions of different kernels in a GPR model.
I know that I can do it manually, like calculating the covariance matrices, inverse and multiply, but such work can be quite annoying as the model gets more complex, like a multi-output SVGP model. Any suggestions?
Thanks in advance!
If you want to e.g. decompose an additive Kernel, I think the easiest way for vanilla GPR would be to just switch out the Kernel to the part you're interested in, while still keeping the learned hyperparameters.
I'm not totally sure about it, but I think it could also work out for SVGP, since the approximation itself is just a standard GP using the same kernel but conditioned on the Inducing Points.
However, I'm not sure if the decomposition of the Variational approximation can be assumed to be close to the decomposition of the true posterior.

Stratified Kfold

If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.

Why would I choose a loss-function differing from my metrics?

When I look through tutorials in the internet or at models posted here at SO, I often see that the loss function differs from the metrics used to evaluate the model. This might look like:
model.compile(loss='mse', optimizer='adadelta', metrics=['mae', 'mape'])
Anyhow, following this example, why wouldn't I optimize 'mae' or 'mape' as loss instead of 'mse' when I don't even care about 'mse' in my metrics (hypothetically speaking when this would be my model)?
In many cases the metric you are interested might not be differentiable, so you cannot use it as a loss, this is the case for accuracy for example, where the cross entropy loss is used instead as it is differentiable.
For metrics that are already differentiable, you just want to get additional information from the learning process, as each metrics measures something different. For example the MSE has a scale that is squared from the scale of the data/predictions, so to get the same scale you have to use RMSE or the MAE. The MAPE gives you relative (not absolute) error, so all of these metrics measure something different that might be of interest.
In the case of accuracy, this metric is used because it is easily interpretable by a human, while cross entropy loss are less intuitive to interpret.
That is a very good question.
Knowing your modeling, you should use a convenience loss function to minimize to achieve your goals.
But to evaluate your model, you will use metrics to report the quality of your generalization using some metrics.
For many reasons, the evaluation part might differ from the optimization criteria.
Giving you an example, in Generative Adversarial Networks, many papers suggest that a mse loss minimization leads to more fuzzy images although mae helps to get a more clear output. You might want to trace both of them in your evaluation to see how it really changes the things.
Another possible case is when you have a customized loss, but you still want to report the evaluation based on accuracy.
I can think of possible cases where you set the loss function in a way to converge faster, better and etc, but you might measure the quality of the model with some other metrics as well.
Hope this can help.
I just asked myself that question when I came across a GAN Implementation that uses mae as loss. I already knew that some metrics are not differentiable and thought that mae is an ecample, albeit only at x=0. So is there simply an exception like just assume a slope of 0? That would make sense to me.
I also wanted to add that I learned to use mae instead of mae because a small error stays smaller when squared while bigger errors increase on relative magnitude. So bigger are being penalized more with mse.

Standard parameter representation in neural networks

Many times I have seen in neural networks forward propagation that example vectors are multiplied from the left (vector-matrix) and some times from the right (matrix-vector). Notation, some Tensorflow tutorials and the datasets I have found seem to prefer the former over the later, contrary to the way in which linear algebra tends to be teached (matrix-vector way).
Moreover, they represent inverted ways of representing parameters: enumerate problem variables in dimension 0 or enumerate neurons in dimension 0.
This confuses me and makes me wonder if there is really a standard here or it has been only coincidence. If there is, I would like to know if the standard follows some deeper reasons. I would feel really better answering this question.
(By the way, I know that you will normally use example matrices instead of vectors [or more complex things in conv nets, etc..] because the use of minibatches, but the point still holds.)
Not sure if this answer is what you are looking for, but in the context of Tensorflow, the standard is to use a dense layer (https://www.tensorflow.org/api_docs/python/tf/layers/dense) which is a higher level abstraction that wraps up the affine transformation logic you are referring to.

Predicting new values in logistic regression

I am building a logistic regression model in tensorflow to approximate a function.
When I randomly select training and testing data from the complete dataset, I get a good result like so (blue are training points; red are testing points, the black line is the predicted curve):
But when I select the spatially seperate testing data, I get terrible predicted curve like so:
I understand why this is happening. But shouldn't a machine learning model learn these patterns and predict new values?
Similar thing happens with a periodic function too:
Am I missing something trivial here?
P.S. I did google this query for quite some time but was not able to get a good answer.
Thanks in advance.
What you are trying to do here is not related to logistic regression. Logistic regression is a classifier and you are doing regression.
No, machine learning systems aren't smart enough to learn to extrapolate functions like you have here. When you fit the model you are telling it to find an explanation for the training data. It doesn't care what the model does outside the range of training data. If you want it to be able to extrapolate then you need to give it extra information. You could set it up to assume that the input belonged to a sine wave or a quadratic polynomial and have it find the best fitting one. However, with no assumptions about the form of the function you won't be able to extrapolate.