SHAP DeepExplainer: shap_values containing "nan" values

SHAP DeepExplainer: shap_values containing "nan" values - tensorflow

I have an issue with my shap values, here is my model:
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 158)] 0
__________________________________________________________________________________________________
model_1 (Model) (None, 158) 57310 input_5[0][0]
__________________________________________________________________________________________________
subtract_4 (Subtract) (None, 158) 0 input_5[0][0]
model_1[5][0]
__________________________________________________________________________________________________
multiply_4 (Multiply) (None, 158) 0 subtract_4[0][0]
subtract_4[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None,) 0 multiply_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 1) 0 lambda_4[0][0]
==================================================================================================
Total params: 57,310
Trainable params: 57,310
Non-trainable params: 0
__________________________________________________________________________________________________
And I call :
scores = new_model.predict(X_test_scaled)
scores = scores.reshape(scores.shape[0],1)
toexplain = np.append(X_test_scaled, scores, axis = 1)
toexplain = pd.DataFrame(toexplain)
toexplain.sort_values(by = [158], ascending=False, inplace=True)
toexplain = toexplain.iloc[0:16]
toexplain.drop(columns = [158], axis = 1, inplace = True)
explainer=shap.DeepExplainer(new_model, df_sampled_X_train_scaled)
shap_values = explainer.shap_values(toexplain, check_additivity=False)
But my shap values look like this (for the first instance):
shap_values[0]
array([ nan, nan, nan, 0.08352888, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.03286453,
nan, nan, 0.2984612 , nan, nan,
nan, 0.01110088, -0.85235232, nan, nan,
nan, nan, nan, nan, -0.27935541,
nan, nan, nan, nan, nan,
nan, nan, -0.18422949, 0.01466912, nan,
nan, nan, -0.1688329 , 0.07462809, 0.03071906,
nan, -0.00554245, nan, nan, nan,
nan, 0.04587848, nan, nan, nan,
nan, 0.05448143, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, 0.00933742, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00919492, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan])
I'm fairly certain I'm not supposed to have nan values among my shap_values, but I can't seem to find the original issue.
Moreover, the predicted values given by the shap.force_plot is different than my model's predictions, which is why I checked my shap_values in the first place.
Would anyone know how could I fix that ?

Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise.
It worked using numpy.arrays

Related

Loss to measure if a sample belong to a distribution

I am implementing a Bayesian Neural Network to generate data imitating a distribution (which is a bernoulli). The architecture of the neural network is the following.
model=tf.keras.Sequential([
tfkl.Input(shape=(),name='dummy_input'),
tfpl.DistributionLambda(lambda t :
latentNormal,
convert_to_tensor_fn=lambda x : x.sample(batchSize)
),
tfp.layers.DenseReparameterization(units=inputDim,activation=tf.nn.relu),
tfpl.DistributionLambda(lambda t:
tfd.Bernoulli(probs=t)
)
])
I want to do a custom run of the fit. To do so, i create a batch of the data i want to imitate and i define. Here is one step of the gradient method
negloglik = lambda data: -model(0).log_prob(data)
optimizer = tf.keras.optimizers.Adam()
kls = []
pbar = tqdm(range(100))
epoch=0
# print(epoch)
# model.fit(mimic[:1453*32], mimic[:1453*32], epochs=1, batch_size=batchSize, verbose=0)
idx = np.random.choice(np.arange(mimic.shape[0]), size=1453*batchSize, replace=False)
shuffled_ds = mimic.numpy()[idx]
nBatch=0
# print(nBatch)
batch = shuffled_ds[nBatch*batchSize:(1+nBatch)*batchSize]
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss = negloglik(batch)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
But loss=negloglik(batch) and grads zare full of nan. (below it is the loss before the reduce mean)
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[ nan, nan, nan, 2.9453604e+00,
-0.0000000e+00],
[ 4.0824111e-03, -0.0000000e+00, -0.0000000e+00, -0.0000000e+00,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, -0.0000000e+00, nan, 4.2825124e-01,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, nan,
nan],
[-0.0000000e+00, nan, -0.0000000e+00, nan,
-0.0000000e+00],
[ nan, -0.0000000e+00, nan, 1.4346083e+00,
nan],
[ nan, -0.0000000e+00, nan, -0.0000000e+00,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, 4.4231796e+00,
nan]], dtype=float32)>
Do you have any ideas why i have a lot of nan ? And do you know what kind of loss i can use to measure if a sample belong to a certain distribution (to replace the negloglike here) ?

Merge arrays on non-nan values

In numpy, how would you merge the following two arrays on the non-nan values, resulting in third array?
Array 1 (shape: r, c):
array([[nan, 1., 1., nan, nan, nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan],
...,
[nan, nan, nan, 1., 1., nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan]])
Array 2: (shape r, 2)
array([[ 0.76620125, 59.14934823],
[ 2.52819832, 43.63809538],
[ 1.9656387 , 25.62212163],
...,
[ 2.55076928, 43.04276273],
[ 2.62058763, 22.14260189],
[ 1.8050997 , 51.72144285]])
Resulting array: (shape r, c)
array([[nan, 0.76620125, 59.14934823, nan, nan, nan],
[nan, nan, nan, 2.52819832, 43.63809538, nan],
[nan, nan, 1.9656387, 25.62212163, nan, nan],
...,
[nan, nan, nan, 2.55076928, 43.04276273, nan],
[nan, nan, nan, 2.62058763, 22.14260189, nan],
[nan, nan, 1.8050997 , 51.72144285, nan, nan]])

this should do, if a is your first array and b the second one:
a[~np.isnan(a)] = b.ravel()

Apply shading, formatting and borders to pivoted dataframe

I have the following data that has been pivoted:
pip install Jinja2
import pandas as pd
import numpy as np
from numpy import rec, nan
a=rec.array([('FY20', 361.410592 , nan, 21.97, nan, 'Total', 'Fast'),
('FY21', 359.26952604, -1., 22.99, 5., 'Total', 'Fast'),
('FY22', 362.4560529 , 1., 22.77, -1., 'Total', 'Fast'),
('FY23', 371.53543252, 2., 21.92, -4., 'Total', 'Fast'),
('FY24', 374.48894494, 1., 21.88, -0., 'Total', 'Fast'),
('FY25', 377.09481613, 1., 21.85, -0., 'Total', 'Fast'),
('FY20', 67.043756 , nan, 21. , nan, 'Homes', 'Fast'),
('FY21', 110.12145222, 63., 20.95, -0., 'Homes', 'Fast'),
('FY22', 117.46526727, 7., 20.73, -1., 'Homes', 'Fast'),
('FY23', 125.83482531, 7., 18.99, -8., 'Homes', 'Fast'),
('FY24', 126.16748411, 1., 18.95, -0., 'Homes', 'Fast'),
('FY25', 127.786528 , 1., 18.96, 0., 'Homes', 'Fast'),
('FY20', 294.366836 , nan, 22.19, nan, 'Businesses', 'Fast'),
('FY21', 249.14807381, -15., 23.89, 8., 'Businesses', 'Fast'),
('FY22', 245.99078563, -2., 23.74, -1., 'Businesses', 'Fast'),
('FY23', 245.70060721, 0., 23.42, -1., 'Businesses', 'Fast'),
('FY24', 247.32146083, 1., 23.37, -0., 'Businesses', 'Fast'),
('FY25', 250.30828813, 1., 23.33, -0., 'Businesses', 'Fast'),
('FY20', 184.631684 , nan, 15.47, nan, 'Total', 'Medium'),
('FY21', 274.25718084, 49., 15.53, 0., 'Total', 'Medium'),
('FY22', 333.23835913, 21., 15.33, -1., 'Total', 'Medium'),
('FY23', 357.33167549, 7., 15.52, 1., 'Total', 'Medium'),
('FY24', 367.84796426, 3., 15.53, 0., 'Total', 'Medium'),
('FY25', 370.1664439 , 1., 15.53, 0., 'Total', 'Medium'),
('FY20', 46.522416 , nan, 17.89, nan, 'Homes', 'Medium'),
('FY21', 97.63428522, 112., 18.72, 5., 'Homes', 'Medium'),
('FY22', 141.25547499, 46., 17.86, -5., 'Homes', 'Medium'),
('FY23', 157.06766598, 11., 18.33, 3., 'Homes', 'Medium'),
('FY24', 163.02337094, 4., 18.29, -0., 'Homes', 'Medium'),
('FY25', 165.98360465, 1., 18.28, -0., 'Homes', 'Medium'),
('FY20', 138.109268 , nan, 14.66, nan, 'Businesses', 'Medium'),
('FY21', 177.62289562, 28., 13.77, -6., 'Businesses', 'Medium'),
('FY22', 191.98288414, 8., 13.46, -2., 'Businesses', 'Medium'),
('FY23', 200.26400951, 4., 13.31, -1., 'Businesses', 'Medium'),
('FY24', 203.82459332, 2., 13.31, 0., 'Businesses', 'Medium'),
('FY25', 205.18283926, 1., 13.31, 0., 'Businesses', 'Medium')],
dtype=[('FY', 'O'), ('ADV', '<f8'), ('YoY_ADV', '<f8'), ('Yield', '<f8'), ('YoY_Yld', '<f8'), ('Cut', 'O'), ('Product', 'O')])
df = pd.DataFrame(a)
df1=pd.melt(df, id_vars=['FY','Product','Cut'], var_name="Metric", value_name="Value")
df2 = pd.pivot(df1, index=['Metric', 'Product','Cut'],columns=['FY'],values=['Value'])
df2
And looks like this:
I want to apply table styles so I can copy/paste a polished table into PowerPoint but need the following:
Shade columns FY23, FY24, FY25 in orange
Apply formatting: Metric=ADV rounded to zero decimals, Metric=Yield to 2 decimals, and each of YoY_ADV plus YoY_Yld to 1 decimal place
Negative numbers red, otherwise black
Apply frame around table.
Here is my code but I am getting error 'Cannot index with multidimensional key':
# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
# 2. Slide major metrics so formatting can be applied
######################################################
adv_slice=df2.loc[('ADV', slice(None)), :]
yld_slice=df2.loc[('Yield', slice(None)), :]
yoy_adv_slice=df2.loc[('YoY_ADV', slice(None)), :]
yoy_yld_slice=df2.loc[('YoY_Yld', slice(None)), :]
#3. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=['FY23','FY24','FY25']).format('{:.0f}', subset=adv_slice, na_rep='-').format('{:.2f}', subset=yld_slice, na_rep='-').format('{:.1f}', subset=(yoy_adv_slice,yoy_yld_slice), na_rep='-').set_table_styles([{'selector': '',
'props' : [('border','1px solid black')]},
{'selector': 'th',
'props' : [('border','1px solid black')]},
{'selector': 'td',
'props' : [('border','1px solid black')]}]).set_properties(**{'text-align': 'center'})
What is needed to make the code work?

With some mods:
[![# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
#2. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=\['FY23','FY24','FY25'\]).\
format('{:.0f}', subset=('ADV',), na_rep='-').\
format('{:.2f}', subset=('Yield',), na_rep='-').\
format('{:.1f}', subset=('YoY_ADV',), na_rep='-').\
format('{:.1f}', subset=('YoY_Yld',), na_rep='-').\
set_table_styles(\[{'selector': '',
'props' : \[('border','0.1px solid black')\]},
{'selector': 'th',
'props' : \[('border','0.5px solid black')\]},
{'selector': 'td',
'props' : \[('border','0.5px solid black')\]}\]).set_properties(**{'text-align': 'center'})][1]][1]

XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes

X is a df with some categorical and continuous variables, y is a multi-class variable with 5 classes. No variables are in 'object' data type. No nan in the df.
Code is as follows.
params = { 'max_depth': np.arange(3,20,1),
'learning_rate': np.arange(0.01, 0.5, 0.01),
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'gamma': np.arange(0,10,0.05),
'reg_alpha': np.arange(0,80,1),
'reg_lambda': np.arange(0,1,0.1),
scoring = {'f1': make_scorer(f1_score, needs_proba=True, multi_class="ovr")}
#I plan to eventually add more score metrics
xgbc = xgb.XGBClassifier(seed = 20, eval_metric='mlogloss')
clf = RandomizedSearchCV(estimator=xgbc, param_distributions=params, scoring=scoring,
n_iter=10, verbose=False, n_jobs=1, refit = 'f1',return_train_score=False,
random_state = 30, cv=10)
clf.fit(X, y)
print(clf.cv_results_)
print(clf.predict_proba(X))
#Need estimated probabilities for each class for each row of X
print("Best parameters:", clf.best_params_) ```
The output is:
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]]
Best parameters: {'subsample': 0.7, 'reg_lambda': 0.30000000000000004, 'reg_alpha': 58, 'n_estimators': 1000, 'max_depth': 13, 'learning_rate': 0.17, 'gamma': 7.1000000000000005, 'colsample_bytree': 0.4, 'colsample_bylevel': 0.6}
'split8_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'split9_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'mean_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

Genfromtxt does not return column names

I want to convert a csv into a numpy array. The first row of the csv file contains the names / titles of the columns. But when I use genfromtxt with the names parameter set to true I still receive only a normal numpy array with a lot of NaN values. What did I forget?
numpy.genfromtxt("test.csv", names=True, delimiter=",")
array([[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
...,
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 5., ..., NaN, NaN, NaN]])

You have to set the dtype to None:
numpy.genfromtxt("test.csv", names=True, delimiter=",", dtype=None)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SHAP DeepExplainer: shap_values containing "nan" values - tensorflow

Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise. It worked using numpy.arrays

Related

Loss to measure if a sample belong to a distribution

Merge arrays on non-nan values

Apply shading, formatting and borders to pivoted dataframe

XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes

Genfromtxt does not return column names

Categories

Resources