Loss to measure if a sample belong to a distribution - tensorflow

I am implementing a Bayesian Neural Network to generate data imitating a distribution (which is a bernoulli). The architecture of the neural network is the following.
model=tf.keras.Sequential([
tfkl.Input(shape=(),name='dummy_input'),
tfpl.DistributionLambda(lambda t :
latentNormal,
convert_to_tensor_fn=lambda x : x.sample(batchSize)
),
tfp.layers.DenseReparameterization(units=inputDim,activation=tf.nn.relu),
tfpl.DistributionLambda(lambda t:
tfd.Bernoulli(probs=t)
)
])
I want to do a custom run of the fit. To do so, i create a batch of the data i want to imitate and i define. Here is one step of the gradient method
negloglik = lambda data: -model(0).log_prob(data)
optimizer = tf.keras.optimizers.Adam()
kls = []
pbar = tqdm(range(100))
epoch=0
# print(epoch)
# model.fit(mimic[:1453*32], mimic[:1453*32], epochs=1, batch_size=batchSize, verbose=0)
idx = np.random.choice(np.arange(mimic.shape[0]), size=1453*batchSize, replace=False)
shuffled_ds = mimic.numpy()[idx]
nBatch=0
# print(nBatch)
batch = shuffled_ds[nBatch*batchSize:(1+nBatch)*batchSize]
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss = negloglik(batch)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
But loss=negloglik(batch) and grads zare full of nan. (below it is the loss before the reduce mean)
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[ nan, nan, nan, 2.9453604e+00,
-0.0000000e+00],
[ 4.0824111e-03, -0.0000000e+00, -0.0000000e+00, -0.0000000e+00,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, -0.0000000e+00, nan, 4.2825124e-01,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, nan,
nan],
[-0.0000000e+00, nan, -0.0000000e+00, nan,
-0.0000000e+00],
[ nan, -0.0000000e+00, nan, 1.4346083e+00,
nan],
[ nan, -0.0000000e+00, nan, -0.0000000e+00,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, 4.4231796e+00,
nan]], dtype=float32)>
Do you have any ideas why i have a lot of nan ? And do you know what kind of loss i can use to measure if a sample belong to a certain distribution (to replace the negloglike here) ?

Related

Merge arrays on non-nan values

In numpy, how would you merge the following two arrays on the non-nan values, resulting in third array?
Array 1 (shape: r, c):
array([[nan, 1., 1., nan, nan, nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan],
...,
[nan, nan, nan, 1., 1., nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan]])
Array 2: (shape r, 2)
array([[ 0.76620125, 59.14934823],
[ 2.52819832, 43.63809538],
[ 1.9656387 , 25.62212163],
...,
[ 2.55076928, 43.04276273],
[ 2.62058763, 22.14260189],
[ 1.8050997 , 51.72144285]])
Resulting array: (shape r, c)
array([[nan, 0.76620125, 59.14934823, nan, nan, nan],
[nan, nan, nan, 2.52819832, 43.63809538, nan],
[nan, nan, 1.9656387, 25.62212163, nan, nan],
...,
[nan, nan, nan, 2.55076928, 43.04276273, nan],
[nan, nan, nan, 2.62058763, 22.14260189, nan],
[nan, nan, 1.8050997 , 51.72144285, nan, nan]])
this should do, if a is your first array and b the second one:
a[~np.isnan(a)] = b.ravel()

XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes

X is a df with some categorical and continuous variables, y is a multi-class variable with 5 classes. No variables are in 'object' data type. No nan in the df.
Code is as follows.
params = { 'max_depth': np.arange(3,20,1),
'learning_rate': np.arange(0.01, 0.5, 0.01),
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'gamma': np.arange(0,10,0.05),
'reg_alpha': np.arange(0,80,1),
'reg_lambda': np.arange(0,1,0.1),
scoring = {'f1': make_scorer(f1_score, needs_proba=True, multi_class="ovr")}
#I plan to eventually add more score metrics
xgbc = xgb.XGBClassifier(seed = 20, eval_metric='mlogloss')
clf = RandomizedSearchCV(estimator=xgbc, param_distributions=params, scoring=scoring,
n_iter=10, verbose=False, n_jobs=1, refit = 'f1',return_train_score=False,
random_state = 30, cv=10)
clf.fit(X, y)
print(clf.cv_results_)
print(clf.predict_proba(X))
#Need estimated probabilities for each class for each row of X
print("Best parameters:", clf.best_params_) ```
The output is:
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]]
Best parameters: {'subsample': 0.7, 'reg_lambda': 0.30000000000000004, 'reg_alpha': 58, 'n_estimators': 1000, 'max_depth': 13, 'learning_rate': 0.17, 'gamma': 7.1000000000000005, 'colsample_bytree': 0.4, 'colsample_bylevel': 0.6}
'split8_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'split9_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'mean_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

Selecting values from tensor based on an index tensor

I have two matrices. Matrix A is contains some values and matrix B contains indices. The shape of matrix A and B is (batch, values) and (batch, indices), respectively.
My goal is to select values from matrix A based on indices of matrix B along the batch dimension.
For example:
# Matrix A
<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[0., 1., 2., 3., 4.],
[5., 6., 7., 8., 9.]], dtype=float32)>
# Matrix B
<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[0, 1],
[1, 2]], dtype=int32)>
# Expected Result
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[0., 1.],
[6., 7.]], dtype=int32)>
How can I achieve this in Tensorflow?
Many thanks in advance!
You can achieve this with the tf.gather function.
mat_a = tf.constant([[0., 1., 2., 3., 4.],
[5., 6., 7., 8., 9.]])
mat_b = tf.constant([[0, 1], [1, 2]])
out = tf.gather(mat_a, mat_b, batch_dims=1)
out.numpy()
array([[0., 1.],
[6., 7.]], dtype=float32)

SHAP DeepExplainer: shap_values containing "nan" values

I have an issue with my shap values, here is my model:
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 158)] 0
__________________________________________________________________________________________________
model_1 (Model) (None, 158) 57310 input_5[0][0]
__________________________________________________________________________________________________
subtract_4 (Subtract) (None, 158) 0 input_5[0][0]
model_1[5][0]
__________________________________________________________________________________________________
multiply_4 (Multiply) (None, 158) 0 subtract_4[0][0]
subtract_4[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None,) 0 multiply_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 1) 0 lambda_4[0][0]
==================================================================================================
Total params: 57,310
Trainable params: 57,310
Non-trainable params: 0
__________________________________________________________________________________________________
And I call :
scores = new_model.predict(X_test_scaled)
scores = scores.reshape(scores.shape[0],1)
toexplain = np.append(X_test_scaled, scores, axis = 1)
toexplain = pd.DataFrame(toexplain)
toexplain.sort_values(by = [158], ascending=False, inplace=True)
toexplain = toexplain.iloc[0:16]
toexplain.drop(columns = [158], axis = 1, inplace = True)
explainer=shap.DeepExplainer(new_model, df_sampled_X_train_scaled)
shap_values = explainer.shap_values(toexplain, check_additivity=False)
But my shap values look like this (for the first instance):
shap_values[0]
array([ nan, nan, nan, 0.08352888, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.03286453,
nan, nan, 0.2984612 , nan, nan,
nan, 0.01110088, -0.85235232, nan, nan,
nan, nan, nan, nan, -0.27935541,
nan, nan, nan, nan, nan,
nan, nan, -0.18422949, 0.01466912, nan,
nan, nan, -0.1688329 , 0.07462809, 0.03071906,
nan, -0.00554245, nan, nan, nan,
nan, 0.04587848, nan, nan, nan,
nan, 0.05448143, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, 0.00933742, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00919492, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan])
I'm fairly certain I'm not supposed to have nan values among my shap_values, but I can't seem to find the original issue.
Moreover, the predicted values given by the shap.force_plot is different than my model's predictions, which is why I checked my shap_values in the first place.
Would anyone know how could I fix that ?
Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise.
It worked using numpy.arrays

Genfromtxt does not return column names

I want to convert a csv into a numpy array. The first row of the csv file contains the names / titles of the columns. But when I use genfromtxt with the names parameter set to true I still receive only a normal numpy array with a lot of NaN values. What did I forget?
numpy.genfromtxt("test.csv", names=True, delimiter=",")
array([[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
...,
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 5., ..., NaN, NaN, NaN]])
You have to set the dtype to None:
numpy.genfromtxt("test.csv", names=True, delimiter=",", dtype=None)