XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes - xgboost

X is a df with some categorical and continuous variables, y is a multi-class variable with 5 classes. No variables are in 'object' data type. No nan in the df.
Code is as follows.
params = { 'max_depth': np.arange(3,20,1),
'learning_rate': np.arange(0.01, 0.5, 0.01),
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'gamma': np.arange(0,10,0.05),
'reg_alpha': np.arange(0,80,1),
'reg_lambda': np.arange(0,1,0.1),
scoring = {'f1': make_scorer(f1_score, needs_proba=True, multi_class="ovr")}
#I plan to eventually add more score metrics
xgbc = xgb.XGBClassifier(seed = 20, eval_metric='mlogloss')
clf = RandomizedSearchCV(estimator=xgbc, param_distributions=params, scoring=scoring,
n_iter=10, verbose=False, n_jobs=1, refit = 'f1',return_train_score=False,
random_state = 30, cv=10)
clf.fit(X, y)
print(clf.cv_results_)
print(clf.predict_proba(X))
#Need estimated probabilities for each class for each row of X
print("Best parameters:", clf.best_params_) ```
The output is:
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]]
Best parameters: {'subsample': 0.7, 'reg_lambda': 0.30000000000000004, 'reg_alpha': 58, 'n_estimators': 1000, 'max_depth': 13, 'learning_rate': 0.17, 'gamma': 7.1000000000000005, 'colsample_bytree': 0.4, 'colsample_bylevel': 0.6}
'split8_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'split9_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'mean_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

Related

Loss to measure if a sample belong to a distribution

I am implementing a Bayesian Neural Network to generate data imitating a distribution (which is a bernoulli). The architecture of the neural network is the following.
model=tf.keras.Sequential([
tfkl.Input(shape=(),name='dummy_input'),
tfpl.DistributionLambda(lambda t :
latentNormal,
convert_to_tensor_fn=lambda x : x.sample(batchSize)
),
tfp.layers.DenseReparameterization(units=inputDim,activation=tf.nn.relu),
tfpl.DistributionLambda(lambda t:
tfd.Bernoulli(probs=t)
)
])
I want to do a custom run of the fit. To do so, i create a batch of the data i want to imitate and i define. Here is one step of the gradient method
negloglik = lambda data: -model(0).log_prob(data)
optimizer = tf.keras.optimizers.Adam()
kls = []
pbar = tqdm(range(100))
epoch=0
# print(epoch)
# model.fit(mimic[:1453*32], mimic[:1453*32], epochs=1, batch_size=batchSize, verbose=0)
idx = np.random.choice(np.arange(mimic.shape[0]), size=1453*batchSize, replace=False)
shuffled_ds = mimic.numpy()[idx]
nBatch=0
# print(nBatch)
batch = shuffled_ds[nBatch*batchSize:(1+nBatch)*batchSize]
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss = negloglik(batch)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
But loss=negloglik(batch) and grads zare full of nan. (below it is the loss before the reduce mean)
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[ nan, nan, nan, 2.9453604e+00,
-0.0000000e+00],
[ 4.0824111e-03, -0.0000000e+00, -0.0000000e+00, -0.0000000e+00,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, -0.0000000e+00, nan, 4.2825124e-01,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, nan,
nan],
[-0.0000000e+00, nan, -0.0000000e+00, nan,
-0.0000000e+00],
[ nan, -0.0000000e+00, nan, 1.4346083e+00,
nan],
[ nan, -0.0000000e+00, nan, -0.0000000e+00,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, 4.4231796e+00,
nan]], dtype=float32)>
Do you have any ideas why i have a lot of nan ? And do you know what kind of loss i can use to measure if a sample belong to a certain distribution (to replace the negloglike here) ?

Merge arrays on non-nan values

In numpy, how would you merge the following two arrays on the non-nan values, resulting in third array?
Array 1 (shape: r, c):
array([[nan, 1., 1., nan, nan, nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan],
...,
[nan, nan, nan, 1., 1., nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan]])
Array 2: (shape r, 2)
array([[ 0.76620125, 59.14934823],
[ 2.52819832, 43.63809538],
[ 1.9656387 , 25.62212163],
...,
[ 2.55076928, 43.04276273],
[ 2.62058763, 22.14260189],
[ 1.8050997 , 51.72144285]])
Resulting array: (shape r, c)
array([[nan, 0.76620125, 59.14934823, nan, nan, nan],
[nan, nan, nan, 2.52819832, 43.63809538, nan],
[nan, nan, 1.9656387, 25.62212163, nan, nan],
...,
[nan, nan, nan, 2.55076928, 43.04276273, nan],
[nan, nan, nan, 2.62058763, 22.14260189, nan],
[nan, nan, 1.8050997 , 51.72144285, nan, nan]])
this should do, if a is your first array and b the second one:
a[~np.isnan(a)] = b.ravel()

SHAP DeepExplainer: shap_values containing "nan" values

I have an issue with my shap values, here is my model:
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 158)] 0
__________________________________________________________________________________________________
model_1 (Model) (None, 158) 57310 input_5[0][0]
__________________________________________________________________________________________________
subtract_4 (Subtract) (None, 158) 0 input_5[0][0]
model_1[5][0]
__________________________________________________________________________________________________
multiply_4 (Multiply) (None, 158) 0 subtract_4[0][0]
subtract_4[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None,) 0 multiply_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 1) 0 lambda_4[0][0]
==================================================================================================
Total params: 57,310
Trainable params: 57,310
Non-trainable params: 0
__________________________________________________________________________________________________
And I call :
scores = new_model.predict(X_test_scaled)
scores = scores.reshape(scores.shape[0],1)
toexplain = np.append(X_test_scaled, scores, axis = 1)
toexplain = pd.DataFrame(toexplain)
toexplain.sort_values(by = [158], ascending=False, inplace=True)
toexplain = toexplain.iloc[0:16]
toexplain.drop(columns = [158], axis = 1, inplace = True)
explainer=shap.DeepExplainer(new_model, df_sampled_X_train_scaled)
shap_values = explainer.shap_values(toexplain, check_additivity=False)
But my shap values look like this (for the first instance):
shap_values[0]
array([ nan, nan, nan, 0.08352888, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.03286453,
nan, nan, 0.2984612 , nan, nan,
nan, 0.01110088, -0.85235232, nan, nan,
nan, nan, nan, nan, -0.27935541,
nan, nan, nan, nan, nan,
nan, nan, -0.18422949, 0.01466912, nan,
nan, nan, -0.1688329 , 0.07462809, 0.03071906,
nan, -0.00554245, nan, nan, nan,
nan, 0.04587848, nan, nan, nan,
nan, 0.05448143, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, 0.00933742, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00919492, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan])
I'm fairly certain I'm not supposed to have nan values among my shap_values, but I can't seem to find the original issue.
Moreover, the predicted values given by the shap.force_plot is different than my model's predictions, which is why I checked my shap_values in the first place.
Would anyone know how could I fix that ?
Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise.
It worked using numpy.arrays

How to change dtypes of numpy array for tensorflow

I am creating a neural network in tensorflow and I have created the placeholders like this:
input_tensor = tf.placeholder(tf.float32, shape = (None,n_input), name = "input_tensor")
output_tensor = tf.placeholder(tf.float32, shape = (None,n_classes), name = "output_tensor")
During the training process, I was getting the following error:
Traceback (most recent call last):
File "try.py", line 150, in <module>
sess.run(optimizer, feed_dict={X: x_train[i: i + 1], Y: y_train[i: i + 1]})
TypeError: unhashable type: 'numpy.ndarray'
I identified that is because of the different datatypes of my x_train and y_train to the datatypes of the placeholders.
My x_train looks somewhat like this:
array([[array([[ 1., 0., 0.],
[ 0., 1., 0.]])],
[array([[ 0., 1., 0.],
[ 1., 0., 0.]])],
[array([[ 0., 0., 1.],
[ 0., 1., 0.]])]], dtype=object)
It was initially a dataframe like this:
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
I did x_train = train_x.values to get the numpy array
And y_train looks this:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
x_train has dtype object and y_train has dtype float64.
What I want to know is that how I can change the datatypes of my training data so that it can work well with the tensorflow placeholders. Or please suggest if I am missing something.
It is little hard to guess what shape you want your data to be, but I am guessing one of the two combinations which you might be looking for. I will also try to simulate your data in Pandas dataframe.
df = pd.DataFrame([[[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]],
[[[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]],
[[[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]]], columns = ['Mydata'])
print(df)
x = df.Mydata.values
print(x.shape)
print(x)
print(x.dtype)
Output:
Mydata
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
(3,)
[list([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])
list([[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]])
list([[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]])]
object
Combination 1
y = [item for sub_list in x for item in sub_list]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (6, 3)
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
Combination 2
y = [sub_list for sub_list in x]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (3, 2, 3)
[[[ 1. 0. 0.]
[ 0. 1. 0.]]
[[ 0. 1. 0.]
[ 1. 0. 0.]]
[[ 0. 0. 1.]
[ 0. 1. 0.]]]
Your x_train is a nested object containing arrays, so you have to unpack it and reshape it. Here's a general purpose hack:
def unpack(a, aggregate=[]):
for x in a:
if type(x) is float:
aggregate.append(x)
else:
unpack(x, aggregate=aggregate)
return np.array(aggregate)
x_train = unpack(x_train.values).reshape(x_train.shape[0],-1)
Once you've got a dense array (y_train is already dense), you can use a function like the following:
def cast(placeholder, array):
dtype = placeholder.dtype.as_numpy_dtype
return array.astype(dtype)
x_train, y_train = cast(X,x_train), cast(Y,y_train)

Genfromtxt does not return column names

I want to convert a csv into a numpy array. The first row of the csv file contains the names / titles of the columns. But when I use genfromtxt with the names parameter set to true I still receive only a normal numpy array with a lot of NaN values. What did I forget?
numpy.genfromtxt("test.csv", names=True, delimiter=",")
array([[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
...,
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 5., ..., NaN, NaN, NaN]])
You have to set the dtype to None:
numpy.genfromtxt("test.csv", names=True, delimiter=",", dtype=None)