Genfromtxt does not return column names - numpy

I want to convert a csv into a numpy array. The first row of the csv file contains the names / titles of the columns. But when I use genfromtxt with the names parameter set to true I still receive only a normal numpy array with a lot of NaN values. What did I forget?
numpy.genfromtxt("test.csv", names=True, delimiter=",")
array([[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
...,
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 5., ..., NaN, NaN, NaN]])

You have to set the dtype to None:
numpy.genfromtxt("test.csv", names=True, delimiter=",", dtype=None)

Related

Loss to measure if a sample belong to a distribution

I am implementing a Bayesian Neural Network to generate data imitating a distribution (which is a bernoulli). The architecture of the neural network is the following.
model=tf.keras.Sequential([
tfkl.Input(shape=(),name='dummy_input'),
tfpl.DistributionLambda(lambda t :
latentNormal,
convert_to_tensor_fn=lambda x : x.sample(batchSize)
),
tfp.layers.DenseReparameterization(units=inputDim,activation=tf.nn.relu),
tfpl.DistributionLambda(lambda t:
tfd.Bernoulli(probs=t)
)
])
I want to do a custom run of the fit. To do so, i create a batch of the data i want to imitate and i define. Here is one step of the gradient method
negloglik = lambda data: -model(0).log_prob(data)
optimizer = tf.keras.optimizers.Adam()
kls = []
pbar = tqdm(range(100))
epoch=0
# print(epoch)
# model.fit(mimic[:1453*32], mimic[:1453*32], epochs=1, batch_size=batchSize, verbose=0)
idx = np.random.choice(np.arange(mimic.shape[0]), size=1453*batchSize, replace=False)
shuffled_ds = mimic.numpy()[idx]
nBatch=0
# print(nBatch)
batch = shuffled_ds[nBatch*batchSize:(1+nBatch)*batchSize]
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss = negloglik(batch)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
But loss=negloglik(batch) and grads zare full of nan. (below it is the loss before the reduce mean)
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[ nan, nan, nan, 2.9453604e+00,
-0.0000000e+00],
[ 4.0824111e-03, -0.0000000e+00, -0.0000000e+00, -0.0000000e+00,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, -0.0000000e+00, nan, 4.2825124e-01,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, nan,
nan],
[-0.0000000e+00, nan, -0.0000000e+00, nan,
-0.0000000e+00],
[ nan, -0.0000000e+00, nan, 1.4346083e+00,
nan],
[ nan, -0.0000000e+00, nan, -0.0000000e+00,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, 4.4231796e+00,
nan]], dtype=float32)>
Do you have any ideas why i have a lot of nan ? And do you know what kind of loss i can use to measure if a sample belong to a certain distribution (to replace the negloglike here) ?

Merge arrays on non-nan values

In numpy, how would you merge the following two arrays on the non-nan values, resulting in third array?
Array 1 (shape: r, c):
array([[nan, 1., 1., nan, nan, nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan],
...,
[nan, nan, nan, 1., 1., nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan]])
Array 2: (shape r, 2)
array([[ 0.76620125, 59.14934823],
[ 2.52819832, 43.63809538],
[ 1.9656387 , 25.62212163],
...,
[ 2.55076928, 43.04276273],
[ 2.62058763, 22.14260189],
[ 1.8050997 , 51.72144285]])
Resulting array: (shape r, c)
array([[nan, 0.76620125, 59.14934823, nan, nan, nan],
[nan, nan, nan, 2.52819832, 43.63809538, nan],
[nan, nan, 1.9656387, 25.62212163, nan, nan],
...,
[nan, nan, nan, 2.55076928, 43.04276273, nan],
[nan, nan, nan, 2.62058763, 22.14260189, nan],
[nan, nan, 1.8050997 , 51.72144285, nan, nan]])
this should do, if a is your first array and b the second one:
a[~np.isnan(a)] = b.ravel()

Apply shading, formatting and borders to pivoted dataframe

I have the following data that has been pivoted:
pip install Jinja2
import pandas as pd
import numpy as np
from numpy import rec, nan
a=rec.array([('FY20', 361.410592 , nan, 21.97, nan, 'Total', 'Fast'),
('FY21', 359.26952604, -1., 22.99, 5., 'Total', 'Fast'),
('FY22', 362.4560529 , 1., 22.77, -1., 'Total', 'Fast'),
('FY23', 371.53543252, 2., 21.92, -4., 'Total', 'Fast'),
('FY24', 374.48894494, 1., 21.88, -0., 'Total', 'Fast'),
('FY25', 377.09481613, 1., 21.85, -0., 'Total', 'Fast'),
('FY20', 67.043756 , nan, 21. , nan, 'Homes', 'Fast'),
('FY21', 110.12145222, 63., 20.95, -0., 'Homes', 'Fast'),
('FY22', 117.46526727, 7., 20.73, -1., 'Homes', 'Fast'),
('FY23', 125.83482531, 7., 18.99, -8., 'Homes', 'Fast'),
('FY24', 126.16748411, 1., 18.95, -0., 'Homes', 'Fast'),
('FY25', 127.786528 , 1., 18.96, 0., 'Homes', 'Fast'),
('FY20', 294.366836 , nan, 22.19, nan, 'Businesses', 'Fast'),
('FY21', 249.14807381, -15., 23.89, 8., 'Businesses', 'Fast'),
('FY22', 245.99078563, -2., 23.74, -1., 'Businesses', 'Fast'),
('FY23', 245.70060721, 0., 23.42, -1., 'Businesses', 'Fast'),
('FY24', 247.32146083, 1., 23.37, -0., 'Businesses', 'Fast'),
('FY25', 250.30828813, 1., 23.33, -0., 'Businesses', 'Fast'),
('FY20', 184.631684 , nan, 15.47, nan, 'Total', 'Medium'),
('FY21', 274.25718084, 49., 15.53, 0., 'Total', 'Medium'),
('FY22', 333.23835913, 21., 15.33, -1., 'Total', 'Medium'),
('FY23', 357.33167549, 7., 15.52, 1., 'Total', 'Medium'),
('FY24', 367.84796426, 3., 15.53, 0., 'Total', 'Medium'),
('FY25', 370.1664439 , 1., 15.53, 0., 'Total', 'Medium'),
('FY20', 46.522416 , nan, 17.89, nan, 'Homes', 'Medium'),
('FY21', 97.63428522, 112., 18.72, 5., 'Homes', 'Medium'),
('FY22', 141.25547499, 46., 17.86, -5., 'Homes', 'Medium'),
('FY23', 157.06766598, 11., 18.33, 3., 'Homes', 'Medium'),
('FY24', 163.02337094, 4., 18.29, -0., 'Homes', 'Medium'),
('FY25', 165.98360465, 1., 18.28, -0., 'Homes', 'Medium'),
('FY20', 138.109268 , nan, 14.66, nan, 'Businesses', 'Medium'),
('FY21', 177.62289562, 28., 13.77, -6., 'Businesses', 'Medium'),
('FY22', 191.98288414, 8., 13.46, -2., 'Businesses', 'Medium'),
('FY23', 200.26400951, 4., 13.31, -1., 'Businesses', 'Medium'),
('FY24', 203.82459332, 2., 13.31, 0., 'Businesses', 'Medium'),
('FY25', 205.18283926, 1., 13.31, 0., 'Businesses', 'Medium')],
dtype=[('FY', 'O'), ('ADV', '<f8'), ('YoY_ADV', '<f8'), ('Yield', '<f8'), ('YoY_Yld', '<f8'), ('Cut', 'O'), ('Product', 'O')])
df = pd.DataFrame(a)
df1=pd.melt(df, id_vars=['FY','Product','Cut'], var_name="Metric", value_name="Value")
df2 = pd.pivot(df1, index=['Metric', 'Product','Cut'],columns=['FY'],values=['Value'])
df2
And looks like this:
I want to apply table styles so I can copy/paste a polished table into PowerPoint but need the following:
Shade columns FY23, FY24, FY25 in orange
Apply formatting: Metric=ADV rounded to zero decimals, Metric=Yield to 2 decimals, and each of YoY_ADV plus YoY_Yld to 1 decimal place
Negative numbers red, otherwise black
Apply frame around table.
Here is my code but I am getting error 'Cannot index with multidimensional key':
# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
# 2. Slide major metrics so formatting can be applied
######################################################
adv_slice=df2.loc[('ADV', slice(None)), :]
yld_slice=df2.loc[('Yield', slice(None)), :]
yoy_adv_slice=df2.loc[('YoY_ADV', slice(None)), :]
yoy_yld_slice=df2.loc[('YoY_Yld', slice(None)), :]
#3. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=['FY23','FY24','FY25']).format('{:.0f}', subset=adv_slice, na_rep='-').format('{:.2f}', subset=yld_slice, na_rep='-').format('{:.1f}', subset=(yoy_adv_slice,yoy_yld_slice), na_rep='-').set_table_styles([{'selector': '',
'props' : [('border','1px solid black')]},
{'selector': 'th',
'props' : [('border','1px solid black')]},
{'selector': 'td',
'props' : [('border','1px solid black')]}]).set_properties(**{'text-align': 'center'})
What is needed to make the code work?
With some mods:
[![# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
#2. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=\['FY23','FY24','FY25'\]).\
format('{:.0f}', subset=('ADV',), na_rep='-').\
format('{:.2f}', subset=('Yield',), na_rep='-').\
format('{:.1f}', subset=('YoY_ADV',), na_rep='-').\
format('{:.1f}', subset=('YoY_Yld',), na_rep='-').\
set_table_styles(\[{'selector': '',
'props' : \[('border','0.1px solid black')\]},
{'selector': 'th',
'props' : \[('border','0.5px solid black')\]},
{'selector': 'td',
'props' : \[('border','0.5px solid black')\]}\]).set_properties(**{'text-align': 'center'})][1]][1]

XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes

X is a df with some categorical and continuous variables, y is a multi-class variable with 5 classes. No variables are in 'object' data type. No nan in the df.
Code is as follows.
params = { 'max_depth': np.arange(3,20,1),
'learning_rate': np.arange(0.01, 0.5, 0.01),
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'gamma': np.arange(0,10,0.05),
'reg_alpha': np.arange(0,80,1),
'reg_lambda': np.arange(0,1,0.1),
scoring = {'f1': make_scorer(f1_score, needs_proba=True, multi_class="ovr")}
#I plan to eventually add more score metrics
xgbc = xgb.XGBClassifier(seed = 20, eval_metric='mlogloss')
clf = RandomizedSearchCV(estimator=xgbc, param_distributions=params, scoring=scoring,
n_iter=10, verbose=False, n_jobs=1, refit = 'f1',return_train_score=False,
random_state = 30, cv=10)
clf.fit(X, y)
print(clf.cv_results_)
print(clf.predict_proba(X))
#Need estimated probabilities for each class for each row of X
print("Best parameters:", clf.best_params_) ```
The output is:
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]]
Best parameters: {'subsample': 0.7, 'reg_lambda': 0.30000000000000004, 'reg_alpha': 58, 'n_estimators': 1000, 'max_depth': 13, 'learning_rate': 0.17, 'gamma': 7.1000000000000005, 'colsample_bytree': 0.4, 'colsample_bylevel': 0.6}
'split8_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'split9_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'mean_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

SHAP DeepExplainer: shap_values containing "nan" values

I have an issue with my shap values, here is my model:
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 158)] 0
__________________________________________________________________________________________________
model_1 (Model) (None, 158) 57310 input_5[0][0]
__________________________________________________________________________________________________
subtract_4 (Subtract) (None, 158) 0 input_5[0][0]
model_1[5][0]
__________________________________________________________________________________________________
multiply_4 (Multiply) (None, 158) 0 subtract_4[0][0]
subtract_4[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None,) 0 multiply_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 1) 0 lambda_4[0][0]
==================================================================================================
Total params: 57,310
Trainable params: 57,310
Non-trainable params: 0
__________________________________________________________________________________________________
And I call :
scores = new_model.predict(X_test_scaled)
scores = scores.reshape(scores.shape[0],1)
toexplain = np.append(X_test_scaled, scores, axis = 1)
toexplain = pd.DataFrame(toexplain)
toexplain.sort_values(by = [158], ascending=False, inplace=True)
toexplain = toexplain.iloc[0:16]
toexplain.drop(columns = [158], axis = 1, inplace = True)
explainer=shap.DeepExplainer(new_model, df_sampled_X_train_scaled)
shap_values = explainer.shap_values(toexplain, check_additivity=False)
But my shap values look like this (for the first instance):
shap_values[0]
array([ nan, nan, nan, 0.08352888, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.03286453,
nan, nan, 0.2984612 , nan, nan,
nan, 0.01110088, -0.85235232, nan, nan,
nan, nan, nan, nan, -0.27935541,
nan, nan, nan, nan, nan,
nan, nan, -0.18422949, 0.01466912, nan,
nan, nan, -0.1688329 , 0.07462809, 0.03071906,
nan, -0.00554245, nan, nan, nan,
nan, 0.04587848, nan, nan, nan,
nan, 0.05448143, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, 0.00933742, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00919492, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan])
I'm fairly certain I'm not supposed to have nan values among my shap_values, but I can't seem to find the original issue.
Moreover, the predicted values given by the shap.force_plot is different than my model's predictions, which is why I checked my shap_values in the first place.
Would anyone know how could I fix that ?
Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise.
It worked using numpy.arrays