Merge arrays on non-nan values - numpy

In numpy, how would you merge the following two arrays on the non-nan values, resulting in third array?
Array 1 (shape: r, c):
array([[nan, 1., 1., nan, nan, nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan],
...,
[nan, nan, nan, 1., 1., nan],
[nan, nan, nan, 1., 1., nan],
[nan, nan, 1., 1., nan, nan]])
Array 2: (shape r, 2)
array([[ 0.76620125, 59.14934823],
[ 2.52819832, 43.63809538],
[ 1.9656387 , 25.62212163],
...,
[ 2.55076928, 43.04276273],
[ 2.62058763, 22.14260189],
[ 1.8050997 , 51.72144285]])
Resulting array: (shape r, c)
array([[nan, 0.76620125, 59.14934823, nan, nan, nan],
[nan, nan, nan, 2.52819832, 43.63809538, nan],
[nan, nan, 1.9656387, 25.62212163, nan, nan],
...,
[nan, nan, nan, 2.55076928, 43.04276273, nan],
[nan, nan, nan, 2.62058763, 22.14260189, nan],
[nan, nan, 1.8050997 , 51.72144285, nan, nan]])

this should do, if a is your first array and b the second one:
a[~np.isnan(a)] = b.ravel()

Related

Loss to measure if a sample belong to a distribution

I am implementing a Bayesian Neural Network to generate data imitating a distribution (which is a bernoulli). The architecture of the neural network is the following.
model=tf.keras.Sequential([
tfkl.Input(shape=(),name='dummy_input'),
tfpl.DistributionLambda(lambda t :
latentNormal,
convert_to_tensor_fn=lambda x : x.sample(batchSize)
),
tfp.layers.DenseReparameterization(units=inputDim,activation=tf.nn.relu),
tfpl.DistributionLambda(lambda t:
tfd.Bernoulli(probs=t)
)
])
I want to do a custom run of the fit. To do so, i create a batch of the data i want to imitate and i define. Here is one step of the gradient method
negloglik = lambda data: -model(0).log_prob(data)
optimizer = tf.keras.optimizers.Adam()
kls = []
pbar = tqdm(range(100))
epoch=0
# print(epoch)
# model.fit(mimic[:1453*32], mimic[:1453*32], epochs=1, batch_size=batchSize, verbose=0)
idx = np.random.choice(np.arange(mimic.shape[0]), size=1453*batchSize, replace=False)
shuffled_ds = mimic.numpy()[idx]
nBatch=0
# print(nBatch)
batch = shuffled_ds[nBatch*batchSize:(1+nBatch)*batchSize]
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss = negloglik(batch)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
But loss=negloglik(batch) and grads zare full of nan. (below it is the loss before the reduce mean)
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[ nan, nan, nan, 2.9453604e+00,
-0.0000000e+00],
[ 4.0824111e-03, -0.0000000e+00, -0.0000000e+00, -0.0000000e+00,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, nan, nan, nan,
nan],
[ nan, -0.0000000e+00, nan, 4.2825124e-01,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, nan,
nan],
[-0.0000000e+00, nan, -0.0000000e+00, nan,
-0.0000000e+00],
[ nan, -0.0000000e+00, nan, 1.4346083e+00,
nan],
[ nan, -0.0000000e+00, nan, -0.0000000e+00,
nan],
[-0.0000000e+00, -0.0000000e+00, -0.0000000e+00, 4.4231796e+00,
nan]], dtype=float32)>
Do you have any ideas why i have a lot of nan ? And do you know what kind of loss i can use to measure if a sample belong to a certain distribution (to replace the negloglike here) ?

Apply shading, formatting and borders to pivoted dataframe

I have the following data that has been pivoted:
pip install Jinja2
import pandas as pd
import numpy as np
from numpy import rec, nan
a=rec.array([('FY20', 361.410592 , nan, 21.97, nan, 'Total', 'Fast'),
('FY21', 359.26952604, -1., 22.99, 5., 'Total', 'Fast'),
('FY22', 362.4560529 , 1., 22.77, -1., 'Total', 'Fast'),
('FY23', 371.53543252, 2., 21.92, -4., 'Total', 'Fast'),
('FY24', 374.48894494, 1., 21.88, -0., 'Total', 'Fast'),
('FY25', 377.09481613, 1., 21.85, -0., 'Total', 'Fast'),
('FY20', 67.043756 , nan, 21. , nan, 'Homes', 'Fast'),
('FY21', 110.12145222, 63., 20.95, -0., 'Homes', 'Fast'),
('FY22', 117.46526727, 7., 20.73, -1., 'Homes', 'Fast'),
('FY23', 125.83482531, 7., 18.99, -8., 'Homes', 'Fast'),
('FY24', 126.16748411, 1., 18.95, -0., 'Homes', 'Fast'),
('FY25', 127.786528 , 1., 18.96, 0., 'Homes', 'Fast'),
('FY20', 294.366836 , nan, 22.19, nan, 'Businesses', 'Fast'),
('FY21', 249.14807381, -15., 23.89, 8., 'Businesses', 'Fast'),
('FY22', 245.99078563, -2., 23.74, -1., 'Businesses', 'Fast'),
('FY23', 245.70060721, 0., 23.42, -1., 'Businesses', 'Fast'),
('FY24', 247.32146083, 1., 23.37, -0., 'Businesses', 'Fast'),
('FY25', 250.30828813, 1., 23.33, -0., 'Businesses', 'Fast'),
('FY20', 184.631684 , nan, 15.47, nan, 'Total', 'Medium'),
('FY21', 274.25718084, 49., 15.53, 0., 'Total', 'Medium'),
('FY22', 333.23835913, 21., 15.33, -1., 'Total', 'Medium'),
('FY23', 357.33167549, 7., 15.52, 1., 'Total', 'Medium'),
('FY24', 367.84796426, 3., 15.53, 0., 'Total', 'Medium'),
('FY25', 370.1664439 , 1., 15.53, 0., 'Total', 'Medium'),
('FY20', 46.522416 , nan, 17.89, nan, 'Homes', 'Medium'),
('FY21', 97.63428522, 112., 18.72, 5., 'Homes', 'Medium'),
('FY22', 141.25547499, 46., 17.86, -5., 'Homes', 'Medium'),
('FY23', 157.06766598, 11., 18.33, 3., 'Homes', 'Medium'),
('FY24', 163.02337094, 4., 18.29, -0., 'Homes', 'Medium'),
('FY25', 165.98360465, 1., 18.28, -0., 'Homes', 'Medium'),
('FY20', 138.109268 , nan, 14.66, nan, 'Businesses', 'Medium'),
('FY21', 177.62289562, 28., 13.77, -6., 'Businesses', 'Medium'),
('FY22', 191.98288414, 8., 13.46, -2., 'Businesses', 'Medium'),
('FY23', 200.26400951, 4., 13.31, -1., 'Businesses', 'Medium'),
('FY24', 203.82459332, 2., 13.31, 0., 'Businesses', 'Medium'),
('FY25', 205.18283926, 1., 13.31, 0., 'Businesses', 'Medium')],
dtype=[('FY', 'O'), ('ADV', '<f8'), ('YoY_ADV', '<f8'), ('Yield', '<f8'), ('YoY_Yld', '<f8'), ('Cut', 'O'), ('Product', 'O')])
df = pd.DataFrame(a)
df1=pd.melt(df, id_vars=['FY','Product','Cut'], var_name="Metric", value_name="Value")
df2 = pd.pivot(df1, index=['Metric', 'Product','Cut'],columns=['FY'],values=['Value'])
df2
And looks like this:
I want to apply table styles so I can copy/paste a polished table into PowerPoint but need the following:
Shade columns FY23, FY24, FY25 in orange
Apply formatting: Metric=ADV rounded to zero decimals, Metric=Yield to 2 decimals, and each of YoY_ADV plus YoY_Yld to 1 decimal place
Negative numbers red, otherwise black
Apply frame around table.
Here is my code but I am getting error 'Cannot index with multidimensional key':
# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
# 2. Slide major metrics so formatting can be applied
######################################################
adv_slice=df2.loc[('ADV', slice(None)), :]
yld_slice=df2.loc[('Yield', slice(None)), :]
yoy_adv_slice=df2.loc[('YoY_ADV', slice(None)), :]
yoy_yld_slice=df2.loc[('YoY_Yld', slice(None)), :]
#3. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=['FY23','FY24','FY25']).format('{:.0f}', subset=adv_slice, na_rep='-').format('{:.2f}', subset=yld_slice, na_rep='-').format('{:.1f}', subset=(yoy_adv_slice,yoy_yld_slice), na_rep='-').set_table_styles([{'selector': '',
'props' : [('border','1px solid black')]},
{'selector': 'th',
'props' : [('border','1px solid black')]},
{'selector': 'td',
'props' : [('border','1px solid black')]}]).set_properties(**{'text-align': 'center'})
What is needed to make the code work?
With some mods:
[![# 1. If numbers are negative, make red otherwise black
#####################################################
def color_negative_red(x):
if x < 0:
return 'color: red'
else:
return 'color: black'
#2. Apply table style
#####################
df2.style.applymap(color_negative_red).set_properties(**{'background-color': 'orange'}, subset=\['FY23','FY24','FY25'\]).\
format('{:.0f}', subset=('ADV',), na_rep='-').\
format('{:.2f}', subset=('Yield',), na_rep='-').\
format('{:.1f}', subset=('YoY_ADV',), na_rep='-').\
format('{:.1f}', subset=('YoY_Yld',), na_rep='-').\
set_table_styles(\[{'selector': '',
'props' : \[('border','0.1px solid black')\]},
{'selector': 'th',
'props' : \[('border','0.5px solid black')\]},
{'selector': 'td',
'props' : \[('border','0.5px solid black')\]}\]).set_properties(**{'text-align': 'center'})][1]][1]

NumPy, SciPy - how to calculate the z score for subsets of an array?

Using the array a below as an example, I am looking for a scalable way to calculate the z score of the last 2 columns a[:, 3:] separately for each value in the third column a[:,2]
In [52]: import numpy as np; from scipy import stats
In [53]: a = np.array([[0., 0., 0., 1., 2.], [ 0., 0., 1., 3., 4.], [ 1., 0.,
...: 0., 5., 6.], [1., 0., 1., 7., 8.], [ 2., 0., 0., 9., 6.], [2.,
...: 0., 1., 8., 9.], [ 3., np.NaN, np.NaN, np.NaN, np.NaN]])
In [54]: a
Out[54]:
array([[ 0., 0., 0., 1., 2.],
[ 0., 0., 1., 3., 4.],
[ 1., 0., 0., 5., 6.],
[ 1., 0., 1., 7., 8.],
[ 2., 0., 0., 9., 6.],
[ 2., 0., 1., 8., 9.],
[ 3., nan, nan, nan, nan]])
For the case where the third column is 0 a[:,2] == 0, I can calculate it with
In [48]: np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 0][:,3:]), (1, 1))
Out[48]:
array([[-1.22474487, -1.41421356],
[ 0. , 0.70710678],
[ 1.22474487, 0.70710678]])
and for the case where the third column is 1 a[:,2] == 1, I can calculate it with
In [49]: np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 1][:,3:]), (1, 1))
Out[49]:
array([[-1.38873015, -1.38873015],
[ 0.46291005, 0.46291005],
[ 0.9258201 , 0.9258201 ]])
How can I augment my original array with these results, regardless of the number of rows and values in the third column, to create something like the following -
Out[62]:
array([[ 0. , 0. , 0. , 1. , 2. ,
-1.22474487, -1.41421356],
[ 0. , 0. , 1. , 3. , 4. ,
-1.38873015, -1.38873015],
[ 1. , 0. , 0. , 5. , 6. ,
0. , 0.70710678],
[ 1. , 0. , 1. , 7. , 8. ,
0.46291005, 0.46291005],
[ 2. , 0. , 0. , 9. , 6. ,
1.22474487, 0.70710678],
[ 2. , 0. , 1. , 8. , 9. ,
0.9258201 , 0.9258201 ],
[ 3. , nan, nan, nan, nan,
nan, nan]])
you need to create an array with same number of columns as a and use np.column_stack to combine them
z1 = np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 0][:,3:]), (1, 1))
z2 = np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 1][:,3:]), (1, 1))
z=np.zeros((a.shape[0],z1.shape[1]))*np.nan
z[::2][:z1.shape[0]]=z1
z[1::2][:z2.shape[0]]=z2
arr1 = np.column_stack((a,z))
arr1
array([[ 0. , 0. , 0. , 1. , 2. ,
-1.22474487, -1.41421356],
[ 0. , 0. , 1. , 3. , 4. ,
-1.38873015, -1.38873015],
[ 1. , 0. , 0. , 5. , 6. ,
0. , 0.70710678],
[ 1. , 0. , 1. , 7. , 8. ,
0.46291005, 0.46291005],
[ 2. , 0. , 0. , 9. , 6. ,
1.22474487, 0.70710678],
[ 2. , 0. , 1. , 8. , 9. ,
0.9258201 , 0.9258201 ],
[ 3. , nan, nan, nan, nan,
nan, nan]])
for n unique values in a[:,2]
N = np.unique(a[:,2])[~np.isnan(np.unique(a[:,2]))]
zTemp = [np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == k][:,3:]), (1, 1)) for k in N]
z=np.zeros((a.shape[0], zTemp[0].shape[1]))*np.nan
for i in range(len(zTemp)):
z[i::2][:z1.shape[0]]=zTemp[i]
arr1 = np.column_stack((a,z))

SHAP DeepExplainer: shap_values containing "nan" values

I have an issue with my shap values, here is my model:
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 158)] 0
__________________________________________________________________________________________________
model_1 (Model) (None, 158) 57310 input_5[0][0]
__________________________________________________________________________________________________
subtract_4 (Subtract) (None, 158) 0 input_5[0][0]
model_1[5][0]
__________________________________________________________________________________________________
multiply_4 (Multiply) (None, 158) 0 subtract_4[0][0]
subtract_4[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None,) 0 multiply_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 1) 0 lambda_4[0][0]
==================================================================================================
Total params: 57,310
Trainable params: 57,310
Non-trainable params: 0
__________________________________________________________________________________________________
And I call :
scores = new_model.predict(X_test_scaled)
scores = scores.reshape(scores.shape[0],1)
toexplain = np.append(X_test_scaled, scores, axis = 1)
toexplain = pd.DataFrame(toexplain)
toexplain.sort_values(by = [158], ascending=False, inplace=True)
toexplain = toexplain.iloc[0:16]
toexplain.drop(columns = [158], axis = 1, inplace = True)
explainer=shap.DeepExplainer(new_model, df_sampled_X_train_scaled)
shap_values = explainer.shap_values(toexplain, check_additivity=False)
But my shap values look like this (for the first instance):
shap_values[0]
array([ nan, nan, nan, 0.08352888, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, 0.03286453,
nan, nan, 0.2984612 , nan, nan,
nan, 0.01110088, -0.85235232, nan, nan,
nan, nan, nan, nan, -0.27935541,
nan, nan, nan, nan, nan,
nan, nan, -0.18422949, 0.01466912, nan,
nan, nan, -0.1688329 , 0.07462809, 0.03071906,
nan, -0.00554245, nan, nan, nan,
nan, 0.04587848, nan, nan, nan,
nan, 0.05448143, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, 0.00933742, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, 0.00919492, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan])
I'm fairly certain I'm not supposed to have nan values among my shap_values, but I can't seem to find the original issue.
Moreover, the predicted values given by the shap.force_plot is different than my model's predictions, which is why I checked my shap_values in the first place.
Would anyone know how could I fix that ?
Okay, via reading shap's source code, I realised it didn't take into account that the data were pandas' dataframes, eventhought the documentation says otherwise.
It worked using numpy.arrays

Genfromtxt does not return column names

I want to convert a csv into a numpy array. The first row of the csv file contains the names / titles of the columns. But when I use genfromtxt with the names parameter set to true I still receive only a normal numpy array with a lot of NaN values. What did I forget?
numpy.genfromtxt("test.csv", names=True, delimiter=",")
array([[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
...,
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 11., ..., NaN, NaN, NaN],
[ NaN, 64., 5., ..., NaN, NaN, NaN]])
You have to set the dtype to None:
numpy.genfromtxt("test.csv", names=True, delimiter=",", dtype=None)