Implementing backpropagation gradient descent using scipy.optimize.minimize - numpy

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?

Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

Related

How to implement custom Keras ordinal loss function with tensor evaluation without disturbing TF>2.0 Model Graph?

I am trying to implement a custom loss function in Tensorflow 2.4 using the Keras backend.
The loss function is a ranking loss; I found the following paper with a somewhat log-likelihood loss: Chen et al. Single-Image Depth Perception in the Wild.
Similarly, I wanted to sample some (in this case 50) points from an image to compare the relative order between ground-truth and predicted depth maps using the NYU-Depth dataset. Being a fan of Numpy, I started working with that but came to the following exception:
ValueError: No gradients provided for any variable: [...]
I have learned that this is caused by the arguments not being filled when calling the loss function but instead, a C function is compiled which is then used later. So while I know the dimensions of my tensors (4, 480, 640, 1), I cannot work with the data as wanted and have to use the keras.backend functions on top so that in the end (if I understood correctly), there is supposed to be a path between the input tensors from the TF graph and the output tensor, which has to provide a gradient.
So my question now is: Is this a feasible loss function within keras?
I have already tried a few ideas and different approaches with different variations of my original code, which was something like:
def ranking_loss_function(y_true, y_pred):
# Chen et al. loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
err_list = [0 for x in range(y_true_np.shape[0])]
for i in range(y_true_np.shape[0]):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
if y_true[i][x1][y1] > y_true[i][x2][y2]:
#image_relation_true = 1
err_list[i] += np.log(1 + np.exp(-1 * y_pred[i][x1][y1] + y_pred[i][x2][y2]))
elif y_true[i][x1][y1] < y_true[i][x2][y2]:
#image_relation_true = -1
err_list[i] += np.log(1 + np.exp(y_pred[i][x1][y1] - y_pred[i][x2][y2]))
else:
#image_relation_true = 0
err_list[i] += np.square(y_pred[i][x1][y1] - y_pred[i][x2][y2])
err_list = np.divide(err_list, total_samples)
return K.constant(err_list)
As you can probably tell, the main idea was to first create the sample points and then based on the existing relation between them in y_true/y_pred continue with the corresponding computation from the cited paper.
Can anyone help me and provide some more helpful information or tips on how to correctly implement this loss using keras.backend functions? Trying to include the ordinal relation information really confused me compared to standard regression losses.
EDIT: Just in case this causes confusion: create_random_samples() just creates 50 random sample points (x, y) coordinate pairs based on the shape[1] and shape[2] of y_true (image width and height)
EDIT(2): After finding this variation on GitHub, I have tried out a variation using only TF functions to retrieve data from the tensors and compute the output. The adjusted and probably more correct version still throws the same exception though:
def ranking_loss_function(y_true, y_pred):
#In the Wild ranking loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
bs = y_true_np.shape[0]
w = y_true_np.shape[1]
h = y_true_np.shape[2]
total_samples = total_samples * bs
num_pairs = tf.constant([total_samples], dtype=tf.float32)
output = tf.Variable(0.0)
for i in range(bs):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
y_true_sq = tf.squeeze(y_true)
y_pred_sq = tf.squeeze(y_pred)
d1_t = tf.slice(y_true_sq, [i, x1, y1], [1, 1, 1])
d2_t = tf.slice(y_true_sq, [i, x2, y2], [1, 1, 1])
d1_p = tf.slice(y_pred_sq, [i, x1, y1], [1, 1, 1])
d2_p = tf.slice(y_pred_sq, [i, x2, y2], [1, 1, 1])
d1_t_sq = tf.squeeze(d1_t)
d2_t_sq = tf.squeeze(d2_t)
d1_p_sq = tf.squeeze(d1_p)
d2_p_sq = tf.squeeze(d2_p)
if d1_t_sq > d2_t_sq:
# --> Image relation = 1
output.assign_add(tf.math.log(1 + tf.math.exp(-1 * d1_p_sq + d2_p_sq)))
elif d1_t_sq < d2_t_sq:
# --> Image relation = -1
output.assign_add(tf.math.log(1 + tf.math.exp(d1_p_sq - d2_p_sq)))
else:
output.assign_add(tf.math.square(d1_p_sq - d2_p_sq))
return output/num_pairs
EDIT(3): This is the code for create_random_samples():
(FYI: Because it was weird to get the shape from y_true in this case, I first proceeded to hard-code it here as I know it for the dataset which I am currently using.)
def create_random_samples(y_true, y_pred, num_points=50):
y_true_shape = (4, 480, 640, 1)
y_pred_shape = (4, 480, 640, 1)
if y_true_shape[0] != None:
num_samples = num_points
population = [(x, y) for x in range(y_true_shape[1]) for y in range(y_true_shape[2])]
sample_points = random.sample(population, num_samples)
return sample_points

how to calculate entropy on float numbers over a tensor in python keras

I have been struggling on this and could not get it to work. hope someone can help me with this.
I want to calculate the entropy on each row of the tensor. Because my data are float numbers not integers I think I need to use bin_histogram.
For example a sample of my data is tensor =[[0.2, -0.1, 1],[2.09,-1.4,0.9]]
Just for information My model is seq2seq and written in keras with tensorflow backend.
This is my code so far: I need to correct rev_entropy
class entropy_measure(Layer):
def __init__(self, beta,batch, **kwargs):
self.beta = beta
self.batch = batch
self.uses_learning_phase = True
self.supports_masking = True
super(entropy_measure, self).__init__(**kwargs)
def call(self, x):
return K.in_train_phase(self.rev_entropy(x, self.beta,self.batch), x)
def get_config(self):
config = {'beta': self.beta}
base_config = super(entropy_measure, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def rev_entropy(self, x, beta,batch):
for i in x:
i = pd.Series(i)
p_data = i.value_counts() # counts occurrence of each value
entropy = entropy(p_data) # get entropy from counts
rev = 1/(1+entropy)
return rev
new_f_w_t = x * (rev.reshape(rev.shape[0], 1))*beta
return new_f_w_t
Any input is much appreciated:)
It looks like you have a series of questions that come together on this issue. I'll settle it here.
You calculate entropy in the following form of scipy.stats.entropy according to your code:
scipy.stats.entropy(pk, qk=None, base=None)
Calculate the entropy of a distribution for given probability values.
If only probabilities pk are given, the entropy is calculated as S =
-sum(pk * log(pk), axis=0).
Tensorflow does not provide a direct API to calculate entropy on each row of the tensor. What we need to do is to implement the above formula.
import tensorflow as tf
import pandas as pd
from scipy.stats import entropy
a = [1.1,2.2,3.3,4.4,2.2,3.3]
res = entropy(pd.value_counts(a))
_, _, count = tf.unique_with_counts(tf.constant(a))
# [1 2 2 1]
prob = count / tf.reduce_sum(count)
# [0.16666667 0.33333333 0.33333333 0.16666667]
tf_res = -tf.reduce_sum(prob * tf.log(prob))
with tf.Session() as sess:
print('scipy version: \n',res)
print('tensorflow version: \n',sess.run(tf_res))
scipy version:
1.329661348854758
tensorflow version:
1.3296613488547582
Then we need to define a function and achieve for loop through tf.map_fn in your custom layer according to above code.
def rev_entropy(self, x, beta,batch):
def row_entropy(row):
_, _, count = tf.unique_with_counts(row)
prob = count / tf.reduce_sum(count)
return -tf.reduce_sum(prob * tf.log(prob))
value_ranges = [-10.0, 100.0]
nbins = 50
new_f_w_t = tf.histogram_fixed_width_bins(x, value_ranges, nbins)
rev = tf.map_fn(row_entropy, new_f_w_t,dtype=tf.float32)
new_f_w_t = x * 1/(1+rev)*beta
return new_f_w_t
Notes that the hidden layer will not produce a gradient that cannot propagate backwards since entropy is calculated on the basis of statistical probabilistic values. Maybe you need to rethink your hidden layer structure.

TensorFlow loss function zeroes out after first epoch

I am trying to implement a discriminative loss function for instance segmentation of images based on this paper: https://arxiv.org/pdf/1708.02551.pdf (This link is just for the readers' reference; I don't expect anyone to read it to help me out!)
My problem: Once I move from a simple loss function to a more complicated one (like you see in the attached code snippet), the loss function zeroes out after the first epoch. I checked the weights, and almost all of them seem to hover closely around -300. They are not exactly identical, but very close to each other (differing only in the decimal places).
Relevant code that implements the discriminative loss function:
def regDLF(y_true, y_pred):
global alpha
global beta
global gamma
global delta_v
global delta_d
global image_height
global image_width
global nDim
y_true = tf.reshape(y_true, [image_height*image_width])
X = tf.reshape(y_pred, [image_height*image_width, nDim])
uniqueLabels, uniqueInd = tf.unique(y_true)
numUnique = tf.size(uniqueLabels)
Sigma = tf.unsorted_segment_sum(X, uniqueInd, numUnique)
ones_Sigma = tf.ones((tf.shape(X)[0], 1))
ones_Sigma = tf.unsorted_segment_sum(ones_Sigma,uniqueInd, numUnique)
mu = tf.divide(Sigma, ones_Sigma)
Lreg = tf.reduce_mean(tf.norm(mu, axis = 1))
T = tf.norm(tf.subtract(tf.gather(mu, uniqueInd), X), axis = 1)
T = tf.divide(T, Lreg)
T = tf.subtract(T, delta_v)
T = tf.clip_by_value(T, 0, T)
T = tf.square(T)
ones_Sigma = tf.ones_like(uniqueInd, dtype = tf.float32)
ones_Sigma = tf.unsorted_segment_sum(ones_Sigma,uniqueInd, numUnique)
clusterSigma = tf.unsorted_segment_sum(T, uniqueInd, numUnique)
clusterSigma = tf.divide(clusterSigma, ones_Sigma)
Lvar = tf.reduce_mean(clusterSigma, axis = 0)
mu_interleaved_rep = tf.tile(mu, [numUnique, 1])
mu_band_rep = tf.tile(mu, [1, numUnique])
mu_band_rep = tf.reshape(mu_band_rep, (numUnique*numUnique, nDim))
mu_diff = tf.subtract(mu_band_rep, mu_interleaved_rep)
mu_diff = tf.norm(mu_diff, axis = 1)
mu_diff = tf.divide(mu_diff, Lreg)
mu_diff = tf.subtract(2*delta_d, mu_diff)
mu_diff = tf.clip_by_value(mu_diff, 0, mu_diff)
mu_diff = tf.square(mu_diff)
numUniqueF = tf.cast(numUnique, tf.float32)
Ldist = tf.reduce_mean(mu_diff)
L = alpha * Lvar + beta * Ldist + gamma * Lreg
return L
Question: I know it's hard to understand what the code does without reading the paper, but I have a couple questions:
Is there something glaringly wrong with the loss function defined
above?
Anyone has a general idea as to why the loss function could zero out after the first epoch?
Thank you very much for your time and help!
I think your problem suffers from tf.norm which is not safe (leads to zeros somewhere in the vector and hence nan in its gradients).
It would be better to replace tf.norm by this custom function:
def tf_norm(inputs, axis=1, epsilon=1e-7, name='safe_norm'):
squared_norm = tf.reduce_sum(tf.square(inputs), axis=axis, keep_dims=True)
safe_norm = tf.sqrt(squared_norm+epsilon)
return tf.identity(safe_norm, name=name)
In your Ldist calculation you use tf.tile and tf.reshape to find the distance between different cluster means in the following manner (suppose we have three clusters):
mu_1 - mu_1
mu_2 - mu_1
mu_3 - mu_1
mu_1 - mu_2
mu_2 - mu_2
mu_3 - mu_2
mu_1 - mu_3
mu_2 - mu_3
mu_3 - mu_3
The problem is that your distance vector contains zero vectors and you perform a norm operation afterwards. tf.norm gets numerical unstable since it performs a division over the length of the vector. The result is that the gradient either gets zero or inf. See this github issue.
The solution would be to remove those zero vectors in a fashion like this Stackoverflow question.

Can I implement a gradient descent for arbitrary convex loss function?

I have a loss function I would like to try and minimize:
def lossfunction(X,b,lambs):
B = b.reshape(X.shape)
penalty = np.linalg.norm(B, axis = 1)**(0.5)
return np.linalg.norm(np.dot(X,B)-X) + lambs*penalty.sum()
Gradient descent, or similar methods, might be useful. I can't calculate the gradient of this function analytically, so I am wondering how I can numerically calculate the gradient for this loss function in order to implement a descent method.
Numpy has a gradient function, but it requires me to pass a scalar field at pre determined points.
You could try scipy.optimize.minimize
For your case a sample call would be:
import scipy.optimize.minimize
scipy.optimize.minimize(lossfunction, args=(b, lambs), method='Nelder-mead')
You could estimate the derivative numerically by a central difference:
def derivative(fun, X, b, lambs, h):
return (fun(X + 0.5*h,b,lambs) - fun(X - 0.5*h,b,lambs))/h
And use it like this:
# assign values to X, b, lambs
# set the value of h
h = 0.001
print derivative(lossfunction, X, b, lambs, h)
The code above is valid for dimX = 1, some modifications are needed to account for multidimensional vector X:
def gradient(fun, X, b, lambs, h):
res = []
for i in range (0,len(X)):
t1 = list(X)
t1[i] = t1[i] + 0.5*h
t2 = list(X)
t2[i] = t2[i] - 0.5*h
res = res + [(fun(t1,b,lambs) - fun(t2,b,lambs))/h]
return res
Forgive the naivity of the code, I barely know how to write some python :-)

Avoiding optimization pitfalls when modeling an ordinal predicted variable in PyMC3

I am trying to model an ordinal predicted variable using PyMC3 based on the approach in chapter 23 of Doing Bayesian Data Analysis. I would like to determine a good starting value using find_MAP, but am receiving an optimization error.
The model:
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
# Some helper functions
def cdf(x, location=0, scale=1):
epsilon = np.array(1e-32, dtype=theano.config.floatX)
location = tt.cast(location, theano.config.floatX)
scale = tt.cast(scale, theano.config.floatX)
div = tt.sqrt(2 * scale ** 2 + epsilon)
div = tt.cast(div, theano.config.floatX)
erf_arg = (x - location) / div
return .5 * (1 + tt.erf(erf_arg + epsilon))
def percent_to_thresh(idx, vect):
return 5 * tt.sum(vect[:idx + 1]) + 1.5
def full_thresh(thresh):
idxs = tt.arange(thresh.shape[0] - 1)
thresh_mod, updates = theano.scan(fn=percent_to_thresh,
sequences=[idxs],
non_sequences=[thresh])
return tt.concatenate([[-1 * np.inf, 1.5], thresh_mod, [6.5, np.inf]])
def compute_ps(thresh, location, scale):
f_thresh = full_thresh(thresh)
return cdf(f_thresh[1:], location, scale) - cdf(f_thresh[:-1], location, scale)
# Generate data
real_ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.3, 0.2]
data = np.random.choice(7, size=1000, p=real_ps)
# Run model
with pm.Model() as model:
mu = pm.Normal('mu', mu=4, sd=3)
sigma = pm.Uniform('sigma', lower=0.1, upper=70)
thresh = pm.Dirichlet('thresh', a=np.ones(5))
cat_p = compute_ps(thresh, mu, sigma)
results = pm.Categorical('results', p=cat_p, observed=data)
with model:
start = pm.find_MAP()
trace = pm.sample(2000, start=start)
When running this, I receive the following error:
Applied interval-transform to sigma and added transformed sigma_interval_ to model.
Applied stickbreaking-transform to thresh and added transformed thresh_stickbreaking_ to model.
Traceback (most recent call last):
File "cm_net_log.v1-for_so.py", line 53, in <module>
start = pm.find_MAP()
File "/usr/local/lib/python3.5/site-packages/pymc3/tuning/starting.py", line 133, in find_MAP
specific_errors)
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'thresh_stickbreaking_': array([-1.04298465, -0.48661088, -0.84326554, -0.44833646]), 'sigma_interval_': array(-2.220446049250313e-16), 'mu': array(7.68422528308479)} logp: array(-3506.530143064723) dlogp: array([ 1.61013190e-06, nan, -6.73994118e-06,
-6.93873894e-06, 6.03358122e-06, 3.18954680e-06])Check that 1) you don't have hierarchical parameters, these will lead to points with infinite density. 2) your distribution logp's are properly specified. Specific issues:
My questions:
How can I determine why dlogp is nan at certain points?
Is there a different way that I can express this model to avoid dlogp being nan?
Also worth noting:
This model runs fine if I don't find_MAP and use a Metropolis sampler. However, I'd like to have the flexibility of using other samplers as this model becomes more complex.
I have a suspicion that the issue is due to the relationship between the thresholds and the normal distribution, but I don't know how to disentangle them for the optimization.
Regarding question 2: I expressed the model for the ordinal predicted variable (single group) differently; I used the Theano #as_op decorator for a function that calculates probabilities for the outcomes. That also explains why I cannot use find_MAP() or gradient based samplers: Theano cannot calculate a gradient for the custom function. (http://pymc-devs.github.io/pymc3/notebooks/getting_started.html#Arbitrary-deterministics)
# Number of outcomes
nYlevels = df.Y.cat.categories.size
thresh = [k + .5 for k in range(1, nYlevels)]
thresh_obs = np.ma.asarray(thresh)
thresh_obs[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities(theta, mu, sigma):
out = np.empty(nYlevels)
n = norm(loc=mu, scale=sigma)
out[0] = n.cdf(theta[0])
out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
out[6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_single:
theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
shape=len(thresh), observed=thresh_obs, testval=thresh[1:-1])
mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)
pr = outcome_probabilities(theta, mu, sigma)
y = pm.Categorical('y', pr, observed=df.Y.cat.codes.as_matrix())
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb