Custom loss for coordinate/landmark prediction - tensorflow

I am currently trying to get a landmark predictor running and thought about the loss function.
Currently the last (dense) layer has 32 values with the 16 coordinates encoded as x1,y1,x2,y2,...
Up until now I was just fiddling with Mean Squared Error or Mean Absolute Error losses but thought the distance between the ground truth and the predicted coordinate would be far more expressive of the correctness of the values.
My current implementation looks like:
def dst_objective(y_true, y_pred):
vats = dict()
for i in range(0, 16):
true_px = y_true[:, i * 2:i * 2 + 1]
pred_px = y_pred[:, i * 2:i * 2 + 1]
true_py = y_true[:, i * 2 + 1:i * 2 + 2]
pred_py = y_pred[:, i * 2 + 1:i * 2 + 2]
vats[i] = K.sqrt(K.square(true_px - pred_px) + K.square(true_py - pred_py))
out = K.concatenate([
vats[0], vats[1], vats[2], vats[3], vats[4], vats[5], vats[6], vats[7],
vats[8], vats[9], vats[10], vats[11], vats[12], vats[13], vats[14],
vats[15]
],axis=1)
return K.mean(out,axis=0)
It does seem to work when I evaluate it but it does look "hacky" to me. Any suggestions how I could improve on this?

The same calculation expressed as tensor operations in Keras, without separating the X and Y coordinates, because that's basically unnecessary:
# get all the squared difference in coordinates
sq_distances = K.square( y_true - y_pred )
# then take the sum of each pair
sum_pool = 2 * K.AveragePooling1D( sq_distances,
pool_size = 2,
strides = 2,
padding = "valid" )
# take the square root to get the distance
dists = K.sqrt( sum_pool )
# take the mean of the distances
mean_dist = K.mean( dists )

Related

vectorized way to multiply and add specific axes in numpy array (convolutional layer backprop)

I was wondering how it would be possible to vectorize the following quadruple for-loops (this is t do with backprop in a convolutional layer).
W = np.ones((2, 2, 3, 8)) # just a toy example
dW = np.zeros(W.shape)
dZ = np.ones((10, 4, 4, 8))*2
# get the shapes: m = samples/images; H_dim = Height of image; W_dim = Width of image; 8 = Channels/filters
(m, H_dim, W_dim, C) = dZ.shape
dA_prev = np.zeros((10, 4, 4, 3))
# add symmetric padding of 2 around height and width borders with 0-values; shape is now: (10, 8, 8, 3)
dA_prev = np.pad(dA_prev,((0,0),(2,2),(2,2),(0,0)), mode='constant', constant_values = (0,0))
# loop over images
for i in range(m):
# loop over height
for h in range(H_dim):
# loop over width
for w in range(W_dim):
# loop over channels/filters
for c in range(C):
vert_start = 1 * h # 1 = stride, just as an example
vert_end = vert_start + 2 # 2 = vertical filter size, just as an example
horiz_start = 1 * w # 1 = stride
horiz_end = horiz_start + 2 # 2 = horizontal filter size, just as an example
dW[:,:,:,c] += dA_prev[i, vert_start:vert_end,horiz_start:horiz_end,:] * dZ[i, h, w, c]
dA_prev[i, vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c] # dZ[i, h, w, c] is a scalar
doing backprop on the bias is easy enough (db = np.sum(dZ, axis=(0,1,2), keepdims=True)), and the weights can be dealt with using stride tricks and by reshaping the dZ and then using the dot product the rescaled input (or tensordot on the axes or einsum).
def _striding(array, stride_size, filter_shapes, Wout=None, Hout=None):
strides = (array.strides[0], array.strides[1] * stride_size, array.strides[2] * stride_size, array.strides[1], array.strides[2], array.strides[3])
strided = as_strided(array, shape=(array.shape[0], Hout, Wout, filter_shapes[0], filter_shapes[1], array.shape[3]), strides=strides, writeable=False)
return strided
Hout = (A_prev.shape[1] - 2) // 1 + 1
Wout = (A_prev.shape[2] - 2) // 1 + 1
x_flat = _striding(array=A_prev, stride_size=2, filter_shapes=(2,2),
Wout=Wout, Hout=Hout).reshape(-1, 2 * 2 * A_prev.shape[3])
dout_descendant_flat = dout_descendant.reshape(-1, n_C)
dW = x_flat.T # dout_descendant_flat # shape (fh * fw * n_prev_C, C)
dW = dW.reshape(fh, fw, n_prev_C, C)
this gives identical results as dW in the slow version. but doing something similar to get the derivative wrt to the input that should yield the same result, doesn't. here's what i've done:
dZ_pad = np.pad(dZ,((0,0),(2,2),(2,2),(0,0)), mode='constant', constant_values = (0,0)) # padding to get the same shape as A_prev
dZ_pad_reshaped = _striding(array=dZ_pad, stride_size=1, filter_shapes=(2,2),
Wout=4, Hout=4) # the Hout and Wout dims are from the unpadded dims of A_prev
Wrot180 = np.rot90(W, 2, axes=(0,1)) # the filter height and width are in the first two axes, which we want to rotate
dA_prev = np.tensordot(dZ_pad_reshaped, Wrot180, axes=([3,4,5],[0,1,3]))
the shapes of dA_prev are right, but for some reason the results aren't identical as the slow version
OK, turns out the error was to do with several things:
dZ needed to be dilated relative to the stride in the forward propagation
the window function used for windowing dZ (done after dilation of dZ) needed to be called with stride 1 (no matter the stride choice in the forward propagation) with the output heights and widths of the padded input (not the original, unpadded input -- this was the main mistake that took me days to debug)
the relevant code is below with comments explaining shapes and operations as well as some further sources for reading. i've also included the forward propagation.
i should note that after days of debugging, writing various functions, reading etc. the variable names changed after a while, so for the ease of reading, here are the names of the variables as defined in my question and then their equivalent in the code below:
A_prev is x
dZ is dout_descendant
Hout is the height of dout_descendant
Wout is the width of dout_descendant
(as one would expect all references to self are to the class these functions are part of)
def _pad(self, array, pad_size, pad_val):
'''
only symmetric padding is implemented
'''
return np.pad(array, ((0, 0), (pad_size, pad_size), (pad_size, pad_size), (0, 0)), 'constant', constant_values=(pad_val, pad_val))
def _dilate(self, array, stride_size, pad_size, symmetric_filter_shape, output_image_size):
# on dilation for backprop with stride>1,
# see: https://medium.com/#mayank.utexas/backpropagation-for-convolution-with-strides-8137e4fc2710
# see also: https://leimao.github.io/blog/Transposed-Convolution-As-Convolution/
pad_bottom_and_right = (output_image_size + 2 * pad_size - symmetric_filter_shape) % stride_size
for m in range(stride_size - 1):
array = np.insert(array, range(1, array.shape[1], m + 1), 0, axis=1)
array = np.insert(array, range(1, array.shape[2], m + 1), 0, axis=2)
for _ in range(pad_bottom_and_right):
array = np.insert(array, array.shape[1], 0, axis=1)
array = np.insert(array, array.shape[2], 0, axis=2)
return array
def _windows(self, array, stride_size, filter_shapes, out_height, out_width):
'''
inputs:
array to create windows of
stride_size: int
filter_shapes: tuple(int): tuple of filter height and width
out_height and out_width: int, respectively: output sizes for the windows
returns:
windows of array with shape (excl. dilation):
array.shape[0], out_height, out_width, filter_shapes[0], filter_shapes[1], array.shape[3]
'''
strides = (array.strides[0], array.strides[1] * stride_size, array.strides[2] * stride_size, array.strides[1], array.strides[2], array.strides[3])
return np.lib.stride_tricks.as_strided(array, shape=(array.shape[0], out_height, out_width, filter_shapes[0], filter_shapes[1], array.shape[3]), strides=strides, writeable=False)
def forward(self, x):
'''
expects inputs to be of shape: [batchsize, height, width, channel in]
after init, filter_shapes are: [fh, fw, channel in, channel out]
'''
self.input_shape = x.shape
x_pad = self._pad(x, self.pad_size, self.pad_val)
self.input_pad_shape = x_pad.shape
# get the shapes
batch_size, h, w, Cin = self.input_shape
# calculate output sizes; only symmetric padding is possible
self.Hout = (h + 2*self.pad_size - self.fh) // self.stride + 1
self.Wout = (w + 2*self.pad_size - self.fw) // self.stride + 1
x_windows = self._windows(array=x_pad, stride_size=self.stride, filter_shapes=(self.fh, self.fw),
out_width=self.Wout, out_height=self.Hout) # 2D matrix with shape (batch_size, Hout, Wout, fh, fw, Cin)
self.out = np.tensordot(x_windows, self.w, axes=([3,4,5], [0,1,2])) + self.b
self.inputs = x_windows
## alternative 1: einsum approach, slower than other alternatives
# self.out = np.einsum('noufvc,fvck->nouk', x_windows, self.w) + self.b
## alternative 2: column approach with simple dot product
# z = x_windows.reshape(-1, self.fh * self.fw * Cin) # self.W.reshape(self.fh*self.fw*Cin, Cout) + self.b # 2D matrix with shape (batch_size * Hout * Wout, Cout)
# self.dout = z.reshape(batch_size, Hout, Wout, Cout)
def backward(self,dout_descendant):
'''
dout_descendant has shape (batch_size, Hout, Wout, Cout)
'''
# get shapes
batch_size, h, w, Cin = self.input_shape
# we want to sum everything but the filters for b
self.db = np.sum(dout_descendant, axis=(0,1,2), keepdims=True) # shape (1,1,1, Cout)
# for dW we'll use the column approach with ordinary dot product for variety ;) tensordot does the same without all the reshaping
dout_descendant_flat = dout_descendant.reshape(-1, self.Cout) # new shape (batch_size * Hout * Wout, Cout)
x_flat = self.inputs.reshape(-1, self.fh * self.fw * Cin) # shape (batch_size * Hout * Wout, fh * fw * Cin)
dw = x_flat.T # dout_descendant_flat # shape (fh * fw * Cin, Cout)
self.dw = dw.reshape(self.fh, self.fw, Cin, self.Cout)
del dout_descendant_flat # free memory
# for dinputs: we'll get padded and dilated windows of dout_descendant and perform the tensordot with 180 rotated W
# for details, see https://medium.com/#mayank.utexas/backpropagation-for-convolution-with-strides-8137e4fc2710 ; also: https://pavisj.medium.com/convolutions-and-backpropagations-46026a8f5d2c ; also: https://youtu.be/Lakz2MoHy6o?t=835
Wrot180 = np.rot90(self.w, 2, axes=(0,1)) # or also self.w[::-1, ::-1, :, :]
# backprop for forward with stride > 1 is done on windowed dout that's padded and dilated with stride 1
dout_descendant = self._dilate(dout_descendant, stride_size=self.stride, pad_size=self.pad_size, symmetric_filter_shape=self.fh, output_image_size=h)
dout_descendant = self._pad(dout_descendant, pad_size=self.fw-1, pad_val=self.pad_val) # pad dout_descendant to dim: fh-1 (or fw-1); only symmetrical filters are supported
dout_descendant = self._windows(array=dout_descendant, stride_size=1, filter_shapes=(self.fh, self.fw),
out_height=h + 2 * self.pad_size, out_width=w + 2 * self.pad_size) # shape: (batch_size * h_padded * w_padded, fh * fw * Cout)
self.dout = np.tensordot(dout_descendant, Wrot180, axes=([3,4,5],[0,1,3]))
self.dout = self.dout[:,self.pad_size:-self.pad_size, self.pad_size:-self.pad_size, :]
## einsum alternative, but slower:
# dinput = np.einsum('nhwfvk,fvck->nhwc', dout_windows, self.W)
i've left this answer here, because all the other sources on stackoverflow or github i could find that used numpy stride tricks were implemented for convolutions of stride 1 (which doesn't require dilation of dZ) or they used very complex fancy indexing operations that were extremely hard to follow (e.g. https://sgugger.github.io/convolution-in-depth.html#convolution-in-depth or https://github.com/parasdahal/deepnet/blob/51a9e61c351138b7dc637f4b748a0e6ca2e15595/deepnet/im2col.py)

How to implement custom Keras ordinal loss function with tensor evaluation without disturbing TF>2.0 Model Graph?

I am trying to implement a custom loss function in Tensorflow 2.4 using the Keras backend.
The loss function is a ranking loss; I found the following paper with a somewhat log-likelihood loss: Chen et al. Single-Image Depth Perception in the Wild.
Similarly, I wanted to sample some (in this case 50) points from an image to compare the relative order between ground-truth and predicted depth maps using the NYU-Depth dataset. Being a fan of Numpy, I started working with that but came to the following exception:
ValueError: No gradients provided for any variable: [...]
I have learned that this is caused by the arguments not being filled when calling the loss function but instead, a C function is compiled which is then used later. So while I know the dimensions of my tensors (4, 480, 640, 1), I cannot work with the data as wanted and have to use the keras.backend functions on top so that in the end (if I understood correctly), there is supposed to be a path between the input tensors from the TF graph and the output tensor, which has to provide a gradient.
So my question now is: Is this a feasible loss function within keras?
I have already tried a few ideas and different approaches with different variations of my original code, which was something like:
def ranking_loss_function(y_true, y_pred):
# Chen et al. loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
err_list = [0 for x in range(y_true_np.shape[0])]
for i in range(y_true_np.shape[0]):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
if y_true[i][x1][y1] > y_true[i][x2][y2]:
#image_relation_true = 1
err_list[i] += np.log(1 + np.exp(-1 * y_pred[i][x1][y1] + y_pred[i][x2][y2]))
elif y_true[i][x1][y1] < y_true[i][x2][y2]:
#image_relation_true = -1
err_list[i] += np.log(1 + np.exp(y_pred[i][x1][y1] - y_pred[i][x2][y2]))
else:
#image_relation_true = 0
err_list[i] += np.square(y_pred[i][x1][y1] - y_pred[i][x2][y2])
err_list = np.divide(err_list, total_samples)
return K.constant(err_list)
As you can probably tell, the main idea was to first create the sample points and then based on the existing relation between them in y_true/y_pred continue with the corresponding computation from the cited paper.
Can anyone help me and provide some more helpful information or tips on how to correctly implement this loss using keras.backend functions? Trying to include the ordinal relation information really confused me compared to standard regression losses.
EDIT: Just in case this causes confusion: create_random_samples() just creates 50 random sample points (x, y) coordinate pairs based on the shape[1] and shape[2] of y_true (image width and height)
EDIT(2): After finding this variation on GitHub, I have tried out a variation using only TF functions to retrieve data from the tensors and compute the output. The adjusted and probably more correct version still throws the same exception though:
def ranking_loss_function(y_true, y_pred):
#In the Wild ranking loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
bs = y_true_np.shape[0]
w = y_true_np.shape[1]
h = y_true_np.shape[2]
total_samples = total_samples * bs
num_pairs = tf.constant([total_samples], dtype=tf.float32)
output = tf.Variable(0.0)
for i in range(bs):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
y_true_sq = tf.squeeze(y_true)
y_pred_sq = tf.squeeze(y_pred)
d1_t = tf.slice(y_true_sq, [i, x1, y1], [1, 1, 1])
d2_t = tf.slice(y_true_sq, [i, x2, y2], [1, 1, 1])
d1_p = tf.slice(y_pred_sq, [i, x1, y1], [1, 1, 1])
d2_p = tf.slice(y_pred_sq, [i, x2, y2], [1, 1, 1])
d1_t_sq = tf.squeeze(d1_t)
d2_t_sq = tf.squeeze(d2_t)
d1_p_sq = tf.squeeze(d1_p)
d2_p_sq = tf.squeeze(d2_p)
if d1_t_sq > d2_t_sq:
# --> Image relation = 1
output.assign_add(tf.math.log(1 + tf.math.exp(-1 * d1_p_sq + d2_p_sq)))
elif d1_t_sq < d2_t_sq:
# --> Image relation = -1
output.assign_add(tf.math.log(1 + tf.math.exp(d1_p_sq - d2_p_sq)))
else:
output.assign_add(tf.math.square(d1_p_sq - d2_p_sq))
return output/num_pairs
EDIT(3): This is the code for create_random_samples():
(FYI: Because it was weird to get the shape from y_true in this case, I first proceeded to hard-code it here as I know it for the dataset which I am currently using.)
def create_random_samples(y_true, y_pred, num_points=50):
y_true_shape = (4, 480, 640, 1)
y_pred_shape = (4, 480, 640, 1)
if y_true_shape[0] != None:
num_samples = num_points
population = [(x, y) for x in range(y_true_shape[1]) for y in range(y_true_shape[2])]
sample_points = random.sample(population, num_samples)
return sample_points

How to display the convolution filters used on a CNN with Tensorflow?

I would like to produce figures similar to this one:
To do that, with Tensorflow I load my model and then, using this code I am about to select the variable with filters from one layer :
# search for the name of the specific layer with the filters I want to display
for v in tf.trainable_variables():
print(v.name)
# store the filters into a variable
var = [v for v in tf.trainable_variables() if v.name == "model/center/kernel:0"][0]
doing var.eval() I am able to store var into a numpy array.
This numpy array have this shape: (3, 3, 512, 512) which correspond to the kernel size: 3x3 and the number of filters: 512.
My problem is the following: How can I extract 1 filter from this 3,3,512,512 array to display it ? If I understand how to do that, I will find how to display the 512 filters
Since you are using Tensorflow, you might be using tf.keras.Sequential for building the CNN Model, and model.summary() gives the names of all the Layers, along with Shapes, as shown below:
Once you have the Layer Name, you can Visualize the Convolutional Filters of that Layer of CNN as shown in the code below:
#-------------------------------------------------
#Utility function for displaying filters as images
#-------------------------------------------------
def deprocess_image(x):
x -= x.mean()
x /= (x.std() + 1e-5)
x *= 0.1
x += 0.5
x = np.clip(x, 0, 1)
x *= 255
x = np.clip(x, 0, 255).astype('uint8')
return x
#---------------------------------------------------------------------------------------------------
#Utility function for generating patterns for given layer starting from empty input image and then
#applying Stochastic Gradient Ascent for maximizing the response of particular filter in given layer
#---------------------------------------------------------------------------------------------------
def generate_pattern(layer_name, filter_index, size=150):
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])
grads = K.gradients(loss, model.input)[0]
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)
iterate = K.function([model.input], [loss, grads])
input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.
step = 1.
for i in range(80):
loss_value, grads_value = iterate([input_img_data])
input_img_data += grads_value * step
img = input_img_data[0]
return deprocess_image(img)
#------------------------------------------------------------------------------------------
#Generating convolution layer filters for intermediate layers using above utility functions
#------------------------------------------------------------------------------------------
layer_name = 'conv2d_4'
size = 299
margin = 5
results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))
for i in range(8):
for j in range(8):
filter_img = generate_pattern(layer_name, i + (j * 8), size=size)
horizontal_start = i * size + i * margin
horizontal_end = horizontal_start + size
vertical_start = j * size + j * margin
vertical_end = vertical_start + size
results[horizontal_start: horizontal_end, vertical_start: vertical_end, :] = filter_img
plt.figure(figsize=(20, 20))
plt.savefig(results)
The above code Visualizes only 64 filters of a Layer. You can change it accordingly.
For more information, you can refer this article.

How do I discover the values for variables of an equation with keras/tensorflow?

I have an equation that describes a curve in two dimensions. This equation has 5 variables. How do I discover the values of them with keras/tensorflow for a set of data? Is it possible? Someone know a tutorial of something similar?
I generated some data to train the network that has the format:
sample => [150, 66, 2] 150 sets with 66*2 with the data something like "time" x "acceleration"
targets => [150, 5] 150 sets with 5 variable numbers.
Obs: I know the range of the variables. I know too, that 150 sets of data are too few sample, but I need, after the code work, to train a new network with experimental data, and this is limited too. Visually, the curve is simple, it has a descendent linear part at the beggining and at the end it gets down "like an exponential".
My code is as follows:
def build_model():
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(66*2,)))
model.add(layers.Dense(5, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['mae'])
return model
def smooth_curve(points, factor=0.9):
[...]
return smoothed_points
#load the generated data
train_data = np.load('samples00.npy')
test_data = np.load('samples00.npy')
train_targets = np.load('labels00.npy')
test_targets = np.load('labels00.npy')
#normalizing the data
mean = train_data.mean()
train_data -= mean
std = train_data.std()
train_data /= std
test_data -= mean
test_data /= std
#k-fold validation:
k = 3
num_val_samples = len(train_data)//k
num_epochs = 100
all_mae_histories = []
for i in range(k):
val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
partial_train_data = np.concatenate(
[train_data[:i * num_val_samples],
train_data[(i + 1) * num_val_samples:]],
axis=0)
partial_train_targets = np.concatenate(
[train_targets[:i * num_val_samples],
train_targets[(i + 1) * num_val_samples:]],
axis=0)
model = build_model()
#reshape the data to get the format (100, 66*2)
partial_train_data = partial_train_data.reshape(100, 66 * 2)
val_data = val_data.reshape(50, 66 * 2)
history = model.fit(partial_train_data,
partial_train_targets,
validation_data = (val_data, val_targets),
epochs = num_epochs,
batch_size = 1,
verbose = 1)
mae_history = history.history['val_mean_absolute_error']
all_mae_histories.append(mae_history)
average_mae_history = [
np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
Obviously as it is, I need to get the best accuracy possible, but I am getting an "median absolute error(MAE)" like 96%, and this is inaceptable.
I see some basic bugs in this methodology. Your final layer of the network has a softmax layer. This would mean it would output 5 values, which sum to 1, and behave as a probability distribution. What you actually want to predict is true numbers, or rather floating point values (under some fixed precision arithmetic).
If you have a range, then probably using a sigmoid and rescaling the final layer would to match the range (just multiply with the max value) would help you. By default sigmoid would ensure you get 5 numbers between 0 and 1.
The other thing should be to remove the cross entropy loss and use a loss like RMS, so that you predict your numbers well. You could also used 1D convolutions instead of using Fully connected layers.
There has been some work here: https://julialang.org/blog/2017/10/gsoc-NeuralNetDiffEq which tries to solve DEs and might be relevant to your work.

How does Tensorflow Batch Normalization work?

I'm using tensorflow batch normalization in my deep neural network successfully. I'm doing it the following way:
if apply_bn:
with tf.variable_scope('bn'):
beta = tf.Variable(tf.constant(0.0, shape=[out_size]), name='beta', trainable=True)
gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), name='gamma', trainable=True)
batch_mean, batch_var = tf.nn.moments(z, [0], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(self.phase_train,
mean_var_with_update,
lambda: (ema.average(batch_mean), ema.average(batch_var)))
self.z_prebn.append(z)
z = tf.nn.batch_normalization(z, mean, var, beta, gamma, 1e-3)
self.z.append(z)
self.bn.append((mean, var, beta, gamma))
And it works fine both for training and testing phases.
However I encounter problems when I try to use the computed neural network parameters in my another project, where I need to compute all the matrix multiplications and stuff by myself. The problem is that I can't reproduce the behavior of the tf.nn.batch_normalization function:
feed_dict = {
self.tf_x: np.array([range(self.x_cnt)]) / 100,
self.keep_prob: 1,
self.phase_train: False
}
for i in range(len(self.z)):
# print 0 layer's 1 value of arrays
print(self.sess.run([
self.z_prebn[i][0][1], # before bn
self.bn[i][0][1], # mean
self.bn[i][1][1], # var
self.bn[i][2][1], # offset
self.bn[i][3][1], # scale
self.z[i][0][1], # after bn
], feed_dict=feed_dict))
# prints
# [-0.077417567, -0.089603029, 0.000436493, -0.016652612, 1.0055743, 0.30664611]
According to the formula on the page https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/nn/batch_normalization:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
But as we can see,
1.0055743 * (-0.077417567 - -0.089603029)/(0.000436493^0.5 + 1e-3) + -0.016652612
= 0.543057
Which differs from the value 0.30664611, computed by Tensorflow itself.
So what am I doing wrong here and why I can't just calculate batch normalized value myself?
Thanks in advance!
The formula used is slightly different from:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
It should be:
bn = scale * (x - mean) / (sqrt(var + 1e-3)) + offset
The variance_epsilon variable is supposed to scale with the variance, not with sigma, which is the square-root of variance.
After the correction, the formula yields the correct value:
1.0055743 * (-0.077417567 - -0.089603029)/((0.000436493 + 1e-3)**0.5) + -0.016652612
# 0.30664642276945747