Evaluating the critic score in WGANs - tensorflow

I've been trying to train a WGAN for the last couple of days with gradient penalty involved. I took the gradient penalty code off of a github tensorflow implementation by ChengBinJin.
With a normal DCGAN you'd be able to tell what the accuracy of the discriminator is at any point, cause it's trying to learn logits you can throw into a sigmoid function. So if I threw in real images, the accuracy would be close to a 100%, very straight-forward.
However with respect to WGANs, the discriminator is now a critic and it outputs a score instead which isn't really translatable into accuracy as far as I can tell. Right now I'm at 3000 iterations and the mean score for real images is at -59,000. So how would one go about trying to gauge accuracy from this score?

Not at all. The Wasserstein Critic is mean independant as it is written as f(x) - f(y). So So a function g(x) = f(x) + b has the same Wasserstein distance. E.g. g(x) - g(y) = f(x) + b - f(y) - b = f(x) - f(y).
So the mean gives you no information. What does give you information is the difference between the means of the real and the fake images, e.g. the Wasserstein distance. The smaller the better.

Related

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

Backpropagation with python/numpy - calculating derivative of weight and bias matrices in neural network

I'm developing a neural network model in python, using various resources to put together all the parts. Everything is working, but I have questions about some of the math. The model has variable number of hidden layers, uses relu activation for all hidden layers except for the last one, which uses sigmoid.
The cost function is:
def calc_cost(AL, Y):
m = Y.shape[1]
cost = (-1/m) * np.sum((Y * np.log(AL)) - ((1 - Y) * np.log(1 - AL)))
return cost
where AL is probability prediction after last sigmoid activation is applied.
In part of my implementation of backpropagation, I use the following
def linear_backward_step(dZ, A_prev, W, b):
m = A_prev.shape[1]
dW = (1/m) * np.dot(dZ, A_prev.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, db
where, given dZ (the derivative of the cost with respect to a linear step of forward propagation at any given layer), the derivative of the layer's weight matrix W, bias vector b, and deriv of previous layer's activation dA_prev, are each calculated.
The forward part that is complement to this step is this equation: Z = np.dot(W, A_prev) + b
My question is: in calculating dW and db, why is it necessary to multiply by 1/m? I've tried differentiating this using calculus rules but I'm unsure how this term fits in.
Any help is appreciated!
Your gradient calculation seems wrong. You do not multiply it by 1/m. Also, your calculation of m seems wrong as well. It should be
# note it's not A_prev.shape[1]
m = A_prev.shape[0]
Also, the definition in your calc_cost function
# should not be Y.shape[1]
m = Y.shape[0]
You can refer the following example for more information.
Neural Network Case Study
This actually depends on your loss function and if you update your weights after each sample or if you update batch-wise. Take a look at the following old-fashion general-purpose cost function:
Source: MSE Cost Function for Training Neural Network
Here, let's say y^_i is your networks output and y_i is your target value. y^_i is the output of your net.
If you differentiate this for y^_i you'll never get rid of the 1/n or the sum, because the derivative of a sum is the sum of the derivates. Since 1/n is a factor to the sum, you won't also not be able to get rid of this, too. Now, think about what the standard gradient descent is actually doing. It updates your weights after calculating the average over all n samples. A stochastic gradient descent can be used to update after each sample, so you won't have to average it. Batch updates calculate the average over each batch. Which I guess in your case is 1/m, where m is the batch size.

Is the L1 regularization in Keras/Tensorflow *really* L1-regularization?

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.
So despite #Joshua answer, there are three other things that are worth to mention:
There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.
Keras correctly implements L1 regularization. In the context of neural networks, L1 regularization simply adds the L1 norm of the parameters to the loss function (see CS231).
While L1 regularization does encourages sparsity, it does not guarantee that output will be sparse. The parameter updates from stochastic gradient descent are inherently noisy. Thus, the probability that any given parameter is exactly 0 is vanishingly small.
However, many of the parameters of an L1 regularized network are often close to 0. A rudimentary approach would be to threshold small values to 0. There has been research to explore more advanced methods of generating sparse neural network. In this paper, the authors simultaneously prune and train a neural network to achieve 90-95% sparsity on a number of well known network architectures.
TL;DR:
The formulation in deep learning frameworks are correct, but currently we don't have a powerful solver/optimizer to solve it EXACTLY with SGD or its variants. But if you use proximal optimizers, you can obtain sparse solution.
Your observation is right.
Almost all deep learning frameworks (including TF) implement L1 regularization by adding absolute values of parameters to the loss function. This is Lagrangian form of L1 regularization and IS CORRECT.
However, The SOLVER/OPTIMIZER is to be blamed. Even for the well studied LASSO problem, where the solution should be sparse and the soft-threshold operator DOES give us the sparse solution, the subgradient descent solver CAN NOT get the EXACT SPARSE solution. This answer from Quora gives some insight on convergence property of subgradient descent, which says:
Subgradient descent has very poor convergence properties for
non-smooth functions, such as the Lasso objective, since it ignores
problem structure completely (it doesn't distinguish between the least
squares fit and the regularization term) by just looking at
subgradients of the entire objective. Intuitively, taking small steps
in the direction of the (sub)gradient usually won't lead to
coordinates equal to zero exactly.
If you use proximal operators, you can get sparse solution. For example, you can have a look at the paper "Data-driven sparse structure selection for deep neural networks" (this one comes with MXNET code and easy to reproduce!) or "Stochastic Proximal Gradient Descent with Acceleration Techniques" (this one gives more theoretical insight). I'm not pretty sure if the built-in proximal optimizer in TF (e.g.: tf.train.ProximalAdagradOptimizer) can lead to sparse solutions, but you may have a try.
Another simple work around is to zero out small weights (i.e.: absolute value <1e-4) after training or after each gradient descent step to force sparsity. This is just a handy heuristic approach and not theoretically rigorous.
Keras implements L1 regularization properly, but this is not a LASSO. For the LASSO one would need a soft-thresholding function, as correctly pointed out in the original post. It would be very useful with a function similar to the keras.layers.ThresholdedReLU(theta=1.0), but with f(x) = x for x > theta or f(x) = x for x < -theta, f(x) = 0 otherwise. For the LASSO, theta would be equal to the learning rate times the regularization factor of the L1 function.

Adam optimizer goes haywire after 200k batches, training loss grows

I've been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training loss grows:
The training data itself is randomized and spread across many .tfrecord files containing 1000 examples each, then shuffled again in the input stage and batched to 200 examples.
The background
I am designing a network that performs four different regression tasks at the same time, e.g. determining the likelihood of an object appearing in the image and simultanously determining its orientation. The network starts with a couple of convolutional layers, some with residual connections, and then branches into the four fully-connected segments.
Since the first regression results in a probability, I'm using cross entropy for the loss, whereas the others use classical L2 distance. However, due to their nature, the probability loss is around the order of 0..1, while the orientation losses can be much larger, say 0..10. I already normalized both input and output values and use clipping
normalized = tf.clip_by_average_norm(inferred.sin_cos, clip_norm=2.)
in cases where things can get really bad.
I've been (successfully) using the Adam optimizer to optimize on the tensor containing all distinct losses (rather than reduce_suming them), like so:
reg_loss = tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
loss = tf.pack([loss_probability, sin_cos_mse, magnitude_mse, pos_mse, reg_loss])
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
epsilon=self.params.adam_epsilon)
op_minimize = optimizer.minimize(loss, global_step=global_step)
In order to display the results in TensorBoard, I then actually do
loss_sum = tf.reduce_sum(loss)
for a scalar summary.
Adam is set to learning rate 1e-4 and epsilon 1e-4 (I see the same behavior with the default value for epislon and it breaks even faster when I keep the learning rate on 1e-3). Regularization also has no influence on this one, it does this sort-of consistently at some point.
I should also add that stopping the training and restarting from the last checkpoint - implying that the training input files are shuffled again as well - results in the same behavior. The training always seems to behave similarly at that point.
Yes. This is a known problem of Adam.
The equations for Adam are
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.
By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.
To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0.
Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.
Yes this could be some sort of super complicated unstable numbers/equations case, but most certainty your training rate is just simply to high as your loss quickly decreases until 25K and then oscillates a lot in the same level. Try to decrease it by factor of 0.1 and see what happens. You should be able to reach even lower loss value.
Keep exploring! :)

Does Stochastic Gradient Descent even work with TensorFlow?

I designed a MLP, fully connected, with 2 hidden and one output layer.
I get a nice learning curve if I use batch or mini-batch gradient descent.
But a straight line while performing Stochastic Gradient Descent (violet)
What did I get wrong?
In my understanding, I do stochastic gradient descent with Tensorflow, if I provide just one train/learn example each train step, like:
X = tf.placeholder("float", [None, amountInput],name="Input")
Y = tf.placeholder("float", [None, amountOutput],name="TeachingInput")
...
m, i = sess.run([merged, train_op], feed_dict={X:[input],Y:[label]})
Whereby input is a 10-component vector and label is a 20-component vector.
For testings I run 1000 iterations, each iterations contains one of 50 prepared train/learn example.
I expected an overfittet nn. But as you see, it doesn't learn :(
Because the nn will perform in an online-learning environment, a mini-batch oder batch gradient descent isn't an option.
thanks for any hints.
The batch size influences the effective learning rate.
If you think to the update formula of a single parameter, you'll see that it's updated averaging the various values computed for this parameter, for every element in the input batch.
This means that if you're working with a batch size with size n, your "real" learning rate per single parameter is about learning_rate/n.
Thus, if the model you've trained with batches of size n have trained without issues, this is because the learning rate was ok for that batch size.
If you use pure stochastic gradient descent, you have to lower the learning rate (usually by a factor of some power of 10).
So, for example, if your learning rate was 1e-4 with a batch size of 128, try with a learning rate of 1e-4 / 128.0 as see if the network learn (it should).