area=pi*r^2 : what loss/optimizer function to use, and why does the below loss function does not predict? - tensorflow

By giving radius r and area of circle, I want the NN to predict correct values. However the below code did not predict. What change I need to make in loss/optimize function? Would be great if you provide some reasoning for choosing loss/optimize function.
from tensorflow import keras
import numpy as np
model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
radiusTrainValues = np.array([1.0,2.0,3.0, 4.0, 5.0, 6.0], dtype=float)
areaTrainValues = np.array([3.14159,12.56637, 28.27433,50.26548,78.53982,113.09734], dtype=float)
model.fit(radiusTrainValues, areaTrainValues, epochs=5000)
radiusTestVlues = np.array([7.0,8.0,9.0, 10.0], dtype=float)
areaTestVlues = np.array([153.93804,201.06193,254.469,314.15927], dtype=float)
print("Input :",radiusTestVlues)
print("CorrectVlues :",areaTestVlues)
print("TF Predicted:",model.predict(radiusTestVlues))

Actually The problem that you have, I believe, does not come from your loss function. Because your loss function is just an indicator of your prediction performance the optimizer will just adjust model parameter in the direction of gradient of your loss respect to each parameter (dLoss/dW). What happen here is you want to use a NN to approximate a function that calculate area f(r) = Pi * r^2 using only 1 neuron which is just f(r) = (W * r) + B. In simple word, you are trying to approximate a parabola function (r^2) using a linear function (W * r + B), therefore your loss will decrease until some point and stuck because that is the best it can do, you can try drawing that yourself and you will see there will be a huge gap between your line and parabola.
What you can do is increase number of layer and neuron you will see a huge improvement.

Related

Implement Kullback-Leibler as loss function for arbitrary distributions

I had already implemented an optimization (gradient descent) algorithm by using the Tensorflow-probability built-in KL-Divergence as a loss function. Theoretically it worked well, but then I found out that the list of registered distributions, which you are able to compare in the KL-Divergence, is quite limited. I tried to minimize the KL-Divergence of a Gaussian Mixture Model (as the true distribution) and a Normal Distribution (optimize Mean and Std, such that KL-Divergence becomes minimal), which was not possible.
So I tried to implement my own approach, which did not work:
import numpy as np
from scipy.stats import norm
import tensorflow as tf
The idea I had was to create densities of the needed distributions via scipy.stats (lets say Normal distributions) and transform the density-variables to Tensors:
x = np.arange(-10,10,0.001)
mu_train = tf.Variable(2.0)
p_pdf = norm.pdf(x, 0, 1)
q_pdf = norm.pdf(x, mu_train,1)
p = tf.convert_to_tensor(p_pdf)
q = tf.convert_to_tensor(q_pdf, dtype=tf.float64)
Now I defined the KL-Divergence as a function that only depends on q.
def kl_loss(q):
return tf.reduce_sum(
tf.where(p == 0, tf.zeros(p.shape, tf.float64), p * tf.math.log(p / q))
)
Then I calculated the gradient of kl_loss with respect to mu_train, but the output I get from this a "None".
with tf.GradientTape() as tape:
tape.watch(mu_train)
loss = kl_loss(q)
d_loss_d_mu = tape.gradient(loss, mu_train)
print(d_loss_d_mu)
Now that I have thought about it.. to get a "None" as output makes sense to me, since kl_loss(q) is a function that does only depend on the values "q(x)", that are generated by the density q, but it does not depend on mu_train directly, since this is just a parameter of the Normal Distribution but the input for the kl_loss is an array/tensor of values of the normal distribution..
Does anyone know how I can find a workaround for this or does anyone else have a completely different solution to get the KL-Divergence as a loss function with arbitrary distributions, such that I can compute gradients with respect to parameters and run a GradientDescent Minimizer.

How to correctly use DenseFlipout layers with TensorFlow Probability

i am a novice with both TensorFlow and TensorFlow Probability.
I am using this network for a regression task.
def normal_sp(params):
return tfd.Normal(loc=params[:,0:1], scale=1e-3 + tf.math.softplus(0.05 * params[:,1:2]))
kernel_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / (x.shape[0] * 1.0)
bias_divergence_fn=lambda q, p, _: tfp.distributions.kl_divergence(q, p) / (x.shape[0] * 1.0)
inputs = Input(shape=(1,),name="input layer")
hidden = tfp.layers.DenseFlipout(50,bias_posterior_fn=tfp.layers.util.default_mean_field_normal_fn(),
bias_prior_fn=tfp.layers.default_multivariate_normal_fn,
kernel_divergence_fn=kernel_divergence_fn,
bias_divergence_fn=bias_divergence_fn,activation="relu",name="DenseFlipout_layer_1")(inputs)
hidden = tfp.layers.DenseFlipout(100,bias_posterior_fn=tfp.layers.util.default_mean_field_normal_fn(),
bias_prior_fn=tfp.layers.default_multivariate_normal_fn,
kernel_divergence_fn=kernel_divergence_fn,
bias_divergence_fn=bias_divergence_fn,activation="relu",name="DenseFlipout_layer_2")(hidden)
hidden = tfp.layers.DenseFlipout(100,bias_posterior_fn=tfp.layers.util.default_mean_field_normal_fn(),
bias_prior_fn=tfp.layers.default_multivariate_normal_fn,
kernel_divergence_fn=kernel_divergence_fn,
bias_divergence_fn=bias_divergence_fn,activation="relu",name="DenseFlipout_layer_3")(hidden)
params = tfp.layers.DenseFlipout(2,bias_posterior_fn=tfp.layers.util.default_mean_field_normal_fn(),
bias_prior_fn=tfp.layers.default_multivariate_normal_fn,
kernel_divergence_fn=kernel_divergence_fn,
bias_divergence_fn=bias_divergence_fn,name="DenseFlipout_layer_4")(hidden)
dist = tfp.layers.DistributionLambda(normal_sp)(params)
model_vi = Model(inputs=inputs, outputs=dist)
model_vi.compile(Adam(learning_rate=0.002), loss=NLL)
model_params = Model(inputs=inputs, outputs=params)
my question is related to the loss function:
in the example posted here, the authors add the kl divergence to the loss function
https://www.tensorflow.org/probability/api_docs/python/tfp/layers/DenseFlipout
kl = sum(model.losses)
loss = neg_log_likelihood + kl
but in the example here https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_08/nb_ch08_03.ipynb
the loss function is simply the NLL. My question is : do i have to add manually the kl divergence or does tensorflow calculate it automatically? in the first case, how do i do it since model.losses doesn't seem to work? Thanks to anyone who help
If you're using Keras to train, the per-layer losses (KLs) are included in the overall loss (I am 90% sure this is right -- you could check by overriding the kl_divergence_fn to return some absurd value and see if your overall loss becomes absurd).
In the example from the docs (which are, ahem, a bit ancient), keras is not doing the training; instead an optimizer is being applied to a manually written loss, and so one has to grab all the per layer losses and add them in.

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

Weak optimizers in Pytorch

Consider a simple line fitting a * x + b = x, where a, b are the optimized parameters and x is the observed vector given by
import torch
X = torch.randn(1000,1,1)
One can immediately see that the exact solution is a=1, b=0 for any x and it can be found as easily as:
import numpy as np
np.polyfit(X.numpy().flatten(), X.numpy().flatten(), 1)
I am trying now to find this solution by means of gradient descent in PyTorch, where the mean square error is used as an optimization criterion.
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, SGD, Adagrad, ASGD
X = torch.randn(1000,1,1) # Sample data
class SimpleNet(nn.Module): # Trivial neural network containing two weights
def __init__(self):
super(SimpleNet, self).__init__()
self.f1 = nn.Linear(1,1)
def forward(self, x):
x = self.f1(x)
return x
# Testing default setting of 3 basic optimizers
K = 500
net = SimpleNet()
optimizer = Adam(params=net.parameters())
Adam_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adam_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = SGD(params=net.parameters(), lr=0.0001)
SGD_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
SGD_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = Adagrad(params=net.parameters())
Adagrad_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adagrad_losses.append(float(loss.detach()))
The training progress in terms of loss evolution can be shown as
What is surprising for me is a very slow convergence of the algorithms in default setting. I have thus 2 questions:
1) Is it possible to achieve an arbitrary small error (loss) purely by means of some Pytorch optimizer? Since the loss function is convex, it should be definitely possible, however, I am not able to figure out, how to achieve this using PyTorch. Note that the above 3 optimizers cannot do that - see the loss progress in log scale for 20000 iterations:
2) I am wondering how the optimizers can work well in complex examples, when they does not work well even in this extremely simple example. Or (and that is the second question) is it something wrong in their application above that I missed?
The place where you called zero_grad is wrong. During each epoch, gradient is added to the previous one and backpropagated. This makes the loss oscillate as it gets closer, but previous gradient throws it off of the solution again.
Code below will easily perform the task:
import torch
X = torch.randn(1000,1,1)
net = SimpleNet()
optimizer = Adam(params=net.parameters())
for epoch in range(EPOCHS):
optimizer.zero_grad() # zero the gradient buffers
loss = torch.mean((net.forward(X) - X) ** 2)
if loss < 1e-8:
print(epoch, loss)
break
loss.backward()
optimizer.step()
1) Is it possible to achieve an arbitrary small error (loss) purely by
means of some Pytorch optimizer?
Yeah, precision above is reached in around ~1500 epochs, you can go lower up to the machine (float in this case) precision
2) I am wondering how the optimizers can work well in complex
examples, when they does not work well even in this extremely simple
example.
Currently, we don't have anything better (at least wide spread) for network optimization than first order methods. Those are used as it's much faster to calculate gradient than Hessians for higher order methods. And complex, non-convex functions may have a lot of minima which kinda fulfill the task we threw at it, there is no need for global minima per se (although they may under some conditions, see this paper).

Compute gradient of the ouputs wrt the weights

Starting from a tensorflow model, I would like to be able to retrieve the gradient of the outputs with respect to the weights. Backpropagation aims to compute the gradient of the loss wrt the weights, in order to do that somewhere in the code the computation of the gradient of the ouputs wrt the weights has to happen.
But I am wondering how to get this Jacobian at the API level, any ideas ?
I know that we can have access to the tape but I am not sure what do to with that, actually I do not need the whole Jacobian I just need to be able to compute the matrix vector product of J^{*}v where J^{} is the transpose of the jacobian and v a given vector.
Thank you,
Regards.
If you only need to compute the vector-Jacobian product, doing only that will be much more efficient than computing the full Jacobian. Computing the Jacobian of a function of N variables will cost O(N) time, as opposed to O(1) time for a vector-Jacobian product.
So how do you compute a vector-Jacobian product in TensorFlow? The trick is to use the output_gradients keyword arg in the gradient function. You set the value of output_gradients to the vector in the vector-Jacobian product. Let's look at an example.
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
vec = tf.constant([2.0,3.0]) # vector in vector jacobian product
grad = g.gradient(y,x,output_gradients = vec)
print(grad) # prints the vector-jacobian product, [4.,12.]
Note: If you try to compute the gradient of a vector-valued (rather than scalar) function in tensorflow without setting output_gradients, it computes a vector-jacobian product where the vector is set to be all ones. For example,
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
grad = g.gradient(y,x)
print(grad) # prints the vector-jacobian product with a vector of ones, [2.0,4.0]