Tensorflow step size incredibly small to prevent errors? - tensorflow

I'm trying to do a simple linear regression problem using Gradient Descent with Tensorflow, but unless I set my step size really, really small, the weight and bias balloon and overflow almost immediately. Here's my code:
import numpy as np
import tensorflow as tf
# Read the data
COLUMNS = ["url", "title_length", "article_length", "keywords", "shares"]
data = np.genfromtxt("OnlineNewsPopularitySample3.csv", delimiter=',', names=COLUMNS)
# We're looking for shares based on article_length
article_length = tf.placeholder("float")
shares = tf.placeholder("float")
# Set up the variables we're going to use
initial_m = 1.0
initial_b = 1.0
w = tf.Variable([initial_m, initial_b], name="w")
predicted_shares = tf.multiply(w[0], article_length) + w[1]
error = tf.square(predicted_shares - shares)
# This is as big as I can make it; any larger, and I have problems.
step_size = .000000025
optimizer = tf.train.GradientDescentOptimizer(step_size).minimize(error)
model = tf.global_variables_initializer()
with tf.Session() as session:
# First initialize all the variables
session.run(model)
# Now we're going to run the optimizer
for i in range(100000):
session.run(optimizer, feed_dict={article_length: data['article_length'], shares: data['shares']})
if (i % 100 == 0):
print (session.run(w))
# Once it's done, we need to get the value of w so we can display it.
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))
So basically, when I run this, the outputs become "NaN" almost immediately. Any ideas?
Thanks in advance!

A very low learning rate means a very small update to the weights. In your case, even a relatively small learning rate is blowing up your weights, its because the weight updates (dE/dW) seems to be very large. And the update is a function of the output Error. If the labels are large values, your squared error will be high at the start as the predictions will be quite low. Try scaling the outputs to avoid this problem.

Related

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

Weak optimizers in Pytorch

Consider a simple line fitting a * x + b = x, where a, b are the optimized parameters and x is the observed vector given by
import torch
X = torch.randn(1000,1,1)
One can immediately see that the exact solution is a=1, b=0 for any x and it can be found as easily as:
import numpy as np
np.polyfit(X.numpy().flatten(), X.numpy().flatten(), 1)
I am trying now to find this solution by means of gradient descent in PyTorch, where the mean square error is used as an optimization criterion.
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, SGD, Adagrad, ASGD
X = torch.randn(1000,1,1) # Sample data
class SimpleNet(nn.Module): # Trivial neural network containing two weights
def __init__(self):
super(SimpleNet, self).__init__()
self.f1 = nn.Linear(1,1)
def forward(self, x):
x = self.f1(x)
return x
# Testing default setting of 3 basic optimizers
K = 500
net = SimpleNet()
optimizer = Adam(params=net.parameters())
Adam_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adam_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = SGD(params=net.parameters(), lr=0.0001)
SGD_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
SGD_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = Adagrad(params=net.parameters())
Adagrad_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adagrad_losses.append(float(loss.detach()))
The training progress in terms of loss evolution can be shown as
What is surprising for me is a very slow convergence of the algorithms in default setting. I have thus 2 questions:
1) Is it possible to achieve an arbitrary small error (loss) purely by means of some Pytorch optimizer? Since the loss function is convex, it should be definitely possible, however, I am not able to figure out, how to achieve this using PyTorch. Note that the above 3 optimizers cannot do that - see the loss progress in log scale for 20000 iterations:
2) I am wondering how the optimizers can work well in complex examples, when they does not work well even in this extremely simple example. Or (and that is the second question) is it something wrong in their application above that I missed?
The place where you called zero_grad is wrong. During each epoch, gradient is added to the previous one and backpropagated. This makes the loss oscillate as it gets closer, but previous gradient throws it off of the solution again.
Code below will easily perform the task:
import torch
X = torch.randn(1000,1,1)
net = SimpleNet()
optimizer = Adam(params=net.parameters())
for epoch in range(EPOCHS):
optimizer.zero_grad() # zero the gradient buffers
loss = torch.mean((net.forward(X) - X) ** 2)
if loss < 1e-8:
print(epoch, loss)
break
loss.backward()
optimizer.step()
1) Is it possible to achieve an arbitrary small error (loss) purely by
means of some Pytorch optimizer?
Yeah, precision above is reached in around ~1500 epochs, you can go lower up to the machine (float in this case) precision
2) I am wondering how the optimizers can work well in complex
examples, when they does not work well even in this extremely simple
example.
Currently, we don't have anything better (at least wide spread) for network optimization than first order methods. Those are used as it's much faster to calculate gradient than Hessians for higher order methods. And complex, non-convex functions may have a lot of minima which kinda fulfill the task we threw at it, there is no need for global minima per se (although they may under some conditions, see this paper).

Basic TPU Cross Shard Optimizer Not Working

In general, there are some good examples that use TF optimizers for solving general (non deep learning) problems. Given:
https://databricks.com/tensorflow/training-and-convergence
https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd
We want to be able to combine the two above and make use of TPU based optimization in solving high dimensional problems.
To that end I've got a simple colab code that does this merging the two examples above:
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1
# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()
with tf.Session(tpu_address) as session:
session.run(tf.contrib.tpu.initialize_system())
print('init')
session.run(model)
for i in range(10000):
print(i)
x_value = np.random.rand()
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x: x_value, y: y_value})
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
session.run(tpu.shutdown_system())
When I run it (in colab) as it is it just runs the first loop printing:
init
0
and then does nothing and colab just keeps spanning.
If I do not use
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
And other TPU features, then it works fine estimating the w Variable.
The questions are:
Why doesn't this work and how can we get the cross shard replicator to optimise this simple function?
How can shall I shape variable w to make use of parallel batches/shards on the TPU?
How can we make this even more efficient through use of an equivalent Dataset prefetch operation or using infeed queues?
The goal is to make use of lower level TPU APIs without TPUEstimator for example to help solve custom problems by leveraging the power of TPUs using the tensors , queues and shards only.
It doesn't work because you are overriding the number of shards without actually splitting the calculations into shards. When I run your code, I get the following error:
InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
It is trying to perform the computations on eight shards and combine the results, but it only has one shard to work with. Take a look at tf.contrib.tpu.shard. It creates a shard context using the given number of shards and distributes a computation over those shards. So, instead of setting the number of shards manually, you can define your variables as usual and then wrap any computations with them in a function to be sharded:
# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
# Average across the entire batch
error = tf.reduce_mean(tf.square(y - y_model))
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
You don't need to shape w to make use of shards, because sharding occurs across the batch dimension and you only have one set of weights for all inputs. You'll want to add a batch dimension to your inputs so that each batch can be distributed across the cores. shard assumes the first dimension is the batch dimension, but includes an argument to change it if your data is shaped differently. According to the TPU troubleshooting page, the ideal batch size is 1024 so that there are 128 samples per TPU core. If that is too big for your model, you can go smaller as long as it is a multiple of 128. Check out the above link and the performance guide for more tips on increasing performance.
for i in range(1000):
print(i)
x_value = np.random.rand(1024) # Generate a batch of 1024 values
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
Everything else should remain the same. I was able to train the model for all 10000 iterations. Keep in mind that for this simple model it will probably be slower than using CPU/GPU, but you should expect performance improvements for more complex problems with larger datasets.
I'm not familiar enough with Datasets or infeed queues to comment on this, but shard includes an argument for infeed queues so it likely has support for them. You might have to play around with it to see how it gets data to the computation function.

Back-propagation exhibiting quadratic memory consumption

I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})

Computing exact moving average over multiple batches in tensorflow

During training, I would like to write the average loss over the last N mini-batches to SummaryWriter as a way of smoothing the very noisy batch loss. It's easy to compute this in python and print it, but I would like to add this to a summary so that I can see it in tensorboard. Here's an overly simplified example of what I'm doing now.
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
print sum(losses)/len(losses)
losses = []
I'm aware that I could use ExponentialMovingAverage with a decay of 1.0, but I would still need some way to reset this every N batches. Really, if all I care about is visualizing loss in tensorboard, the reset probably isn't necessary, but I'm still curious how one would go about aggregating across batches for other reasons (e.g. computing total accuracy over a test dataset that is too big to run in a single batch).
You can manually construct the Summary object, like this:
from tensorflow.core.framework import summary_pb2
def make_summary(name, val):
return summary_pb2.Summary(value=[summary_pb2.Summary.Value(tag=name,
simple_value=val)])
summary_writer.add_summary(make_summary('myvalue', myvalue), step)
Passing data from python to a graph function like tf.scalar_summary can be done using a placeholder and feed_dict.
average_pl = tf.placeholder(tf.float32)
average_summary = tf.summary.scalar("average_loss", average_pl)
writer = tf.summary.FileWriter("/tmp/mnist_logs", sess.graph_def)
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
feed = {average_pl: sum(losses)/len(losses)}
summary_str = sess.run(average_summary, feed_dict=feed)
writer.add_summary(summary_str, i)
losses = []
I haven't tried it and this was hastily copied from the visualizing data how to but I expect something like this would work.