How does TensorFlow know which variables to change for optimization? - tensorflow

Code taken from:-
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Python optimisation variables
learning_rate = 0.5
epochs = 10
batch_size = 100
# declare the training data placeholders
# input x - for 28 x 28 pixels = 784
x = tf.placeholder(tf.float32, [None, 784])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')
# calculate the output of the hidden layer
hidden_out = tf.add(tf.matmul(x, W1), b1)
hidden_out = tf.nn.relu(hidden_out)
# now calculate the hidden layer output - in this case, let's use a softmax activated
# output layer
y_ = tf.nn.softmax(tf.add(tf.matmul(hidden_out, W2), b2))
y_clipped = tf.clip_by_value(y_, 1e-10, 0.9999999)
cross_entropy = -tf.reduce_mean(tf.reduce_sum(y * tf.log(y_clipped)
+ (1 - y) * tf.log(1 - y_clipped), axis=1))
# add an optimiser
optimiser = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
# finally setup the initialisation operator
init_op = tf.global_variables_initializer()
# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# start the session
with tf.Session() as sess:
# initialise the variables
total_batch = int(len(mnist.train.labels) / batch_size)
for epoch in range(epochs):
avg_cost = 0
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
_, c =[optimiser, cross_entropy],
feed_dict={x: batch_x, y: batch_y})
avg_cost += c / total_batch
print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost))
print(, feed_dict={x: mnist.test.images, y: mnist.test.labels}))
I wanted to ask, how does tensorflow recognize the parameters it needs to optimize , like in the above code we need to optimize w1,w2,b1 & b2 but we never specified that anywhere. We did ask GradientDescentOptimizer to minimize cross_entropy but we never told it that it would have to change the values of w1,w2,b1&b2 in order to do so , So how did it know the parameters on which cross_entropy depended upon?

The answer by Cory Nezin is only partially correct, and could lead to wrong assumptions!
You actually do specify which parameters are optimized (=trainable), namely by doing this:
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')
In short, TensorFlow will only update tf.Variables. If you would use something like tf.Variable(...,trainable=False), you would not get any updates, regardless of what "network depends on". You would have still specified it, and the network would still propagate through that part, but you would never receive any updates for that specific variable.
Cory's answer is correct in the way that the network does automatically recognize what values to update it with, but you specify what has to be defined/updated first!

TensorFlow works on the premise of something called a computational graph. Essentially, whenever you say something like:
hidden_out = tf.add(tf.matmul(x, W1), b1)
TensorFlow says ok, so that output clearly depends on W1, I'll connect an edge from "hidden_out" to W1. This same process happens for y_, y_clipped, and cross_entropy. So in the end you have a graph which connects cross_entropy to W1. Pick your favorite graph traversal algorithm and TensorFlow finds the connection between cross entropy and W1.


2 Layer Neural Network Does not Converge

I am a newbie to TensorFlow and I am trying to understand the basics of deep learning. I started from writing a two-layer neural network from scratch and it achieved 89% accuracy on MNIST dataset and now I am trying to implement the same network in TensorFlow and compare their performance.
I am not sure if I miss something basic in the code, but the following implementation seems to be unable to update weights and therefore could not output anything meaningful.
num_hidden = 100
# x -> (batch_size, 784)
x = tf.placeholder(tf.float32, [None, 784])
W1 = tf.Variable(tf.zeros((784, num_hidden)))
b1 = tf.Variable(tf.zeros((1, num_hidden)))
W2 = tf.Variable(tf.zeros((num_hidden, 10)))
b2 = tf.Variable(tf.zeros((1, 10)))
# z -> (batch_size, num_hidden)
z = tf.nn.relu(tf.matmul(x, W1) + b1)
# y -> (batch_size, 10)
y = tf.nn.softmax(tf.matmul(z, W2) + b2)
# y_ -> (batch_size, 10)
y_ = tf.placeholder(tf.float32, [None, 10])
# y_ * tf.log(y) -> (batch_size, 10)
cross_entropy = -tf.reduce_sum(y_ * tf.log(y+1e-10))
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
sess = tf.InteractiveSession()
# tf.argmax(y, axis=1) returns the maximum index in each row
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
for epoch in range(1000):
# batch_xs -> (100, 784)
# batch_ys -> (100, 10), one-hot encoded
batch_xs, batch_ys = mnist.train.next_batch(100)
train_data = {x: batch_xs, y_: batch_ys}, feed_dict=train_data)
print(, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
W1_e, b1_e, W2_e, b2_e = W1.eval(), b1.eval(), W2.eval(), b2.eval()
What I Have Done
I checked many the official docs and many other implementations, but I feel totally confused since they may use different versions and API varies greatly.
So could someone help me, thank you in advance.
There are two problems with what you have done so far. First, you have initialised all of the weights to zero, which will prevent the network from learning. And secondly, the learning rate was too high. The below code got me 0.9665 accuracy. For why not to set all the weights to zero you can see here .
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
num_hidden = 100
# x -> (batch_size, 784)
x = tf.placeholder(tf.float32, [None, 784])
label_place = tf.placeholder(tf.float32, [None, 10])
# # Get accuracy at chance \approx 0.1
# W1 = tf.Variable(tf.zeros((784, num_hidden)))
# b1 = tf.Variable(tf.zeros((1, num_hidden)))
# W2 = tf.Variable(tf.zeros((num_hidden, 10)))
# b2 = tf.Variable(tf.zeros((1, 10)))
# Will work, you will need to train a bit more than 1000 steps
# though
W1 = tf.Variable(tf.random_normal((784, num_hidden), 0., 0.1))
b1 = tf.Variable(tf.zeros((1, num_hidden)))
W2 = tf.Variable(tf.random_normal((num_hidden, 10), 0, 0.1))
b2 = tf.Variable(tf.zeros((1, 10)))
# network, we only go as far as the linear output after the hidden layer
# so we can feed it into the tf.nn.softmax_cross_entropy_with_logits below
# this is more numerically stable
z = tf.nn.relu(tf.matmul(x, W1) + b1)
logits = tf.matmul(z, W2) + b2
# define our loss etc as before. however note that the learning rate is lower as
# with a higher learning rate it wasnt really working
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=label_place, logits=logits)
train_step = tf.train.GradientDescentOptimizer(.001).minimize(cross_entropy)
# continue as before
sess = tf.InteractiveSession()
correct_prediction = tf.equal(tf.argmax(tf.nn.softmax(logits), 1), tf.argmax(label_place, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
for epoch in range(5000):
batch_xs, batch_ys = mnist.train.next_batch(100)
train_data = {x: batch_xs, label_place: batch_ys}, feed_dict=train_data)
print(, feed_dict={x: mnist.test.images, label_place: mnist.test.labels}))
W1_e, b1_e, W2_e, b2_e = W1.eval(), b1.eval(), W2.eval(), b2.eval()

How to switch from GradientDescent Optimizer to Adam in Tensorflow

My code is running perfectly with Gradient Descent, but I want to compare the effectiveness of my algorithm using Adam Optimizer, so I tried to modify the following code:
# Import MNIST data
#import input_data
#mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
#fashion_mnist = input_data.read_data_sets('data/fashion')
import tensorflow as tf
# Set parameters
learning_rate = 0.01 #1e-4
training_iteration = 30
batch_size = 100
display_step = 2
# TF graph input
x = tf.placeholder("float", [None, 784]) # mnist data image of shape 28*28=784
y = tf.placeholder("float", [None, 10]) # 0-9 digits recognition => 10 classes
#regularizer = tf.reduce_sum(tf.square(y))
# Create a model
# Set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
with tf.name_scope("Wx_b") as scope:
# Construct a linear model
model = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
# Add summary ops to collect data
w_h = tf.summary.histogram("weights", W)
b_h = tf.summary.histogram("biases", b)
# More name scopes will clean up graph representation
with tf.name_scope("cost_function") as scope:
# Minimize error using cross entropy
# Cross entropy
cost_function = -tf.reduce_sum(y*tf.log(model))
# Create a summary to monitor the cost function
tf.summary.scalar("cost_function", cost_function)
with tf.name_scope("train") as scope:
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
# Initializing the variables
#init = tf.initialize_all_variables()
init = tf.global_variables_initializer()
# Merge all summaries into a single operator
merged_summary_op = tf.summary.merge_all()
# Launch the graph
with tf.Session() as sess:
summary_writer = tf.summary.FileWriter('/home/raed/Tensorflow/tensorflow_demo', graph_def =sess.graph_def)
# Training cycle
for iteration in range(training_iteration):
avg_cost = 0.
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
# Fit training using batch data, feed_dict={x: batch_xs, y: batch_ys})
# Compute the average loss
avg_cost +=, feed_dict={x: batch_xs, y: batch_ys})/total_batch
# Write logs for each iteration
summary_str =, feed_dict={x: batch_xs, y: batch_ys})
summary_writer.add_summary(summary_str, iteration*total_batch + i)
# Display logs per iteration step
if iteration % display_step == 0:
print ("Iteration:" "%04d" % (iteration + 1), "cost=", "{:.9f}".format(avg_cost))
print ("Tuning completed!")
# Test the model
predictions = tf.equal(tf.argmax(model, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(predictions, "float"))
print ("Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels}))
to use Adam Optimizer I tried to change the following line :
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
and replace it with the AdamOptimizer :
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost_function)
when I ran the code , I got few iteration and then it stopped with the following error.
InvalidArgumentError (see above for traceback): Nan in summary histogram for: weights
[[Node: weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](weights/tag, Variable/read)]]
could you please help me understnad the problem , thanks in advance
the problem is weights are initialized to zero W = tf.Variable(tf.zeros([784, 10])) that`s why you re get Nan as weights.
you need to inialize them with some initializer i.e normal distribution as follow
W = tf.Variable(tf.random_normal([784, 10], stddev=0.35),

Error in simple Network

What is wrong with this tensorflow code? I seem to be tardy to see the mistake. It doesn't converge. It stopps by 2.30.
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
W1 = tf.Variable(tf.zeros([784, 100]))
b1 = tf.Variable(tf.zeros([100]))
W2 = tf.Variable(tf.zeros([100, 20]))
b2 = tf.Variable(tf.zeros([20]))
W3 = tf.Variable(tf.zeros([20, 10]))
b3 = tf.Variable(tf.zeros([10]))
y1 = tf.nn.relu(tf.add(tf.matmul(x, W1), b1))
y2 = tf.nn.relu(tf.add(tf.matmul(y1, W2), b2))
y3 = tf.nn.softmax(tf.add(tf.matmul(y2, W3), b3))
sess = tf.InteractiveSession()
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y3), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.global_variables_initializer()
for _ in range(10000):
batch_xs, batch_ys = mnist.train.next_batch(100)
print(, feed_dict={x: batch_xs, y_: batch_ys})), feed_dict={x: batch_xs, y_: batch_ys})
Thank you!
I can see a couple of things that should be addressed:
The learning rate of 0.5 is quite large for stochastic gradient descent. If a network isn't training, you can always try with different values, typically in the range [1e-2, 1e-5].
Networks initialized with zeros (tf.zeros) fail to learn for two reasons:
Without any difference between parameter values, gradient is shared evenly across them all, meaning that they all learn to be the same value.
As the gradients are multiplied by weights during back-propagation, the resultant value will always equal zero - meaning no change in weight values.
I would also recommend using the built-in tf.losses.softmax_cross_entropy instead of doing it yourself. It's generally a good idea, as it minimizes the chance of making a mistake along the way. :)

how to restore the learning rate in TF from previously saved checkpoint ?

I have stopped training at some point and saved checkpoint, meta files etc.
Now when I want to resume training, I want to start with last running learning rate of the optimizer. Can you provide a example of doing so ?
For those coming here (like me) wondering whether the last learning rate is automatically restored: tf.train.exponential_decay doesn't add any Variables to the graph, it only adds the operations necessary to derive the correct current learning rate value given a certain global_step value. This way, you only need to checkpoint the global_step value (which is done by default normally) and, assuming you keep the same initial learning rate, decay steps and decay factor, you'll automatically pick up training where you left it, with the correct learning rate value.
Inspecting the checkpoint won't show any learning_rate variable (or similar), simply because there is no need for any.
This example code learns to add two numbers:
import tensorflow as tf
import numpy as np
import os
save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'
save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)
if not os.path.isdir(save_ckpt_dir):
if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]: # prefer to load pre-trained net
load_ckpt_path = save_ckpt_path
load_ckpt_path = None # train from scratch
def add_layer(inputs, in_size, out_size, activation_fn=None):
Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
if activation_fn is None:
layer_output = Wx_plus_b
layer_output = activation_fn(Wx_plus_b)
return layer_output
def produce_batch(batch_size=256):
"""Loads a single batch of data.
batch_size: The number of excersises in the batch.
x : column vector of numbers
y : another column of numbers
xy_sum : the sum of the columns
x = np.random.random(size=[batch_size, 1]) * 10
y = np.random.random(size=[batch_size, 1]) * 10
xy_sum = x + y
return x, y, xy_sum
with tf.name_scope("inputs"):
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("correct_labels"):
xysums = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
with tf.name_scope("graph_body"):
prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)
with tf.name_scope("loss_and_train"):
# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))
# Passing global_step to minimize() will increment it at each step.
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
with tf.name_scope("init_load_save"):
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
if load_ckpt_path:
saver.restore(sess, load_ckpt_path)
for i in range(1000):
x, y, xy_sum = produce_batch(256)
_, global_step_np, loss_np, lr_np =[train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
if global_step_np % 100 == 0:
print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np)), save_ckpt_path)
if you run it a few times, you will see the learning rate decrease. It also saves the global step. The trick is here:
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
By default, will save all savable objects (including learning rate and global step). However, if tf.train.Saver is provided with var_list, will only save the vars included in var_list:
saver = tf.train.Saver(var_list = ..list of vars to save..)
sources: (see "saveable objects")

restore a model trained with variable input length in tensorflow results in InvalidArgumentError

I am rather new to tensorflow and am currently experimenting with models of varying complexity. I have a problem with the save and restore functionality of the package. As far as I did understand the tutorials, I should be able to restore a trained graph and run it with some new input at some later point. However, I get the following error when I try to do just that.:
InvalidArgumentError (see above for traceback): Shape [-1,10] has negative dimensions
[[Node: Placeholder = Placeholderdtype=DT_FLOAT, shape=[?,10], _device="/job:localhost/replica:0/task:0/cpu:0"]]
My understanding of the message is that the restored graph does not like one dimension to be left arbitrary, which in turn is necessary for practical cases where I don't know beforehand how large my input will be. A code snippet as a minimal example, producing the error above, can be found below. I know how to restore each tensor individually but this gets impractical pretty quickly when the models grow in complexity. I am thankful for any help I get and apologize in case my question is stupid.
import numpy as np
import tensorflow as tf
def generate_random_input():
alist = []
for _ in range(10):
alist.append(np.random.uniform(-1, 1, 100))
return np.array(alist).T
def generate_random_target():
return np.random.uniform(-1, 1, 100)
x = tf.placeholder('float', [None, 10])
y = tf.placeholder('float')
# the model
w1 = tf.get_variable('w1', [10, 1], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(seed=1))
b1 = tf.get_variable('b1', [1], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(seed=1))
result = tf.add(tf.matmul(x, w1), b1, name='result')
loss = tf.reduce_mean(tf.losses.mean_squared_error(predictions=result, labels=y))
optimizer = tf.train.AdamOptimizer(0.03).minimize(loss)
saver = tf.train.Saver()
with tf.Session() as sess:[optimizer, loss], feed_dict={x: generate_random_input(), y: generate_random_target()}), 'file_name')
# now load the model in another session:
sess2 = tf.Session()
saver = tf.train.import_meta_graph('file_name.meta')
saver.restore(sess2, tf.train.latest_checkpoint('./'))
graph = tf.get_default_graph()
pred = graph.get_operation_by_name('result')
test_result =, feed_dict={x: generate_random_input()})
in the last line, you don't feed_dict the label_palceholder with the data. So in the placeholder, the [-1] dimension is still -1, other than the batch size. That's the cause.
I'm having the exact same problem as you. I'm importing and testing a bunch of different CNNs with different layer sizes and testing on various datasets. You can stick your model creation in a function like so and recreate it in your other code:
def create_model():
x = tf.placeholder('float', [None, 10])
y = tf.placeholder('float')
w1 = tf.get_variable('w1', [10, 1], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(seed=1))
b1 = tf.get_variable('b1', [1], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(seed=1))
result = tf.add(tf.matmul(x, w1), b1, name='result')
return x, y, result
x, y, result = create_model()
loss = tf.reduce_mean(tf.losses.mean_squared_error(predictions=result, labels=y))
optimizer = tf.train.AdamOptimizer(0.03).minimize(loss)
saver = tf.train.Saver()
with tf.Session() as sess:[optimizer, loss], feed_dict={x: generate_random_input(), y: generate_random_target()}), 'file_name')
# now load the model in another session:
sess2 = tf.Session()
# This stuff is optional if everything is the same scope
x, y, result = create_model()
saver = tf.train.Saver()
# loss = ... if you want loss
# Now just restore the weights and run
saver.restore(sess, 'file_name')
test_result =, feed_dict={x: generate_random_input()})
This is a bit tedious if I want to import many complex architectures with different dimensions. For our situation, I don't know if there's any other way to restore an entire model than to recreate that architecture first in your second session.