How does Tensorflow's reduce_sum work in a loop? - tensorflow

I cannot understand the working of reduce_sum when the optimizer is run in a loop.
I have 30 samples in my train_x and train_y lists. I run my optimizer in a loop by feeding one sample from both at an iteration. My cost function computes the sum of the difference of predicted and actual values for all samples using the tensorflow's reduce_sum method. According to the graph the optimzer depends on the cost function and so the cost will be computed for every x and y. I need to know whether the reduce_sum will wait for all the 30 samples or take one sample (x, y) at a time. Here n_samples is 30. I also need to know whether the weights and bias will be updated for each epoch or for each x and y.
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
W = tf.Variable(np.random.randn(), name='weights')
B = tf.Variable(np.random.randn(), name='bias')
pred = X * W + B
cost = tf.reduce_sum((pred - Y) ** 2) / (2 * n_samples)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sesh:
for epoch in range(epochs):
for x, y in zip(train_x, train_y):, feed_dict={X: x, Y: y})
if not epoch % 20:
c =, feed_dict={X: train_x, Y: train_y})
w =
b =
print(f'epoch: {epoch:04d} c={c:.4f} w={w:.4f} b={b:.4f}')

I need to know whether the reduce_sum will wait for all the 30 samples or take one sample (x, y) at a time.
tf.reduce_sum is an operation and as such it does not have any implicit mutable state. The result of tf.reduce_sum is fully defined by the model parameters (W and B) and the placeholder values explicitly provided in the feed_dict argument to the, feed_dict={...}) call.
If you would like to aggregate the value of a metric across all batches check out tf.metrics:
y_pred = tf.placeholder(tf.float32)
y_true = tf.placeholder(tf.float32)
mse, update_op = tf.metrics.mean_squared_error(y_true, y_pred)
init = tf.local_variables_initializer() # MSE state is local!
sess = tf.Session()
# Update the metric and compute the value after the update., feed_dict={y_pred: [0.0], y_true: [42.0]}) # => 1764.0
# Get current value. # => 1764.0
I also need to know whether the weights and bias will be updated for each epoch or for each x and y.
Each, ...) call will compute the gradients of the trainable variables and apply these gradients to the variable values. See GradientDescentOptimizer.minimize.


Tensorflow: When should I use or not use `feed_dict`?

I am kind of confused why are we using feed_dict? According to my friend, you commonly use feed_dict when you use placeholder, and this is probably something bad for production.
I have seen code like this, in which feed_dict is not involved:
for j in range(n_batches):
X_batch, Y_batch = mnist.train.next_batch(batch_size)
_, loss_batch =[optimizer, loss], {X: X_batch, Y:Y_batch})
I have also seen code like this, in which feed_dict is involved:
for i in range(100):
for x, y in data:
# Session execute optimizer and fetch values of loss
_, l =[optimizer, loss], feed_dict={X: x, Y:y})
total_loss += l
I understand feed_dict is that you are feeding in data and try X as the key as if in the dictionary. But here I don't see any difference. So, what exactly is the difference and why do we need feed_dict?
In a tensorflow model you can define a placeholder such as x = tf.placeholder(tf.float32), then you will use x in your model.
For example, I define a simple set of operations as:
x = tf.placeholder(tf.float32)
y = x * 42
Now when I ask tensorflow to compute y, it's clear that y depends on x.
with tf.Session() as sess:
This will produce an error because I did not give it a value for x. In this case, because x is a placeholder, if it gets used in a computation you must pass it in via feed_dict. If you don't it's an error.
Let's fix that:
with tf.Session() as sess:, feed_dict={x: 2})
The result this time will be 84. Great. Now let's look at a trivial case where feed_dict is not needed:
x = tf.constant(2)
y = x * 42
Now there are no placeholders (x is a constant) and so nothing needs to be fed to the model. This works now:
with tf.Session() as sess:

Which is correct shape of linear regression in my model?

I am designing regression network to predict the weight of a person from 10 to 100 kg. My dataset has 50 training data that is
Vector 1: 1024x1 corresponding to 40kg
Vector 2: 1024x1 corresponding to 20kg
Vector 3: 1024x1 corresponding to 40kg
Vector 50: 1024x1 corresponding to 30kg
Hence, my dataset size is 1024x50, and the label size is 1x50. If I design a simple linear regression, like y=xW+b, so the size of W and b will be
W is 1024x1
b is 1x50
Am I right?
This is my tensorflow code but it provide a wrong prediction
# Training Data
train_X = ...# shape of 1024 x 50
train_Y = ...# shape of 1x50
n_samples = 50
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(tf.truncated_normal([1024, 1], mean=0.0, stddev=1.0, dtype=tf.float32))
b = tf.Variable(tf.zeros(1, dtype = tf.float32))
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
# Run the initializer
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch + 1) % display_step == 0:
c =, feed_dict={X: train_X, Y: train_Y})
print("Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(c), \
"W=",, "b=",
print("Optimization Finished!")
W is 1024x1
b is 1x50
Am I right?
No, shape of W is correct, but b should be a scalar (1x1 matrix). In your approach you have one trainable bias per data point which makes no sense. However, in your code it is correctly set to size 1.
What is wrong is handling matrix multiplication, your model should be:
pred = tf.matmul(X, W) + b # you will have to transpose your train_X
tf.multiply is pointwise multiplication, not matrix multiplication.

how to restore the learning rate in TF from previously saved checkpoint ?

I have stopped training at some point and saved checkpoint, meta files etc.
Now when I want to resume training, I want to start with last running learning rate of the optimizer. Can you provide a example of doing so ?
For those coming here (like me) wondering whether the last learning rate is automatically restored: tf.train.exponential_decay doesn't add any Variables to the graph, it only adds the operations necessary to derive the correct current learning rate value given a certain global_step value. This way, you only need to checkpoint the global_step value (which is done by default normally) and, assuming you keep the same initial learning rate, decay steps and decay factor, you'll automatically pick up training where you left it, with the correct learning rate value.
Inspecting the checkpoint won't show any learning_rate variable (or similar), simply because there is no need for any.
This example code learns to add two numbers:
import tensorflow as tf
import numpy as np
import os
save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'
save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)
if not os.path.isdir(save_ckpt_dir):
if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]: # prefer to load pre-trained net
load_ckpt_path = save_ckpt_path
load_ckpt_path = None # train from scratch
def add_layer(inputs, in_size, out_size, activation_fn=None):
Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
if activation_fn is None:
layer_output = Wx_plus_b
layer_output = activation_fn(Wx_plus_b)
return layer_output
def produce_batch(batch_size=256):
"""Loads a single batch of data.
batch_size: The number of excersises in the batch.
x : column vector of numbers
y : another column of numbers
xy_sum : the sum of the columns
x = np.random.random(size=[batch_size, 1]) * 10
y = np.random.random(size=[batch_size, 1]) * 10
xy_sum = x + y
return x, y, xy_sum
with tf.name_scope("inputs"):
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("correct_labels"):
xysums = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
with tf.name_scope("graph_body"):
prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)
with tf.name_scope("loss_and_train"):
# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))
# Passing global_step to minimize() will increment it at each step.
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
with tf.name_scope("init_load_save"):
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
if load_ckpt_path:
saver.restore(sess, load_ckpt_path)
for i in range(1000):
x, y, xy_sum = produce_batch(256)
_, global_step_np, loss_np, lr_np =[train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
if global_step_np % 100 == 0:
print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np)), save_ckpt_path)
if you run it a few times, you will see the learning rate decrease. It also saves the global step. The trick is here:
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
By default, will save all savable objects (including learning rate and global step). However, if tf.train.Saver is provided with var_list, will only save the vars included in var_list:
saver = tf.train.Saver(var_list = ..list of vars to save..)
sources: (see "saveable objects")

Tensorflow loss minimization is increasing loss

I implemented the linear regression model shown on Tensorflow's main page:
import numpy as np
import tensorflow as tf
# Model parameters
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session() # reset values to wrong
for i in range(1000):, {x:x_train, y:y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss =[W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
However, when I change the training data to x_train=[2,4,6,8] and y_train=[3,4,5,6],
the loss starts to increase over time until it reaches 'nan'
As suggested by Steven, you should probably use reduce_mean(), which seems to fix the problem of the increasing loss function. Note that I also increased the number of training steps since reduce_mean() appears to need a bit longer to converge. Be careful with increasing the learning rate, since this may reproduce the problem. Instead, if training time is not a critical factor, you might want to decrease the learning rate and increase the number of training iterations further.
With the reduce_sum() function it worked well for me after decreasing the learning rate from 0.01 to 0.001. Again, thanks to Steven for the suggestion.
import numpy as np
import tensorflow as tf
# Model parameters
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_mean(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [2,4,6,8]
y_train = [0,3,4,5]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session() # reset values to wrong
for i in range(5000):, {x:x_train, y:y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss =[W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

Tensorflow model always produces mean

I am having trouble with fitting a very simple model in tensorflow. If I have a column of input data which is constant, my output always converges to produce the same value for all rows, which is the mean of my output data, y_, even when there is another column in x_ which has enough information to reproduce y_ exactly. Here is a small example.
import tensorflow as tf
def weight_variable(shape):
"""Initialize the weights with random weights"""
initial = tf.truncated_normal(shape, stddev=0.1, dtype=tf.float64)
return tf.Variable(initial)
#Initialize my data
x = tf.constant([[1.0,1.0],[1.0,2.0],[1.0,3.0]], dtype=tf.float64)
y_ = tf.constant([1.0,2.0,3.0], dtype=tf.float64)
w = weight_variable((2,1))
y = tf.matmul(x,w)
error = tf.reduce_mean(tf.square(y_ - y))
train_step = tf.train.AdamOptimizer(1e-5).minimize(error)
with tf.Session() as sess:
#Train the model and output every 1000 iterations
for i in range(1000000):
err =
if i % 1000 == 0:
print "\nerr:", err
print "x: ",
print "w: ",
print "y_: ",
print "y: ",
This example always converges to w=[2,0], and y = [2,2,2]. This is a smooth function with a minimum at w=[0,1] and y = [1,2,3], where the error function is zero. Why does it not converge to this? I have also tried using gradient descent and I have tried varying the training rate.
Your target is y_ = tf.constant([1.0,2.0,3.0], dtype=tf.float64) has the shape (1, 3). The output of tf.matmul(x, w) has the shape (3, 1). Thus y_ - y has the shape (3, 3) according to numpy broadcasting rules. So you are really not optimizing the function that you thought you were optimizing. Change your y_ to the following and give it a shot :
y_ = tf.constant([[1.0],[2.0],[3.0]], dtype=tf.float64)
This should converge pretty quickly to your expected answer, even with a large learning rate.