Out Of Memory when running multi-gpu cnn with TensorFlow

Out Of Memory when running multi-gpu cnn with TensorFlow - tensorflow

I'm trying to run a simple cnn on cifar10, combining code from 2 examples:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/6_MultiGPU/multigpu_cnn.py
https://github.com/exelban/tensorflow-cifar-10
I'm getting OOM errors.
I first tried the code with the complete cnn , without multi-gpu support, and it is working ok. Next I used the multi-gpu code, ran ok too.
Combining them is not working.
with tf.device('/cpu:0'):
tower_grads = []
reuse_vars = False
# tf Graph input
X = tf.placeholder(tf.float32, shape=[None, _IMAGE_SIZE * _IMAGE_SIZE * _IMAGE_CHANNELS], name='Input')
Y = tf.placeholder(tf.float32, shape=[None, _NUM_CLASSES], name='Output')
phase = tf.placeholder(tf.bool, name='phase')
# learning_rate = tf.placeholder(tf.float32, shape=[], name='learning_rate')
keep_prob = tf.placeholder(tf.float32)
global_step = tf.get_variable(name='global_step', trainable=False, initializer=0)
# Loop over all GPUs and construct their own computation graph
for i in range(_NUM_GPUS):
with tf.device('/gpu:{}'.format(i)):
# learning_rate = tf.placeholder(tf.float32, shape=[], name='learning_rate')
# keep_prob = tf.placeholder(tf.float32)
# Split data between GPUs
_x = X[i * _BATCH_SIZE: (i+1) * _BATCH_SIZE]
_y = Y[i * _BATCH_SIZE: (i+1) * _BATCH_SIZE]
print("x shape:",_x.shape)
print("y shape:",_y.shape)
# Because Dropout have different behavior at training and prediction time, we
# need to create 2 distinct computation graphs that share the same weights.
_x = tf.reshape(_x, [-1, _IMAGE_SIZE, _IMAGE_SIZE, _IMAGE_CHANNELS], name='images')
# Create a graph for training
logits_train, y_pred_cls = feed_net(_x, _NUM_CLASSES, keep_prob, reuse=reuse_vars, is_training=True)
# Create another graph for testing that reuse the same weights
logits_test, y_pred_cls = feed_net(_x, _NUM_CLASSES, keep_prob, reuse=True, is_training=False)
# Define loss and optimizer (with train logits, for dropout to take effect)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits_train, labels=_y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
grads = optimizer.compute_gradients(loss_op)
# Only first GPU compute accuracy
if i == 0:
# Evaluate model (with test logits, for dropout to be disabled)
correct_pred = tf.equal(tf.argmax(logits_test, 1), tf.argmax(_y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
reuse_vars = True
tower_grads.append(grads)
tower_grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(tower_grads)
The error is happening when running with more than 1 gpu (got 4), after about 90 iterations (less than one epoch)
ResourceExhaustedError: Ran out of GPU memory when allocating 0 bytes for
[[Node: softmax_cross_entropy_with_logits_sg_3 = SoftmaxCrossEntropyWithLogits[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:3"](softmax_cross_entropy_with_logits_sg_3/Reshape, softmax_cross_entropy_with_logits_sg_3/Reshape_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: main_params/map/while/Less_1/_206 = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1905_main_params/map/while/Less_1", _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_params/map/while/Less_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
UPDATE:
The problem was with how the data was divided across the GPUs.
I used tf.split(X, _NUM_GPUS) for the data and the labels, then I could assign each GPU with it's right data chunk.

Here's the solution:
The problem was with how the data was divided across the GPUs.
I used tf.split(X, _NUM_GPUS) for the data and the labels, then I could assign each GPU with it's right data chunk. Also , only one GPU is running accuracy so it needed to get the full sized data.

Related

how to restore the learning rate in TF from previously saved checkpoint ?

I have stopped training at some point and saved checkpoint, meta files etc.
Now when I want to resume training, I want to start with last running learning rate of the optimizer. Can you provide a example of doing so ?

For those coming here (like me) wondering whether the last learning rate is automatically restored: tf.train.exponential_decay doesn't add any Variables to the graph, it only adds the operations necessary to derive the correct current learning rate value given a certain global_step value. This way, you only need to checkpoint the global_step value (which is done by default normally) and, assuming you keep the same initial learning rate, decay steps and decay factor, you'll automatically pick up training where you left it, with the correct learning rate value.
Inspecting the checkpoint won't show any learning_rate variable (or similar), simply because there is no need for any.

This example code learns to add two numbers:
import tensorflow as tf
import numpy as np
import os
save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'
save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)
if not os.path.isdir(save_ckpt_dir):
os.mkdir(save_ckpt_dir)
if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]: # prefer to load pre-trained net
load_ckpt_path = save_ckpt_path
else:
load_ckpt_path = None # train from scratch
def add_layer(inputs, in_size, out_size, activation_fn=None):
Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
if activation_fn is None:
layer_output = Wx_plus_b
else:
layer_output = activation_fn(Wx_plus_b)
return layer_output
def produce_batch(batch_size=256):
"""Loads a single batch of data.
Args:
batch_size: The number of excersises in the batch.
Returns:
x : column vector of numbers
y : another column of numbers
xy_sum : the sum of the columns
"""
x = np.random.random(size=[batch_size, 1]) * 10
y = np.random.random(size=[batch_size, 1]) * 10
xy_sum = x + y
return x, y, xy_sum
with tf.name_scope("inputs"):
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("correct_labels"):
xysums = tf.placeholder(tf.float32, [None, 1])
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
with tf.name_scope("graph_body"):
prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)
with tf.name_scope("loss_and_train"):
# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))
# Passing global_step to minimize() will increment it at each step.
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
with tf.name_scope("init_load_save"):
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init)
if load_ckpt_path:
saver.restore(sess, load_ckpt_path)
for i in range(1000):
x, y, xy_sum = produce_batch(256)
_, global_step_np, loss_np, lr_np = sess.run([train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
if global_step_np % 100 == 0:
print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np))
saver.save(sess, save_ckpt_path)
if you run it a few times, you will see the learning rate decrease. It also saves the global step. The trick is here:
with tf.name_scope("step_and_learning_rate"):
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96) # start lr=0.15, decay every 10 steps with a base of 0.96
...
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)
By default, saver.save will save all savable objects (including learning rate and global step). However, if tf.train.Saver is provided with var_list, saver.save will only save the vars included in var_list:
saver = tf.train.Saver(var_list = ..list of vars to save..)
sources:
https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay
https://stats.stackexchange.com/questions/200063/tensorflow-adam-optimizer-with-exponential-decay
https://www.tensorflow.org/api_docs/python/tf/train/Saver (see "saveable objects")

How restore training from Inception-3 checkpoint with different trainable variables

I have the pretty common use case of freezing the bottom layers of Inception and training only the first two layers, after which I lower the learning rate and fine tune the entire Inception model.
Here is my code for running the first part
train_dir='/home/ubuntu/pynb/TF play/log-inceptionv3flowers'
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
dataset = get_dataset()
images, _, labels = load_batch(dataset, batch_size=32)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v3_arg_scope()):
logits, _ = inception.inception_v3(images, num_classes=5, is_training=True)
# Specify the loss function:
one_hot_labels = slim.one_hot_encoding(labels, 5)
tf.losses.softmax_cross_entropy(one_hot_labels, logits)
total_loss = tf.losses.get_total_loss()
# Create some summaries to visualize the training process:
tf.summary.scalar('losses/Total Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.RMSPropOptimizer(0.001, 0.9,
momentum=0.9, epsilon=1.0)
train_op = slim.learning.create_train_op(total_loss, optimizer, variables_to_train=get_variables_to_train())
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=4500,
save_summaries_secs=30,
save_interval_secs=30,
session_config=tf.ConfigProto(gpu_options=gpu_options))
print('Finished training. Last batch loss %f' % final_loss)
which runs properly, then my code for running the second part
train_dir='/home/ubuntu/pynb/TF play/log-inceptionv3flowers'
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
dataset = get_dataset()
images, _, labels = load_batch(dataset, batch_size=32)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v3_arg_scope()):
logits, _ = inception.inception_v3(images, num_classes=5, is_training=True)
# Specify the loss function:
one_hot_labels = slim.one_hot_encoding(labels, 5)
tf.losses.softmax_cross_entropy(one_hot_labels, logits)
total_loss = tf.losses.get_total_loss()
# Create some summaries to visualize the training process:
tf.summary.scalar('losses/Total Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.RMSPropOptimizer(0.0001, 0.9,
momentum=0.9, epsilon=1.0)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=10000,
save_summaries_secs=30,
save_interval_secs=30,
session_config=tf.ConfigProto(gpu_options=gpu_options))
print('Finished training. Last batch loss %f' % final_loss)
Notice that in the second part, I do not pass anything into create_train_op's variables_to_train parameter. This error is then shown
NotFoundError (see above for traceback): Key InceptionV3/Conv2d_4a_3x3/BatchNorm/beta/RMSProp not found in checkpoint
[[Node: save_1/RestoreV2_49 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_49/tensor_names, save_1/RestoreV2_49/shape_and_slices)]]
[[Node: save_1/Assign_774/_1550 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_2911_save_1/Assign_774", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I suspect that it's looking for the RMSProp variables for the InceptionV3/Conv2d_4a_3x3 layer, which is non-existent, because I didn't train that layer in the previous checkpoint. I'm not sure how to achieve what I want, as I can see no examples in the documentation about how to do this.

TF Slim has support for reading from a checkpoint whose variable names do not match, described here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/learning.py#L146
You can specify how the variable names in a checkpoint map to the variables in your model.
I hope that helps!

Can i generate the input given the output in a pretrained Tensorflow model?

Let's assume i have trained a model for the MNist task, given the following code:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
import tensorflow as tf
# Parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100
display_step = 1
# Network Parameters
n_hidden_1 = 256 # 1st layer number of features
n_hidden_2 = 256 # 2nd layer number of features
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Test model
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
avg_acc = 0.
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
batch_acc = accuracy.eval({x: batch_x, y: batch_y})
# Compute average loss
avg_cost += c / total_batch
avg_acc += batch_acc / total_batch
# Display logs per epoch step
if epoch % display_step == 0:
test_acc = accuracy.eval({x: mnist.test.images, y: mnist.test.labels})
print(
"Epoch:",
'%04d' % (epoch+1),
"cost=",
"{:.9f}".format(avg_cost),
"average_train_accuracy=",
"{:.6f}".format(avg_acc),
"test_accuracy=",
"{:.6f}".format(test_acc)
)
print("Optimization Finished!")
So this model predicts the number shown in an image given the image.
Once i have trained it, could i make the input a 'variable' instead of 'placeholder' and try to reverse engineer the input given an output ?
For example i would like to feed the output '8' and produce a representative image of number eight.
I thought of:
Freezing the model
Add a variable matrix 'M' of the same size as the input between the input and the weights
Feed an Identical matrix as input to the input placeholder
Run the optimizer to learn the 'M' matrix.
Is there a better way ?

If your goal is to reverse the model in the sense that the input should be a digit and the output an image displaying that digit (in say, handwritten form), it is not quite possible to do with machine learning models.
Because machine learning models attempt to create generalizations from the input (so that similar input will provide similar output, although the model was never trained on it) they tend to be quite lossy. Additionally, the reduction from hundreds, thousands and more input variables into a single output variable obviously has to lose some information in the process.
More specifically, although a Multilayer Perceptron (as you're using in your example) is a fully connected Neural Network, some weights are expected to be zero, thus completely dropping the information in certain input variables. Moverover, the same output of a neuron can be retrieved by multiple distinctive input values to it's function, due to the many degrees of freedom.
It is theoretically possible to replace those degrees of freedom and lost information with specifically crafted or random data, but that does not guarantee a successful output.
On a side note, I'm a bit puzzled by this question. If you are able to generate that model yourself, you could also create a similar model that does the opposite. You could train a model to accept an input digit (and perhaps some random seed) and output an image.

ValueError: No gradients provided for any variable

I'm facing a trouble with tensorFlow. Executing the following code
import tensorflow as tf
import input_data
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# tensorflow graph input
X = tf.placeholder('float', [None, 784]) # mnist data image of shape 28 * 28 = 784
Y = tf.placeholder('float', [None, 10]) # 0-9 digits recognition = > 10 classes
# set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# Our hypothesis
activation = tf.add(tf.matmul(X, W),b) # Softmax
# Cost function: cross entropy
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=activation, logits=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) # Gradient Descen
I get the following error:
ValueError: No gradients provided for any variable, check your graph
for ops that do not support gradients, between variables
['Tensor("Variable/read:0", shape=(784, 10), dtype=float32)',
'Tensor("Variable_1/read:0", shape=(10,), dtype=float32)'] and loss
Tensor("Mean:0", shape=(), dtype=float32).

This problem is caused by the following line: tf.nn.softmax_cross_entropy_with_logits(labels=activation, logits=Y)
Based on documentation you should have
labels: Each row labels[i] must be a valid probability distribution.
logits: Unscaled log probabilities.
So logits suppose to be your hypothesis and thus equal to activation and valid probability distribution is Y. So just change it with tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=activation)

I ended up here because I had passed my input X data to my model, but not my expected outputs. I had:
model.fit(X, epochs=30) # whoops!
I should have had:
model.fit(X, y, epochs=30) # fixed!

In my case I've forgotten to add the compile layer to the model
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics = ['accuracy'] )

Restoring saved TensorFlow model to evaluate on test set

I have seen a few posts on restoring TF models and the Google doc page on exporting graphs but I think I am missing something.
I use the code in this Gist to save the model along with this utils file to which defines the model
Now I would like to restore it and run in a previously unseen test data as follows:
def evaluate(X_data, y_data):
num_examples = len(X_data)
total_accuracy = 0
total_loss = 0
sess = tf.get_default_session()
acc_steps = len(X_data) // BATCH_SIZE
for i in range(acc_steps):
batch_x, batch_y = next_batch(X_val, Y_val, BATCH_SIZE)
loss, accuracy = sess.run([loss_value, acc], feed_dict={
images_placeholder: batch_x,
labels_placeholder: batch_y,
keep_prob: 0.5
})
total_accuracy += (accuracy * len(batch_x))
total_loss += (loss * len(batch_x))
return (total_accuracy / num_examples, total_loss / num_examples)
## re-execute the code that defines the model
# Image Tensor
images_placeholder = tf.placeholder(tf.float32, shape=[None, 32, 32, 3], name='x')
gray = tf.image.rgb_to_grayscale(images_placeholder, name='gray')
gray /= 255.
# Label Tensor
labels_placeholder = tf.placeholder(tf.float32, shape=(None, 43), name='y')
# dropout Tensor
keep_prob = tf.placeholder(tf.float32, name='drop')
# construct model
logits = inference(gray, keep_prob)
# calculate loss
loss_value = loss(logits, labels_placeholder)
# training
train_op = training(loss_value, 0.001)
# accuracy
acc = accuracy(logits, labels_placeholder)
with tf.Session() as sess:
loader = tf.train.import_meta_graph('gtsd.meta')
loader.restore(sess, tf.train.latest_checkpoint('./'))
sess.run(tf.initialize_all_variables())
test_accuracy = evaluate(X_test, y_test)
print("Test Accuracy = {:.3f}".format(test_accuracy[0]))
I'm getting a test accuracy of only 3%. However If I don't close the Notebook and run the test code immediately after training the model, I get a 95% accuracy.
This leads me to believe I'm not loading the model correctly?

The problem arises from these two lines:
loader.restore(sess, tf.train.latest_checkpoint('./'))
sess.run(tf.initialize_all_variables())
The first line loads the saved model from a checkpoint. The second line re-initializes all of the variables in the model (such as the weight matrices, convolutional filters, and bias vectors), usually to random numbers, and overwrites the loaded values.
The solution is simple: delete the second line (sess.run(tf.initialize_all_variables())) and evaluation will proceed with the trained values loaded from the checkpoint.
PS. There is a small chance that this change will give you an error about "uninitialized variables". In that case, you should execute sess.run(tf.initialize_all_variables()) to initialize any variables not saved in the checkpoint before executing loader.restore(sess, tf.train.latest_checkpoint('./')).

I had a similar problem and for me this worked:
with tf.Session() as sess:
saver=tf.train.Saver(tf.all_variables())
saver=tf.train.import_meta_graph('model.meta')
saver.restore(sess,"model")
test_accuracy = evaluate(X_test, y_test)

The answer found here is what ended up working as follows:
save_path = saver.save(sess, '/home/ubuntu/gtsd-12-23-16.chkpt')
print("Model saved in file: %s" % save_path)
## later re-run code that creates the model
# Image Tensor
images_placeholder = tf.placeholder(tf.float32, shape=[None, 32, 32, 3], name='x')
gray = tf.image.rgb_to_grayscale(images_placeholder, name='gray')
gray /= 255.
# Label Tensor
labels_placeholder = tf.placeholder(tf.float32, shape=(None, 43), name='y')
# dropout Tensor
keep_prob = tf.placeholder(tf.float32, name='drop')
# construct model
logits = inference(gray, keep_prob)
# calculate loss
loss_value = loss(logits, labels_placeholder)
# training
train_op = training(loss_value, 0.001)
# accuracy
acc = accuracy(logits, labels_placeholder)
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, '/home/ubuntu/gtsd-12-23-16.chkpt')
print("Model restored.")
test_accuracy = evaluate(X_test, y_test)
print("Test Accuracy = {:.3f}".format(test_accuracy[0]*100))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Out Of Memory when running multi-gpu cnn with TensorFlow - tensorflow

Here's the solution: The problem was with how the data was divided across the GPUs. I used tf.split(X, _NUM_GPUS) for the data and the labels, then I could assign each GPU with it's right data chunk. Also , only one GPU is running accuracy so it needed to get the full sized data.

Related

how to restore the learning rate in TF from previously saved checkpoint ?

How restore training from Inception-3 checkpoint with different trainable variables

Can i generate the input given the output in a pretrained Tensorflow model?

ValueError: No gradients provided for any variable

Restoring saved TensorFlow model to evaluate on test set

Categories

Resources