How to use a model trained on GPU using CudnnLSTM on CPU? - tensorflow

Tensorflow version 1.6.0 on Ubuntu 16.04.
Network uses CudnnLSTM https://www.tensorflow.org/api_docs/python/tf/contrib/cudnn_rnn/CudnnLSTM
Model export and prediction works on GPU. But while exporting and inferencing on CPU gives the below error.
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 501, in _create_saveable
name="%s_saveable" % self.trainable_variables[0].name.split(":")[0])
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 262, in __init__
weights, biases = self._OpaqueParamsToCanonical()
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 315, in _OpaqueParamsToCanonical
direction=self._direction)
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 769, in cudnn_rnn_params_to_canonical
name=name)
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/home/deepak/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'CudnnRNNParamsToCanonical' with these attrs. Registered devices: [CPU], Registered kernels:
<no registered kernels>
[[Node: CudnnRNNParamsToCanonical = CudnnRNNParamsToCanonical[T=DT_FLOAT, direction="bidirectional", dropout=0, input_mode="linear_input", num_params=16, rnn_mode="lstm", seed=0, seed2=0, _device="/device:GPU:0"](CudnnRNNParamsToCanonical/num_layers, CudnnRNNParamsToCanonical/num_units, CudnnRNNParamsToCanonical/input_size, cudnn_lstm/opaque_kernel/read)]]
And the export code is as below:
with tf.Graph().as_default() as graph:
inputs, outputs = create_graph()
# Create a saver using variables from the above newly created graph
saver = tf.train.Saver(tf.global_variables())
with tf.Session() as sess:
# Restore the model from last checkpoints
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
saver.restore(sess, ckpt.model_checkpoint_path)
# (re-)create export directory
export_path = os.path.join(
tf.compat.as_bytes(FLAGS.export_dir),
tf.compat.as_bytes(str(FLAGS.export_version)))
if os.path.exists(export_path):
shutil.rmtree(export_path)
# create model builder
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
input_node = graph.get_tensor_by_name('input_node:0')
input_lengths = graph.get_tensor_by_name('input_lengths:0')
outputs = graph.get_tensor_by_name('output_node:0')
# create tensors info
predict_tensor_inputs_info = tf.saved_model.utils.build_tensor_info(input_node)
predict_tensor_inputs_length_info = tf.saved_model.utils.build_tensor_info(input_lengths)
predict_tensor_scores_info = tf.saved_model.utils.build_tensor_info(outputs)
# build prediction signature
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'input': predict_tensor_inputs_info,'input_len':predict_tensor_inputs_length_info},
outputs={'output': predict_tensor_scores_info},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME
)
)
# save the model
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'infer': prediction_signature
})
builder.save()

Related

Keras load_weights() not loading checkpoints

I have been following the RNN tutorial of Tensorflow
https://www.tensorflow.org/tutorials/text/text_generation
The model.load_weights() is not working, and is throwing the error
Traceback (most recent call last):
File "C:/Users/swati.srivastava/PycharmProjects/TensorFlow Practice/main.py", line 1232, in <module>
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
File "C:\Users\swati.srivastava\PycharmProjects\TensorFlow Practice\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2260, in load_weights
filepath, save_format = _detect_save_format(filepath)
File "C:\Users\swati.srivastava\PycharmProjects\TensorFlow Practice\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2868, in _detect_save_format
if saving_utils.is_hdf5_filepath(filepath):
File "C:\Users\swati.srivastava\PycharmProjects\TensorFlow Practice\venv\lib\site-packages\tensorflow\python\keras\saving\saving_utils.py", line 327, in is_hdf5_filepath
return (filepath.endswith('.h5') or filepath.endswith('.keras') or
AttributeError: 'tensorflow.python.util._pywrap_checkpoint_reader.C' object has no attribute 'endswith'
Process finished with exit code 1
My code is
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)
EMBEDDING_DIM = 256
RNN_UNITS = 1024
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
tf.keras.layers.LSTM(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True
)
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)
checkpoint_num = 2
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
model.build(tf.TensorShape([1, None]))
My project directory looks like
which means that the training checkpoints are created, and exist. None of the checkpoint files are empty.
The only solution I could find was at https://github.com/tensorflow/tensorflow/issues/38745, where it says to do save_weights_only=True, which I have already done.
I think it is some sort of version conflict, but am not sure.
Edit: Added the checkpoint_callback snippet. training_checkpoints directory is created as can be seen in the project directory image
The issue is that load_weights api expects an HDF5 format file, but as per the your code, you do not provide it.
I am assuming that you are using ModelCheckpoint API for checkpoint creation like below:
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
Then you will just provide the checkpoint path which is in your case './training_checkpoints/'
model.load_weights(checkpoint_path)
You can look at the documentation for more details.

L2-normalization with Keras Backend?

I'd like to normalize the inputs going into my neural network but, as I'm defining my model in this way:
df = pd.read_csv(r'C:\Users\Davide Mori\PycharmProjects\pythonProject\Dataset.csv')
print(df)
target_column = ['W_mag', 'W_phase']
predictors = list(set(list(df.columns)) - set(target_column))
X = df[predictors].values
Y = df[target_column].values
def get_model(n_inputs, n_outputs):
model = Sequential()
model.add(Dense(1000,input_dim= n_inputs, activation='relu'))
#model.add(Lambda(lambda x: K.l2_normalize(x, axis=1)))
model.add(Dense(1000, activation='linear', activity_regularizer=regularizers.l1(0.0001)))
model.add(Activation('relu'))
model.add(Dense(n_outputs, activation='linear'))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["mean_squared_error"])
model.summary()
return model
n_inputs, n_outputs = X.shape[1], Y.shape[1]
model = get_model(n_inputs, n_outputs)
# fit the model on all data
model.fit(X, Y, epochs=100, batch_size=1)
how do I apply the lambda layer to my inputs? Isn't wrong the commented line position? Because If I put the lambda layer there I'm normalizing what is already be "transformed" by the first hidden layer,right? How can I solve this problem?
This is the error I have when putting the lambda layer before everything else :
2020-10-12 15:08:46.036872: I
tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX AVX2
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm
2020.2.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197,
in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the
script
File "C:\Program Files\JetBrains\PyCharm
2020.2.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line
18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Davide Mori/PycharmProjects/pythonProject/prova_rete_sfs.py",
line 60, in <module>
model = get_model(n_inputs, n_outputs)
File "C:/Users/Davide Mori/PycharmProjects/pythonProject/prova_rete_sfs.py",
line 52, in get_model
model.summary()
File "C:\Users\Davide Mori\Anaconda3\envs\pythonProject\lib\site-
packages\tensorflow_core\python\keras\engine\network.py", line 1302, in
summary
raise ValueError('This model has not yet been built. '
ValueError: This model has not yet been built. Build the model first by
calling `build()` or calling `fit()` with some data, or specify an
`input_shape` argument in the first layer(s) for automatic build.

Memory error when initializing Xception using Keras

I am having difficulty implementing the pre-trained Xception model for binary classification over new set of classes. The model is successfully returned from the following function:
#adapted from:
#https://github.com/fchollet/keras/issues/4465
from keras.applications.xception import Xception
from keras.layers import Input, Flatten, Dense
from keras.models import Model
def get_xception(in_shape,trn_conv):
#Get back the convolutional part of Xception trained on ImageNet
model = Xception(weights='imagenet', include_top=False)
#Here the input images have been resized to 299x299x3, so this is the
#same as Xception's native input
input = Input(in_shape,name = 'image_input')
#Use the generated model
output = model(input)
#Only train the top fully connected layers (keep pre-trained feature extractors)
for layer in model.layers:
layer.trainable = False
#Add the fully-connected layers
x = Flatten(name='flatten')(output)
x = Dense(2048, activation='relu', name='fc1')(x)
x = Dense(2048, activation='relu', name='fc2')(x)
x = Dense(2, activation='softmax', name='predictions')(x)
#Create your own model
my_model = Model(input=input, output=x)
my_model.compile(loss='binary_crossentropy', optimizer='SGD')
return my_model
This returns fine, however when I run this code:
model=get_xception(shp,trn_feat)
in_data=HDF5Matrix(str_trn,'/inputs')
labels=HDF5Matrix(str_trn,'/labels')
model.fit(in_data,labels,shuffle="batch")
I get the following error:
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/engine/training.py", line 1576, in fit
self._make_train_function()
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/engine/training.py", line 960, in _make_train_function
loss=self.total_loss)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/optimizers.py", line 169, in get_updates
v = self.momentum * m - lr * g # velocity
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 705, in _run_op
return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 865, in binary_op_wrapper
return func(x, y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1088, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1449, in _mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[204800,2048]
[[Node: training/SGD/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](SGD/momentum/read, training/SGD/Variable/read)]]
I have been tracing the function calls for hours now and still can't figure out what is happening. The system should be far above and beyond the requirements. System specs:
Ubuntu Version: 14.04.5 LTS
Tensorflow Version: 1.3.0
Keras Version: 2.0.7
28x dual core Inten Xeon processor (1.2 GHz)
4x NVidia GeForce 1080 (8Gb memory each)
Any clues as to what is going wrong here?
Per Yu-Yang, the simplest solution was to reduce the batch size, everything ran fine after that!

Tensorflow batch training OutOfRangeError

Saving variables
Variables saved in 0.88 seconds
Saving metagraph
Metagraph saved in 35.81 seconds
Saving variables
Variables saved in 0.95 seconds
Saving metagraph
Metagraph saved in 33.20 seconds
Traceback (most recent call last):
Caused by op u'batch', defined at:
File "ava_train.py", line 155, in <module>
image_batch, label_batch = tf.train.batch([image, label], batch_size=batch_size, allow_smaller_final_batch=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 665, in _batch
dequeued = queue.dequeue_up_to(batch_size, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 510, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1402, in _queue_dequeue_up_to_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 100, current size 0)
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
my code is here
with tf.Graph().as_default():
global_step = tf.Variable(0, trainable=False)
# process same as cifar10.distorted_inputs
log_dir = '../log'
model_dir = '../model'
max_num_epoch = 80
if not os.path.exists(log_dir):
os.makedirs(log_dir)
if not os.path.exists(model_dir):
os.makedirs(model_dir)
num_train_example = len(os.listdir('../images/'))
# Reads pfathes of images together with their labels
image_list, label_list = read_labeled_image_list('../raw.txt')
images = ops.convert_to_tensor(image_list, dtype=dtypes.string)
labels = ops.convert_to_tensor(label_list, dtype=dtypes.int32)
# Makes an input queue
# input_queue = tf.train.slice_input_producer([images, labels], num_epochs=max_num_epoch, shuffle=True)
input_queue = tf.train.slice_input_producer([images, labels], shuffle=True)
image, label = read_images_from_disk(input_queue)
image_size = 240
keep_probability = 0.8
weight_decay = 5e-5
image = preprocess(image, image_size, image_size, None)
batch_size = 100
epoch_size = 1000
embedding_size = 128
# Optional Image and Label Batching
image_batch, label_batch = tf.train.batch([image, label], batch_size=batch_size, allow_smaller_final_batch=True)
This is the output of training an image classification model based on 20w images. I set allow_smaller_final_batch=True in batch. After some epochs the OutOfRangeError occured.
I don't know the reason and thanks for the help.
Since you get a OutOfRangeError it could be that you are training for more epochs than max_num_epochs, which will result in the slice_input_producer throwing this exception.
One possible workaround would be to remove the num_epochs=max_num_epochs from your slice_input_producer since this will allow it to produce even after the maximum number of epochs has been reached.
I have battled with this particular error for days. I finally found the cause. You are getting this error because your file is corrupted somewhere. Try running this code on another train and test data

tensorflow : restore from checkpoint for continue training

in this case ,i want to continue train my model from checkpoint.i use the cifar-10 example and did a little change in cifar-10_train.py like below,they are almost the same,except i want to restore from checkpoint:
i replaced cifar-10 by md.
"""
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import os.path
import time
import numpy
import tensorflow.python.platform
from tensorflow.python.platform import gfile
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
import md
"""
"""
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('train_dir', '/root/test/INT/tbc',
"""Directory where to write event logs """
"""and checkpoint.""")
tf.app.flags.DEFINE_integer('max_steps', 60000, # 55000 steps per epoch
"""Number of batches to run.""")
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
tf.app.flags.DEFINE_string('pretrained_model_checkpoint_path', '/root/test/INT/',
"""If specified, restore this pretrained model """
"""before beginning any training.""")
def error_rate(predictions, labels):
"""Return the error rate based on dense predictions and 1-hot labels."""
return 100.0 - (
100.0 *
numpy.sum(numpy.argmax(predictions, 0) == numpy.argmax(labels, 0)) /
predictions.shape[0])
def train():
"""Train MD65500 for a number of steps."""
with tf.Graph().as_default():
# global_step = tf.Variable(0, trainable=False)
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Get images and labels for CIFAR-10.
images, labels = md.distorted_inputs()
# Build a Graph that computes the logits predictions from the
# inference model.
logits = md.inference(images)
# Calculate loss.
loss = md.loss(logits, labels)
# Build a Graph that trains the model with one batch of examples and
# updates the model parameters.
train_op = md.train(loss, global_step)
# Predictions for the minibatch. there is no validation set or test set.
# train_prediction = tf.nn.softmax(logits)
train_prediction = logits
# Create a saver.
saver = tf.train.Saver(tf.all_variables())
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.merge_all_summaries()
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph.
# sess = tf.Session(config=tf.ConfigProto(
# log_device_placement=FLAGS.log_device_placement))
# sess.run(init)
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
# sess.run(init)
if FLAGS.pretrained_model_checkpoint_path:
assert tf.gfile.Exists(FLAGS.pretrained_model_checkpoint_path)
# variables_to_restore = tf.get_collection(
# slim.variables.VARIABLES_TO_RESTORE)
variable_averages = tf.train.ExponentialMovingAverage(
md.MOVING_AVERAGE_DECAY)
variables_to_restore = {}
for v in tf.all_variables():
if v in tf.trainable_variables():
restore_name = variable_averages.average_name(v)
else:
restore_name = v.op.name
variables_to_restore[restore_name] = v
ckpt = tf.train.get_checkpoint_state(FLAGS.pretrained_model_checkpoint_path)
if ckpt and ckpt.model_checkpoint_path:
# global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]
restorer = tf.train.Saver(variables_to_restore)
restorer.restore(sess, ckpt.model_checkpoint_path)
print('%s: Pre-trained model restored from %s' %
(datetime.now(), ckpt.model_checkpoint_path))
# print("variables_to_restore")
# print(variables_to_restore)
else:
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
summary_writer = tf.train.SummaryWriter(FLAGS.train_dir,
graph_def=sess.graph) #####graph_def=sess.graph_def)
# tf.add_to_collection('train_op', train_op)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value, predictions = sess.run([train_op, loss, train_prediction])
duration = time.time() - start_time
assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
if step % 100 == 0:
num_examples_per_step = FLAGS.batch_size
examples_per_sec = num_examples_per_step / duration
sec_per_batch = float(duration)
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print (format_str % (datetime.now(), step, loss_value,
examples_per_sec, sec_per_batch))
# print('Minibatch error: %.5f%%' % error_rate(predictions, labels))
if step % 100 == 0:
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
# Save the model checkpoint periodically.
if step % 1000 == 0 or (step + 1) == FLAGS.max_steps:
checkpoint_path = os.path.join(FLAGS.train_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=step)
def main(argv=None): # pylint: disable=unused-argument
# md.maybe_download()
# if gfile.Exists(FLAGS.train_dir):
# gfile.DeleteRecursively(FLAGS.train_dir)
# gfile.MakeDirs(FLAGS.train_dir)
train()
if __name__ == '__main__':
tf.app.run()
when i run the code,errors like this:
[root#bogon md try]# pythonnew mdtbc_3.py
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Filling queue with 4000 CIFAR images before starting to train. This will take a few minutes.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.228
pciBusID 0000:01:00.0
Total memory: 6.00GiB
Free memory: 5.78GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
2016-08-30 17:12:48.883303: Pre-trained model restored from /root/test/INT/model.ckpt-59999
WARNING:tensorflow:When passing a `Graph` object, please use the `graph` named argument instead of `graph_def`.
Traceback (most recent call last):
File "mdtbc_3.py", line 195, in <module>
tf.app.run()
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "mdtbc_3.py", line 191, in main
train()
File "mdtbc_3.py", line 160, in train
_, loss_value, predictions = sess.run([train_op, loss, train_prediction])
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 340, in run
run_metadata_ptr)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 564, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 637, in _do_run
target_list, options, run_metadata)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 659, in _do_call
e.code)
tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value conv2/weights
[[Node: conv2/weights/read = Identity[T=DT_FLOAT, _class=["loc:#conv2/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](conv2/weights)]]
Caused by op u'conv2/weights/read', defined at:
File "mdtbc_3.py", line 195, in <module>
tf.app.run()
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "mdtbc_3.py", line 191, in main
train()
File "mdtbc_3.py", line 77, in train
logits = md.inference(images)
File "/root/test/md try/md.py", line 272, in inference
stddev=0.1, wd=0.0)
File "/root/test/md try/md.py", line 114, in _variable_with_weight_decay
tf.truncated_normal_initializer(stddev=stddev))
File "/root/test/md try/md.py", line 93, in _variable_on_cpu
var = tf.get_variable(name, shape, initializer=initializer)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 339, in get_variable
collections=collections)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 262, in get_variable
collections=collections, caching_device=caching_device)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 158, in get_variable
dtype=variable_dtype)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 209, in __init__
dtype=dtype)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 318, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 609, in identity
return _op_def_lib.apply_op("Identity", input=input, name=name)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/pythonnew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
self._traceback = _extract_stack()
when i uncomment the line 107 "sess.run(init)" ,it runs perfectly,but a initialised model,just a new one from sctrach. i want to restore variables from checkpoint , and continue my training.i want to restore.
Without having the rest of your code handy, I'd say that the following part is problematic:
for v in tf.all_variables():
if v in tf.trainable_variables():
restore_name = variable_averages.average_name(v)
else:
restore_name = v.op.name
variables_to_restore[restore_name] = v
Because you specify a list of variables you want to restore here, but you exclude some (i.e. the v.op.name for the ones in the trainable vars). That will change the name of the variable in the net that throws the error (again, without the rest of the code, I cannot really say), s.t. one (or more) vars are not restored properly. Two approaches (which are not very sophisticated) will help you here:
If you do not store all variables, do an initialization first, and then restore the variables you have actually stored. This makes sure that tensors you do not really care about get initialized none the less
TF is very efficient when it comes to storing nets. If in doubt, store all variables ...