Tensorflow batch training OutOfRangeError - tensorflow

Saving variables
Variables saved in 0.88 seconds
Saving metagraph
Metagraph saved in 35.81 seconds
Saving variables
Variables saved in 0.95 seconds
Saving metagraph
Metagraph saved in 33.20 seconds
Traceback (most recent call last):
Caused by op u'batch', defined at:
File "ava_train.py", line 155, in <module>
image_batch, label_batch = tf.train.batch([image, label], batch_size=batch_size, allow_smaller_final_batch=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 665, in _batch
dequeued = queue.dequeue_up_to(batch_size, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 510, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1402, in _queue_dequeue_up_to_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 100, current size 0)
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
my code is here
with tf.Graph().as_default():
global_step = tf.Variable(0, trainable=False)
# process same as cifar10.distorted_inputs
log_dir = '../log'
model_dir = '../model'
max_num_epoch = 80
if not os.path.exists(log_dir):
os.makedirs(log_dir)
if not os.path.exists(model_dir):
os.makedirs(model_dir)
num_train_example = len(os.listdir('../images/'))
# Reads pfathes of images together with their labels
image_list, label_list = read_labeled_image_list('../raw.txt')
images = ops.convert_to_tensor(image_list, dtype=dtypes.string)
labels = ops.convert_to_tensor(label_list, dtype=dtypes.int32)
# Makes an input queue
# input_queue = tf.train.slice_input_producer([images, labels], num_epochs=max_num_epoch, shuffle=True)
input_queue = tf.train.slice_input_producer([images, labels], shuffle=True)
image, label = read_images_from_disk(input_queue)
image_size = 240
keep_probability = 0.8
weight_decay = 5e-5
image = preprocess(image, image_size, image_size, None)
batch_size = 100
epoch_size = 1000
embedding_size = 128
# Optional Image and Label Batching
image_batch, label_batch = tf.train.batch([image, label], batch_size=batch_size, allow_smaller_final_batch=True)
This is the output of training an image classification model based on 20w images. I set allow_smaller_final_batch=True in batch. After some epochs the OutOfRangeError occured.
I don't know the reason and thanks for the help.

Since you get a OutOfRangeError it could be that you are training for more epochs than max_num_epochs, which will result in the slice_input_producer throwing this exception.
One possible workaround would be to remove the num_epochs=max_num_epochs from your slice_input_producer since this will allow it to produce even after the maximum number of epochs has been reached.

I have battled with this particular error for days. I finally found the cause. You are getting this error because your file is corrupted somewhere. Try running this code on another train and test data

Related

Incompatible shapes while using triplet loss and pre-trained resnet

I am trying to use pre-trained resnet and fine-tune it using triplet loss. The following code I came up with is a combination of tutorials I found on the topic:
import pathlib
import tensorflow as tf
import tensorflow_addons as tfa
with tf.device('/cpu:0'):
INPUT_SHAPE = (32, 32, 3)
BATCH_SIZE = 16
data_dir = pathlib.Path('/home/user/dataset/')
base_model = tf.keras.applications.ResNet50V2(
weights='imagenet',
pooling='avg',
include_top=False,
input_shape=INPUT_SHAPE,
)
# following two lines are added after edit, originally it was model = base_model
head_model = tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))(base_model.output)
model = tf.keras.Model(inputs=base_model.input, outputs=head_model)
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=10,
zoom_range=0.1,
)
generator = datagen.flow_from_directory(
data_dir,
target_size=INPUT_SHAPE[:2],
batch_size=BATCH_SIZE,
seed=42,
)
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tfa.losses.TripletSemiHardLoss(),
)
model.fit(
generator,
epochs=5,
)
Unfortunately after running the code I get the following error:
Found 4857 images belonging to 83 classes.
Epoch 1/5
Traceback (most recent call last):
File "ReID/external_process.py", line 35, in <module>
model.fit(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call
return self._call_flat(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call
outputs = execute.execute(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 1328 values, but the requested shape has 16
[[{{node TripletSemiHardLoss/PartitionedCall/Reshape}}]] [Op:__inference_train_function_13749]
Function call stack:
train_function
2020-10-23 22:07:09.094736: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
The dataset directory has 83 subdirectories, one per class and each of this subdirectories contains images of given class. The dimension 1328 in the error output is the batch size (16) times number of classes (83), and the dimension 16 is the batch size (both dimensions change accordingly if I change the BATCH_SIZE.
To be honest I do not really understand the error, so any solution or even any kind of indight where is the problem is deeply appreciated.
The problem is that the TripletSemiHardLoss expects
labels y_true to be provided as 1-D integer Tensor with shape [batch_size] of multi-class integer labels
but the flow_from_directory by default generate categorical labels; using class_mode="sparse" should fix the problem.

TensorFlow tfrecord edit (read and then write to file) OutOfRange error

In src/datasets/h36m_edit.py:
with tf.Session() as sess:
reader = tf.TFRecordReader()
coder = ImageCoder()
fqueue = tf.train.string_input_producer(files, num_epochs=1, shuffle=False, name="input")
_, example_serialized = reader.read(fqueue)
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
fidx = 0
total_imgs = 0
image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d = parse_example_proto(example_serialized)
while not coord.should_stop():
fidx += 1
tf_filename = out_path% fidx
print('Starting tfrecord file %s \n' % tf_filename)
with tf.python_io.TFRecordWriter(tf_filename) as writer:
for i in tqdm(range(train_shards)): # min(train_shards, image_bs.shape[0])
image_v, image_size_v, label_v, center_v, fname_v, pose_v, shape_v, gt3d_v, has_smpl3d_v = sess.run(
[image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d])
image_s = coder.encode_jpeg(image_v)
example = convert_to_example_wmosh(image_s, fname_v, image_size_v[0], image_size_v[1],
label_v, center_v, gt3d_v, pose_v, shape_v)
writer.write(example.SerializeToString())
total_imgs += 1
coord.request_stop()
coord.join(threads)
Sometimes the inner loop stops before it reaches the maximum iter limit (train_shards) 500.
100%|██████████| 500/500 [00:02<00:00, 225.07it/s]
Starting tfrecord file /home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0011.tfrecord
96%|█████████▌| 478/500 [00:02<00:00, 225.58it/s]Starting tfrecord file /home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0012.tfrecord
100%|██████████| 500/500 [00:02<00:00, 230.37it/s]
And when it writes to the number 625 tfrecord file, there is OutOfRange error (it supposes to finish with more than 3000 tfrecord files, cause human36m train has 1559985 images and each tfrecord contains 500 images). I guess it's because the image queue is not handled correctly, maybe the producer is too slow?
/home/cdeng/tf_datasets/tf_records_human36m_wjoints/train_modified/train_0625.tfrecord
36%|███▌ | 180/500 [00:00<00:01, 221.50it/s]2019-01-13 22:47:40.946736: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
2019-01-13 22:47:40.946816: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
Traceback (most recent call last):
File "/home/cdeng/star_repos/hmr/src/datasets/h36m_edit.py", line 233, in <module>
[image, image_size, label, center, fname, pose, shape, gt3d, has_smpl3d])
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
[[Node: ParseSingleExample/ParseExample/ParseExample/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_52_ParseSingleExample/ParseExample/ParseExample", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'ReaderReadV2', defined at:
File "/home/cdeng/star_repos/hmr/src/datasets/h36m_edit.py", line 204, in <module>
_, example_serialized = reader.read(fqueue)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 194, in read
return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 423, in _reader_read_v2
queue_handle=queue_handle, name=name)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/cdeng/.virtualenvs/hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): FIFOQueue '_0_input' is closed and has insufficient elements (requested 1, current size 0)
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input)]]
[[Node: ParseSingleExample/ParseExample/ParseExample/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_52_ParseSingleExample/ParseExample/ParseExample", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Process finished with exit code 1
Problem solved.
Two comments:
tqdm could be wrong when the loop is running too fast.
when the queue is empty at the end, OutOfRange error will be thrown out, a good practice is to add exception handling as suggested in QueueRunner

Resource exhausted: OOM when allocating tensor with shape[845246,300]

I am working with a sequence to sequence language model, and after changing the code to pass custom word embedding weights to the Embeddings layer, I am receiving a OOM error when I try to train on the gpu.
Here is the relevant code:
def create_model(word_map, X_train, Y_train, vocab_size, max_length):
# define model
model = Sequential()
# get custom embedding weights as matrix
embedding_matrix = get_weights_matrix_from_word_map(word_map)
model.add(Embedding(len(word_map)+1, 300, weights=[embedding_matrix], input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=100, verbose=2)
return model
And here is the full error log from the server:
File "/home2/slp24/thesis/UpdatedLanguageModel_7_31.py", line 335, in create_model_2
model.fit(X_train, Y_train, batch_size=32, epochs=1, verbose=2) ## prev X, y
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/models.py", line 963, in fit
validation_steps=validation_steps)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/engine/training.py", line 1682, in fit
self._make_train_function()
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/engine/training.py", line 990, in _make_train_function
loss=self.total_loss)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/optimizers.py", line 466, in get_updates
m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 898, in binary_op_wrapper
y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 932, in convert_to_tensor
as_ref=False)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1022, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 100, in _IndexedSlicesToTensor
value.values, value.indices, value.dense_shape[0], name=name)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5186, in unsorted_segment_sum
num_segments=num_segments, name=name)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[845246,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: training/Adam/mul_2/y = UnsortedSegmentSum[T=DT_FLOAT, Tindices=DT_INT32, Tnumsegments=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/embedding_1/Gather_grad/Reshape, training/Adam/gradients/embedding_1/Gather_grad/Reshape_1/_101, training/Adam/mul_2/strided_slice)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Edit:
So far I have tried
Adding batching, started with batch_size=32
I am currently working to decrease the number of output classes from 845,286. I think something went wrong when I calculated the custom embedding matrix, specifically when I was "connecting" the vocabulary token index's assigned during preprocessing and the y_categorical values assigned by Keras that the model uses...
Any help or guidance is greatly appreciated! I have searched many similar issued but have not been able to apply those fixes to my code thus far. Thank you
You're exceeding the memory size of your GPU.
You can:
Train/Predict with smaller batches
Or, if even a batch_size=1 is too much, you need a model with less parameters.
Hint, the length in that tensor (845246) is really really big. Is that the correct length?
I had the same problem with Google Colab GPU
The batch size was 64 and this error has appeared and after I reduced the batch size to 32 it worked properly

OutOfRangeError when creating batch from tfrecord file

I'm writing a script that saves certain features of my data to tfrecord. The features are numpy arrays (float32). When I read the tfrecord file I get the following error:
OutOfRangeError (see above for traceback): RandomShuffleQueue '_1_shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 20, current size 0)
[[Node: shuffle_batch = QueueDequeueManyV2[component_types=[DT_UINT8, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](shuffle_batch/random_shuffle_queue, shuffle_batch/n)]]
I searched a lot, and apparently this error can be caused by different things. So far, I was not able to fix it. I recreated the problem with the following minimal code:
saving the toy data:
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter('stuff.tfrecords')
for i in range(100):
seq = np.random.uniform(size=(500,300)).astype(np.float32)
lbl = np.random.uniform(size=(90,1)).astype(np.float32)
feature = {'train/lbl': _bytes_feature(tf.compat.as_bytes(lbl.tostring())),
'train/seq': _bytes_feature(tf.compat.as_bytes(seq.tostring()))}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
sys.stdout.flush()
Reading the data:
def read_and_decode_single_example(filename):
filename_queue = tf.train.string_input_producer([filename], num_epochs=1)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
f = {'train/lbl': tf.FixedLenFeature([], tf.string),
'train/seq': tf.FixedLenFeature([], tf.string)}
features = tf.parse_single_example(serialized_example, features=f)
seq = tf.decode_raw(features['train/seq'], tf.float32)
lbl = tf.decode_raw(features['train/lbl'], tf.float32)
seq = tf.reshape(seq, [ 500,300 ])
lbl = tf.reshape(lbl, [ 90,1 ])
sbatch, lbatch = tf.train.shuffle_batch([seq, lbl],
batch_size= batch_size,
capacity=3*batch_size,
min_after_dequeue=batch_size)
return sbatch, lbatch
sbatch, lbatch = read_and_decode_single_example("stuff.tfrecords" )
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
s,l = sess.run([sbatch, lbatch])
coord.request_stop()
coord.join(threads)
I'm using Tensorflow-GPU v. 1.4.0.
Here is some error code, that may be informative:
Caused by op 'shuffle_batch', defined at:
File "teststuff.py", line 59, in <module>
sbatch, lbatch = read_and_decode_single_example("stuff.tfrecords" )
File "teststuff.py", line 54, in read_and_decode_single_example
min_after_dequeue=batch_size)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py", line 1225, in shuffle_batch
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py", line 796, in _shuffle_batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 464, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2418, in _queue_dequeue_many_v2
component_types=component_types, timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

Prettytensor: Attempting to use uninitialized value

I'm following these tutorials:
https://www.youtube.com/watch?v=wuo4JdG3SvU&list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ
and prettytensor is introduced in tutorial 4.
Following the tutorial, i wrote this code to run a small neural network:
import tensorflow as tf
# Use PrettyTensor to simplify Neural Network construction.
import prettytensor as pt
from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets('../data/MNIST/', one_hot=True)
# We know that MNIST images are 28 pixels in each dimension.
img_size = 28
# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size
# Tuple with height and width of images used to reshape arrays.
img_shape = (img_size, img_size)
# Number of colour channels for the images: 1 channel for gray-scale.
num_channels = 1
# Number of classes, one class for each of 10 digits.
num_classes = 10
# the placeholders
x = tf.placeholder(tf.float32, shape=[None, img_size_flat], name='x')
x_image = tf.reshape(x, [-1, img_size, img_size, num_channels])
y_true = tf.placeholder(tf.float32, shape=[None, 10], name='y_true')
# use prettyTensor to build the model
# this will give us the predictions and the loss functions
x_pretty = pt.wrap(x_image)
with pt.defaults_scope(activation_fn=tf.nn.relu):
y_pred, loss = x_pretty.\
conv2d(kernel=5, depth=16, name='layer_conv1').\
max_pool(kernel=2, stride=2).\
conv2d(kernel=5, depth=36, name='layer_conv2').\
max_pool(kernel=2, stride=2).\
flatten().\
fully_connected(size=128, name='layer_fc1').\
softmax_classifier(class_count=10, labels=y_true)
# the model optimizer
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss)
# the model testing
correct_prediction = tf.equal(tf.argmax(y_pred,1), tf.argmax(y_true,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# start the session
session = tf.InteractiveSession()
# Start the training
tf.global_variables_initializer().run(session = session)
train_batch_size = 64
for i in range(1000):
print("training batch ",i)
x_batch, y_true_batch = data.train.next_batch(train_batch_size)
session.run(optimizer, feed_dict={x:x_batch, y_true:y_true_batch})
When i tried to run it, I got the following error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value layer_conv1/bias
[[Node: layer_conv1/bias/read = Identity[T=DT_FLOAT, _class=["loc:#layer_conv1/bias"], _device="/job:localhost/replica:0/task:0/cpu:0"](layer_conv1/bias)]]
Caused by op u'layer_conv1/bias/read', defined at:
File "/home/gal/Documents/Workspace/EclipseWorkspace/Melanoma Classification!/tutorial4/tutorial4Test.py", line 31, in <module>
the full error trace:
Traceback (most recent call last):
File "/home/gal/Documents/Workspace/EclipseWorkspace/Melanoma Classification!/tutorial4/tutorial4Test.py", line 55, in <module>
session.run(optimizer, feed_dict={x:x_batch, y_true:y_true_batch})
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value layer_conv1/bias
[[Node: layer_conv1/bias/read = Identity[T=DT_FLOAT, _class=["loc:#layer_conv1/bias"], _device="/job:localhost/replica:0/task:0/cpu:0"](layer_conv1/bias)]]
Caused by op u'layer_conv1/bias/read', defined at:
File "/home/gal/Documents/Workspace/EclipseWorkspace/Melanoma Classification!/tutorial4/tutorial4Test.py", line 31, in <module>
conv2d(kernel=5, depth=16, name='layer_conv1').\
File "/home/gal/anaconda2/lib/python2.7/site-packages/prettytensor/pretty_tensor_class.py", line 1981, in method
result = func(non_seq_layer, *args, **kwargs)
File "/home/gal/anaconda2/lib/python2.7/site-packages/prettytensor/pretty_tensor_image_methods.py", line 163, in __call__
y += self.variable('bias', [size[-1]], bias_init, dt=dtype)
File "/home/gal/anaconda2/lib/python2.7/site-packages/prettytensor/pretty_tensor_class.py", line 1695, in variable
collections=variable_collections)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 370, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1424, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/gal/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value layer_conv1/bias
[[Node: layer_conv1/bias/read = Identity[T=DT_FLOAT, _class=["loc:#layer_conv1/bias"], _device="/job:localhost/replica:0/task:0/cpu:0"](layer_conv1/bias)]]
So my question is, How can i solve this error?
This problem is caused by a bug in the 0.12rc0 release candidate of TensorFlow, and the fact that Pretty Tensor uses a deprecated TensorFlow API (for which I've opened an issue).
Until this bug is fixed, the best workaround I can think of is a hack. Add the following line at the top of your program, after import tensorflow as tf:
tf.GraphKeys.VARIABLES = tf.GraphKeys.GLOBAL_VARIABLES