tensorflow: ValueError: GraphDef cannot be larger than 2GB - tensorflow

This is the error i got
Traceback (most recent call last):
File "fully_connected_feed.py", line 387, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "fully_connected_feed.py", line 289, in main
run_training()
File "fully_connected_feed.py", line 256, in run_training
saver.save(sess, checkpoint_file, global_step=step)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1386, in save
self.export_meta_graph(meta_graph_filename)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1414, in export_meta_graph
graph_def=ops.get_default_graph().as_graph_def(add_shapes=True),
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2257, in as_graph_def
result, _ = self._as_graph_def(from_version, add_shapes)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2220, in _as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I believe it is from the result of this code
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden1")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden2")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
I am not sure why my model is getting so big in size (15k steps and 240MB). Any thoughts? thanks!

It's hard to say what is happening without seeing the code, but in general TensorFlow model sizes will not increase with number of steps - they should be fixed.
If the model size is increasing with number of steps, it suggests that the computation graph is being added to on every step. For example, something like:
import tensorflow as tf
with tf.Session() as sess:
for i in xrange(1000):
sess.run(tf.add(1, 2))
# or perhaps sess.run(tf.scatter_nd_update(...)) in your case
will create 3000 nodes in the graph (one for add, one for '1' one for '2' on every iteration). Instead, you want to define your computational graph once and run repeatedly with something like:
import tensorflow as tf
x = tf.add(1, 2)
# or perhaps x = tf.scatter_nd_update(...) in your case
with tf.Session() as sess:
for i in xrange(1000):
sess.run(x)
Which will have a fixed graph of 3 nodes for all the 1000 (and any more) iterations. Hope that helps.

Related

Can't save YOLOv4 model because of array shape mismatch

I am able to run transfer learning on YOLOv4 and my custom dataset with the following command (which runs successfully and can identify test images I present to the model):
!./darknet detector train /content/darknet/build/darknet/x64/data/obj.data /content/darknet/build/darknet/x64/cfg/yolov4_train.cfg /content/darknet/build/darknet/x64/yolov4.conv.137 -dont_show
I am using the save_model.py tool from this github site:
!git clone https://github.com/hunglc007/tensorflow-yolov4-tflite
When I enter the following command to save the model it fails:
!python3 save_model.py --weights /content/darknet/build/darknet/x64/backup/yolov4_train_final.weights --output ./checkpoints/yolov4-224 --input_size 224
The failure is a mismatch between the weights saved in training and the expected array shape in the core/utility module utils.py (line 63):
Traceback (most recent call last):
File "save_model.py", line 58, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "save_model.py", line 54, in main
save_tf()
File "save_model.py", line 49, in save_tf
utils.load_weights(model, FLAGS.weights, FLAGS.model, FLAGS.tiny)
File "/content/tensorflow-yolov4-tflite/core/utils.py", line 65, in load_weights
conv_weights = conv_weights.reshape(conv_shape).transpose([2, 3, 1, 0])
ValueError: cannot reshape array of size 4554552 into shape (1024,512,3,3)
I added a debug print, and it looks like the it's getting all the way to the last layer before choking. In other words, the previous layers all get through this line of code in utils.py with a match between the saved weights and the array shape. I think this is somehow related to the fact I'm using image sizes of 224,224,3 instead of 416,416,3, but I did specify that in the input_size. For completeness, here's the last couple of debug prints before the Traceback above:
layer (out_dim, in_dim, height, width) 107 512 1024 1 1
layer (out_dim, in_dim, height, width) 108 1024 512 3 3
If anyone has any ideas, that would be great!

Camera Digit Prediction stopped working after moving to python 3.7, anyone know why?

When I moved my code from an interpreter based python 3.9 and tensorflow to python 3.7 and tensorflow-directml (so I could use my AMD GPU). The training part worked fine when I copied over the code. But when running the model I get an error suddenly complaining about the sizes of the input arrays to my neural network. The error does not occur with the initial interpreter but does with the second one even though the code is identical.
(The shapes of the digit array are the same for both versions (1, 28, 28) - binary image)
def cam_predict_digits(cam):
dig = np.zeros((1, 28, 28))
dig[0, :, :] = np.array(cam)
digit = np.array(dig)
print("predict input shape: " + str(digit.shape))
# Make prediction
prediction = model.predict(digit)
print(prediction)
print(f'Detected is probably: {np.argmax(prediction)}')
Traceback (most recent call last):
File "C:/Z_Uni/Individual_Project/Python_Projects/NeuralNet_GPU/Conv_NN_GPU_Model.py", line 123, in <module>
cam_predict_digits(Processed_Frame)
File "C:/Z_Uni/Individual_Project/Python_Projects/NeuralNet_GPU/Conv_NN_GPU_Model.py", line 74, in cam_predict_digits
prediction = model.predict(digit)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 908, in predict
use_multiprocessing=use_multiprocessing)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 716, in predict
x, check_steps=True, steps_name='steps', steps=steps)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 2471, in _standardize_user_data
exception_prefix='input')
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training_utils.py", line 563, in standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (1, 28, 28)
Process finished with exit code 1
Could anyone explain why this is happening and what I can do to fix it? Thanks

Naming TensorFlow/Keras checkpoints

I am following the "Text generation with an RNN" tutorial on TensorFlow (link). I have trained the model for 10 epochs, and would like to train it some more. I have already written the code that allows the model to resume training. (This resumes training starting from the most recent checkpoint -- in this case, checkpoint 10). It trains just fine. However, the saved checkpoints are overwriting the previous checkpoints. This is because when I rerun the code, the epoch number starts at 1 again. Therefore, when I have finished epochs 11 - 20, I still have only 10 checkpoints (1 - 10), but they have overwritten the previous 10 checkpoints. I would like to rename the new checkpoints to checkpoints 11 - 20, but have failed to do so. Here is the pertinent segment of the code:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch+10}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
EPOCHS = 10
The only difference from the original code from the TensorFlow website is that I have modified the original line
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
to
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch+10}")
However, it does not work. Here is the error:
KeyError: 'epoch+10'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "project/RNN_text_generator_finetune.py", line 102, in <module>
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1137, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 412, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1249, in on_epoch_end
self._save_model(epoch=epoch, logs=logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1282, in _save_model
filepath = self._get_file_path(epoch, logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1332, in _get_file_path
raise KeyError('Failed to format this callback filepath: "{}". '
KeyError: 'Failed to format this callback filepath: "./training_checkpoints/ckpt_{epoch+10}". Reason: \'epoch+10\''
Is there any way to rename the checkpoints in the code?
You can set as follows when resuming training
model.fit(...,
initial_epoch=epoch,
..)
Here, initial_epoch is an integer. Epoch at which to start training, it's useful for resuming a previous training run). Let's say you've trained a model at epoch 10 and stop training. So, when resuming the training, set the initial_epoch at 10. Src, and insightfull discussin.

'MemoryError' when padding sequences using tensorflow

I am trying to training my model on an AWS instance 'g2.2xlarge' but getting a 'MemoryError' when trying to add paddings to my sequences.
content_array = keras.preprocessing.sequence.pad_sequences(content_array, maxlen=max_sequence_length,
padding='post')
Getting this error:
Traceback (most recent call last):
File "trainer.py", line 185, in <module>
train()
File "trainer.py", line 52, in train
padding='post')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/preprocessing/sequence.py", line 94, in pad_sequences
x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
MemoryError
Any idea why ? I haven't started training the model even.
I was calculating the maximum sequence length incorrectly which led to a huge number. After correcting it I am not having any issues.

slim.dataset_data_provider.DatasetDataProvider with num_epochs=1 throws error

I am using the relatively new tf.slim Dataset, DatasetDataProvider pattern. The following code shows the key fragments:
with tf.Graph().as_default():
# get the dataset split
dataset = util.get_split(train_or_eval,
args.tfrecord_folder,
0,
args.eval_set_size,
crop_size,
file_pattern=file_pattern)
features, labels = util.load_batch(dataset,
batch_size=args.eval_batch_size,
num_readers=10,
num_epochs=1,
is_training=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
# start the queue runner
with slim.queues.QueueRunners(sess):
...run some ops...
Here's the definition of load_batch:
def load_batch(dataset, batch_size=64, is_training=False,
num_epochs=None, common_queue_capacity=256,
common_queue_min=32, num_readers=None):
shuffle = True
# create the data provider
data_provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=num_readers,
shuffle=shuffle,
num_epochs=num_epochs,
common_queue_capacity=
common_queue_capacity,
common_queue_min= common_queue_min,
seed=5)
# get the tensors from the data provider
images, labels = data_provider.get(['image_raw','label'])
# batch up some training data
images, labels = tf.train.batch([image_raw, label],
batch_size=batch_size,
num_threads=5,
allow_smaller_final_batch=True,
capacity=2 * batch_size)
return images, labels
This works fine when num_epochs=None (which according to the comments in the source means that a file of tfrecords can be read an infinite number of times), but fails when num_epochs=1. Here's the error message:
Out of range: FIFOQueue '_9_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
Obviously, I need to be able to run an eval step without repeating the examples to get good accuracy and confusion matrix numbers. Any thoughts would be appreciated...
Per the request in the comments I am adding the stack trace. I am running this job in Google Cloud ML so its easiest to show it this way. The logs have a series of paired messages as follows:
Out of range: FIFOQueue '_6_batch/fifo_queue' is closed and has
insufficient elements (requested 32, current size 0)[[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
[[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
Final Stack Trace is
"The replica master 0 exited with a non-zero status of 1. Termination
reason: Error.Traceback (most recent call last): [...] File
"/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509,
in
main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 505,
in main
run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 113,
in run
run_eval(args) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 285,
in run_eval
is_training=True) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 210,
in load_batch
capacity=3 * batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py",
line 872, in batch
name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py",
line 665, in _batch
dequeued = queue.dequeue_up_to(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py",
line 499, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name) File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py",
line 1402, in _queue_dequeue_up_to_v2
timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py",
line 763, in apply_op
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 2327, in create_op
original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 1226, in init
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue
'_6_batch/fifo_queue' is closed and has insufficient elements
(requested 32, current size 0) [[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
To find out more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?...
After extensive study and reading on Github, many reported that eliminating this issue was a matter of making sure that the initializer for local and global variables is run at the top of the session. Like this one Using the following:
tf.group(tf.local_variables_initializer(), tf.global_variables_initializer{}
However, that did not fix the issue for many (including me), and I suspect for those that it did work, there were other problems leading to an empty FIFO queue.
After much reading, it appears that this is a defect for which there is not an obvious fix. Several work arounds are proposed. I was running a full cycle of train, eval, and predict. Here is the approach which worked for me:
1) On training, I set num_epochs=None. This cycles through the data an infinite number of times and if the documentation is correct, each example is presented only once per epoch. I did spot checking to confirm this, but my dataset was too large to guarantee the docs are correct. That said, my model did not overfit. Train, test, and validation were all reasonably close in terms of accuracy.
2) On eval, I was building a 15 model ensemble and I wanted to compare the proposal selection to ground truth before submitting unlabeled data for validation. I kept an extra hold out set from a k-fold cross validation run and needed to be sure that the each example in the hold out set was predicted once and only once. So to make that work, I: a)set num_epochs=1, b) eliminated all calculations from the eval graph except the prediction, c) reduced the size of the eval set to ~3000 examples, d) set shuffle_batch=False, e) set the batch size so that the queue would have a few extra examples
With these conditions, the queue runners did not run out of examples before my graph completed and I got my test set
3) On predict, I used the same technique again as for eval except that I chose a batch size and number of train steps that was exactly equal to the number of predict records. Since there was no gradient back prop, the predicts were fast enough to finish before the queue runner could kill my job.
Problem solved. Jury rigged. But, it worked. Desperation is the mother of ingenuity or something like that!