Naming TensorFlow/Keras checkpoints - tensorflow

I am following the "Text generation with an RNN" tutorial on TensorFlow (link). I have trained the model for 10 epochs, and would like to train it some more. I have already written the code that allows the model to resume training. (This resumes training starting from the most recent checkpoint -- in this case, checkpoint 10). It trains just fine. However, the saved checkpoints are overwriting the previous checkpoints. This is because when I rerun the code, the epoch number starts at 1 again. Therefore, when I have finished epochs 11 - 20, I still have only 10 checkpoints (1 - 10), but they have overwritten the previous 10 checkpoints. I would like to rename the new checkpoints to checkpoints 11 - 20, but have failed to do so. Here is the pertinent segment of the code:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch+10}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
EPOCHS = 10
The only difference from the original code from the TensorFlow website is that I have modified the original line
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
to
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch+10}")
However, it does not work. Here is the error:
KeyError: 'epoch+10'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "project/RNN_text_generator_finetune.py", line 102, in <module>
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1137, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 412, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1249, in on_epoch_end
self._save_model(epoch=epoch, logs=logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1282, in _save_model
filepath = self._get_file_path(epoch, logs)
File "/opt/miniconda3/envs/newest11142020/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 1332, in _get_file_path
raise KeyError('Failed to format this callback filepath: "{}". '
KeyError: 'Failed to format this callback filepath: "./training_checkpoints/ckpt_{epoch+10}". Reason: \'epoch+10\''
Is there any way to rename the checkpoints in the code?

You can set as follows when resuming training
model.fit(...,
initial_epoch=epoch,
..)
Here, initial_epoch is an integer. Epoch at which to start training, it's useful for resuming a previous training run). Let's say you've trained a model at epoch 10 and stop training. So, when resuming the training, set the initial_epoch at 10. Src, and insightfull discussin.

Related

How can I run mobiledet model successfully with the pretrained model in TF1 model zoo from TensorFlow object detection api?

I want to test the mobiledet model provided in the TF1 model zoo from TensorFlow object detection api. tf1 object detection model zoo
since the pretrained files contain both the pb file and the ckpt files the Screenshot of ckpt files.
So, I have tried two methods to load the pretrained model to do inference.
Firstly, I tried to load the tflite_graph.pb directly.I encountered the following problem, I tried to change the tf version, but it still did not solve.
The code is like this:
MODEL_DIR = '/tf_ckpts/ssdlite_mobiledet_cpu_320x320_coco_2020_05_19/'
MODEL_CHECK_FILE = os.path.join(MODEL_DIR, 'tflite_graph.pb')
graph = tf.Graph()
with graph.as_default():
graph_def = tf.GraphDef()
with tf.gfile.Open(MODEL_CHECK_FILE,'rb') as f:
graph_def.ParseFromString(f.read())
tf.import_graph_def(graph_def, name='')
Traceback (most recent call last):
File "/home/zhaoxin/workspace/models-1.12.0/research/inference_demo.py", line 41, in <module>
tf.import_graph_def(graph_def, name='')
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
producer_op_list=producer_op_list)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal
raise ValueError(str(e))
ValueError: NodeDef mentions attr 'exponential_avg_factor' not in Op<name=FusedBatchNormV3; signature=x:T, scale:U, offset:U, mean:U, variance:U -> y:T, batch_mean:U, batch_variance:U, reserve_space_1:U, reserve_space_2:U, reserve_space_3:U; attr=T:type,allowed=[DT_HALF, DT_BFLOAT16, DT_FLOAT]; attr=U:type,allowed=[DT_FLOAT]; attr=epsilon:float,default=0.0001; attr=data_format:string,default="NHWC",allowed=["NHWC", "NCHW"]; attr=is_training:bool,default=true>; NodeDef: {{node FeatureExtractor/MobileDetCPU/Conv/BatchNorm/FusedBatchNormV3}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
Then, I tried to load the ckpt files to run the model.
mobiledet = 'tf_ckpts/ssdlite_mobiledet_cpu_320x320_coco_2020_05_19/'
meta_path = mobiledet+'model.ckpt-400000.meta'
ckpt_path = mobiledet+'model.ckpt-400000'
with tf.Session() as sess:
saver=tf.train.import_meta_graph(meta_path)
saver.restore(sess, ckpt_path)
graph = tf.get_default_graph()
The error like this:
Traceback (most recent call last):
File "/home/zhaoxin/workspace/models-1.12.0/research/tf_load.py", line 15, in <module>
saver=tf.train.import_meta_graph(meta_path)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1453, in import_meta_graph
**kwargs)[0]
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements
**kwargs))
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
producer_op_list=producer_op_list)
File "/home/zhaoxin/tools/miniconda3/envs/tf115/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 501, in _import_graph_def_internal
graph._c_graph, serialized, options) # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'LegacyParallelInterleaveDatasetV2' in binary running on localhost.localdomain. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
It seems that the loading errors of the above two methds are caused by the inconsistency of the tf version, but I have tried many tf versions and failed to solve it. Has anyone successfully run the mobiledet model in TF1 object detection model zoo?
OS: linux
TF version: tf 1.15
#Shane Zhao - are you planning on training with custom dataset or are you using the pretrained graph as is? The version of Tensorflow should only matter during training to the best of my knowledge. Anyways please refer this demo from Google in Colab - https://colab.research.google.com/github/luxonis/depthai-ml-training/blob/master/colab-notebooks/Easy_Object_Detection_Demo_Training.ipynb#scrollTo=JDddx2rPfex9

Invalid Argument Error Tensorflow Object Detection Training

I am training tensor flow object detection following the tensor flow API. I have trained many models in the past using the exact same steps. This model however keeps giving me the error message below. The error message references
InvalidArgumentError: image_size must contain 3 elements[4]
I searched the error and found
InvalidArgumentError: image_size must contain 3 elements[4] #3349
which shows the error and gives the solution of checking to make sure that all images are RGB. I used the code provided in that thread to check all images. I found about 15 images that were not RGB. I removed the images and the corresponding xml files. I recompiled the csv files and the tfrecord files and restarted the training. I received the error message again. I then tried to start the training over without resuming from the last checkpoint and I still received the error. The error does not happen on a regular basis. Sometimes the model will go for several thousand steps before a failure. I have also tried removing the random crop parameter from the pipeline.config file which had no affect.
Any help is appreciated.
Error Message:
INFO:tensorflow:global_step/sec: 2.03361
INFO:tensorflow:global step 4039: loss = 6.2836 (0.512 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, image_size must contain 3 elements[4]
[[Node: cond_2/RandomCropImage/sample_distorted_bounding_box/SampleDistortedBoundingBoxV2 = SampleDistortedBoundingBoxV2[T=DT_INT32, area_range=[0.1, 1], aspect_ratio_range=[0.5, 2],max_attempts=100, seed=0, seed2=0, use_image_if_no_bounding_boxes=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/RandomCropImage/Shape, cond_2/RandomCropImage/ExpandDims, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Const)]]
INFO:tensorflow:Recording summary at step 4039.
INFO:tensorflow:global step 4040: loss = 4.6984 (0.880 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "/floyd/object_detection/legacy/train.py", line 184, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, inrun
_sys.exit(main(argv))
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
return func(*args, **kwargs)
File "/floyd/object_detection/legacy/train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "/floyd/object_detection/legacy/trainer.py", line 415, in train
saver=saver)
File "/usr/local/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1244,in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409,in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: image_size must contain 3 elements[4]
[[Node: cond_2/RandomCropImage/sample_distorted_bounding_box/SampleDistortedBoundingBoxV2 = SampleDistortedBoundingBoxV2[T=DT_INT32, area_range=[0.1, 1], aspect_ratio_range=[0.5, 2],max_attempts=100, seed=0, seed2=0, use_image_if_no_bounding_boxes=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](cond_2/RandomCropImage/Shape, cond_2/RandomCropImage/ExpandDims, cond_2/RandomCropImage/PruneNonOverlappingBoxes/Const)]]
Thanks in advance.
so it was the RGB image problem. I had checked the images and removed the non RGB images and recreated the records, but the model was still pointing to the old records because the paths were very similar, I did not notice.

slim.dataset_data_provider.DatasetDataProvider with num_epochs=1 throws error

I am using the relatively new tf.slim Dataset, DatasetDataProvider pattern. The following code shows the key fragments:
with tf.Graph().as_default():
# get the dataset split
dataset = util.get_split(train_or_eval,
args.tfrecord_folder,
0,
args.eval_set_size,
crop_size,
file_pattern=file_pattern)
features, labels = util.load_batch(dataset,
batch_size=args.eval_batch_size,
num_readers=10,
num_epochs=1,
is_training=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
# start the queue runner
with slim.queues.QueueRunners(sess):
...run some ops...
Here's the definition of load_batch:
def load_batch(dataset, batch_size=64, is_training=False,
num_epochs=None, common_queue_capacity=256,
common_queue_min=32, num_readers=None):
shuffle = True
# create the data provider
data_provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=num_readers,
shuffle=shuffle,
num_epochs=num_epochs,
common_queue_capacity=
common_queue_capacity,
common_queue_min= common_queue_min,
seed=5)
# get the tensors from the data provider
images, labels = data_provider.get(['image_raw','label'])
# batch up some training data
images, labels = tf.train.batch([image_raw, label],
batch_size=batch_size,
num_threads=5,
allow_smaller_final_batch=True,
capacity=2 * batch_size)
return images, labels
This works fine when num_epochs=None (which according to the comments in the source means that a file of tfrecords can be read an infinite number of times), but fails when num_epochs=1. Here's the error message:
Out of range: FIFOQueue '_9_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
Obviously, I need to be able to run an eval step without repeating the examples to get good accuracy and confusion matrix numbers. Any thoughts would be appreciated...
Per the request in the comments I am adding the stack trace. I am running this job in Google Cloud ML so its easiest to show it this way. The logs have a series of paired messages as follows:
Out of range: FIFOQueue '_6_batch/fifo_queue' is closed and has
insufficient elements (requested 32, current size 0)[[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
[[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
Final Stack Trace is
"The replica master 0 exited with a non-zero status of 1. Termination
reason: Error.Traceback (most recent call last): [...] File
"/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509,
in
main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 505,
in main
run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 113,
in run
run_eval(args) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 285,
in run_eval
is_training=True) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 210,
in load_batch
capacity=3 * batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py",
line 872, in batch
name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py",
line 665, in _batch
dequeued = queue.dequeue_up_to(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py",
line 499, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name) File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py",
line 1402, in _queue_dequeue_up_to_v2
timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py",
line 763, in apply_op
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 2327, in create_op
original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 1226, in init
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue
'_6_batch/fifo_queue' is closed and has insufficient elements
(requested 32, current size 0) [[Node: batch =
QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING,
DT_STRING], timeout_ms=-1,
_device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
To find out more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?...
After extensive study and reading on Github, many reported that eliminating this issue was a matter of making sure that the initializer for local and global variables is run at the top of the session. Like this one Using the following:
tf.group(tf.local_variables_initializer(), tf.global_variables_initializer{}
However, that did not fix the issue for many (including me), and I suspect for those that it did work, there were other problems leading to an empty FIFO queue.
After much reading, it appears that this is a defect for which there is not an obvious fix. Several work arounds are proposed. I was running a full cycle of train, eval, and predict. Here is the approach which worked for me:
1) On training, I set num_epochs=None. This cycles through the data an infinite number of times and if the documentation is correct, each example is presented only once per epoch. I did spot checking to confirm this, but my dataset was too large to guarantee the docs are correct. That said, my model did not overfit. Train, test, and validation were all reasonably close in terms of accuracy.
2) On eval, I was building a 15 model ensemble and I wanted to compare the proposal selection to ground truth before submitting unlabeled data for validation. I kept an extra hold out set from a k-fold cross validation run and needed to be sure that the each example in the hold out set was predicted once and only once. So to make that work, I: a)set num_epochs=1, b) eliminated all calculations from the eval graph except the prediction, c) reduced the size of the eval set to ~3000 examples, d) set shuffle_batch=False, e) set the batch size so that the queue would have a few extra examples
With these conditions, the queue runners did not run out of examples before my graph completed and I got my test set
3) On predict, I used the same technique again as for eval except that I chose a batch size and number of train steps that was exactly equal to the number of predict records. Since there was no gradient back prop, the predicts were fast enough to finish before the queue runner could kill my job.
Problem solved. Jury rigged. But, it worked. Desperation is the mother of ingenuity or something like that!

Receiving Negative Input Dimensions with TensorFlow MonitoredTrainingSession

I'm attempting to switch from a tf.Session() to a tf.train.MonitoredTrainingSession (all on one machine, no fancy distributed computing), but I'm getting an error that I don't fully understand.
W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [16,-1,4] has negative dimensions
E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [16,-1,4] has negative dimensions
[[Node: define_inputs/Placeholder = Placeholder[dtype=DT_FLOAT, shape=[16,?,4], _device="/job:local/replica:0/task:0/cpu:0"]()]]
Further down, I receive a little more information about the error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
status, run_metadata)
File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/contextlib.py", line 89, in __exit__
next(self.gen)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
I'm using tf.contrib.seq2seq and my input and output sequences have variable lengths e.g. x_placeholder = tf.placeholder(tf.float32, [batch_size, None, 4]).
I suspect that the queues that I'm using to read data and bucket data by sequence length are somehow failing or getting interrupted by the MonitoredTrainingSession, as I don't have this problem with a vanilla Session.
Here's the code that sets up the MonitoredTrainingSession
# create a global step
global_step = tf.contrib.framework.get_or_create_global_step()
# define graph
model = import_model(global_step)
# create a one process cluster with an in-process server
server = tf.train.Server.create_local_server()
# define hooks for writing summaries and model variables to disk
hooks = construct_training_hooks(model.summary_op)
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=True,
hooks=hooks) as monitored_sess:
# create coordinator to handle threading
coord = tf.train.Coordinator()
# start threads to enqueue input minibatches for training
threads = tf.train.start_queue_runners(sess=monitored_sess, coord=coord)
# train
while not monitored_sess.should_stop():
train_op(monitored_sess, model, x_train, y_train, y_lengths_train)
# when done, ask the threads to stop
coord.request_stop()
# wait for threads to finish
coord.join(threads)
Here is how I'm creating my training hooks:
def construct_training_hooks(summary_op):
hooks = [tf.train.StopAtStepHook(last_step=tf.flags.FLAGS.training_steps),
tf.train.CheckpointSaverHook(checkpoint_dir=tf.flags.FLAGS.log_dir,
saver=tf.train.Saver(),
save_steps=10),
tf.train.SummarySaverHook(output_dir=tf.flags.FLAGS.log_dir,
summary_op=summary_op,
save_steps=10)]
return hooks

tensorflow: ValueError: GraphDef cannot be larger than 2GB

This is the error i got
Traceback (most recent call last):
File "fully_connected_feed.py", line 387, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "fully_connected_feed.py", line 289, in main
run_training()
File "fully_connected_feed.py", line 256, in run_training
saver.save(sess, checkpoint_file, global_step=step)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1386, in save
self.export_meta_graph(meta_graph_filename)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1414, in export_meta_graph
graph_def=ops.get_default_graph().as_graph_def(add_shapes=True),
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2257, in as_graph_def
result, _ = self._as_graph_def(from_version, add_shapes)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2220, in _as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I believe it is from the result of this code
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden1")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden2")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
I am not sure why my model is getting so big in size (15k steps and 240MB). Any thoughts? thanks!
It's hard to say what is happening without seeing the code, but in general TensorFlow model sizes will not increase with number of steps - they should be fixed.
If the model size is increasing with number of steps, it suggests that the computation graph is being added to on every step. For example, something like:
import tensorflow as tf
with tf.Session() as sess:
for i in xrange(1000):
sess.run(tf.add(1, 2))
# or perhaps sess.run(tf.scatter_nd_update(...)) in your case
will create 3000 nodes in the graph (one for add, one for '1' one for '2' on every iteration). Instead, you want to define your computational graph once and run repeatedly with something like:
import tensorflow as tf
x = tf.add(1, 2)
# or perhaps x = tf.scatter_nd_update(...) in your case
with tf.Session() as sess:
for i in xrange(1000):
sess.run(x)
Which will have a fixed graph of 3 nodes for all the 1000 (and any more) iterations. Hope that helps.