Tensorflow can't save model - tensorflow

I encountered this weird problem...I use this code to construct tensorflow saver:
tf.train.Saver(tf.all_variables(), max_to_keep=FLAGS.keep)
which is supposed to be very standard. However, when I point the saving directory to my custom directory (under my username) instead of "/tmp", all of a sudden, the saved models are files like
translate.ckpt-329.data-00000-of-00001
translate.ckpt-329.index
translate.ckpt-329.meta
I can't find the file "translate.ckpt-329".
The generated checkpoint file is pointing to:
model_checkpoint_path: "/Users/.../train_dir/translate.ckpt-329"
all_model_checkpoint_paths: "/Users/.../train_dir/translate.ckpt-329"
while this file does not exist and create problems for me restoring my model.
Can someone shed any light on this?? What could possibly be the problem?
Thanks for the first answer! I guess my bigger problem is the restore method:
The original code uses this way to restore a session:
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
model.saver.restore(session, ckpt.model_checkpoint_path)
Which failed with V2 saving :(
if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):
logging.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
model.saver.restore(session, ckpt.model_checkpoint_path)
else:
logging.info("Created model with fresh parameters.")
session.run(tf.global_variables_initializer())

TL;DR: In the new checkpoint format, the "filename" that you pass to the saver is actually used as the prefix of several filenames, and no file with that exact name is written. You can use the old checkpoint format by constructing your tf.train.Saver with the optional argument write_version=tf.train.SaverDef.V1.
From the names of the saved files, it appears that you are using the "V2" checkpoint format, which became the default in TensorFlow 0.12. This format stores the checkpoint data in multiple files: one or more data files (e.g. translate.ckpt-329.data-00000-of-00001 in your case) and an index file (translate.ckpt-329.index) that tells TensorFlow where each saved variable is located in the data files. The tf.train.Saver uses the "filename" that you pass as the prefix for these files' names, but doesn't produce a file with that exact name.
Although there is no file with the exact name you gave, you can use the value returned from saver.save() as the argument to a subsequent saver.restore(), and the other checkpoint locating mechanisms should continue to work as before.

Related

Estimator's model_fn includes params argument, but params are not passed to Estimator

I'm trying to run Object Detection API locally.
I believe I have everything set up as described in the TensorFlow Object Detection API documents, however, when I'm trying to run model_main.py, this warning shows and model doesn't train. (I can't really tell if model is training or not, because the process isn't terminated, but no further logs appear)
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000024BDBB3D158>) includes
params argument, but params are not passed to Estimator.
The code I'm passing in is:
python tensorflow-models/research/object_detection/model_main.py \
--model_dir=training \
--pipeline_config_path=ssd_mobilenet_v1_coco.config \
--checkpoint_dir=ssd_mobilenet_v1_coco_2017_11_17/model.ckpt \
--num_tain_steps=2000 \
--num_eval_steps=200 \
--alsologtostderr
What could be causing this warning?
Why would the code seem stuck?
Please help!
I met the same problem, and I found that this warning has nothing to do with the problem that the model doesn't work. I can make the model work as this warning showing.
My mistake was that I misunderstood the line in the document of running_locally.md
"${MODEL_DIR} points to the directory in which training checkpoints and events will be written to"
I changed the MODEL_DIR to the {project directory}/models/model where the structure of the directory is:
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval
And it worked. Hoping this can help you.
Edit: while this may work, in this case model_dir does not contain any saved checkpoint files, if you stop the training after some checkpoint files are saved and restart again, the training would still be skipped. The doc specifies the recommended directory structure, but it is not necessary to be the same structure as all paths to tfrecord, pretrained checkpoints can be configured in the config file.
The actual reason is when model_dir contains checkpoint files which already reached the NUM_TRAIN_STEP, the script will assume the training is finished and exit. Remove the checkpoint files and restart training will work.
In my case, I had the same error because I had inside of the folder where my .cpkt files were, the checkpoint of the pre-trained models too.
Removing that file came inside of the .tar.gz file, the training worked.
I also received this error, and it was because I had previously trained a model on a different dataset/model/config file, and the previous ckpt files still existed in the directory I was working with, moving the old ckpt training data to a different directory fixed the issue
Your script seems good.
One thing we should notice is that, the new model_main.py will not print the log of training(like training step, lr, loss and so on.) It only print the evaluation result after one or multi-epoches, which will be a long time.
So "the process isn't terminated, but no further logs appear" is normal. You can confirm its running by using "nvidia-smi" to check the gpu situation, or use tensorboard to check.
I also encountered this warning message. I checked nvidia-smi and it seemed training wasn't started. Also tried re-organizing output directory and it didn't work out. After checking out Configuring the Object Detection Training Pipeline (tensorflow official), I found it was configuration problem. Solved the problem by adding load_all_detection_checkpoint_vars: true.

How to rewrite a tensorflow's checkpoint files?

I want to change a ckpt files's tensor's value by many other ckpt files's tensors, and use the modified ckpt files to restart TF training jobs.
Hope you some advices!
Thanks!
There are standalone utilities for reading checkpoint files (search for CheckpointReader or NewCheckpointReader) but not modifying them. The easiest approach is probably to load the checkpoint into your model, assign a new value to the variable you want to change, and save this new checkpoint.

TensorFlow: Are saved variables from tf.saved_model and tf.train.Saver not compatible?

I saved a TensorFlow model using tf.saved_model and now I'm trying to load only the variables from that model using a tf.train.Saver, but I get one of the following two errors depending on the path I give it:
DataLossError: Unable to open table file saved_model/variables:
Failed precondition: saved_model/variables: perhaps your file is in a
different file format and you need to use a different restore operator?
or
InvalidArgumentError: Unsuccessful TensorSliceReader constructor:
Failed to get matching files on saved_model/variables/variables:
Not found: saved_model/variables
[[Node: save/RestoreV2_34 = RestoreV2[dtypes=[DT_FLOAT],
_device="/job:localhost/replica:0/task:0/cpu:0"]
(_arg_save/Const_1_0_0,
save/RestoreV2_34/tensor_names, save/RestoreV2_34/shape_and_slices)]]
tf.saved_model, when saving a model, creates a saved_model.pb protocol buffer and a folder named variables that contains two files:
variables.data-00000-of-00001
variables.index
tf.train.Saver.save() creates the following files:
some_name.data-00000-of-00001
some_name.index
some_name.meta
checkpoint
I have always assumed that the two output files ending in .data-00000-of-00001 and .index are compatible between both savers.
Is that not the case?

how to properly train TensorFlow on one machine and evaluate on another?

I'm training a TensorFlow (1.2) model on one machine and attempting to evaluate it on another. Everything works fine when I stay local to one machine.
I am not using placeholders and feed-dict's to get data to the model but rather TF file queues and batch generators. I suspect with placeholders this would be much easier but I am trying to make the TF batch generator machinery work.
In my evaluation code I have lines like:
saver = tf.train.Saver()
ckpt = tf.train.get_checkpoint_state(os.path.dirname(ckpt_dir))
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
This produces errors like:
017-08-16 12:29:06.387435: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/perdue/minerva/tensorflow/models/11/20170816/checkpoints-20: Not found: /data/perdue/minerva/tensorflow/models/11/20170816
The referenced directory (/data/...) exists on my training machine but not the evaluation machine. I have tried things like
saver = tf.train.import_meta_graph(
'/local-path/checkpoints-XXX.meta',
clear_devices=True
)
saver.restore(
sess, '/local-path/checkpoints-XXX',
)
but this produces a different error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value train_file_queue/limit_epochs/epochs
or, if I explicitly call the initializer functions immediately after the restore,
AttributeError: 'Tensor' object has no attribute 'initializer'
Here, train_file_queue/limit_epochs/epochs is an element of the training graph that I would like the evaluation function to ignore (I have another, new element test_file_queue that is pointing at a different file queue with the evaluation data files in it).
I think in the second case when I'm calling the initializers right after the restore that there is something in the local variables that won't doesn't work quite like a "normal" Tensor, but I'm not sure exactly what the issue is.
If I just use a generic Saver and restore TF does the right thing on the original machine - it just restores model parameters and then uses my new file queue for evaluation. But I can't be restricted to that machine, I need to be able to evaluate the model on other machines.
I've also tried freezing a protobuf and a few other options and there are always difficulties associated with the fact that I need to use file queues as the most upstream inputs.
What is the proper way to train using TensorFlow's file queues and batch generators and then deploy the model on a different machine / in a different environment? I suspect if I were using feed-dict's to get data to the graph this would be fairly simple, but it isn't as clear when using the built in file queues and batch generators.
Thanks for any comments or suggestions!
At least part of the answer to this dilemma was answered in TF 1.2 or 1.3. There is a new flag for the Saver() constructor:
saver = tf.train.Saver(save_relative_paths=True)
that makes it such that when you save the checkpoint directory and move it to another machine, and use it to restore() a model, everything works without errors relating to nonexistent paths for the data (the paths from the old machine where training was performed).
It isn't clear my use of the API is really idiomatic in this case, but at least the code works such that I can export trained models from one machine to another.

Fail to read the new format of tensorflow checkpoint?

I pip installed tensorflow 0.12. I am able to resume training by loading old checkpoints which ends with .ckpk. However, tensorflow 0.12 dumps new checkpoints in a different format including *.index, .data-00000-of-00001 and *.meta. After that, I am not able to restore from the new checkpoint.
What is the proper way of loading the new format? Besides, how to read *index?
Mostly duplicate of How to restore a model by filename in Tensorflow r12?
Troubleshooting:
Read common suffix
-stop before the first dot after ckpt
Check model path
absolute
saver.restore(sess, "/full/path/to/model.ckpt")
or relative
saver.restore(sess, "./model.ckpt")
Regarding reading the .index file, as the name suggests, it is the first file to be opened by the restore function. No .index file, no restore (you could still restore without a .meta file).
The .index files needs the data-xxxx-of-xxxx shards, so it would be kind of pointless to read only the .index file, without any tensor data restored. What are you trying to achieve?