Tensorflow - which is the real checkpoint file?

Tensorflow - which is the real checkpoint file? - tensorflow

I'm trying to used pretrained models with checkpoints but I can't figure out which file should be put into config as checkpoint. There are files:
model.ckpt.meta
model.ckpt.index
model.ckpt.data0000-of-0001
I tried all of them but I see errors. In the different articles I saw just "model.ckpt" but there is no such files.
I tried to work with ssd_mobilenet from here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

model.ckpt.meta : File to store graph information
model.ckpt.index : File to store index of variables
model.ckpt.data0000-of-0001 : File to store value of variables
All three files are real checkpoint files. When you restore the model using tf.train.Saver.restore, parameter "save_path" should be "~model.ckpt". Your error probably occurred because of the invalid save_path. Check if three files(.meta, .index, .data~) in save_path(relative path or absolute path).

Related

tf.train.latest_checkpoint returning none when passing checkpoint path

When I am trying to load checkpoint after training ENet model for prediction using tf.train.latest_checkpoint(), it's returning "None" though I am passing the correct checkpoint path.
Here is my code:
image_dir = './dataset/test/'
images_list = sorted([os.path.join(image_dir, file) for file in
os.listdir(image_dir) if file.endswith('.png')])
checkpoint_dir = "./checkpoint_mk"
listi = os.listdir(checkpoint_dir)
print(listi)
checkpoint = tf.train.latest_checkpoint("./log/original/check")
print(checkpoint,'---------------------------------------
++++++++++++++++++++++++++++++++++++++++++++++++++++')
It returns None.
I am passing absolute path of checkpoint as they store in some other Dir.
Here are my checkpoint folder.
EDIT ---------------
model_checkpoint_path: "model.ckpt-400"
all_model_checkpoint_paths: "model.ckpt-0"
all_model_checkpoint_paths: "model.ckpt-400"

The tf.train.latest_checkpoint path argument needs to be relative to your current directory (from which Python script is executed). If it is a complex structure (i.e. data set is stored in a different folder or on a HDD) you can simply use absolute path to the folder. That is why tf.train.latest_checkpoint("/home/nikhil_m/TensorFlow-ENet/log/original") works in this case.

Try tf.train.latest_checkpoint(os.path.dirname('your_checkpoint_path'))

Error while deserializing the Apache MXNet object

I have trained and saved a model using Amazon SageMaker which saves the model in the format of model.tar.gz which when untarred, has a file model_algo-1 which is a serialized Apache MXNet object. To load the model in memory I need to deserialize the model. I tried doing so as follows:
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
Reference taken from https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
However, doing this yields me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/mxnet/ndarray/utils.py", line
175, in load
ctypes.byref(names)))
File "/usr/local/lib/python3.4/site-packages/mxnet/base.py", line 146, in
check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:06:25] src/ndarray/ndarray.cc:1112: Check failed:
header == kMXAPINDArrayListMagic Invalid NDArray file format
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192112)
[0x7fe432bfa112]
[bt] (1) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192738)
[0x7fe432bfa738]
[bt] (2) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(+0x24a5c44) [0x7fe434f0dc44]
[bt] (3) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(MXNDArrayLoad+0x248) [0x7fe434d19ad8]
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe48c5bbcec]
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7fe48c5bb615]
[bt] (6) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-
34m.so(_ctypes_callproc+0x2fb) [0x7fe48c7ce18b]
[bt] (7) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-34m.so(+0xa4cf)
[0x7fe48c7c84cf]
[bt] (8) /usr/lib64/libpython3.4m.so.1.0(PyObject_Call+0x8c)
[0x7fe4942fcb5c]
[bt] (9) /usr/lib64/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x36c5)
[0x7fe4943ac915]
Could someone suggest how this can be resolved?

If your model is serialized into archive properly, then there should be at least 2 files:
model_name.json - it contains the architecture of your model
model_name.params - it contains parameters of your model
So, to load the model back, you need to:
Restore the model itself by loading json file.
Restore model parameters (you don't use mxnet nd.array for that, but full model).
Here is the code example how to do it:
# sym_json - content of .json file
net = gluon.nn.SymbolBlock(
outputs=mx.sym.load_json(sym_json),
inputs=mx.sym.var('data'))
# params_filename - full path to parameters file
net.load_params(params_filename)
If you also want to check serialization of your model as well, take a look to this example. This example shows how to serialize a trained model manually before uploading to SageMaker.
More details on serializing and deserializing model manually can be found here.

I trained a stock Linear Learner algorithm through AWS Sagemaker. It makes a model object called model.tar.gz in the output folder. As Vasanti noted, there is some notation that these objects are mxnet object in an article
I knew I had to unpack the tar, what I didn't realize was how many times. I started with this code:
import subprocess
cmdline = ['tar','-xzvf','model.tar.gz']
subprocess.call(cmdline)
which yields the file called 'model_algo-1' which led me to this page. However, it's still a packed file. So run:
cmdline = ['tar','-xzvf','model_algo-1']
subprocess.call(cmdline)
This yields:
additional-params.json
manifest.json
mx-mod-0000.params
mx-mod-symbol.json
From there, you can utilize Sergei's post:
# load the json file
import json
sym_json = json.load(open('mx-mod-symbol.json'))
sym_json_string = json.dumps(sym_json)
# open model
import mxnet as mx
from mxnet import gluon
net = gluon.nn.SymbolBlock(
outputs=mx.sym.load_json(sym_json_string),
inputs=mx.sym.var('data'))
# params file
net.load_parameters('mx-mod-0000.params', allow_missing=True)
Now, if only I knew what to do with this mxnet / gluon object to get what I really want which is a feature importance rank order and weight for some model explainability.

How to specify model directory in Floydhub?

I am new to Floydhub. I am trying to run the code from this github repository and the corresponding tutorial.
For the training, I successfully used this command:
floyd run --gpu --env tensorflow-1.2 --data janinanu/dataset
/data/2:tut_train 'python udc_train.py'
I adjusted this line in the training file to work in Floydhub:
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
As said, this worked without problems for the training.
Now for the testing, I do not really find any details on how to specify the model directory in which the output of the training gets stored. I mounted the output from training with model_dir as mount point. I assumed that the correct command should look something like this:
floyd run --cpu --env tensorflow-1.2 --data janinanu/datasets
/data/2:tut_test --data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir 'python udc_test.py
--model_dir=?'
I have no idea what to put in the --model_dir=?
Correspondingly, I assumed that I have to adjust some lines in the test file:
tf.flags.DEFINE_string("test_file", "/tut_test/test.tfrecords", "Path of
test data in TFRecords format")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to load model
checkpoints from")
...as well as in the train file (not sure about that though...):
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to store
model checkpoints (defaults to /model_dir)")
When I use e.g. --model_dir=/model_dir and the code with the above adjustments, I get this error:
2017-12-22 12:17:49,048 INFO - return func(*args, **kwargs)
2017-12-22 12:17:49,048 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 543, in evaluate
2017-12-22 12:17:49,048 INFO - log_progress=log_progress)
2017-12-22 12:17:49,049 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 816, in _evaluate_model
2017-12-22 12:17:49,049 INFO - % self._model_dir)
2017-12-22 12:17:49,049 INFO -
tensorflow.contrib.learn.python.learn.estimators._sklearn.NotFittedError:
Couldn't find trained model at /model_dir
Which doesn't come as a surprise.
Can anyone give me some clarification on how to feed the training output into the test run?
I will also post this question in the Floydhub Forum.
Thanks!!
.

You can mount the output of any job just like you mount a data. In your example:
--data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir
should mount the entire output directory from run 18 to /mount_dir of the new job.
You can confirm this by viewing the job page (select the "data" tab to see what datasets are mounted at which paths).
In you case, can you confirm if the test is looking for the correct model filename?
I will also respond to this in the FloydHub forum.

_DecodeError('Unexpected end-group tag.') when running import_pb_to_tensorboard.py

I am trying out ways to deploy tensorflow model on android/iOS devices. So I did:
1) use tf.saved_model.builder.SavedModelBuilder to get model in .pb file
2) use tf.saved_model.loader.load() to verify that I can restore the model
However, when I want to do further inspection of the model using import_pb_to_tensorboard.py following suggestions at
1) https://medium.com/#daj/how-to-inspect-a-pre-trained-tensorflow-model-5fd2ee79ced0
2) https://hackernoon.com/running-a-tensorflow-model-on-ios-and-android-ce89446c8143
I got this error:
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/python_message.py", line 1083, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
.....
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/decoder.py", line 612, in DecodeRepeatedField
if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
....
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/decoder.py", line 746, in DecodeMap
raise _DecodeError('Unexpected end-group tag.')
The code and the generated .pb files are here:
https://github.com/rjt10/hear_it/blob/master/urban_sound/saved_model.pb
https://github.com/rjt10/hear_it/blob/master/urban_sound/savedmodel_save.py
https://github.com/rjt10/hear_it/blob/master/urban_sound/savedmodel_load.py
The version of tensorflow that I use is built from source "HEAD detached at v1.4.1"

Well, I understand what's happening now. Tensorflow has at least 3 ways to save and load a model. The graph will be serialized as one of the following 3 protobuf objects:
GraphDef
MetaGraphDef
SavedModel
You just need to deserialize it properly, such as https://github.com/rjt10/hear_it/blob/master/urban_sound/model_check.py
For Android, TensorFlowInferenceInterface() expects a GraphDef, https://github.com/tensorflow/tensorflow/blob/e2be6d4c4fc9f1b7f6040b51b23190c14202e797/tensorflow/contrib/android/java/org/tensorflow/contrib/android/TensorFlowInferenceInterface.java#L541
That explains why.

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.

The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow - which is the real checkpoint file? - tensorflow

Related

tf.train.latest_checkpoint returning none when passing checkpoint path

Error while deserializing the Apache MXNet object

How to specify model directory in Floydhub?

_DecodeError('Unexpected end-group tag.') when running import_pb_to_tensorboard.py

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

Categories

Resources