TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised - tensorflow

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
raise Exception(msg);
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
File "d_rl_main_model_dist_0.py", line 360, in save
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
not meets the tensorflow api which define Saver.save as below:
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.

The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.


SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with cmake?
When I try and install using pip install sentencepiece and try to include sentencepiece in my "transforms" in my script, I get this following error
After running this script (matched from the OpenNMT translation tutorial)
!onmt_build_vocab -config en-sp.yaml -n_sample -1
I get:
Traceback (most recent call last):
File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Below is how my script is written. I'm not sure what the not a string is coming from.
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
path_src: train_europarl-v7.es-en.es
path_tgt: train_europarl-v7.es-en.en
transforms: [sentencepiece, filtertoolong]
weight: 1
path_src: dev_europarl-v7.es-en.es
path_tgt: dev_europarl-v7.es-en.en
transforms: [sentencepiece]
skip_empty_level: silent
world_size: 1
gpu_ranks: [0]
EDIT: So I went ahead and Googled the issue more and found a google colab project that built sentencepiece using cmake here https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW. However, even after building using cmake, I'm still getting this issue.
To fix this issue, I had to filter and tokenize my dataset and then train with sentencepiece. I used the scripts from this helpful source: https://github.com/ymoslem/MT-Preparation to do everything and now my model is training!

Keras: Error when downloading Fashion_MNIST Data

I am trying to download data from Fashion MNIST, but it produces an error. Originally, it was downloading and working properly, but I had to terminate it because I had to turn off my computer. Once I opened the file up again, it gives me an error. I'm not sure what the problem is, but is it because I already downloaded some parts of the data once, and keras doesn't recognize that? I am using Jupyter notebook in a conda environment
Here is the link to the image:
You have missed adding tf. to the line
fashion_mnist = keras.datasets.fashion_mnist
The below code works perfectly for me. Importing the fashion_mnist dataset has been outlined in tensorflow documention here.
Change your code to:
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
or, use the better way to do it below. This avoids creating an extra variable fashion_mnist:
import tensorflow as tf
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
I am using tensorflow 1.9.0, keras 2.2.2 and python 3.6.6 on Windows 10 x64 OS.
I know my pc well, I can't download anything larger than 2.7 MB (in terminal), due to WinError 8.
So I manually downloaded all packs from storage.google (since some packs are 25 MB).
Check the packs:
then I paste all packs to \datasets\fashion-mnist
The next time u run your code, it should be fixed.
Note : If u have VScode then just CTRL and click the link, then you can download it easily.
I had an error regarding the cURL connection, and by looking into the error message I was able to track the file where the URL was declared. In my case it was:
At line 44 I have commented out the line:
# base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
And declared a different base URL, which I had found looking into the documentation of the original dataset:
base = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/'
The download started immediately and gave no errors. Hope this helps.
This is because for some reason you have an incomplete download for the MNIST dataset.
You will have to manually delete the downloaded folder which usually resides in ~/.keras/datasets or any path specified by you relative to this path, in your case MNIST_data.
Go to : C:\Users\Username.keras\datasets
and then Delete the Dataset that you want to redownload or has the error
You should be good to go!
You can also manually add print for the path from which it is taking dataset ..
Ex: print(paths) in file fashion_mnist.py
with gzip.open(paths[3], 'rb') as imgpath:
print(paths) #debug print in fashion_mnist.py
x_test = np.frombuffer(
imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
& from this path, remove the files & this will start to download fresh data ..
Change The base address with 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/' as described previously. It works for me.
I was getting error of Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 448, in load
return pickle.load(fid, **pickle_kwargs)
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 450, in load
raise IOError(
OSError: Failed to interpret file 'C:\\Users\\AsadA\\.keras\\datasets\\mnist.npz' as a pickle"**
GO TO FILE C:\Users\AsadA\AppData\Local\Programs\Python\Python38\Lib\site-packages\tensorflow\python\keras\datasets (In my Case) and follow the instructions:

UnicodeDecodeError from tf.train.import_meta_graph

I serialized a Tensorflow model with the following code ...
save_path = self.saver.save(self.session, os.path.join(self.logdir, "model.ckpt"), global_step)
logging.info("Model saved in file: %s" % save_path)
... and I'm now trying to restore it from scratch in a separate file using the following code:
saver = tf.train.import_meta_graph(PROJ_DIR + '/logs/default/model.ckpt-54.meta')
session = tf.Session()
saver.restore(session, PROJ_DIR + '/logs/default/model.ckpt-54')
print('Model restored')
When tf.train.import_meta_graph is called, the following exception is thrown:
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Traceback (most recent call last):
File "/home/reid/projects/research/ccg/taggerflow_modified/test/tf_restore.py", line 4, in <module>
saver = tf.train.import_meta_graph(PROJ_DIR + '/logs/default/model.ckpt-54.meta')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1711, in import_meta_graph
read_meta_graph_file(meta_graph_or_file), clear_devices)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1563, in read_meta_graph_file
text_format.Merge(file_content.decode("utf-8"), meta_graph_def)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 1: invalid start byte
For reference, here's the first few lines of <PROJ_DIR>/logs/default/model.ckpt-54.meta:
^M2^K^S^A^B^D^F^E^C ^R^G
I think that Tensorflow is using a different encoding when serializing vs when deserializing. How do we specify the encoding that Tensorflow uses when serializing/deserializing? Or is the solution something different?
I was facing the same issue. Have you ensured that apart from the
.meta, .data-00000-of-00001 and the .index files
the file named 'checkpoint' too is there in the directory from which you're loading the model?
My issue got resolved after I made sure of this. Hope this helps!

Error while deserializing the Apache MXNet object

I have trained and saved a model using Amazon SageMaker which saves the model in the format of model.tar.gz which when untarred, has a file model_algo-1 which is a serialized Apache MXNet object. To load the model in memory I need to deserialize the model. I tried doing so as follows:
import mxnet as mx
Reference taken from https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
However, doing this yields me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/mxnet/ndarray/utils.py", line
175, in load
File "/usr/local/lib/python3.4/site-packages/mxnet/base.py", line 146, in
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:06:25] src/ndarray/ndarray.cc:1112: Check failed:
header == kMXAPINDArrayListMagic Invalid NDArray file format
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192112)
[bt] (1) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192738)
[bt] (2) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(+0x24a5c44) [0x7fe434f0dc44]
[bt] (3) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(MXNDArrayLoad+0x248) [0x7fe434d19ad8]
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe48c5bbcec]
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7fe48c5bb615]
[bt] (6) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-
34m.so(_ctypes_callproc+0x2fb) [0x7fe48c7ce18b]
[bt] (7) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-34m.so(+0xa4cf)
[bt] (8) /usr/lib64/libpython3.4m.so.1.0(PyObject_Call+0x8c)
[bt] (9) /usr/lib64/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x36c5)
Could someone suggest how this can be resolved?
If your model is serialized into archive properly, then there should be at least 2 files:
model_name.json - it contains the architecture of your model
model_name.params - it contains parameters of your model
So, to load the model back, you need to:
Restore the model itself by loading json file.
Restore model parameters (you don't use mxnet nd.array for that, but full model).
Here is the code example how to do it:
# sym_json - content of .json file
net = gluon.nn.SymbolBlock(
# params_filename - full path to parameters file
If you also want to check serialization of your model as well, take a look to this example. This example shows how to serialize a trained model manually before uploading to SageMaker.
More details on serializing and deserializing model manually can be found here.
I trained a stock Linear Learner algorithm through AWS Sagemaker. It makes a model object called model.tar.gz in the output folder. As Vasanti noted, there is some notation that these objects are mxnet object in an article
I knew I had to unpack the tar, what I didn't realize was how many times. I started with this code:
import subprocess
cmdline = ['tar','-xzvf','model.tar.gz']
which yields the file called 'model_algo-1' which led me to this page. However, it's still a packed file. So run:
cmdline = ['tar','-xzvf','model_algo-1']
This yields:
From there, you can utilize Sergei's post:
# load the json file
import json
sym_json = json.load(open('mx-mod-symbol.json'))
sym_json_string = json.dumps(sym_json)
# open model
import mxnet as mx
from mxnet import gluon
net = gluon.nn.SymbolBlock(
# params file
net.load_parameters('mx-mod-0000.params', allow_missing=True)
Now, if only I knew what to do with this mxnet / gluon object to get what I really want which is a feature importance rank order and weight for some model explainability.

TF record corrupted after several successful training epochs

I was training a neural network and had run over all the training data for several epochs successfully.
However, the tfrecord corrputed error suddenly came out as follows:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 77, in tf_record_iterator
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 106241330
I checked the data file again and it was indeed corrupted at that line. But the data was intact before I ran the training code and I simply just read the data by following code:
batch_data = []
record_iterator = tf.python_io.tf_record_iterator(path=file, options=options)
for string_record in record_iterator:
example = tf.train.Example()
data = generate_data_from_record(example) # record parsing code
if len(batch_data) == batch_size:
yield batch_data
batch_data = []
I am wondering why the data file was corrupted and how can I remain the integrity of the data file.
You should make a clean copy of your tfrecord files. Whenever your working copy get corrupted, replace from the clean copy. The dataLoss error seems to be as a result of several reading of the same record, and its also dependent on the disk.
If someone is facing this problem, the above answer by #nwoye-cid worked for me plus the link below to install everything properly.
Also, restart your kernel from scratch if nothing works then only go for other solutions.