Error while deserializing the Apache MXNet object - mxnet

I have trained and saved a model using Amazon SageMaker which saves the model in the format of model.tar.gz which when untarred, has a file model_algo-1 which is a serialized Apache MXNet object. To load the model in memory I need to deserialize the model. I tried doing so as follows:
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
Reference taken from https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
However, doing this yields me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/mxnet/ndarray/utils.py", line
175, in load
ctypes.byref(names)))
File "/usr/local/lib/python3.4/site-packages/mxnet/base.py", line 146, in
check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:06:25] src/ndarray/ndarray.cc:1112: Check failed:
header == kMXAPINDArrayListMagic Invalid NDArray file format
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192112)
[0x7fe432bfa112]
[bt] (1) /usr/local/lib/python3.4/site-packages/mxnet/libmxnet.so(+0x192738)
[0x7fe432bfa738]
[bt] (2) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(+0x24a5c44) [0x7fe434f0dc44]
[bt] (3) /usr/local/lib/python3.4/site-
packages/mxnet/libmxnet.so(MXNDArrayLoad+0x248) [0x7fe434d19ad8]
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe48c5bbcec]
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7fe48c5bb615]
[bt] (6) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-
34m.so(_ctypes_callproc+0x2fb) [0x7fe48c7ce18b]
[bt] (7) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-34m.so(+0xa4cf)
[0x7fe48c7c84cf]
[bt] (8) /usr/lib64/libpython3.4m.so.1.0(PyObject_Call+0x8c)
[0x7fe4942fcb5c]
[bt] (9) /usr/lib64/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x36c5)
[0x7fe4943ac915]
Could someone suggest how this can be resolved?

If your model is serialized into archive properly, then there should be at least 2 files:
model_name.json - it contains the architecture of your model
model_name.params - it contains parameters of your model
So, to load the model back, you need to:
Restore the model itself by loading json file.
Restore model parameters (you don't use mxnet nd.array for that, but full model).
Here is the code example how to do it:
# sym_json - content of .json file
net = gluon.nn.SymbolBlock(
outputs=mx.sym.load_json(sym_json),
inputs=mx.sym.var('data'))
# params_filename - full path to parameters file
net.load_params(params_filename)
If you also want to check serialization of your model as well, take a look to this example. This example shows how to serialize a trained model manually before uploading to SageMaker.
More details on serializing and deserializing model manually can be found here.

I trained a stock Linear Learner algorithm through AWS Sagemaker. It makes a model object called model.tar.gz in the output folder. As Vasanti noted, there is some notation that these objects are mxnet object in an article
I knew I had to unpack the tar, what I didn't realize was how many times. I started with this code:
import subprocess
cmdline = ['tar','-xzvf','model.tar.gz']
subprocess.call(cmdline)
which yields the file called 'model_algo-1' which led me to this page. However, it's still a packed file. So run:
cmdline = ['tar','-xzvf','model_algo-1']
subprocess.call(cmdline)
This yields:
additional-params.json
manifest.json
mx-mod-0000.params
mx-mod-symbol.json
From there, you can utilize Sergei's post:
# load the json file
import json
sym_json = json.load(open('mx-mod-symbol.json'))
sym_json_string = json.dumps(sym_json)
# open model
import mxnet as mx
from mxnet import gluon
net = gluon.nn.SymbolBlock(
outputs=mx.sym.load_json(sym_json_string),
inputs=mx.sym.var('data'))
# params file
net.load_parameters('mx-mod-0000.params', allow_missing=True)
Now, if only I knew what to do with this mxnet / gluon object to get what I really want which is a feature importance rank order and weight for some model explainability.

Related

How to recover pickled Keras histories?

Does anyone know how I can recover a list of Keras history objects that I saved to drive by pickling with the following code:
import pickle
with open("H:/hists", "wb") as fp: #Pickling
pickle.dump(hists, fp)
Currently I'm trying :
with open("H:/hists", "rb") as fp: # Unpickling
hists = pickle.load(fp)
but getting this error:
FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram://b3edea45-d0d4-442d-ab4f-b0c43a87d19e/variables/variables
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
Which I believe is because the python kernel in which I saved the history object has been terminated and a new one started.
I now think that the best way to save histories is to convert them to DataFrames or numpy arrays and save those, but that's not possible now since the histories are no longer in memory. It took about 5 hours to produce the histories, so I'm hoping it's possible to recover them.

how to import a tensorflow webmodel in python

I have a tensorflow "graph-model" consisting of a model.json and several .bin files. In javascript I am able to read those files using
const weights = browser.runtime.getURL("web_model/model.json");
tf.loadGraphModel(weights)
However I would like to be able to use this model in python, in order to process the results better.
When I try to load the model in python with
new_model = keras.models.load_model('./web_model/model.json')
I get the following error:
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (file signature not found)
I don't understand, since the javascript code is able to run the model, I think python should be able to do the same as well. What am I doing wrong ?

Keras model.get_config() returns list instead of dictionary

I am using tensorflow-gpu==1.10.0 and keras from tensorflow as tf.keras.
I am trying to use source code written by someone else to implement it on my network.
I saved my network using save_model and load it using load_model. when I use model.get_config(), I expect a dictionary, but i"m getting a list. Keras source documentation also says that get_config returns a dictionary (https://keras.io/models/about-keras-models/).
I tried to check if it has to do with saving type : save_model or model.save that makes the difference in how it is saved, but both give me this error:
TypeError: list indices must be integers or slices, not str
my code block :
model_config = self.keras_model.get_config()
for layer in model_config['layers']:
name = layer['name']
if name in update_layers:
layer['config']['filters'] = update_layers[name]['filters']
my pip freeze :
absl-py==0.6.1
astor==0.7.1
bitstring==3.1.5
coverage==4.5.1
cycler==0.10.0
decorator==4.3.0
Django==2.1.3
easydict==1.7
enum34==1.1.6
futures==3.1.1
gast==0.2.0
geopy==1.11.0
grpcio==1.16.1
h5py==2.7.1
image==1.5.15
ImageHash==3.7
imageio==2.5.0
imgaug==0.2.5
Keras==2.1.3
kiwisolver==1.1.0
lxml==4.1.1
Markdown==3.0.1
matplotlib==2.1.0
networkx==2.2
nose==1.3.7
numpy==1.14.1
olefile==0.46
opencv-python==3.3.0.10
pandas==0.20.3
Pillow==4.2.1
prometheus-client==0.4.2
protobuf==3.6.1
pyparsing==2.3.0
pyquaternion==0.9.2
python-dateutil==2.7.5
pytz==2018.7
PyWavelets==1.0.1
PyYAML==3.12
Rtree==0.8.3
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==0.19.1
Shapely==1.6.4.post1
six==1.11.0
sk-video==1.1.8
sklearn-porter==0.6.2
tensorboard==1.10.0
tensorflow-gpu==1.10.0
termcolor==1.1.0
tqdm==4.19.4
utm==0.4.2
vtk==8.1.0
Werkzeug==0.14.1
xlrd==1.1.0
xmltodict==0.11.0

_DecodeError('Unexpected end-group tag.') when running import_pb_to_tensorboard.py

I am trying out ways to deploy tensorflow model on android/iOS devices. So I did:
1) use tf.saved_model.builder.SavedModelBuilder to get model in .pb file
2) use tf.saved_model.loader.load() to verify that I can restore the model
However, when I want to do further inspection of the model using import_pb_to_tensorboard.py following suggestions at
1) https://medium.com/#daj/how-to-inspect-a-pre-trained-tensorflow-model-5fd2ee79ced0
2) https://hackernoon.com/running-a-tensorflow-model-on-ios-and-android-ce89446c8143
I got this error:
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/python_message.py", line 1083, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
.....
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/decoder.py", line 612, in DecodeRepeatedField
if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
....
File "/Users/rjtang/_hack/env.tensorflow_src/lib/python3.4/site-packages/google/protobuf/internal/decoder.py", line 746, in DecodeMap
raise _DecodeError('Unexpected end-group tag.')
The code and the generated .pb files are here:
https://github.com/rjt10/hear_it/blob/master/urban_sound/saved_model.pb
https://github.com/rjt10/hear_it/blob/master/urban_sound/savedmodel_save.py
https://github.com/rjt10/hear_it/blob/master/urban_sound/savedmodel_load.py
The version of tensorflow that I use is built from source "HEAD detached at v1.4.1"
Well, I understand what's happening now. Tensorflow has at least 3 ways to save and load a model. The graph will be serialized as one of the following 3 protobuf objects:
GraphDef
MetaGraphDef
SavedModel
You just need to deserialize it properly, such as https://github.com/rjt10/hear_it/blob/master/urban_sound/model_check.py
For Android, TensorFlowInferenceInterface() expects a GraphDef, https://github.com/tensorflow/tensorflow/blob/e2be6d4c4fc9f1b7f6040b51b23190c14202e797/tensorflow/contrib/android/java/org/tensorflow/contrib/android/TensorFlowInferenceInterface.java#L541
That explains why.

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.