TFMA run_model_analysis not parsing TFRecord files properly - tensorflow

I am trying to use the run_model_analysis function of TFMA library to evaluate my model.
The data has been written to a TFRecord following the tf.train.example format.
The model called for the evalaution expects an input shape of (None, 1, 5).
Somehow on running the TFRecord bytestring saved by the SerializeToString function dosen't get parsed and gets passed as it is.
On running it gives this error -
WARNING:absl:Tensorflow version (2.8.1) found. Note that TFMA support for TF 2.0 is currently in beta
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
The thread 0x3 has exited with code 0 (0x0).
The thread 0x4 has exited with code 0 (0x0).
The thread 0x5 has exited with code 0 (0x0).
The thread 0x6 has exited with code 0 (0x0).
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
WARNING:absl:Large batch_size 1 failed with error Fail to call signature func with signature_name: serving_default.
the inputs are:
[b'\n\xab\x01\n\x1e\n\x15DistinctCurrencyCodes\x12\x05\x1a\x03\n\x01\x00\n\x1e\n\x12AvgInvoiceQuantity\x12\x08\x12\x06\n\x04\xb0\xc2\xd8=\n\x16\n\nAvgTaxRate\x12\x08\x12\x06\n\x04<\xbf1>\n\x1c\n\x10InvoiceTaxAmount\x12\x08\x12\x06\n\x04\x88\x10\x8d\xbb\n\x1f\n\x13AvgInvoiceUnitPrice\x12\x08\x12\x06\n\x04\xc0\x95\xf4>\n\x12\n\tIsAnomaly\x12\x05\x1a\x03\n\x01\x00'].
The input_specs are:
{'input_1': TensorSpec(shape=(None, 1, 5), dtype=tf.float32, name='input_1')}.. Attempting to run batch through serially. Note that this will significantly affect the performance.
The thread 0x7 has exited with code 0 (0x0).
The thread 0x2 has exited with code 0 (0x0).
Fail to call signature func with signature_name: serving_default.
the inputs are:
[b'\n\xab\x01\n\x1e\n\x15DistinctCurrencyCodes\x12\x05\x1a\x03\n\x01\x00\n\x1e\n\x12AvgInvoiceQuantity\x12\x08\x12\x06\n\x04\xb0\xc2\xd8=\n\x16\n\nAvgTaxRate\x12\x08\x12\x06\n\x04<\xbf1>\n\x1c\n\x10InvoiceTaxAmount\x12\x08\x12\x06\n\x04\x88\x10\x8d\xbb\n\x1f\n\x13AvgInvoiceUnitPrice\x12\x08\x12\x06\n\x04\xc0\x95\xf4>\n\x12\n\tIsAnomaly\x12\x05\x1a\x03\n\x01\x00'].
The input_specs are:
{'input_1': TensorSpec(shape=(None, 1, 5), dtype=tf.float32, name='input_1')}. [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractPredictions/Predict']
Stack trace:
>
>During handling of the above exception, another exception occurred:
>
>
>The above exception was the direct cause of the following exception:
>
>
>During handling of the above exception, another exception occurred:
>
>
>During handling of the above exception, another exception occurred:
>
>
>The above exception was the direct cause of the following exception:
>
>
>During handling of the above exception, another exception occurred:
>
> File "C:\Users\t-ankbiswas\OneDrive - Microsoft\Desktop\EC.VL.CommerceTools\DataScience\AnomalyDetection\AnomalyDetector\AnomalyDetector\Evaluator\TFEvaluator.py", line 163, in evaluateModel
> evalResult = tfma.run_model_analysis(
> File "C:\Users\t-ankbiswas\OneDrive - Microsoft\Desktop\EC.VL.CommerceTools\DataScience\AnomalyDetection\AnomalyDetector\AnomalyDetector\Helpers\ExecutionManager.py", line 115, in CheckEvaluatorStages
> result = TFEvaluator(config).evaluateModel(config, dataProducer, dataPreprocessor,
> File "C:\Users\t-ankbiswas\OneDrive - Microsoft\Desktop\EC.VL.CommerceTools\DataScience\AnomalyDetection\AnomalyDetector\AnomalyDetector\Helpers\ExecutionManager.py", line 56, in Execute
> CheckEvaluatorStages(config)
> File "C:\Users\t-ankbiswas\OneDrive - Microsoft\Desktop\EC.VL.CommerceTools\DataScience\AnomalyDetection\AnomalyDetector\AnomalyDetector\Main.py", line 53, in <module> (Current frame)
> ExecutionManager.Execute()
Backend QtAgg is interactive backend. Turning interactive mode on.
Loaded 'tensorflow.python.eager.function'
Loaded 'tensorflow.python.eager.execute'
Loaded 'tensorflow.python.saved_model.load'
Loaded 'tensorflow_model_analysis.utils.model_util'
Loaded 'apache_beam.runners.common'
Loaded 'apache_beam.runners.worker.operations'
Loaded 'apache_beam.runners.worker.bundle_processor'
Loaded 'apache_beam.runners.worker.sdk_worker'
Loaded 'apache_beam.runners.portability.fn_api_runner.worker_handlers'
Loaded 'apache_beam.runners.portability.fn_api_runner.fn_runner'
Loaded 'apache_beam.runners.direct.direct_runner'
Loaded 'apache_beam.pipeline'
Loaded 'tensorflow_model_analysis.api.model_eval_lib'
Loaded 'Evaluator.TFEvaluator'
Loaded 'Helpers.ExecutionManager'
Loaded '__main__'
Loaded 'runpy'
The program 'python.exe' has exited with code 0 (0x0).
Any idea what causes this error and how it can be fixed ?

Related

np.matmul() inside tf.py_func() throws SIGBUS error

I am seeing a fatal error from matrix multiplication inside a py_func call.
In the py_func call I am multiplying a tensor with a set of 3D coordinates by a rotation matrix. This reproduces the error:
x = np.matmul(np.ones([640*480, 3]), np.eye(3))
When running outside a TF session this works with no problem, but inside the session when called via py_func I get
Process finished with exit code 138 (interrupted by signal 10: SIGBUS)
Trying different tensor sizes I see that for a shape (29000,3) the line works, and for (29200,3) it fails.
I am using TensorFlow-1.12.0.
What could cause this issue and how can I resolve it?

UnicodeDecodeError from tf.train.import_meta_graph

I serialized a Tensorflow model with the following code ...
save_path = self.saver.save(self.session, os.path.join(self.logdir, "model.ckpt"), global_step)
logging.info("Model saved in file: %s" % save_path)
... and I'm now trying to restore it from scratch in a separate file using the following code:
saver = tf.train.import_meta_graph(PROJ_DIR + '/logs/default/model.ckpt-54.meta')
session = tf.Session()
saver.restore(session, PROJ_DIR + '/logs/default/model.ckpt-54')
print('Model restored')
When tf.train.import_meta_graph is called, the following exception is thrown:
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Traceback (most recent call last):
File "/home/reid/projects/research/ccg/taggerflow_modified/test/tf_restore.py", line 4, in <module>
saver = tf.train.import_meta_graph(PROJ_DIR + '/logs/default/model.ckpt-54.meta')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1711, in import_meta_graph
read_meta_graph_file(meta_graph_or_file), clear_devices)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1563, in read_meta_graph_file
text_format.Merge(file_content.decode("utf-8"), meta_graph_def)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 1: invalid start byte
For reference, here's the first few lines of <PROJ_DIR>/logs/default/model.ckpt-54.meta:
<A7>:^R<A4>:
9
^CAdd^R^F
^Ax"^AT^R^F
^Ay"^AT^Z^F
^Az"^AT"^Z
^AT^R^Dtype:^O
^M2^K^S^A^B^D^F^E^C ^R^G
I think that Tensorflow is using a different encoding when serializing vs when deserializing. How do we specify the encoding that Tensorflow uses when serializing/deserializing? Or is the solution something different?
I was facing the same issue. Have you ensured that apart from the
.meta, .data-00000-of-00001 and the .index files
the file named 'checkpoint' too is there in the directory from which you're loading the model?
My issue got resolved after I made sure of this. Hope this helps!

some error/warning messages for running Tensorflow implementation

When running a Tensorflow implementation, I got the following error/warning messages, which does not include the line of python code that causes this issue. At the same time, the result is still generated. I am not sure what do these messages indicate?
Exception ignored in: <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x2b48ec89f748>>
Traceback (most recent call last):
File "/data/tfw/lib/python3.4/site- packages/tensorflow/python/client/session.py", line 140, in __del__
File "/data/tfw/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 137, in close
UnboundLocalError: local variable 'status' referenced before assignment
Today I also encountered this exception while running some Multi layer perceptron model on Windows 10 64 Bit with Python 3.5 and TensorFlow 0.12
I have seen this answer for this exception
it induced by different gc sequence, if python collect session first , the program will exit successfully, if python collect swig memory(tf_session) first, the program exit with failure.
Here

Warnings when executing tensorflow sample code `mnist_with_summaries.py`

I'm trying to execute the examples given by TensorFlow. More specifically, the minist example. When I'm executing
tensorflow/examples/tutorials/mnist/mnist_with_summaries.py, line 163, which is :
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
The warning below comes out:
W tensorflow/core/common_runtime/gpu/gpu_tracer.cc:513] Unhandled API Callback for 2 41
W tensorflow/core/common_runtime/gpu/gpu_tracer.cc:513] Unhandled API Callback for 2 41
W tensorflow/core/common_runtime/gpu/gpu_tracer.cc:513] Unhandled API Callback for 2 41
W tensorflow/core/common_runtime/gpu/gpu_tracer.cc:513] Unhandled API Callback for 2 41
Any ideas on why this warning happen?
The full code is avaliable at here over which I changed nothing.
Thanks
This is a warning message and should be harmless. A fix is also in progress: https://github.com/tensorflow/tensorflow/issues/2959

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.