ExpirationError(code=StatusCode.DEADLINE_EXCEEDED, details="Deadline Exceeded") - tensorflow

I am following tutorial for deploying the inception model using tensorflow serving.I am using ubuntu 16.04 and bazel 13.0.The server is running am able to ping the server.But when I upload a pic ,It shows the following error
jennings#Jennings:~/serving$ bazel-bin/tensorflow_serving/example/inception_clie nt --server=localhost:9000 --image=./Xiang_Xiang_panda.jpg
Traceback (most recent call last):
File "/home/jennings/serving/bazel-bin/tensorflow_serving/example/inception_client.runfiles/tf_serving/tensorflow_serving/example/inception_client.py", line 56, in <module>
tf.app.run()
File "/home/jennings/serving/bazel-bin/tensorflow_serving/example/inception_client.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jennings/serving/bazel-bin/tensorflow_serving/example/inception_client.runfiles/tf_serving/tensorflow_serving/example/inception_client.py", line 51, in main
result = stub.Predict(request, 10.0) # 10 secs timeout
File "/home/jennings/.local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
self._request_serializer, self._response_deserializer)
File "/home/jennings/.local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.ExpirationError: ExpirationError(code=StatusCode.DEADLINE_EXCEEDED, details="Deadline Exceeded")

This happens when the tensorflow serving client is not able to make communication with server. Or this might also occur due to network error. If you are using a docker to host your tensorflow model server, you need to open the port in the container as mentioned below,
docker run --name=tensorflow_container -p 9020:9020 -it $USER/tensorflow-serving-devel
Let me know if this works.Have a good one.

Related

Ambari cluster restart error: Timeline Service V2.0 Reader not restarting

Attempting to restart an Ambari-managed cluster and getting errors related to the Timeline Service V2.0 Reader service starting:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 108, in <module>
ApplicationTimelineReader().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 51, in start
hbase(action='start')
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 80, in hbase
createTables()
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 147, in createTables
logoutput=True)
File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
self.env.run()
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
self.run_action(resource, action)
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
provider_action()
File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 263, in action_run
returns=self.resource.returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
result = function(command, **kwargs)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 308, in _call
raise ExecuteTimeoutException(err_msg)
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/lib64/qt-3.3/bin:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/maven/bin:/root/bin:/opt/maven/bin:/opt/maven/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
I have not changed any configs or installed anything new between the restart attempt; simply stopped the cluster services and attempted to restart them. Not sure what this error message means. Any debugging tips or fixes?
Found the solution on another community post.
navigate to the host where Timeline Reader is installed and Install Hbase Client in that host
Here is how I installed HBase Client from via the Ambari UI...
In the Ambari UI, go to Hosts then click the host you want to install the hbase client component on
In the list on components, you will have option to add more, see...
From here I installed the HBase client
Then stopped and restarted the cluster via Ambari UI (got notification of stale configs (though not sure if this was my problem all along or if installing the HBase Client reaised the stale configs alert))

Cannot run dask-mpi with Python 3.7 -- timeout when connecting client to dask-mpi scheduler

I'm attempting to run the Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) example in a fresh Anaconda environment.
I set up an environment using
conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi
Inside the environment, I run
mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json
Then, from a python interpreter on the same machine (and in the same folder), I run
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
self.start(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
sync(self.loop, self._start, **kwargs)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
yield self._ensure_connected(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
timedelta(seconds=timeout), self._update_scheduler_info()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
The terminal that I ran dask-mpi from does not have any output which would indicate that something is trying to connect. I have verified that the port in question, 8786, is open. I've also verified via debugger that the client is getting the correct address from the scheduler file.
I've tried this in quite a few different environments and on a few different machines, including a fresh Ubuntu 18.04 docker container. I'm completely at a loss for what steps I might be missing.
It turns out this was due to an error in newer versions of dask.distributed (1.25.3) which broke the behavior of dask-mpi. This seems to be fixed as of dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3).

Tensorflow Serving client does not work

I am trying to build a program to predict from a already trained model, based on this tutorial
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/flowers
For this, I am using Docker and Tensorflow Serving following up this tutorial:
https://towardsdatascience.com/how-to-deploy-machine-learning-models-with-tensorflow-part-2-containerize-it-db0ad7ca35a7
When I run the client, which is based on this example:
https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_client.py
I got this error:
userml#userml:~/usermodel$ python tensorflow_serving_client.py --server=172.17.0.2:9000 --image=./image5578.jpg
Traceback (most recent call last):
File "tensorflow_serving_client.py", line 87, in <module>
tf.app.run()
File "/home/userml/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_serving_client.py", line 46, in main
result = stub.Predict(request, 60.0) # 60 secs timeout
File "/home/userml/.local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 324, in __call__
self._request_serializer, self._response_deserializer)
File "/home/userml/.local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 210, in _blocking_unary_unary
raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="input size does not match signature")
How can I solve this?
I was trying to find input size, to set on 'shape[1]' but seems not work. Does my model has something that is not compatible with the client?
Thanks a lot!

Error loading library gpuarray with Theano

I am trying to run this script to test Theano's use of my GPU and get the following error:
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 164, in <module>
use(config.device)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 151, in use
init_dev(device)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 60, in init_dev
sched=config.gpuarray.sched)
File "pygpu/gpuarray.pyx", line 614, in pygpu.gpuarray.init
(pygpu/gpuarray.c:9419)
File "pygpu/gpuarray.pyx", line 566, in pygpu.gpuarray.pygpu_init
(pygpu/gpuarray.c:9110)
File "pygpu/gpuarray.pyx", line 1021, in
pygpu.gpuarray.GpuContext.__cinit__ (pygpu/gpuarray.c:13472)
pygpu.gpuarray.GpuArrayException: Error loading library: -1
I need to use the nvidia-381 driver since my GPU is a 1080 ti and is not compatible with nvidia-375. I'm not sure if that matters but installing nvcc overwrites 381 and causes some errors if I reinstall 381 after setting up nvcc so I can't use nvcc.
I can import pygpu without errors but if I run pygpu.test() I get the following error and I don't know how to specify the DEVICE variable without nvcc.
======================================================================
ERROR: Failure: RuntimeError (No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/failure.py", line 39, in runTest
raise self.exc_val.with_traceback(self.tb)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/loader.py", line 418, in loadTestsFromName
addr.filename, addr.module)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/home/me/anaconda3/envs/py35/lib/python3.5/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/home/me/anaconda3/envs/py35/lib/python3.5/imp.py", line 172, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 693, in _load
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/test_tools.py", line 5, in <module>
from .support import (guard_devsup, rand, check_flags, check_meta, check_all,
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/support.py", line 32, in <module>
context = gpuarray.init(get_env_dev())
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/support.py", line 29, in get_env_dev
raise RuntimeError("No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.")
RuntimeError: No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.
----------------------------------------------------------------------
Ran 7 tests in 0.003s
FAILED (errors=7)
<nose.result.TextTestResult run=7 errors=7 failures=0>
Warning: its entirely possible that this is all wrong and the actual reason for your problem is in fact - as you suspect - your gpu driver.
I had the same issue with gpuarray on Windows 10.
In the end I solved it by:
completely uninstall python
install cuda 8.0 (with cudnn 5.1)
install anaconda
install theano through anaconda:
conda install theano pygpu
As you are using linux: This error message basically means It didn't work, don't ask me why And is mostly shown if something with your setup is wrong (e.g. different compilers used for compiling python and theano, or incompatible cuda version)
I would recommend to update to cuda 8.0 and to reinstall your python environment over anaconda (just in case)
On a side note: I tested your example script from the docu and at least that is working....
Note for windows users: Never try to install Anaconda in a location where you have spaces in the path... Everything looks fine ... until theano starts having trouble finding and compiling things.
Note regarding the pygpu.test():
Normally you just set the environment variable:
windows: set DEVICE=cuda
linux: export DEVICE=cuda
BUT The test has the habit of saying you didn't specify a device if the library couldn't be loaded...

Ambari shows zeppelin server not started but the server is actually up and running

I am using HDP 2.4.2 and I had previously installed the zeppelin server. It was working fine but today when i restarted the cluster ( AWS nodes were restarted), Ambari shows that Zeppelin server is not running and fails to start the server with the following error:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/ZEPPELIN/package/scripts/master.py", line 235, in <module>
Master().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/ZEPPELIN/package/scripts/master.py", line 169, in start
+ params.zeppelin_log_file, user=params.zeppelin_user)
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__
self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 158, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 121, in run_action
provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 238, in action_run
tries=self.resource.tries, try_sleep=self.resource.try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of '/usr/hdp/current/zeppelin-server/lib/bin/zeppelin-daemon.sh start >> /var/log/zeppelin/zeppelin-setup.log' returned 1. /usr/hdp/current/zeppelin-server/lib/bin/zeppelin-daemon.sh: line 187: /var/run/zeppelin-notebook/zeppelin-zeppelin-ip-10-0-0-11.eu-west-1.compute.internal.pid: Permission denied
cat: /var/run/zeppelin-notebook/zeppelin-zeppelin-ip-10-0-0-11.eu-west-1.compute.internal.pid: No such file or directory
In the zeppelin logs:
ERROR [2016-06-06 03:20:36,714] ({main} VFSNotebookRepo.java[list]:140) - Can't read note file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots java.io.IOException: file:///usr/hdp/current/zeppelin-server/lib/notebook/screenshots/note.json not found
ERROR [2016-06-06 03:34:12,795] ({main} Notebook.java[loadNoteFromRepo]:330) - Failed to load 2BHU1G67J java.io.IOException: file:///usr/hdp/current/zeppelin-server/lib/notebook/2BHU1G67J is not a directory
But for some reason, the zeppelin port is listening and despite these errors, the zeppelin server is running fine and executing all the queries. Please advice on how to correct the issue in Ambari and start the service without error from ambari.
The problem is with the PID file for the zeppelin service. It's either owned by the wrong user or has the wrong permissions. Manually stop the zeppelin service then delete the pid file locate at: /var/run/zeppelin-notebook/zeppelin-zeppelin-ip-10-0-0-11.eu-west-1.compute.internal.pid. Double check the owner/permissions on the /var/run/zeppelin-notebook folder as well. You should then be able to restart the service in the Ambari UI.