tf-serving abnormal exit without error message
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ReaHat EL6
TensorFlow Serving installed from (source or binary): source using bazel 0.18.0
TensorFlow Serving version: 1.12.0
Describe the problem
i compile the tf-serving using bazel in RHEL 6.9, and start it using:
./model_servers/tensorflow_model_server --model_config_file=./data/models.conf --rest_api_port=8502
models.conf:
model_config_list: {
config: {
name: "model_1",
base_path:"/search/work/tf_serving_bin/tensorflow_serving/data/model_data/model_1",
model_platform: "tensorflow",
model_version_policy: {
latest: {
num_versions: 1
}
}
}
}
Client using C++, and use libCurl to request tf-serving REST api, but, tf-serving often abnormal exits without error message in some minutes.
When my client service requests localhost tf-serving, the question occur frequently. But, client service requests tf-serving at other machines, the question do not occur, qps < 100.
I check memory, cpu idle, etc... no problems is found. so, it is very strange.
export export TF_CPP_MIN_VLOG_LEVEL=1, no error/critical message too.
Source code / logs
2019-01-09 09:28:35.118183: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models.
2019-01-09 09:28:35.118259: I tensorflow_serving/model_servers/server_core.cc:558] (Re-)adding model: app_ks_nfm_1
2019-01-09 09:28:35.227383: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227424: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227443: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227492: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359
2019-01-09 09:28:35.227530: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359
2019-01-09 09:28:35.256712: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-01-09 09:28:35.267728: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-09 09:28:35.313087: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:162] Restoring SavedModel bundle.
2019-01-09 09:28:38.797633: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:138] Running MainOp with key legacy_init_op on SavedModel bundle.
2019-01-09 09:28:38.803984: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: success. Took 3570131 microseconds.
2019-01-09 09:28:38.804027: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:83] No warmup data file found at /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359/assets.extra/tf_serving_warmup_requests
2019-01-09 09:28:38.804148: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:38.831860: I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2019-01-09 09:28:38.865243: I tensorflow_serving/model_servers/server.cc:302] Exporting HTTP/REST API at:localhost:8502 ...
[evhttp_server.cc : 237] RAW: Entering the event loop ...
It is not an abnormal exit. It is an indication that the Server is ready to receive the Inference Requests.
For clarification, please find the below explanation:
docker run --runtime=nvidia -p 8501:8501 \
--mount type=bind,\ source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,\
target=/models/half_plus_two \
-e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &
This will run the docker container with the nvidia-docker runtime, launch the TensorFlow Serving Model Server, bind the REST API port 8501, and map our desired model from our host to where models are expected in the container. We also pass the name of the model as an environment variable, which will be important when we query the model.
TIP: Before querying the model, be sure to wait till you see a message like the following, indicating that the server is ready to receive requests:
2018-07-27 00:07:20.773693: I tensorflow_serving/model_servers/main.cc:333]
Exporting HTTP/REST API at:localhost:8501 ...
After that Message, just press Enter and you can query the model using the below command
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict
For more information, refer the below link:
https://www.tensorflow.org/tfx/serving/docker#gpu_serving_example
The Reason:the short connection product a large amount of TCP Status 'TIME_WAIT', the available linux system file handle is occupied.
Related
I had trained one GAN model and saved the generator by the following function:
tf.keras.models.save_model(
generator,
filepath=os.path.join(MODEL_PATH, 'model_saver'),
overwrite=True,
include_optimizer=False,
save_format=None,
options=None
)
It predicts successfully when load model by tf.keras.models.load_model in python. But when serving the model in tensorflow model server, the model returns NaN value.
I serve the model by the following:
zhaocc:~/products/tensorflow_server$ sudo docker run -t --rm -p 8502:8501 -v "/tmp/pix2pix/sketch_photo/model_saver:/models/photo2sketch" -e MODEL_NAME=photo2sketch tensorflow/serving &
[3] 30089
zhaocc:~/products/tensorflow_server$ 2020-06-17 12:57:31.745339: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config: model_name: photo2sketch model_base_path: /models/photo2sketch
2020-06-17 12:57:31.745448: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-06-17 12:57:31.745459: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: photo2sketch
2020-06-17 12:57:31.846162: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: photo2sketch version: 1}
2020-06-17 12:57:31.846213: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846233: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846282: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/photo2sketch/1
2020-06-17 12:57:31.874158: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-06-17 12:57:31.874182: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:295] Reading SavedModel debug info (if present) from: /models/photo2sketch/1
2020-06-17 12:57:31.874315: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-17 12:57:31.952982: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2020-06-17 12:57:32.172641: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/photo2sketch/1
2020-06-17 12:57:32.248514: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 402236 microseconds.
2020-06-17 12:57:32.256576: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /models/photo2sketch/1/assets.extra/tf_serving_warmup_requests
2020-06-17 12:57:32.265064: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: photo2sketch version: 1}
2020-06-17 12:57:32.267113: I tensorflow_serving/model_servers/server.cc:355] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-06-17 12:57:32.269289: I tensorflow_serving/model_servers/server.cc:375] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
When I predict by REST request, it return NaN with correct shape:
[[[[nan nan nan]
[nan nan nan]
[nan nan nan]
...
[nan nan nan]
[nan nan nan]
[nan nan nan]]
Anybody knows why? How can I debug it? Thanks very much!
I had the very same problem with my Pix2Pix generator. The problem was with the training parameter. As explained here What does `training=True` mean when calling a TensorFlow Keras model? this parameter affects the results of the network. One possible solution is to remove all dropouts (and other affected parts) prior to saving the network. This solution did not work for me (probably missed something). So instead as a temporary workaround, I added 2 signatures to the model
#tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict1(input_batch):
return {'outputs': generator(input_batch, training=True)}
#tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict2(input_batch):
return {'outputs': generator(input_batch, training=False)}
...
generator.save(base_path + "kerassave",signatures={'predict1': model_predict1, 'predict2': model_predict2})
predict2 still always returned nans. predict1 worked, however.
I managed to export a Keras model for segmentation into a tensorflow/serving:1.10.0-gpu-based container. However, at start up I notice a warning in the docker logs, just before the event loop starts: [warn] getaddrinfo: address family for nodename not supported. I'm not sure what this means but so far I haven't been able to get a response from the server. Instead the client receives a status = StatusCode.UNAVAILABE, details="OS Error", "grpc_status":14.
Is this somehow related to that warning? Am I experiencing some kind of networking problem between the gRPC client and the tfserving container due to this unsupported address family?
For completeness, I post the docker logs below. Note that I cleared timestamps and unimportant lines out of the log for readability:
[]: I tensorflow_serving/model_servers/main.cc:157] Building single TensorFlow model file config: model_name: mrcnn model_base_path: /models/mrcnn
[]: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
[]: I tensorflow_serving/model_servers/server_core.cc:517] (Re-)adding model: mrcnn
[]: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: mrcnn version: 1}
[]: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mrcnn version: 1}
[]: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mrcnn version: 1}
[]: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /models/mrcnn/1
[]: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/mrcnn/1
[]: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
<skip>
[]: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10277 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:68:00.0, compute capability: 6.1)
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:113] Restoring SavedModel bundle.
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:148] Running LegacyInitOp on SavedModel bundle.
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:233] SavedModel load for tags { serve }; Status: success. Took 1240882 microseconds.
<skip>
[]: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mrcnn version: 1}
[]: I tensorflow_serving/model_servers/main.cc:327] Running ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
[]: I tensorflow_serving/model_servers/main.cc:337] Exporting HTTP/REST API at:localhost:8501 ..
Short answer is no, that warning is benign. My hunch is that your client isn't able to talk to the server, possibly because of how you have bound the docker ports or your client's code or how you're invoking it.
When you launch your container, do not forget to specify the port with "-p" option.
docker run -d -p <port out>:<port in> <IMAGE>
Otherwise, you can get the ip address with this command:
docker-machine ip
I'm new to Tensorflow.
I am using a 64 bit version of Windows 10 and I would like to install Tensorflow for the CPU.
I don't remember the exact steps that I followed to install it, however when I checked for the installation using:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
I have the following output:
2017-10-18 09:56:21.656601: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-18 09:56:21.656984: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
b'Hello, TensorFlow!'
I am running python in Sublime Text 3 using the package SublimeREPL.
I tried to search these errors and found out that it means that the tensorflow is built without these instructions which could improve performances for the CPU. I also found the code to hide these warnings, but I actually I want to use these instructions.
The code that I found that enables this is:
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mfma -k //tensorflow/tools/pip_package:build_pip_package
but I got this output:
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': no such package 'tensorflow/tools/pip_package': BUILD file not found on package path.
WARNING: Target pattern parsing failed. Continuing anyway.
INFO: Found 0 targets...
ERROR: command succeeded, but there were errors parsing the target pattern.
INFO: Elapsed time: 8,147s, Critical Path: 0,02s
How can I solve this problem?
Lastly, I don't understand what pip, wheel and bazel are so I need a step by step instructions.
Thank you a lot!
if you want to download TensorFlow source, compile+install, use this link. If you want to download binaries, then use this link.
After executing the distributed tensor flow as mentioned in the following link:
https://github.com/tobegit3hub/deep_recommend_system/tree/master/distributed
I got the following in ./checkpoint folder;
checkpoint
graph.pbtxt
model.ckpt-269.data-00000-of-00001
model.ckpt-269.index
model.ckpt-269.meta
I wanted to run tensorFlow serving on the above model provided in TensorFlow Serving. But I do get the below error when doing so:
./tensorflow_model_server --port="9000" --model_base_path=./model/
2017-07-14 15:32:32.791636: I tensorflow_serving/model_servers/main.cc:151] Building single TensorFlow model file config: model_name: default model_base_path: ./model/ model_version_policy: 0
2017-07-14 15:32:32.792156: I tensorflow_serving/model_servers/server_core.cc:375] Adding/updating models.
2017-07-14 15:32:32.792188: I tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding model: default
2017-07-14 15:32:32.893072: I tensorflow_serving/core/basic_manager.cc:698] Successfully reserved resources to load servable {name: default version: 1}
2017-07-14 15:32:32.893143: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: default version: 1}
2017-07-14 15:32:32.893165: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: default version: 1}
2017-07-14 15:32:32.893252: E tensorflow_serving/util/retrier.cc:38] Loading servable: {name: default version: 1} failed: Not found: Session bundle or SavedModel bundle not found at specified export location
Any suggestion on this ?
I have used SavedModel (Inception_resnet_v2) to export the TensorFlow model files and use TensorFlow Serving to load the files.I have directly replaced offical minst saved_model.pb with my own Inception_resnet_v2 saved_model.pb file. But I got one error.
deep#ubuntu:~/serving$ bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/home/deep/serving/tmp/mnist_model
2017-06-18 10:39:41.963490: I tensorflow_serving/model_servers/main.cc:146] Building single TensorFlow model file config: model_name: mnist model_base_path: home/deep/serving/tmp/mnist_model model_version_policy: 0
2017-06-18 10:39:41.963752: I tensorflow_serving/model_servers/server_core.cc:375] Adding/updating models.
2017-06-18 10:39:41.963762: I tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding model: mnist
2017-06-18 10:39:42.065556: I tensorflow_serving/core/basic_manager.cc:698] Successfully reserved resources to load servable {name: mnist version: 1}
2017-06-18 10:39:42.065610: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}
2017-06-18 10:39:42.065648: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}
2017-06-18 10:39:42.065896: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /home/deep/serving/tmp/mnist_model/1
2017-06-18 10:39:42.066130: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:226] Loading SavedModel from: /home/deep/serving/tmp/mnist_model/1
2017-06-18 10:39:42.080775: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:274] Loading SavedModel: fail. Took 14816 microseconds.
2017-06-18 10:39:42.080822: E tensorflow_serving/util/retrier.cc:38] Loading servable: {name: mnist version: 1} failed: Not found: Could not find meta graph def matching supplied tags.
What should I do? Thanks!
I chatted to the Serving engineers, and here are some of their thoughts on this:
Looks like they need to specify a tag either in the saved model, or on
the command line. (log line of note: failed: Not found: Could not find
meta graph def matching supplied tags. )
It looks like the SavedModel loader is unable to find a graph
corresponding to the tags they have supplied. Here is some
documentation:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/saved_model#tags
Ah, to add: They could use the SavedModel CLI to inspect the model and
see what tag-sets are available. Here is the documentation for that:
https://www.tensorflow.org/versions/master/programmers_guide/saved_model_cli.
They can run
saved_model_cli show --dir <SavedModelDir>
to check what tag-sets are in the SavedModel if they have pip
installed tensorflow.