Can a "address family for nodename not supported" warning prevent proper serving? - tensorflow-serving

I managed to export a Keras model for segmentation into a tensorflow/serving:1.10.0-gpu-based container. However, at start up I notice a warning in the docker logs, just before the event loop starts: [warn] getaddrinfo: address family for nodename not supported. I'm not sure what this means but so far I haven't been able to get a response from the server. Instead the client receives a status = StatusCode.UNAVAILABE, details="OS Error", "grpc_status":14.
Is this somehow related to that warning? Am I experiencing some kind of networking problem between the gRPC client and the tfserving container due to this unsupported address family?
For completeness, I post the docker logs below. Note that I cleared timestamps and unimportant lines out of the log for readability:
[]: I tensorflow_serving/model_servers/main.cc:157] Building single TensorFlow model file config: model_name: mrcnn model_base_path: /models/mrcnn
[]: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
[]: I tensorflow_serving/model_servers/server_core.cc:517] (Re-)adding model: mrcnn
[]: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: mrcnn version: 1}
[]: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mrcnn version: 1}
[]: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mrcnn version: 1}
[]: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /models/mrcnn/1
[]: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/mrcnn/1
[]: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
<skip>
[]: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10277 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:68:00.0, compute capability: 6.1)
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:113] Restoring SavedModel bundle.
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:148] Running LegacyInitOp on SavedModel bundle.
[]: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:233] SavedModel load for tags { serve }; Status: success. Took 1240882 microseconds.
<skip>
[]: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mrcnn version: 1}
[]: I tensorflow_serving/model_servers/main.cc:327] Running ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
[]: I tensorflow_serving/model_servers/main.cc:337] Exporting HTTP/REST API at:localhost:8501 ..

Short answer is no, that warning is benign. My hunch is that your client isn't able to talk to the server, possibly because of how you have bound the docker ports or your client's code or how you're invoking it.

When you launch your container, do not forget to specify the port with "-p" option.
docker run -d -p <port out>:<port in> <IMAGE>
Otherwise, you can get the ip address with this command:
docker-machine ip

Related

Crash tool fails to load the vmcore file with error : Segmentation fault

I'm new to kernel debugging, I'm trying to analyze the vmcore file using crash tool on RHEL 7. I'm trying to open the dump with below command.
crash /lib/debug/lib/modules/3.10.0-1160.80.1.el7.x86_64/vmlinux vmcore
but tools fails to load the vmcore with below error:
WARNING: kernel relocated [460MB]: patching 87472 gdb minimal_symbol values
crash: page excluded: kernel virtual address: ffffffffffffffff type: "possible"
WARNING: cannot read cpu_possible_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "present"
WARNING: cannot read cpu_present_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "online"
WARNING: cannot read cpu_online_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "active"
WARNING: cannot read cpu_active_map
WARNING: kernel version inconsistency between vmlinux and dumpfile
crash: page excluded: kernel virtual address: ffffffffffffffff type: "cpu_present_map"
crash: page excluded: kernel virtual address: ffffffffffffffff type: "cpu_present_map"
crash: cannot determine thread return address
WARNING: cannot determine pgdat list for this kernel/architecture
please wait... (gathering kmem slab cache data)
crash: invalid kernel virtual address: 1c type: "kmem_cache objsize/object_size"
Segmentation fault (core dumped)
I'm not sure this the vmcore file issue or crash tool issue. Can someone please help resolve this issue.
Thanks in advance

tensorflow serving returning NaN when predict

I had trained one GAN model and saved the generator by the following function:
tf.keras.models.save_model(
generator,
filepath=os.path.join(MODEL_PATH, 'model_saver'),
overwrite=True,
include_optimizer=False,
save_format=None,
options=None
)
It predicts successfully when load model by tf.keras.models.load_model in python. But when serving the model in tensorflow model server, the model returns NaN value.
I serve the model by the following:
zhaocc:~/products/tensorflow_server$ sudo docker run -t --rm -p 8502:8501 -v "/tmp/pix2pix/sketch_photo/model_saver:/models/photo2sketch" -e MODEL_NAME=photo2sketch tensorflow/serving &
[3] 30089
zhaocc:~/products/tensorflow_server$ 2020-06-17 12:57:31.745339: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config: model_name: photo2sketch model_base_path: /models/photo2sketch
2020-06-17 12:57:31.745448: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-06-17 12:57:31.745459: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: photo2sketch
2020-06-17 12:57:31.846162: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: photo2sketch version: 1}
2020-06-17 12:57:31.846213: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846233: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846282: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/photo2sketch/1
2020-06-17 12:57:31.874158: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-06-17 12:57:31.874182: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:295] Reading SavedModel debug info (if present) from: /models/photo2sketch/1
2020-06-17 12:57:31.874315: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-17 12:57:31.952982: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2020-06-17 12:57:32.172641: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/photo2sketch/1
2020-06-17 12:57:32.248514: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 402236 microseconds.
2020-06-17 12:57:32.256576: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /models/photo2sketch/1/assets.extra/tf_serving_warmup_requests
2020-06-17 12:57:32.265064: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: photo2sketch version: 1}
2020-06-17 12:57:32.267113: I tensorflow_serving/model_servers/server.cc:355] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-06-17 12:57:32.269289: I tensorflow_serving/model_servers/server.cc:375] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
When I predict by REST request, it return NaN with correct shape:
[[[[nan nan nan]
[nan nan nan]
[nan nan nan]
...
[nan nan nan]
[nan nan nan]
[nan nan nan]]
Anybody knows why? How can I debug it? Thanks very much!
I had the very same problem with my Pix2Pix generator. The problem was with the training parameter. As explained here What does `training=True` mean when calling a TensorFlow Keras model? this parameter affects the results of the network. One possible solution is to remove all dropouts (and other affected parts) prior to saving the network. This solution did not work for me (probably missed something). So instead as a temporary workaround, I added 2 signatures to the model
#tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict1(input_batch):
return {'outputs': generator(input_batch, training=True)}
#tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict2(input_batch):
return {'outputs': generator(input_batch, training=False)}
...
generator.save(base_path + "kerassave",signatures={'predict1': model_predict1, 'predict2': model_predict2})
predict2 still always returned nans. predict1 worked, however.

tf-serving abnormal exit without error message

tf-serving abnormal exit without error message
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ReaHat EL6
TensorFlow Serving installed from (source or binary): source using bazel 0.18.0
TensorFlow Serving version: 1.12.0
Describe the problem
i compile the tf-serving using bazel in RHEL 6.9, and start it using:
./model_servers/tensorflow_model_server --model_config_file=./data/models.conf --rest_api_port=8502
models.conf:
model_config_list: {
config: {
name: "model_1",
base_path:"/search/work/tf_serving_bin/tensorflow_serving/data/model_data/model_1",
model_platform: "tensorflow",
model_version_policy: {
latest: {
num_versions: 1
}
}
}
}
Client using C++, and use libCurl to request tf-serving REST api, but, tf-serving often abnormal exits without error message in some minutes.
When my client service requests localhost tf-serving, the question occur frequently. But, client service requests tf-serving at other machines, the question do not occur, qps < 100.
I check memory, cpu idle, etc... no problems is found. so, it is very strange.
export export TF_CPP_MIN_VLOG_LEVEL=1, no error/critical message too.
Source code / logs
2019-01-09 09:28:35.118183: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models.
2019-01-09 09:28:35.118259: I tensorflow_serving/model_servers/server_core.cc:558] (Re-)adding model: app_ks_nfm_1
2019-01-09 09:28:35.227383: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227424: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227443: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:35.227492: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359
2019-01-09 09:28:35.227530: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359
2019-01-09 09:28:35.256712: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-01-09 09:28:35.267728: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-09 09:28:35.313087: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:162] Restoring SavedModel bundle.
2019-01-09 09:28:38.797633: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:138] Running MainOp with key legacy_init_op on SavedModel bundle.
2019-01-09 09:28:38.803984: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: success. Took 3570131 microseconds.
2019-01-09 09:28:38.804027: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:83] No warmup data file found at /search/work/bazel-bin-serving/tensorflow_serving/data/model_data/app_ks_nfm_1/201901072359/assets.extra/tf_serving_warmup_requests
2019-01-09 09:28:38.804148: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: app_ks_nfm_1 version: 201901072359}
2019-01-09 09:28:38.831860: I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2019-01-09 09:28:38.865243: I tensorflow_serving/model_servers/server.cc:302] Exporting HTTP/REST API at:localhost:8502 ...
[evhttp_server.cc : 237] RAW: Entering the event loop ...
It is not an abnormal exit. It is an indication that the Server is ready to receive the Inference Requests.
For clarification, please find the below explanation:
docker run --runtime=nvidia -p 8501:8501 \
--mount type=bind,\ source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,\
target=/models/half_plus_two \
-e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &
This will run the docker container with the nvidia-docker runtime, launch the TensorFlow Serving Model Server, bind the REST API port 8501, and map our desired model from our host to where models are expected in the container. We also pass the name of the model as an environment variable, which will be important when we query the model.
TIP: Before querying the model, be sure to wait till you see a message like the following, indicating that the server is ready to receive requests:
2018-07-27 00:07:20.773693: I tensorflow_serving/model_servers/main.cc:333]
Exporting HTTP/REST API at:localhost:8501 ...
After that Message, just press Enter and you can query the model using the below command
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict
For more information, refer the below link:
https://www.tensorflow.org/tfx/serving/docker#gpu_serving_example
The Reason:the short connection product a large amount of TCP Status 'TIME_WAIT', the available linux system file handle is occupied.

Distributed Deep recommended System- Tensorflow: failed: Session bundle or SavedModel bundle not found at specified export location

After executing the distributed tensor flow as mentioned in the following link:
https://github.com/tobegit3hub/deep_recommend_system/tree/master/distributed
I got the following in ./checkpoint folder;
checkpoint
graph.pbtxt
model.ckpt-269.data-00000-of-00001
model.ckpt-269.index
model.ckpt-269.meta
I wanted to run tensorFlow serving on the above model provided in TensorFlow Serving. But I do get the below error when doing so:
./tensorflow_model_server --port="9000" --model_base_path=./model/
2017-07-14 15:32:32.791636: I tensorflow_serving/model_servers/main.cc:151] Building single TensorFlow model file config: model_name: default model_base_path: ./model/ model_version_policy: 0
2017-07-14 15:32:32.792156: I tensorflow_serving/model_servers/server_core.cc:375] Adding/updating models.
2017-07-14 15:32:32.792188: I tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding model: default
2017-07-14 15:32:32.893072: I tensorflow_serving/core/basic_manager.cc:698] Successfully reserved resources to load servable {name: default version: 1}
2017-07-14 15:32:32.893143: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: default version: 1}
2017-07-14 15:32:32.893165: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: default version: 1}
2017-07-14 15:32:32.893252: E tensorflow_serving/util/retrier.cc:38] Loading servable: {name: default version: 1} failed: Not found: Session bundle or SavedModel bundle not found at specified export location
Any suggestion on this ?

Can not load pb file in tensorflow serving

I have used SavedModel (Inception_resnet_v2) to export the TensorFlow model files and use TensorFlow Serving to load the files.I have directly replaced offical minst saved_model.pb with my own Inception_resnet_v2 saved_model.pb file. But I got one error.
deep#ubuntu:~/serving$ bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/home/deep/serving/tmp/mnist_model
2017-06-18 10:39:41.963490: I tensorflow_serving/model_servers/main.cc:146] Building single TensorFlow model file config: model_name: mnist model_base_path: home/deep/serving/tmp/mnist_model model_version_policy: 0
2017-06-18 10:39:41.963752: I tensorflow_serving/model_servers/server_core.cc:375] Adding/updating models.
2017-06-18 10:39:41.963762: I tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding model: mnist
2017-06-18 10:39:42.065556: I tensorflow_serving/core/basic_manager.cc:698] Successfully reserved resources to load servable {name: mnist version: 1}
2017-06-18 10:39:42.065610: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}
2017-06-18 10:39:42.065648: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}
2017-06-18 10:39:42.065896: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /home/deep/serving/tmp/mnist_model/1
2017-06-18 10:39:42.066130: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:226] Loading SavedModel from: /home/deep/serving/tmp/mnist_model/1
2017-06-18 10:39:42.080775: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:274] Loading SavedModel: fail. Took 14816 microseconds.
2017-06-18 10:39:42.080822: E tensorflow_serving/util/retrier.cc:38] Loading servable: {name: mnist version: 1} failed: Not found: Could not find meta graph def matching supplied tags.
What should I do? Thanks!
I chatted to the Serving engineers, and here are some of their thoughts on this:
Looks like they need to specify a tag either in the saved model, or on
the command line. (log line of note: failed: Not found: Could not find
meta graph def matching supplied tags. )
It looks like the SavedModel loader is unable to find a graph
corresponding to the tags they have supplied. Here is some
documentation:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/saved_model#tags
Ah, to add: They could use the SavedModel CLI to inspect the model and
see what tag-sets are available. Here is the documentation for that:
https://www.tensorflow.org/versions/master/programmers_guide/saved_model_cli.
They can run
saved_model_cli show --dir <SavedModelDir>
to check what tag-sets are in the SavedModel if they have pip
installed tensorflow.