Error with TPUClusterResolver for Cloud TPU v3 Pod with TensorFlow 2.1 - tensorflow

I'm trying to use my (pre-emptible) Cloud TPU v3-256 on my Google Cloud Compute Engine VM with TensorFlow 2.1, but it doesn't seem to be working as the TPUClusterResolver throws a Could not lookup TPU metadata error.
Using individual (non-preemptible) TPUs works fine as long as I use the grpc:// address rather than the TPU Name. However, neither individual TPUs nor my TPU Pod work when using the TPU Name, and throw this error.
Can someone help me fix this issue?
Code:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='my-tpu-name', zone='europe-west4-a', project='my-project') # The zone, project and TPU Name are correct
Output:
ValueError: Could not lookup TPU metadata from name 'my-tpu-name'. Please double
check the tpu argument in the TPUClusterResolver constructor.
Exception: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/
instance/service-accounts/default/?recursive=True
from the Google Compute Enginemetadata service. Response: {'metadata-flavor': 'Google',
'date': 'Thu, 28 May 2020 17:42:35 GMT', 'content-type': 'text/html; charset=UTF-8',
'server': 'Metadata Server for VM', 'content-length': '1629', 'x-xss-protection': '0', 'x
frame-options': 'SAMEORIGIN', 'status': '404'}

I suspect it could be a mismatch in either one of the following: Tensorflow version, zone or project between compute VM and TPU.
If you create both TPU and GCE VM with the same Tensorflow version (2.1 or 2.2) and they both are created in the same project and zone. You can just provide the TPU name in TPUClusterResolver and it should work fine:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='my-tpu-name')
You can omit TPU name if you set TPU_NAME environment variable (export TPU_NAME=my-tpu-name) on your VM.

Related

Tensorflow Distributed Multi-Machine Training: Failing to connect to nodes other than self

If I only use nodes 2-4 in TF_CONFIG, the program just hangs, where if I have one set as the primary (node01) the script throws the error logs below. I believe node01 is delegated as chief due to placement so this may be why it fails further down the line, but the issue still remains.
Code:
import os
import json
import tensorflow as tf
import mnist_setup
per_worker_batch_size = 64
tf_config = json.loads(os.environ['TF_CONFIG'])
num_workers = len(tf_config['cluster']['worker'])
strategy = tf.distribute.MultiWorkerMirroredStrategy()
global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)
with strategy.scope():
# Model building/compiling need to be within `strategy.scope()`.
multi_worker_model = mnist_setup.build_and_compile_cnn_model()
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)
TF_CONFIG:
{'cluster': {'worker': ['node01:34425', 'node02:36257']},
'task': {'type': 'worker', 'index': 1}}
Output:
/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:98: UserWarning: unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ["[Errno 2] The file to load file system plugin from does not exist.: '/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so'"]
warnings.warn(f"unable to load libtensorflow_io_plugins.so: {e}")
/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:104: UserWarning: file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: cannot open shared object file: No such file or directory']
warnings.warn(f"file system plugins are not loaded: {e}")
2023-02-04 06:16:07.014328: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://node02:36257
2023-02-04 06:16:07.039956: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2023-02-04 06:16:08.150050: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 188160000 exceeds 10% of free system memory.
2023-02-04 06:16:12.054935: E tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:678] Coordination agent is in ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
I have tried the other suggestions such as "unset https_proxy" and similar, however this only worked for being able to change localhost -> node01 in the tf_config file, with node01 being the only node. I also changed the iptables to allow all Input/Output traffic from the ip of each machine on the local network, still with no success. The code is an excerpt from the tutorial on tensorflow's tutorial notebook on distributed training, with the only modifications of removing the tf-nightly install, and changing localhost to the hostnames of my machines.
I am running tensorflow 2.11.0
Link here: https://github.com/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb

CUDA Error when training YOLOv4-tiny on Colab: no kernel image is available for execution on the device

I was following this tutorial to train a YOLOv4-tiny model to detect custom objects: https://www.youtube.com/watch?v=NTnZgLsk_DA
However, when I attempt to train the model, I get this error message:
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 841 : build time: Jan 7 2022 - 12:01:41
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: File exists
I was running the code on Colab, not locally. The GPU used for training is Tesla K80.
A common answer is to set the ARCH values to compute_37, code_37, but I've already set them this way and keep getting the same error! So what should I do to get this code running?
Link to my Colab notebook: https://colab.research.google.com/drive/16EQ6I67OOs1I7rF6PHgBHp1eVHMXXvyO#scrollTo=QyMBDkaL-Aep
Any help would be appreciated!

Tensorflow: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0

I got this when using keras with Tensorflow backend:
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
while setting up XLA_GPU_JIT device number 0
Relevant code:
tfconfig = tf.ConfigProto()
tfconfig.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
tfconfig.gpu_options.allow_growth = True
K.tensorflow_backend.set_session(tf.Session(config=tfconfig))
tensorflow version: 1.14.0
Chairman Guo's code:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
solved my problem of jupyter notebook kernel crashing at:
tf.keras.models.load_model(path/to/my/model)
The fatal message was:
2020-01-26 11:31:58.727326: F
tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch
value instead of handling error Internal: failed initializing
StreamExecutor for CUDA device ordinal 0: Internal: failed call to
cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
My TF's version is: 2.2.0-dev20200123. There are 2 GPUs on this system.
This could be due to your TF-default (i.e. 1st) GPU is running out of memory. If you have multiple GPUs, divert your Python program to run on other GPUs. In TF (suppose using TF-2.0-rc1), set the following:
# Specify which GPU(s) to use
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # Or 2, 3, etc. other than 0
# On CPU/GPU placement
config = tf.compat.v1.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
tf.compat.v1.Session(config=config)
# Note that ConfigProto disappeared in TF-2.0
Suppose, however, your environment have only one GPU, then perhaps you have no choice but ask your buddy to stop his program, then treat him a cup of coffee.

how to use other tokenlizer(NLTK,Jiebe etc.) in tensorflow serving

Recently, I have been using estimator to train and deploy a tensorflow model, but when I deploy the model (it was exported using estimator serving_fn including tf.py_func) using tensorflow seving, there is an error (see below).
I found this question on Github that said the serving can't support tf.py_func.
Can anyone help?
I want to implement a token function using other tokenlizer(NLTK,Jieba).
The error:
Invalid argument: No OpKernel was registered to support Op 'PyFunc' used by {{node map/while/PyFunc}}with these attrs: [Tout=[DT_STRING], token="pyfunc_4", _output_shapes=[<unknown>], Tin=[DT_STRING]]
Registered devices: [CPU]
Registered kernels:
<no registered kernels>
Have you tried using the tensorflow native tokenizer,eg. see https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro#tokenization

Can't use tensorflow.keras.layers.CuDNNLSTM or keras.layers.CuDNNLSTM in my Colab hosted runtime

When I tried to use either tensorflow.keras.layers.CuDNNLSTM or keras.layers.CuDNNLSTM, I got the following error:
InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNN' used by {{node cu_dnnlstm/CudnnRNN}}with these attrs: [dropout=0, seed=0, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", is_training=true, seed2=0]
Registered devices: [CPU, XLA_CPU]
I am using the hosted runtime and I presume that supports GPU as well but I noticed the error message above shows there is no GPU. Not so sure what the problem is but any clue will be appreciated
You need to explicitly request a GPU enabled runtime.
From the Runtime menu select "Change runtime type" then select GPU under "hardware accelerator":