Model get stuck by using MirroredStrategy() - tensorflow

Im trying to use multiple GPU(2) in Keras, using MirroredStrategy() from Tensorflow. However it leads to the following error:
Epoch 1/5
WARNING:tensorflow:From /home/user/conda36/lib/python3.6/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
2020-09-06 18:50:54.766930: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-06 18:50:55.069925: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-06 18:50:56.049400: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at depthwise_conv_op.cc:386 : Invalid argument: Computed output size would be negative: -1 [input_size: 3, effective_filter_size: 5, stride: 1]
At this point the model gets frozen. That means it still running but it does nothing.
If I run it without MirroredStrategy() it works totally fine, but of course it only uses 1 GPU.
I use MirroredStrategy() like this:
def get_model():
.
.
.
decoded = Conv2D(1, (3, 3), activation='linear', padding='same')(d)
autoencoder = Model(input_img, decoded)
autoencoder.summary()
autoencoder.compile(optimizer='Adagrad', loss='mean_squared_error')
return autoencoder
if __name__ == '__main__':
# Model
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
autoencoder = get_model()
What could be the error?

Related

How to use Tensorflow Hub models on cloud TPU?

I'm trying to use a model from tensorflow hub on Kaggle.
Like so:
m = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4", output_shape=[1280],
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
m.build([None, 224, 224, 3]) # Batch input shape.
It works well with GPU, but as soon as I switch to TPU with TF records I get the following error:
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables')
However the set up and tfrecords dataset are all correct as it works with a switching the pretrained model to a keras application of the same model (i.e. for example above using the mobilenet keras application).
I tried caching but I have been unsuccessful, is there something I have to beware when following this guide:
https://www.tensorflow.org/hub/caching
Thanks in advance!
The failure happens because the TPU is trying to load the TFHub model from /tmp/ which it doesn't have access to. You should be able to get this to work by:
with strategy.scope():
load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
m = tf.keras.Sequential([
hub.KerasLayer(
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4",
output_shape=[1280],
load_options=load_locally,
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
Source: EfficientNetB7 on 100+ flowers.

Python 3.8.8 Jupyter notebook kernel dies when I call model.fit() when I try to use my GPU

My tensorflow recognizes my gpu
However, when I call model.fit() on my data it shows:
epoch(1/2) and then the kernel dies immediately
If I run this in a separate virtual environment with no GPU it works fine:
I have simplified the model architecture and number of training points to only ten as a quick test and it still fails
Simple example
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
model = keras.Sequential()
model.add(Dense(4,
activation='relu'))
model.add(Dense(1, activation='sigmoid'))
opt = keras.optimizers.Adam(learning_rate=.001)
model.compile(loss = 'binary_crossentropy' , optimizer = opt, metrics = ['accuracy'] )
info = model.fit(X_train, y_train, epochs=2, batch_size=2,shuffle=True, verbose=1)
versions:
Python 3.8.8
Num GPUs Available 1
2.5.0-dev20210227
2.4.3
cuda v11.2
I am going to answer my own question rather than deleting this because maybe someone else will be making the same simple mistake I was.
The main mistake I made was having the incorrect CUDA download. you can refer to the what versions are correct at this link:
https://www.tensorflow.org/install/source#gpu
TLDR: Just follow this video:
https://www.youtube.com/watch?v=hHWkvEcDBO0
This also highlighted the importance of a virtual environment where you control the package versions to prevent incompatibilities.
I had the same problem. I transferred the code into a python file and found the root cause. In my case it was copying cudnn dll files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin. Check the following link as well:
Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found

Why Keras Lambda-Layer cause problem Mask_RCNN?

I'm using the Mask_RCNN package from this repo: https://github.com/matterport/Mask_RCNN.
I tried to train my own dataset using this package but it gives me an error at the beginning.
2020-11-30 12:13:16.577252: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-30 12:13:16.587017: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-30 12:13:16.587075: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7612ade969e5): /proc/driver/nvidia/version does not exist
2020-11-30 12:13:16.587479: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-30 12:13:16.593569: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz
2020-11-30 12:13:16.593811: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2aa00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-30 12:13:16.593846: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "machines.py", line 345, in <module>
model_dir=args.logs)
File "/content/Mask_RCNN/mrcnn/model.py", line 1837, in __init__
self.keras_model = self.build(mode=mode, config=config)
File "/content/Mask_RCNN/mrcnn/model.py", line 1934, in build
anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in __call__
input_list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 904, in call
self._check_variables(created_variables, tape.watched_variables())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 931, in _check_variables
raise ValueError(error_str)
ValueError:
The following Variables were created within a Lambda layer (anchors)
but are not tracked by said layer:
<tf.Variable 'anchors/Variable:0' shape=(1, 261888, 4) dtype=float32>
The layer cannot safely ensure proper Variable reuse across multiple
calls, and consquently this behavior is disallowed for safety. Lambda
layers are not well suited to stateful computation; instead, writing a
subclassed Layer is the recommend way to define layers with
Variables.
I looked up the part of code responsible for the problem (located at file: /mrcnn/model.py line: 1935 in the repo):
IN[0]: anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
If anyone have an idea how to solve it or have already solved it, please mention the solution.
Go to mrcnn/model.py and add:
class AnchorsLayer(KL.Layer):
def __init__(self, anchors, name="anchors", **kwargs):
super(AnchorsLayer, self).__init__(name=name, **kwargs)
self.anchors = tf.Variable(anchors)
def call(self, dummy):
return self.anchors
def get_config(self):
config = super(AnchorsLayer, self).get_config()
return config
Then find the line:
anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
and replace it with:
anchors = AnchorsLayer(anchors, name="anchors")(input_image)
Works like a charm in TF 2.4!
ROOT CAUSE:
The bahavior of Lambda layer of Keras in Tensorflow 2.X was changed from Tensorflow 1.X.
In Keras in Tensorflow 1.X, all tf.Variable and tf.get_variable are automatically tracked into the layer.weights via variable creator context so they receive gradient and trainable automatically. Such approach has problem with auto graph compilation that convert Python code into Execution Graph in Tensorflow 2.X so it is removed and now Lambda layer has the code to check for variable creation and raise the error as you see. In short, Lambda layer in Tensorflow 2.X has to be stateless. If you want to create variable, the correct way in Tensorflow 2.X is to subclass layer class and add trainable weight as a class member.
SOLUTIONS:
There are 2 choices -
Change to use Tensorflow 1.X.. This error will not be raised.
Replace the Lambda layer with subclass of Keras Layer:
class AnchorsLayer(tensorflow.keras.layers.Layer):
def __init__(self, anchors):
super(AnchorLayer, self).__init__()
self.anchors_v = tf.Variable(anchors)
def call(self):
return self.anchors_v
# Then replace the Lambda call with this:
anchors_layer = AnchorLayers(anchors)
anchors = anchors_layer()

ValueError: None is only supported in the 1st dimension. Tensor 'input_tensor' has invalid shape '[1, None, None, 3]'

I trained a custom MobileNetV2 SSD model for object detection. I saved the .pb file and now I want to convert it into a .tflite-file in order to use it with Coral edge-tpu.
I use Tensorflow 2.2 on Windows 10 on CPU.
The code I'm using:
import tensorflow as tf
saved_model_dir = r"C:/Tensorflow/Backup_Training/my_MobileNetV2_fpnlite_320x320/saved_model"
num_calibration_steps = 100
def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array
yield [np.random.uniform(0.0, 1.0, size=(1,416,416, 3)).astype(np.float32)]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter = True
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [
#tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_quant_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
I tried several proposed solutions from other threads and I also tried it with tf-nightly, tf2.3 & tf1.14 but none of it worked (there was always another error message I couldn't handle). Since I trained with tf2.2 I thought it might be a good idea to proceed with tf2.2.
Since I'm new to Tensorflow I have several questions: what exactly is the input tensor and where do I define it? Is there a possibility to see or extract this input tensor?
Does anybody know how to fix this issue?
The whole error message:
(tf22) C:\Tensorflow\Backup_Training>python full_int_quant.py
2020-10-22 14:51:20.460948: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-10-22 14:51:20.466366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-10-22 14:51:29.231404: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2020-10-22 14:51:29.239003: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-10-22 14:51:29.250497: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ip3536
2020-10-22 14:51:29.258432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ip3536
2020-10-22 14:51:29.269261: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-10-22 14:51:29.291457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ae2ac3ffc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-22 14:51:29.298043: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-22 14:52:03.785341: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:03.790251: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:04.559832: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:04.564529: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: Graph size after: 3672 nodes (3263), 5969 edges (5553), time = 136.265ms.
2020-10-22 14:52:04.570187: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: function_optimizer did nothing. time = 2.637ms.
2020-10-22 14:52:10.742013: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:10.746868: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:12.358897: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:12.363657: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (-1958), 2661 edges (-3308), time = 900.347ms.
2020-10-22 14:52:12.369137: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (0), 2661 edges (0), time = 60.628ms.
Traceback (most recent call last):
File "full_int_quant.py", line 40, in <module>
tflite_model = converter.convert()
File "C:\Users\schulzyk\Anaconda3\envs\tf22\lib\site-packages\tensorflow\lite\python\lite.py", line 480, in convert
raise ValueError(
ValueError: None is only supported in the 1st dimension. Tensor 'input_tensor' has invalid shape '[1, None, None, 3]'.
Whatever I change in the code, there is always the same error message. I don't know if this is a sign that during training something went wrong but there were no eye-catching occurences.
I'd be happy for any kind of feedback!
Ahh, object detection API with tensorflow 2.0 for coral is still a WIP. We are having many roadblocks and may not see this feature soon. I suggest using tf1.x aPI for now. Here is a good tutorial :)
https://github.com/Namburger/edgetpu-ssdlite-mobiledet-retrain

Tensorflow2.0 MultiWorkerMirroredStrategy example hangs

I followed the example from official tensorflow website.
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
Here is my spec
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
No-GPU available
I have a file called 'tfexample.py' which looks like this
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os
tfds.disable_progress_bar()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {"worker": ["localhost:12345", "localhost:23456"]},
"task": {"type": "worker", "index": 0},
}
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=["accuracy"],
)
return model
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
# Creation of dataset, and model building/compiling need to be within
# `strategy.scope()`.
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model()
# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
When I run this file with
python tfexample.py
The terminal just hangs like below
2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800 24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"#1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500 24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
Any help will be appreciated!
Are you running the tfexample.py on two sessions with the correct TFconfig. I haven't tried two instances on the same machine
This problem is due MultiWorkerMirroredStrategy() needs as much different physical devices as number of workers you want to run. If you want to run your script in your local machine you can run each worker in a different Docker container.