I am currently trying to train a model for my bachelor's.
The train ETA though is very huge so I have considered using TPUs. However, everytime I try to train with a tpu strategy following this google's notebook I keep getting the following error:
(0) INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1651692210.674048314","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"#1651692210.674047476","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_69/_310]]
Error as shown in Colab
you can check my TPU boilerplate code here:
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
My dataset is stored in my google's drive as images.
I am trying to train using tf.keras.model.fit
I'm trying to use a model from tensorflow hub on Kaggle.
Like so:
m = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4", output_shape=[1280],
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
m.build([None, 224, 224, 3]) # Batch input shape.
It works well with GPU, but as soon as I switch to TPU with TF records I get the following error:
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables')
However the set up and tfrecords dataset are all correct as it works with a switching the pretrained model to a keras application of the same model (i.e. for example above using the mobilenet keras application).
I tried caching but I have been unsuccessful, is there something I have to beware when following this guide:
https://www.tensorflow.org/hub/caching
Thanks in advance!
The failure happens because the TPU is trying to load the TFHub model from /tmp/ which it doesn't have access to. You should be able to get this to work by:
with strategy.scope():
load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
m = tf.keras.Sequential([
hub.KerasLayer(
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4",
output_shape=[1280],
load_options=load_locally,
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
Source: EfficientNetB7 on 100+ flowers.
I'm a newbie to ML. When trying to complete digit recognition with TPU method, I encountered problems really stuck me.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
And then I run it again. Got errors as follows
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
Must be somewhere missing strategy.scopy():
I tried many times and succeeded in many other notebooks but they are all tf.data.Dataset
Though, I still can't figure out this simple digit recognition where wrong. I searched again and again and stuck here 2days, really pissed off.
Full code is at
https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6 is the TPU version. And only modified from Version 5 with codes above. Please help me!
It looks like you are storing your training data locally which is causing the issue as TPUs can only access data in GCS.
TPUs read training data exclusively from GCS (Google Cloud Storage) see details here
You can also check this stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError post.
Fixed the problem with changing them to tf.data.Dataset.( without GCS)
Only use local tf.data.Dataset. to call fit() is ok. But it fails with Unavailable: failed to connect to all addresses once ImageDataGenerator() used.
# Fixed with changing to tf.data.Dataset.
ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
...
...
History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
callbacks=[ReduceLR, Stop], verbose=1)
# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs
Fails once ImageDataGenerator() used.
# Fail again with ImageDataGenerator() used
ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)
History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
-> 1102 context.async_wait()
1103 logs = tmp_logs # No error, now safe to assign to logs.
1104 end_step = step + data_handler.step_increment
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
2328 an error state.
2329 """
-> 2330 context().sync_executors()
2331
2332
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
643 """
644 if self._context_handle:
--> 645 pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
646 else:
647 raise ValueError("Context is not initialized.")
UnavailableError: 4 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Pad_2/paddings/_130]]
(1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_36/_238]]
(2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[IteratorGetNextAsOptional_3/_35]]
(3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.
I'm trying to train a Keras model and save model weighta at every epoch and patch.
I define a checkpoint as follows:
checkpoint_path='model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}'
checkpoint = ModelCheckpoint(filepath = checkpoint_5000_path,frequency = 5000)
and train the model:
model.fit(x=x_train, y=y_train, epochs=3, validation_data=(x_test, y_test),
batch_size=10, callbacks=[checkpoint])
But right after the forst iteration the error occurs:
KeyError: 'Failed to format this callback filepath: "model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}". Reason: \'batch\'
How can I have python add the batch number to the file name?
How to fide the list of other parameters that are available for output?
My setup: Windows 10, jupyter notebook in chrome, Python 3.5.4, Tensorflow 2.3.0, Keras is imported from Tensorflow.
I am trying to run LSTM with autoencoder configuration in Google Colab with gpu support. But I get following warning:
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Generic GPU kernel is significantly slower than cuDNN kernel. So I searched for solution and found here (https://keras.io/api/layers/recurrent_layers/lstm/) cuDNN kernel requirements for LSTM unit. Specifically:
activation == tanh
recurrent_activation == sigmoid
recurrent_dropout == 0
unroll is False
use_bias is True
Inputs, if use masking, are strictly right-padded.
Eager execution is enabled in the outermost context.
I thought that I found my problem in activation function which was 'relu', other cuDNN requirements are defaults in TF. Here is updated code:
model = Sequential()
model.add(LSTM(200, activation='tanh',input_shape=(n_timesteps, n_features)))
model.add(RepeatVector(n_outputs))
model.add(LSTM(200, activation='tanh', return_sequences=True))
model.add(TimeDistributed(Dense(100, activation='relu')))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mse', optimizer='adam')
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
But upon execution I get same warnings that cuDNN requirements are not met. What's wrong?
TF: 2.3.0