Getting KeyError : 'callable_inputs' when trying to save a TF model in S3 bucket - tensorflow

I'm using sagemaker 2.5.1 and tensorflow 2.3.0
The weird part is that the same code worked before, the only change that I could think of is the new release of the two libraries

This appears to be a bug with SageMaker.
I'm assuming you are using a TensorFlow estimator to train the model. Something like this:
estimator = TensorFlow(
entry_point='script.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
framework_version='2.3.0',
py_version='py37',
script_mode=True,
hyperparameters={
'epochs': 100,
'batch-size': 256,
'learning-rate': 0.001
}
)
If that's the case, either TensorFlow 2.2 it TensorFlow 3.3 is causing this error when debugger callbacks are enabled. To fix the issue, you can set the debugger_hook_config to False:
estimator = TensorFlow(
entry_point='script.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
framework_version='2.3.0',
py_version='py37',
script_mode=True,
debugger_hook_config=False,
hyperparameters={
'epochs': 100,
'batch-size': 256,
'learning-rate': 0.001
}
)

The problem is actually coming from smdebug version 0.9.1
Downgrading to 0.8.1 solves the issue

Related

CUDA error: device-side assert triggered in Colab

I am training EfficientDet v2 model in coco json format on colab. model confg are here:
gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=512, use_gpu=True,num_workers=2)
gtf.Model();
gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
%%time
gtf.Train(num_epochs=10, model_output_dir="trained/");
I am facing following issue while training:
I tried adding this code and restarting runtime but facing same issues.
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
Anyone to solve?

Extremely slow when saving model on Colab TPU

my situation is that saving model is extremely slow under Colab TPU environment.
I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.
Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.
The version of Tensorflow = 2.3
My code of model fitting is here:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
Baseline = create_model()
checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5',
save_weights_only=True, save_freq="epoch")
hist = model.fit(get_train_ds().repeat(),
steps_per_epoch = 100,
epochs = 5,
verbose = 1,
callbacks = [checkpoint])
model.save_weights("epoch-test.h5", overwrite=True)
I found the issue happened because I explicitly switched to graph mode by writing
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Before
with tpu_strategy.scope():
model.fit(...)
Though I still don't understand the cause, remove disable_eager_execution solved the issue.

Table not initialized issue using #tf.function while loading TF hub model

I am trying to load the Tf hub model and predict the output using #tf.function decorator. It is throwing tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized. error.
TF version - 2.1.0
TF hub Version - 0.8.0
Note: It is working without using #tf.function decorator
import tensorflow as tf
import tensorflow_hub as hub
image_tensor = tf.constant(2.0, shape=[1, 298, 298, 3])
#tf.function
def run_function(method, args):
return method(args)
detector = hub.KerasLayer("https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1",
signature_outputs_as_dict=True)
detector_output = run_function(detector, image_tensor)
class_names = detector_output["detection_class_entities"]
print(class_names)
Can anyone know the reason why it is not working with #tf.function?
You are using a TensorFlow V1 hub model in hub.KerasLayer which is to be used for tf2.0 models
In TensorFlow hub, you can find a toggle button to view tf hub models for specific TensorFlow versions.
To make it work using hub.KeralLayer, change the URL to either of the following tf2.0 mobilenet versions
https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4
https://tfhub.dev/google/imagenet/mobilenet_v2_050_96/classification/4
or if you have to use the exact URL as in your example. Use hub.Module instead of hub.KeralLayer

How to use the efficientnet-lite provided by tfhub for the second training on tf2.1

The version I use is tensorflow-gpu version 2.1.0, installed from pip.
import tensorflow as tf
import tensorflow_hub as hub
tf.keras.backend.set_learning_phase(True)
module_url = "https://tfhub.dev/tensorflow/efficientnet/lite0/classification/2"
module2 = tf.keras.Sequential([
hub.KerasLayer(module_url, trainable=False, input_shape=(224,224,3))])
output1 = module2(tf.ones(shape=(1,224,224,3)))
print(module2.summary())
When I set trainable = True, the operation will give an error.
So, can't I retrain it on tf2.1 version?
The EfficientNet-Lite models on TFHub are based on TensorFlow 1, and thus are subject to many restrictions on TF2 including fine-tuning as you've discovered. The EfficientNet models were updated to TF2 but we're still waiting for their lite counterparts.
https://www.tensorflow.org/hub/model_compatibility
https://github.com/tensorflow/hub/issues/751
UPDATE: Beginning October 5, 2021, the EfficientNet-Lite models on TFHub are available for TensorFlow 2.

What is "cudnn_status_internal_error" meaning Tensorflow pipeline?

If we start an object_detection project using TensorFlow Model API, we got an error "cudnn_status_internal_error".
I'm using TensorFlow object detection pipeline to train the model.
How to solve it?
In my case, it is because the GPU resource is not enough.
To solve this issue in the pipeline, please add a piece of code in "model_main.py"
session_config = tf.ConfigProto()
#session_config.gpu_options.allow_growth = True
session_config.gpu_options.per_process_gpu_memory_fraction = 0.8
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)