Amazon Sagemaker ModelError when serving model - amazon-s3

I have uploaded a transformer roberta model in S3 bucket. Am now trying to run inference against the model using Pytorch with SageMaker Python SDK. I specified the model directory s3://snet101/sent.tar.gz which is a compressed file of the model (pytorch_model.bin) and all its dependencies. Here is the code
model = PyTorchModel(model_data=model_artifact,
name=name_from_base('roberta-model'),
role=role,
entry_point='torchserve-predictor2.py',
source_dir='source_dir',
framework_version='1.4.0',
py_version = 'py3',
predictor_cls=SentimentAnalysis)
predictor = model.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')
test_data = {"text": "How many cows are in the farm ?"}
prediction = predictor.predict(test_data)
I get the following error on the predict method of the predictor object:
ModelError Traceback (most recent call last)
<ipython-input-6-bc621eb2e056> in <module>
----> 1 prediction = predictor.predict(test_data)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant)
123
124 request_args = self._create_request_args(data, initial_args, target_model, target_variant)
--> 125 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
126 return self._handle_response(response)
127
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/roberta-model-2020-12-16-09-42-37-479 in account 165258297056 for more information.
I checked the server log error
java.lang.IllegalArgumentException: reasonPhrase contains one of the following prohibited characters: \r\n: Can't load config for '/.sagemaker/mms/models/model'. Make sure that:
'/.sagemaker/mms/models/model' is a correct model identifier listed on 'https://huggingface.co/models'
or '/.sagemaker/mms/models/model' is the correct path to a directory containing a config.json file
How can I fix this?

I have the same problem, it seems like the endpoint is trying to load the pretrained model using the path '/.sagemaker/mms/models/model' and fails.
Maybe this path is not correct or perhaps has no access to the S3 bucket so it cannot store the model into the given path.

Related

Cannot load checkpoints

I taught a model (tensorflow tutorial) in Jupyter then saved it, then succesfully loaded it back (kernel was restarted). Here's the code:
# Directory where the checkpoints will be saved
checkpoint_dir = '/home/charlie-chin/william_model/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
model.save('/home/charlie-chin/william_model')
model = keras.models.load_model('/home/charlie-chin/william_model', custom_objects={'loss':loss})
checkpoint_num = 10
model.load_weights(tf.train.Checkpoint("/home/charlie-chin/william_model/training_checkpoints/ckpt_" + str(checkpoint_num)))
All went good except the last 2 lines which gave me this error:
ValueError: `Checkpoint` was expecting root to be a trackable object (an object derived from `Trackable`), got /home/charlie-chin/william_model/training_checkpoints/ckpt_1. If you believe this object should be trackable (i.e. it is part of the TensorFlow Python API and manages state), please open an issue.
I checked the path - it is correct. Here's full output of the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [39], in <cell line: 4>()
1 checkpoint_num = 10
2 # model.load_weights(tf.train.load_checkpoint("./william_model/training_checkpoints/ckpt_"))
3 # model.load_weights(tf.train.Checkpoint("/home/charlie-chin/william_model/training_checkpoints/ckpt_" + str(checkpoint_num)+".data-00000-of-00001"))
----> 4 model.load_weights(tf.train.Checkpoint("/home/charlie-chin/william_model/training_checkpoints/ckpt_" + str(checkpoint_num)))
File ~/.local/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:2107, in Checkpoint.__init__(self, root, **kwargs)
2105 if root:
2106 trackable_root = root() if isinstance(root, weakref.ref) else root
-> 2107 _assert_trackable(trackable_root, "root")
2108 attached_dependencies = []
2110 # All keyword arguments (including root itself) are set as children
2111 # of root.
File ~/.local/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1546, in _assert_trackable(obj, name)
1543 def _assert_trackable(obj, name):
1544 if not isinstance(
1545 obj, (base.Trackable, def_function.Function)):
-> 1546 raise ValueError(
1547 f"`Checkpoint` was expecting {name} to be a trackable object (an "
1548 f"object derived from `Trackable`), got {obj}. If you believe this "
1549 "object should be trackable (i.e. it is part of the "
1550 "TensorFlow Python API and manages state), please open an issue.")
ValueError: `Checkpoint` was expecting root to be a trackable object (an object derived from `Trackable`), got /home/charlie-chin/william_model/training_checkpoints/ckpt_10. If you believe this object should be trackable (i.e. it is part of the TensorFlow Python API and manages state), please open an issue.
You should be able to load the checkpoints according to the TensorFlow documentation like this:
checkpoint_num = 10
model.load_weights("/home/charlie-chin/william_model/training_checkpoints/ckpt_" + str(checkpoint_num))

Instance Normalization Error while converting model from tensorflow to Coreml (4.0)

I try to convert my model from Tensorflow to Coreml however I get below error. Isn't it possible to convert instance normalization layer to CoreML? Any workaround to overcome?
ValueError Traceback (most recent call last)
in ()
6
7 model = ct.convert(
----> 8 tf_keras_model )
6 frames
/usr/local/lib/python3.6/dist-packages/coremltools/converters/mil/mil/block.py in remove_ops(self, existing_ops)
700 + "used by ops {}"
701 )
--> 702 raise ValueError(msg.format(op.name, i, v.name, child_op_names))
703 # Check that the output Var isn't block's output
704 if v in self._outputs:
ValueError: Cannot delete op 'Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/Shape' with active output at id 0: 'Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/Shape' used by ops ['Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/strided_slice']
SEARCH STACK OVERFLOW
I use keras-contrib instead and it works fine. Please see issue and its solution below. It is still open for tensorflow_addons.
https://github.com/apple/coremltools/issues/1007

"KeyError: in converted code" when packing data with numeric features in Tensorflow Tutorial

I'm using TF 2.0 on Google Colab. I copied from of the most code the Tensorflow "Load CSV Data" tutorial and changed up some config variables for my training and eval / test csv files. When I ran it, I got this error (only last frame is shown, full output is here:
In
NUMERIC_FEATURES = ['microtime', 'dist']
packed_train_data = raw_train_data.map(
PackNumericFeatures(NUMERIC_FEATURES))
packed_test_data = raw_test_data.map(
PackNumericFeatures(NUMERIC_FEATURES))
Out
/tensorflow-2.0.0/python3.6/tensorflow_core/python/autograph/impl/api.py in wrapper(*args, **kwargs)
235 except Exception as e: # pylint:disable=broad-except
236 if hasattr(e, 'ag_error_metadata'):
--> 237 raise e.ag_error_metadata.to_exception(e)
238 else:
239 raise
KeyError: in converted code:
<ipython-input-19-85ea56f80c91>:6 __call__ *
numeric_features = [features.pop(name) for name in self.names]
/tensorflow-2.0.0/python3.6/tensorflow_core/python/autograph/impl/api.py:396 converted_call
return py_builtins.overload_of(f)(*args)
KeyError: 'dist'
The "in converted code" is used when autograph wraps errors that likely occur in user code. In this case, the following like is relevant:
<ipython-input-19-85ea56f80c91>:6 __call__ *
numeric_features = [features.pop(name) for name in self.names]
The error message is missing some critical information and we should fix it, but it suggests a call to features.pop(name) raises KeyError, so likely that key is missing from features.

Accessing S3 from Dask.bag

As the title suggests, I'm trying to use a dask.bag to read a single file from S3 on an EC2 instance:
from distributed import Executor, progress
from dask import delayed
import dask
import dask.bag as db
data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq')
I get a very long error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
322 bucket, key = split_path(path)
--> 323 out = self.s3.head_object(Bucket=bucket, Key=key)
324 out = {'ETag': out['ETag'], 'Key': '/'.join([bucket, key]),
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
277 # The "self" in this scope is referring to the BaseClient.
--> 278 return self._make_api_call(operation_name, kwargs)
279
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
571 if http.status_code >= 300:
--> 572 raise ClientError(parsed_response, operation_name)
573 else:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
<ipython-input-43-0ad435c69ecc> in <module>()
4 #data = db.read_text('/Users/zen/Code/git/sra_data.fastq')
5 #data = db.read_text('/Users/zen/Code/git/pycuda-euler/data/Ba10k.sim1.fq')
----> 6 data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq', blocksize=900000)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bag/text.py in read_text(urlpath, blocksize, compression, encoding, errors, linedelimiter, collection, storage_options)
89 _, blocks = read_bytes(urlpath, delimiter=linedelimiter.encode(),
90 blocksize=blocksize, sample=False, compression=compression,
---> 91 **(storage_options or {}))
92 if isinstance(blocks[0], (tuple, list)):
93 blocks = list(concat(blocks))
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, **kwargs)
210 return read_bytes(storage_options.pop('path'), delimiter=delimiter,
211 not_zero=not_zero, blocksize=blocksize, sample=sample,
--> 212 compression=compression, **storage_options)
213
214
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in read_bytes(path, s3, delimiter, not_zero, blocksize, sample, compression, **kwargs)
91 offsets = [0]
92 else:
---> 93 size = getsize(s3_path, compression, s3)
94 offsets = list(range(0, size, blocksize))
95 if not_zero:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in getsize(path, compression, s3)
185 def getsize(path, compression, s3):
186 if compression is None:
--> 187 return s3.info(path)['Size']
188 else:
189 with s3.open(path, 'rb') as f:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
327 return out
328 except (ClientError, ParamValidationError):
--> 329 raise FileNotFoundError(path)
330
331 def _walk(self, path, refresh=False):
FileNotFoundError: pycuda-euler-data/Ba10k.sim1.fq
As far as I can tell, this is exactly what the docs say to do and unfortunately many examples I see online use the older from_s3() method that no longer exists.
However I am able to access the file using s3fs:
sample, partitions = s3.read_bytes('pycuda-euler-data/Ba10k.sim1.fq', s3=s3files, delimiter=b'\n')
sample
b'#gi|30260195|ref|NC_003997.3|_5093_5330_1:0:0_1:0:0_0/1\nGATAACTCGATTTAAACCAGATCCAGAAAATTTTCA\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_7142_7326_1:1:0_0:0:0_1/1\nCTATTCCGCCGCATCAACTTGGTGAAGTAATGGATG\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_5524_5757_3:0:0_2:0:0_2/1\nGTAATTTAACTGGTGAGGACGTGCGTGATGGTTTAT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2706_2926_1:0:0_3:0:0_3/1\nAGTAAAACAGATATTTTTGTAAATAGAAAAGAATTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_500_735_3:1:0_0:0:0_4/1\nATACTCTGTGGTAAATGATTAGAATCATCTTGTGCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2449_2653_3:0:0_1:0:0_5/1\nCTTGAATTGCTACAGATAGTCATAGGTTAGCCCTTC\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_3252_3460_0:0:0_0:0:0_6/1\nCGATGTAATTGATACAGGTGGCGCTGTAAAATGGTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_1860_2095_0:0:0_1:0:0_7/1\nATAAAAGATTCAATCGAAATATCAGCATCGTTTCCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_870_1092_1:0:0_0:0:0_8/1\nTTGGAAAAACCCATTTAATGCATGCAATTGGCCTTT\n ... etc.
What am I doing wrong?
EDIT:
Upon suggestion, I went back and checked permissions. On the bucket I added a Grantee Everyone List, and on the file a Grantee Everyone Open/Download. I still get the same error.
Dask uses the library s3fs to manage data on S3. The s3fs project uses Amazon's boto3. You can provide credentials in two ways:
Use a .boto file
You can put a .boto file in your home directory
Use the storage_options= keyword
You can add a storage_option= keyword to your db.read_text call to include credential information by hand. This option is a dictionary whose values will be added to the s3fs.S3FileSystem constructor.

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.