What is called within a sagemaker custom (training) container? - tensorflow

Somewhere this spring the behaviour of the sagemaker docker image changed and I cannot find the way I need to construct it now.
Directory structure
/src/some/package
/project1
/some_entrypoint.py
/some_notebook.ipynb
/project2
/another_entrypoint.py
/another_notebook.ipynb
setup.py
Docker file
Note that I want to shift tensorflow version, so I changed the FROM to the latest version. This was the
breaking change.
# Core
FROM 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3.0-cpu-py37-ubuntu18.04
COPY . /opt/ml/code/all/
RUN pip install /opt/ml/code/all/
WORKDIR "/opt/ml/code"
Python code
This code should start the entrypoints, for example here we have the code of some_notebook.ipynb. I tried all possible combinations of working directory + source_dir (None, '.', or '..'), entry_point (with or without /), dependencies ('src')...
if setup is present it tries to call my project as a module (python -m some_entrypoint)
if not, it often is not able to find my entrypoint. Which I don't understand because the TensorFlow is supposed to add it to the container, isn't it?
estimator = TensorFlow(
entry_point='some_entrypoint.py',
image_name='ECR.dkr.ecr.eu-west-1.amazonaws.com/overall-project/sagemaker-training:latest',
source_dir='.',
# dependencies=['../src/'],
script_mode=True,
train_instance_type='ml.m5.4xlarge',
train_instance_count=1,
train_max_run=60*60, # seconds * minutes
train_max_wait=60*60, # seconds * minutes. Must be >= train_max_run
hyperparameters=hyperparameters,
metric_definitions=metrics,
role=role,
framework_version='2.0.0',
py_version='py3',
)
estimator.fit({
'training': f"s3://some-data/"}
# , wait=False
)
Ideally I would want to understand the logic within: what is called given what settings?

when the training container runs, your entry_point script will be executed.
Since your notebook file and entry_point script are under the same directory, your source_dir should just be "."
Does your entry_point script import any modules that are not installed by the tensorflow training container by default? Also could you share your stacktrace of the error?

Related

Installing matplotlib from wheel for aws lambda

I want to use matplotlib to create image files in an AWS Lambda function. Since my system does not run the same version of Linux that Lambda uses, I received the guidance to download a wheel (.whl file) from the matplotlib website. I downloaded matplotlib-3.3.3-cp36-cp36m-manylinux1_x86_64.whl https://files.pythonhosted.org/packages/d2/43/2bd63467490036697e7be71444fafc7b236923d614d4521979a200c6b559/matplotlib-3.3.3-cp36-cp36m-manylinux1_x86_64.whl
to the folder where I am creating my deployment package.
First of all, I there are many wheel files and it is not clear if this is the correct one to use. Second, it doesn't work.
THen I upacked the wheel with the command
wheel unpack *.whl
This produced a directory with the matplotlib package in it.
My main function is
import matplotlib.pyplot
def lambda_handler(event, context):
return
I gave the commands
zip -r my-deployment-package.zip .
zip -g my-deployment-package.zip lambda_function.py
and uploaded the zip to an s3 bucket. Then I used the "aws lambda update-function-code..." command to update the lambda function. When I deploy and test, I get the error: "Unable to import module 'lambda_function': No module named 'matplotlib'"
Thinking that the matplotlib module is not there (perhaps I need to do something with the whl file or put it in a different place), i activated my virtual environment and started the python interpreter. Trying to import matplotlib, I get "module not found." Trying to use pip install on the matplotlib directory, it complains that there is no setup.py:
$ pip install --no-index --find-links=. matplotlib-3.3.3/
ERROR: Directory 'matplotlib-3.3.3/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
How can I get the module to be recognized?
With the help of Serverless framework and Docker, you can build a layer in a minute. Check out my answer in this post.
If you are in a hurry, feel free to grab this lambda layer that I packaged for you. It's compatible with Python 3.8. Here is the gdrive download link.
No need to build your own layer; loads of those available at https://github.com/keithrozario/Klayers - you just have to include them in your Lambda config.

Heroku large size problem for deploying rasa

I have an issue when I'm trying to deploy my Docker image of RASA in heroku.
here is the screenshot:
terminal
How I can avoid this large size:
I used requirement.txt to install RASA, here is the screen shot:
requirements.txt
Can you help me please?
One option is to increase the Dyno size (this is not free).
Alternatively you can build a small Docker image, for example not using the Tensorflow or Spacy model (which are pretty large).
You typically need those if you want, for example, to use their NER models (extracts names, locations, etc..)
This is an example on how to build a Rasa instance which can fit in the free-tier:
# from rasa base image
FROM rasa/rasa:1.8.0
# copy Rasa config and the Rasa generated model
COPY . /app
# script to run rasa core
COPY startup.sh /app/scripts/startup.sh

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

toco_from_protos: command not found

I'm using this following link to convert my Tensorflow model to tf lite model
https://www.tensorflow.org/lite/convert/python_api, In here i'm following instruction for 'Exporting a GraphDef from file'
But i'm getting following error
"TOCO failed. See console for info.\n%s\n%s\n" % (stdout, stderr))
tensorflow.lite.python.convert.ConverterError: TOCO failed. See console for info.
/bin/sh: toco_from_protos: command not found
I've installed latest tensorflow v1.13.1
The problem
Tensorflow calls to a specific binary file to convert the .pb file (stored by protobuf) in a tflite model. The binary file is 'toco_from_protos', and the error message suggests that the shell interpreter ('/bin/sh' in this case) is not able to find the binary file ('toco_from_proto').
You need to include the path to 'toco_from_proto' file in the PATH environment variable.
How to do this
First, check if the file exists. You can use the command 'locate' for example:
$ locate toco_from_proto
/home/user/anaconda3/envs/tensorflow/bin/toco_from_protos
/home/user/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/lite/toco/python/toco_from_protos.py
/home/user/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/lite/toco/python/__pycache__/toco_from_protos.cpython-36.pyc
In my case, I am using Anaconda to manage the environments. Thus, the binary is in the binary path ('bin' folder) of the environment container ('tensorflow' in this case).
To ensure the correct execution of the binary, include the path to 'toco_from_protos' file inside PATH environment variable. If you are using a Linux based system, you can do something like:
$ export PATH=$PATH:/home/user/anaconda3/envs/tensorflow/bin
If you are using an IDE program (e.g. Pycharm), you can call the IDE run script using the same console you used to export the PATH variable. For example:
$ export PATH=$PATH:/home/user/anaconda3/envs/tensorflow/bin
$ /opt/pycharm-community-2018.1.4/bin/pycharm.sh
The new PATH value change remains only in that console window, so if you want to make the change persistent, include the export sentence inside '~/.bashrc' file.
I had the same issue and solved by using an official docker image, host machine has a fresh Ubuntu 18.04.
docker run --runtime=nvidia -v /path/to/my/project:/mapped/docker/path -it tensorflow/tensorflow:latest-gpu bash
Then run the conversion script inside docker:
model = load_model() # keras model
output_names = [node.op.name for node in model.outputs]
input_names = [node.op.name for node in model.inputs]
with tf.keras.backend.get_session() as sess:
sess.run(tf.global_variables_initializer())
frozen_def = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, output_names)
converter = tf.lite.TFLiteConverter.from_session(sess, model.inputs, model.outputs)
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)
At the time of writing tensorflow/tensorflow:latest-gpu is version 1.13.1
I too got same error log in tensorflow 1.14. For me issue was not in converter, it was was related to path not getting resolved.
On running this before Python script, it worked for me
export PATH=$PATH:~/.local/bin

How do you install modules within sagemaker training jobs?

I don't think I'm asking this question right but I have jupyter notebook that launches a Tensorflow training job with a python training script I wrote.
That training script requires certain modules. Seems my sagemaker training job is failing because some of the modules don't exist.
How can I ensure that my training job script has all the modules it needs?
Edit
An example of one of these modules is keras.
The odd thing is, I can import keras in the jupyter notebook, but when that import statement is in my training script then I get the No module named keras error
If you want to install multiple packages, one way is to upgrade to Sagemaker Python SDK v2. With this, you can create a requirements.txt in the same directory as your notebook, and run the training. Sagemaker will automatically take care of the installation.
If you want to stay on v1 SDK, you can add the following snippet to your entry_point script.
import subprocess
import sys
def install(package):
subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('keras')
The module script runs within a docker container which obviously does not have the dependency installed. Jupyter notebook on the other hand has keras pre-installed.
Easy way to do this is to have a requirements.txt file with all the requirements and then pass that on when creating your model.
env = {
'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
}
sagemaker_model = TensorFlowModel(model_data = 's3://mybucket/modelTarFile,
role = role,
entry_point = 'entry.py',
code_location = 's3://mybucket/runtime-code/',
source_dir = 'src',
env = env,
name = 'model_name',
sagemaker_session = sagemaker_session,
)
You can upload your requirements.txt file to s3 bucket which can be
accessible by sagemaker and download the file to your working
directory of the container using boto3. Install the libraries from
requirements.txt the entry file.
import os
import boto3
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', '/opt/ml/code/requirements.txt')
os.command('pip install -r /opt/ml/code/requirements.txt')
The other way you can do it is by building your own container using
bring your own algorithm option provided by aws.
Ref-links:
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
The EstimatorBase class (and TensorFlow class) accept the parameter dependencies which you can use as follows to pass your requirements.txt:
estimator = TensorFlow(
dependencies=['requirements.txt'], # copies this file
)
e.g.
estimator = TensorFlow(
entry_point='src/train.py',
dependencies=['requirements.txt'], # copies this file
)
or
estimator = TensorFlow(
source_dir='src', # this copies the entire src folder
entry_point='train.py', # when using source_dir has to be directly under that dir
dependencies=['requirements.txt'], # copies this file
)
This copies the requirements.txt file into your sourcedir.tar.gz along with the training code.
This may only work on newer image versions. I read that in older versions you may need to put the requirements.txt file in the same folder as your training code.
If this doesn't work, you can use pip download to download your dependencies defined in requirements.txt locally, then use the dependencies parameter to specify the folder to which you downloaded your dependencies.
Another option is in your entry_point .py file you can add
import os
if __name__ == "__main__":
os.system('pip install mymodule')
import mymodule
# rest of code goes here
This worked for me for simple modules such as pyparsing, but I think with keras you better just use a Tensorflow container that has keras preinstalled, as mentioned above.
The environment on your notebook instance is exclusive from the environment of your training job on SageMaker, unless it is local mode.
If you're using a custom docker image, then most likely your docker image doesn't have Keras installed.
If you are using the SageMaker predefined TensorFlow container, which is most likely invoked through the following code:
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L170
TensorFlow(entry_point='training_code.py',
blah,
blah
)
Then you will need to install your dependencies within that container. There are currently two modes for training for TensorFlow on SageMaker, "framework" and "script" mode.
If training through "framework" mode, which is only available with 1.12 and below, then you will be limited to using a keras_model_fn defined here:
https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#preparing-the-tensorflow-training-script
Installing your dependencies would be done by passing in a requirements.txt.
On "script mode", which is introduced with TensorFlow 1.11 and above:
https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#training-with-tensorflow
Requirements.txt is not supported for "script" mode and instead it is recommended to install your dependencies within your user script, which would be your Python file that contains all of your Keras code.
Please let me know if there is anything I can clarify.
For examples:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_script_mode_quickstart
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_iris_dnn_classifier_using_estimators