How to access a file inside sagemaker entrypoint script - amazon-s3

I want to know how to access a private bucket S3 file or a folder inside script.py entry point of sagemaker .
I uploaded the file to S3 using following code
boto3_client = boto3.Session(
region_name='us-east-1',
aws_access_key_id='xxxxxxxxxxx',
aws_secret_access_key='xxxxxxxxxxx'
)
sess = sagemaker.Session(boto3_client)
role=sagemaker.session.get_execution_role(sagemaker_session=sess)
inputs = sess.upload_data(path="df.csv", bucket=sess.default_bucket(), key_prefix=prefix)
This is the code of estimator
import sagemaker
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch(
entry_point='script.py',
instance_type='ml.g4dn.xlarge',
source_dir = './',
role=role,
sagemaker_session=sess,
)
Now inside script.py file i want to access the df.csv file from s3.
This is my code inside script.py.
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
args, _ = parser.parse_known_args()
#create session
sess=Session(boto3.Session(
region_name='us-east-1'))
S3Downloader.download(s3_uri=args.data_dir,
local_path='./',
sagemaker_session=sess)
df=pd.read_csv('df.csv')
But this is giving error
ValueError: Expecting 's3' scheme, got: in /opt/ml/input/data/training., exit code: 1
I think one way is to pass secret key and access key. But i am already passing sagemaker_session. How can i call that session inside script.py file and get my file read.

I think this approach is conceptually wrong.
Files within sagemaker jobs (whether training or otherwise) should be passed during machine initialization. Imagine you have to create a job with 10 machines, do you want to read the file 10 times or replicate it directly by having it read once?
In the case of the training job, they should be passed into the fit (in the case of direct code like yours) or as TrainingInput in the case of pipeline.
You can follow this official AWS example: "Train an MNIST model with PyTorch"
However, the important part is simply passing a dictionary of input channels to the fit:
pytorch_estimator.fit({'training': s3_input_train})
You can put the name of the channel (in this case 'train') any way you want. The path s3 will be the one in your df.csv.
Within your script.py, you can read the df.csv directly between environment variables (or at least be able to specify it between argparse). Generic code with this default will suffice:
parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
It follows the nomenclature "SM_CHANNEL_" + your_channel_name.
So if you had put "train": s3_path, the variable would have been called SM_CHANNEL_TRAIN.
Then you can read your file directly by pointing to the path corresponding to that environment variable.

Related

Passing Parameters to Sagemaker Notebook Instance

I want to pass the name of a file from a labda function to sagemaker notebook instance
I am using Sagemaker notebook to perform a preprocessing job when a file landed in the s3 bucket. As such, I wrote a lambda function triggered by an S3 event that start the notebook. The lambda code is like
` import boto3
import logging
def lambda_handler(event, context):
print("notebook starting .....")
client = boto3.client('sagemaker')
client.start_notebook_instance(NotebookInstanceName='preprocess-dataset')
print("notebook started .....")
return 0
`
I want to pass the name of the file to this notebook so the notebook read the content of the file. How can I do that?
You can create tags using the boto3 client add_tags(..) method to pass values to your notebook instance from your lambda.
Then in your startup script you can read from that tag and set an environment variable containing the value in the tag. You can modify the script from the following AWS sagemaker on-start script example to set your environment variable.
You can now use that environment variable inside your notebook or python file which contains the value you need (in your case the name of the file).

boto3 load custom models

For example:
session = boto3.Session()
client = session.client('custom-service')
I know that I can create a json with API definitions under ~/.aws/models and botocore will load it from there. The problem is that I need to get it done on the AWS Lambda function, which looks like impossible to do so.
Looking for a way to tell boto3 where are the custom json api definitions so it could load from the defined path.
Thanks
I have only a partial answer. There's a bit of documentation about botocore's loader module, which is what reads the model files. In a disscusion about loading models from ZIP archives, a monkey patch was offered up which extracts the ZIP to a temporary filesystem location and then extends the loader search path to that location. It doesn't seem like you can load model data directly from memory based on the API, but Lambda does give you some scratch space in /tmp.
Here's the important bits:
import boto3
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
The directory structure of /tmp/boto needs to follow the resource loader documentation. The main model file needs to be at /tmp/boto/custom-service/yyyy-mm-dd/service-2.json.
The issue also mentions that alternative loaders can be swapped in using Session.register_component so if you wanted to write a scrappy loader which returned a model straight from memory you could try that too. I don't have any info about how to go about doing that.
Just adding more details:
import boto3
import zipfile
import os
s3_client = boto3.client('s3')
s3_client.download_file('your-bucket','model.zip','/tmp/model.zip')
os.chdir('/tmp')
with zipfile.ZipFile('model.zip', 'r') as archive:
archive.extractall()
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
model.zip is just a compressed file that contains:
Archive: model.zip
Length Date Time Name
--------- ---------- ----- ----
0 11-04-2020 16:44 boto/
0 11-04-2020 16:44 boto/custom-service/
0 11-04-2020 16:44 boto/custom-service/2018-04-23/
21440 11-04-2020 16:44 boto/custom-service/2018-04-23/service-2.json
Just remember to have the proper lambda role to access S3 and your custom-service.
boto3 also allows setting the AWS_DATA_PATH environment variable which can point to a directory path of your choice.
[boto3 Docs]
Everything zipped with your lambda function is put under /opt/.
Let's assume all your custom models live under a models/ folder. When this folder is mounted to the lambda environment, it'll live under /opt/models/.
Simply specify AWS_DATA_PATH=/opt/models/ in the Lambda configuration and boto3 will pick up models in that directory.
This is better than fetching models from S3 during runtime, unpacking, and then modifying session parameters.

AWS Lambda - dynamically import python module from S3 at runtime

I have some tens of python modules, each has one common method (e.g: run(params)) but with different implementations. I also have an AWS Lambda which will need to call that method from within one of those modules. Choosing which module depending on the input of that lambda.
It seems that I can achieve that by using Layers in Lambda.
However, if I use one single layer for all those modules, then I could see problems with versioning that. If I need to update one module, I'll need to re-deploy that layer, which could bring unexpected changes to other modules.
If I use one layer for each module, then there will be too many layers to manage.
I thought of putting each module into one individual zip file, and put those zip files into an S3 location. My lambda will then dynamically reads the required zip files from S3 and execute.
Is that approach viable?
=====================
My current solution is to have something like this:
def read_python_script_from_zip(bucket: str, key: str, script_name: str) -> str:
s3 = boto3.resource('s3')
raw = s3.Object(bucket, key).get()['Body'].read()
zf = zipfile.ZipFile(io.BytesIO(raw), "r")
scripts = list(filter(lambda f: f.endswith(f"/{script_name}.py"), zf.namelist()))
if len(scripts) == 0:
raise ModuleNotFoundError(f"{script_name} not found.")
if len(scripts) > 1:
raise ModuleNotFoundError(f"{script_name} is ambiguous.")
source = zf.read(scripts[0])
mod = ModuleType(script_name, '')
exec(source, mod.__dict__)
return mod
read_python_script_from_zip(source_bucket, source_key, module_name).run(params)
Looks complicate to me though, would expect an easier way.
You could try packaging each module as a separate distribution package, which would let you version them separately. However, creating a Python distribution package is not as simple as you might hope, especially if you want to publish it to a private repository hosted on S3.

Amazon SageMaker notebook rl_deepracer_coach_robomaker - Write log CSV on S3 after simulation

I created my first notebook instance on Amazon SageMaker.
Next I opened the Jupyter notebook and I used the SageMaker Example in the section Reinforcement Learning rl_deepracer_coach_robomaker.ipynb. The question is addressed principally to those who are familiar with this notebook.
There you can launch a training process and a RoboMaker simulation application to start the learning process for an autonomous car.
When a simulation job is launched, one can access to the log file, which is visualised by default in a CloudWatch console. Some of the informations that appear in the log file can be modified in the script deepracer_env.py in /src/robomaker/environments subdirectory.
I would like to "bypass" the CloudWatch console, saving the log file informations like episode, total reward, number of steps, coordinates of the car, steering and throttle etc. in a dataframe or csv file to be written somewhere on the S3 at the end of the simulation.
Something similar has been done in the main notebook rl_deepracer_coach_robomaker.ipynb to plot the metrics for a training job, namely the training reward per episode. There one can see that
csv_file_name = "worker_0.simple_rl_graph.main_level.main_level.agent_0.csv"
is called from the S3, but I simply cannot find where this csv is generated to mimic the process.
You can create a csv file in the /opt/ml/output/intermediate/ folder, and the file will be saved in the following directory:
s3://<s3_bucket>/<s3_prefix>/output/intermediate/<csv_file_name>
However, it is not clear to me where exactly you will create such a file. DeepRacer notebook uses two machines, one for training (SageMaker instance) and one for simulations (RoboMaker instance). The above method will only work in a SageMaker instance, but much of what you would like to log such as ("Total rewards" in an episode) is actually in RoboMaker instance. For RoboMaker instances, the intermediate folder feature doesn't exist, and you'll have to save the file to s3 yourself using the boto library. Here is an example of doing that: https://qiita.com/hengsokvisal/items/329924dd9e3f65dd48e7
There is a way to download the CloudWatch logs to a file. This way you can just print, save the logs and parse it. Assuming you are executing from a notebook cell:
STREAM_NAME= <your stream name as given by RoboMaker CloudWatch logs>
task = !aws logs create-export-task --task-name "copy_deepracer_logs" --log-group-name "/aws/robomaker/SimulationJobs" --log-stream-name-prefix $STREAM_NAME --destination "<s3_bucket>" --destination-prefix "<s3_prefix>" --from <unix timestamp in milliseconds> --to <unix timestamp in milliseconds>
task_id = json.loads(''.join(task))['taskId']
The export is an asynchronous call, so give it a few minutes to download. If you can print the task_id, then the export is done.

Tensorflow: checkpoints simple load

I have a checkpoint file:
checkpoint-20001 checkpoint-20001.meta
how do I extract variables from this space, without having to load the previous model and starting session etc.
I want to do something like
cp = load(checkpoint-20001)
cp.var_a
It's not documented, but you can inspect the contents of a checkpoint from Python using the class tf.train.NewCheckpointReader.
Here's a test case that uses it, so you can see how the class works.
https://github.com/tensorflow/tensorflow/blob/861644c0bcae5d56f7b3f439696eefa6df8580ec/tensorflow/python/training/saver_test.py#L1203
Since it isn't a documented class, its API may change in the future.