Passing Parameters to Sagemaker Notebook Instance - amazon-s3

I want to pass the name of a file from a labda function to sagemaker notebook instance
I am using Sagemaker notebook to perform a preprocessing job when a file landed in the s3 bucket. As such, I wrote a lambda function triggered by an S3 event that start the notebook. The lambda code is like
` import boto3
import logging
def lambda_handler(event, context):
print("notebook starting .....")
client = boto3.client('sagemaker')
client.start_notebook_instance(NotebookInstanceName='preprocess-dataset')
print("notebook started .....")
return 0
`
I want to pass the name of the file to this notebook so the notebook read the content of the file. How can I do that?

You can create tags using the boto3 client add_tags(..) method to pass values to your notebook instance from your lambda.
Then in your startup script you can read from that tag and set an environment variable containing the value in the tag. You can modify the script from the following AWS sagemaker on-start script example to set your environment variable.
You can now use that environment variable inside your notebook or python file which contains the value you need (in your case the name of the file).

Related

How to access a file inside sagemaker entrypoint script

I want to know how to access a private bucket S3 file or a folder inside script.py entry point of sagemaker .
I uploaded the file to S3 using following code
boto3_client = boto3.Session(
region_name='us-east-1',
aws_access_key_id='xxxxxxxxxxx',
aws_secret_access_key='xxxxxxxxxxx'
)
sess = sagemaker.Session(boto3_client)
role=sagemaker.session.get_execution_role(sagemaker_session=sess)
inputs = sess.upload_data(path="df.csv", bucket=sess.default_bucket(), key_prefix=prefix)
This is the code of estimator
import sagemaker
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch(
entry_point='script.py',
instance_type='ml.g4dn.xlarge',
source_dir = './',
role=role,
sagemaker_session=sess,
)
Now inside script.py file i want to access the df.csv file from s3.
This is my code inside script.py.
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
args, _ = parser.parse_known_args()
#create session
sess=Session(boto3.Session(
region_name='us-east-1'))
S3Downloader.download(s3_uri=args.data_dir,
local_path='./',
sagemaker_session=sess)
df=pd.read_csv('df.csv')
But this is giving error
ValueError: Expecting 's3' scheme, got: in /opt/ml/input/data/training., exit code: 1
I think one way is to pass secret key and access key. But i am already passing sagemaker_session. How can i call that session inside script.py file and get my file read.
I think this approach is conceptually wrong.
Files within sagemaker jobs (whether training or otherwise) should be passed during machine initialization. Imagine you have to create a job with 10 machines, do you want to read the file 10 times or replicate it directly by having it read once?
In the case of the training job, they should be passed into the fit (in the case of direct code like yours) or as TrainingInput in the case of pipeline.
You can follow this official AWS example: "Train an MNIST model with PyTorch"
However, the important part is simply passing a dictionary of input channels to the fit:
pytorch_estimator.fit({'training': s3_input_train})
You can put the name of the channel (in this case 'train') any way you want. The path s3 will be the one in your df.csv.
Within your script.py, you can read the df.csv directly between environment variables (or at least be able to specify it between argparse). Generic code with this default will suffice:
parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
It follows the nomenclature "SM_CHANNEL_" + your_channel_name.
So if you had put "train": s3_path, the variable would have been called SM_CHANNEL_TRAIN.
Then you can read your file directly by pointing to the path corresponding to that environment variable.

Activating data pipeline when new files arrived on S3 using SNS

How to activating data pipeline when new files arrived on S3.For EMR scheduling using triggered using SNS when new files arrived on S3.
You can execute data pipeline without using SNS. When files will be arrived into S3 Location
Create S3 event which should invoke lambda function.enter image description here
Create Lambda Function (Make sure the role which you will give that has s3 , lambda, data pipeline permissions).
Paste below code in lambda function to execute data pipeline (mention your data pipeline_id)
import boto3
def lambda_handler(event, context):
try :
client = boto3.client('datapipeline',region_name='ap-southeast-2')
s3_client = boto3.client('s3')
data_pipeline_id="df-09312983K28XXXXXXXX"
response_pipeline = client.describe_pipelines(pipelineIds=[data_pipeline_id])
activate = client.activate_pipeline(pipelineId=data_pipeline_id,parameterValues=[])
except Exception as e:
raise Exception("Pipeline is not found or not active")

boto3 load custom models

For example:
session = boto3.Session()
client = session.client('custom-service')
I know that I can create a json with API definitions under ~/.aws/models and botocore will load it from there. The problem is that I need to get it done on the AWS Lambda function, which looks like impossible to do so.
Looking for a way to tell boto3 where are the custom json api definitions so it could load from the defined path.
Thanks
I have only a partial answer. There's a bit of documentation about botocore's loader module, which is what reads the model files. In a disscusion about loading models from ZIP archives, a monkey patch was offered up which extracts the ZIP to a temporary filesystem location and then extends the loader search path to that location. It doesn't seem like you can load model data directly from memory based on the API, but Lambda does give you some scratch space in /tmp.
Here's the important bits:
import boto3
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
The directory structure of /tmp/boto needs to follow the resource loader documentation. The main model file needs to be at /tmp/boto/custom-service/yyyy-mm-dd/service-2.json.
The issue also mentions that alternative loaders can be swapped in using Session.register_component so if you wanted to write a scrappy loader which returned a model straight from memory you could try that too. I don't have any info about how to go about doing that.
Just adding more details:
import boto3
import zipfile
import os
s3_client = boto3.client('s3')
s3_client.download_file('your-bucket','model.zip','/tmp/model.zip')
os.chdir('/tmp')
with zipfile.ZipFile('model.zip', 'r') as archive:
archive.extractall()
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
model.zip is just a compressed file that contains:
Archive: model.zip
Length Date Time Name
--------- ---------- ----- ----
0 11-04-2020 16:44 boto/
0 11-04-2020 16:44 boto/custom-service/
0 11-04-2020 16:44 boto/custom-service/2018-04-23/
21440 11-04-2020 16:44 boto/custom-service/2018-04-23/service-2.json
Just remember to have the proper lambda role to access S3 and your custom-service.
boto3 also allows setting the AWS_DATA_PATH environment variable which can point to a directory path of your choice.
[boto3 Docs]
Everything zipped with your lambda function is put under /opt/.
Let's assume all your custom models live under a models/ folder. When this folder is mounted to the lambda environment, it'll live under /opt/models/.
Simply specify AWS_DATA_PATH=/opt/models/ in the Lambda configuration and boto3 will pick up models in that directory.
This is better than fetching models from S3 during runtime, unpacking, and then modifying session parameters.

boto3 s3 copy_object with ContentEncoding argument

I'm trying to copy s3 object with boto3 command like below
import boto3
client = boto3.client('s3')
client.copy_object(Bucket=bucket_name, ContentEncoding='gzip', CopySource=copy_source, Key=new_key)
To copy the object succeeded, but ContentEncoding metadata was not added to the object.
When I use the console to add Content-Encoding metadata, there was no problem.
But using python boto3 copy command, it cannot do that.
Here's a document link about client.copy_object()
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy_object
And the application versions are like this.
python=2.7.16
boto3=1.0.28
botocore=1.13.50
Thank you in advance.
Try adding MetadataDirective='REPLACE' to your copy_object call
client.copy_object(Bucket=bucket_name, ContentEncoding='gzip', CopySource=copy_source, Key=new_key, MetadataDirective='REPLACE')

How can I pass variables to USERDATA in EC2 creation via Lambda using Boto3

I am trying to create EC2 instance via Lambda using Boto3. Creation of EC2 works ok but the passing of variables into USERDATA script.
I've tried different following ways of calling in USERDATA :
os.environ['VARIABLE'], VARIABLE, $VARIABLE, ${VARIABLE}
import os
import boto3
EC2 = boto3.client('ec2', region_name=os.environ['REGION'])
AMI = os.environ['AMI']
INSTANCE_TYPE = os.environ['INSTANCE_TYPE']
def lambda_to_ec2(event, context):
init_script = """#!/bin/bash
yum update -y
echo VARIABLE
shutdown -h +5"""
print 'Running script:'
print init_script
instance = EC2.run_instances(
ImageId= AMI,
InstanceType= INSTANCE_TYPE,
MinCount=1,
MaxCount=1,
InstanceInitiatedShutdownBehavior='terminate',
UserData=init_script
)
instance_id = instance['Instances'][0]['InstanceId']
return instance_id
I`d like to be able to use environment variables in init_script which is passed to UserData.
Found out this is really python issue. When using tripple quotes you can access the variables following way In Python, can you have variables within triple quotes? If so, how?