AWS SageMaker Notebook's Default S3 Bucket - Cant Access Uploaded Files within Notebook - amazon-s3

In SageMaker Studio, I created directories and uploaded files to my SageMaker's default S3 bucket using the GUI, and was exploring how to work with those uploaded files using a SageMaker Studio Notebook.
Within the SageMaker Studio Notebook, I ran
sess = sagemaker.Session()
bucket = sess.default_bucket() #sagemaker-abcdef
prefix = "folderJustBelowRoot"
conn = boto3.client('s3')
conn.list_objects(Bucket=bucket, Prefix=prefix)
# this returns a response dictionary with the corresponding metadata, which includes 'HTTPStatusCode': 200, 'server': 'AmazonS3' => which means the request-response was successful
What I dont understand is why the 'Contents' key and its value are missing from the 'conn.list_objects' dictionary response?
And when I go to 'my SageMaker's default bucket' in the S3 console, I am wondering why my uploaded files are not appearing.
===============================================================
I was expecting
the response from conn.list_objects(Bucket=bucket, Prefix=prefix) to contain the 'Contents' key (within my SageMaker Studio Notebook)
the S3 console to show the files I uploaded to 'my SageMaker's default bucket'

Question 2: And when I go to 'my SageMaker's default bucket' in the S3 console, I am wondering why my uploaded files are not appearing.
It seems that when you upload files from your local desktop/laptop onto AWS SageMaker Studio using the GUI, your files are in the the Elastic Block Storage/EBS of your SageMaker Studio instance.
To access the following items within your SageMaker Studio instance:
Folder Path - "subFolderLayer1/subFolderLayer2/subFolderLayer3" => to access 'subFolderLayer3'
File Path - "subFolderLayer1/subFolderLayer2/subFolderLayer3/fileName.extension" => to access 'fileName.extension' within your subFolderLayers
=========
To access the files on the default S3 storage bucket for your AWS SageMaker instance, first identify it by
sess = sagemaker.Session()
bucket = sess.default_bucket() #sagemaker-abcdef
Then go to the bucket and upload your files and folders. When you have done that, move to the response for question 1.
=================================================================
Question 1: What I dont understand is why the 'Contents' key and its value are missing from the 'conn.list_objects' dictionary response?
prefix = "folderJustBelowYourBucket"
conn = boto3.client('s3')
conn.list_objects(Bucket=bucket, Prefix=prefix)
The 'conn.list_objects' dictionary response now contains a 'Contents' key, containing a list of metadata as its values - 1 metadata dictionary for each file/folder within that 'prefix'/'folderJustBelowYourBucket'.

You can upload and download files from Amazon SageMaker to Amazon S3 using SageMaker Python SDK. SageMaker S3 utilities provides S3Uploader and S3Downloader classes to easily work with S3 from within SageMaker studio notebooks.
A comment about the 'file system' in your question 2, the files are stored onto SageMaker Studio user profile Amazon Elastic File System (Amazon EFS) volume, and not EBS(SageMaker classic notebooks uses EBS volumes). Refer this blog for more detailed overview of SageMaker architecture

Related

How can AWS Lambda pick the latest versions of of the script from S3

I have a S3 bucket and we are using as code repository to store our Lambda code, which is then read by lambda.
The S3 bucket is version so that every time we upload the script again( after altering the code) there is a new version of the zip file created for the existing file.
Now I want Lambda to automatically pickup the latest version of the zip file automatically instead of me altering it manually in the CloudFormation templet and running it OR attaching it manually to the Lambda every time.
I was able to resolve the issue ,so just wanted to post the solution for reference:
Followed the below steps:
Make sure that the name of the Lambda function and the name of the zip file ( deployment package) is exactly the same.
Create a Lambda that will be triggered when you upload any new code in your S3 bucket.
Process the event information and use s3 API it to fetch the latest version of the file from s3.
Use boto3 API to reconfigure your final Lambda
import boto3
import json
client = boto3.client("s3")
lambda_client = boto3.client("lambda")
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
file = event["Records"][0]["s3"]["object"]["key"]
get_version = client.get_object(
Bucket = bucket,
Key = file
)
versionId = (get_version["VersionId"]) #Getting the latest version of the code
update_lambda = lambda_client.update_function_code(
FunctionName= file.split("/")[-1].split(".")[0],
S3Bucket=bucket,
S3Key=file,
S3ObjectVersion= versionId
)
If you want to deploy a new version of a Lambda function's code automatically as it is uploaded to the S3 bucket, then you can use S3 Event Notifications to e.g. notify an SNS topic and subscribe another Lambda function which performs the deployment (such as via CloudFormation or AWS SDK deploy lambda function).

How to dynamically change the "S3 object key" in AWS CodePipeline when source is S3 bucket

I am trying to use S3 bucket as source for CodePipeline. We want to save source code version like "1.0.1" or "1.0.2" in S3 bucket each time we trigger Jenkins pipeline dynamically as source which is saved in S3 bucket. But since the "S3 object key" is not dynamic we cant able to build artifact based on version numbers which is generated dynamically by Jenkins. Is there a way to make the "S3 object key" dynamic and take value from Jenkins pipeline when code pipeline is triggered.
Not possible natively but you can do that by writing your own Lambda function. It’d require Lambda as it’s a restriction with CodePipeline that you’ve to specify a fixed object key name while setting up the pipeline.
So, let’s say you’ve 2 pipelines, CircleCI (CCI) & CodePipeline (CP). CCI generates some files and push it to your S3 bucket (S3-A). Now, you want CP to pick up the latest zip file as a source. But since the latest zip file will be having different names (1.0.1 or 1.0.2), you can’t do that dynamically.
So, on that S3 bucket (S3-A), you can have have S3 event notification trigger enabled with your custom Lambda function. Whenever any new object gets uploaded to that S3 bucket (S3-A), your Lambda function will be triggered, it’ll fetch the latest uploaded object to that S3 bucket (S3-A), zip/unzip that object and push it to an another S3 bucket (S3-B) with some fixed name like file.zip with which you’ll configure your CP with as a source. As there’s a new object with file.zip in your S3 bucket (S3-B), your CP will be triggered automatically.
PS: You’ll have to write your own Lambda function such that it’ll do all those above operations like zipping/unzipping up the newly uploaded object in S3-A, uploading it to S3-B, etc.

how to access text file from s3 bucket into sagemaker for training a model?

I am trying to train chatbot model using tensorflow and seq to seq architecture using sagemaker also I have completed coding in spyder but when
I am trying to access cornel movie corpus dataset from s3 bucket into sagemaker it says no such file or directory even granting access to s3 bucket
if you're in a notebook: aws s3 cp s3://path_to_the_file /home/ec2-user/SageMaker will copy data from s3 to your SageMaker directory in the notebook (if you have the IAM permissions to do so)
if you're in the docker container of a SageMaker training job: you need to pass the s3 path to the SDK training call: estimator.fit({'mydata':'s3://path_to_the_file'}) and in the docker your tensorflow code must read from this path: opt/ml/input/data/mydata

How would cv2 access videos in AWS s3 bucket

I can use the SageMaker notebook now. But here is a significant problem. When I wanted to use cv2.VideoCapture to read the video in the s3 bucket. It said the path doesn't exist. One answer in Stackoverflow said cv2 only supports local files, which means we have to download videos from s3 bucket to notebook but I don't want to do this. I wonder how you read the videos? Thanks.
I found one solution is to use CloudFront but would this be charged and is it fast?
You are using Python in SageMaker, so you could use:
import boto3
s3_client = boto3.client('s3')
s3_client.download_file('deepfake2020', 'dfdc_train_part_1/foo.mp4', foo.mp4')
This will download the file from Amazon S3 to the local disk, in a file called foo.mp4.
See: download_file() in boto3
This requires that the SageMaker instance has been granted permissions to access the Amazon S3 bucket.
This solution is also working.
To use AWS SageMaker,
1) go to Support Center to ask to improve notebook instance limit. They will reply in 1 day normally.
2) When creating a notebook, change local disk size to 1TB (double data size).
3) Open Jupyter lab and type cd SageMaker on terminal
4) Use CurlWget to get the download link of the dataset.
5) After downloading, unzip data
unzip dfdc_train_all.zip
unzip '*.zip'
There you go.

How to read trained data file in S3

I'm trying to make face recognition service using AWS Lambda.
I want to deploy .zip file including trained data file.
But, AWS Lambda don't deploy it because of its size.
So, I change the way. Upload trained data file to S3 and use it.
But, I don't know how to do it.
Could you tell me the way to read trained data file in S3, in AWS Lambda function?
Once you have the data in S3, you can copy the file from S3 into lambda. Lambda provides 512 MB of storage in the tmp folder that is writable at run-time.
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
https://docs.aws.amazon.com/lambda/latest/dg/limits.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file