how to link s3 bucket to sagemaker notebook - amazon-s3

I am trying to link my s3 bucket to a notebook instance, however i am not able to:
Here is how much I know:
from sagemaker import get_execution_role
role = get_execution_role
bucket = 'atwinebankloadrisk'
datalocation = 'atwinebankloadrisk'
data_location = 's3://{}/'.format(bucket)
output_location = 's3://{}/'.format(bucket)
to call the data from the bucket:
df_test = pd.read_csv(data_location/'application_test.csv')
df_train = pd.read_csv('./application_train.csv')
df_bureau = pd.read_csv('./bureau_balance.csv')
However I keep getting errors and unable to proceed.
I haven't found answers that can assist much.
PS: I am new to this AWS

You can load S3 Data into AWS SageMaker Notebook by using the sample code below. Do make sure the Amazon SageMaker role has policy attached to it to have access to S3.
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html
import boto3
import botocore
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'Your_bucket_name'
data_key = your_data_file.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)

You're trying to use Pandas to read files from S3 - Pandas can read files from your local disk, but not directly from S3.
Instead, download the files from S3 to your local disk, then use Pandas to read them.
import boto3
import botocore
BUCKET_NAME = 'my-bucket' # replace with your bucket name
KEY = 'my_image_in_s3.jpg' # replace with your object key
s3 = boto3.resource('s3')
try:
# download as local file
s3.Bucket(BUCKET_NAME).download_file(KEY, 'my_local_image.jpg')
# OR read directly to memory as bytes:
# bytes = s3.Object(BUCKET_NAME, KEY).get()['Body'].read()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise

You can use the https://s3fs.readthedocs.io/en/latest/ to read s3 files directly with pandas. The code below is taken from here
import os
import pandas as pd
from s3fs.core import S3FileSystem
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'
s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'
df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))

In pandas 1.0.5, if you've already provided access to the notebook instance, reading a csv from S3 is as easy as this (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files):
df = pd.read_csv('s3://<bucket-name>/<filepath>.csv')
During the notebook setup process I attached a SageMakerFullAccess policy to the notebook instance granting it access to the S3 bucket. You can also do this via the IAM Management console.
If you need credentials, there's three ways to providing them (https://s3fs.readthedocs.io/en/latest/#credentials):
aws_access_key_id, aws_secret_access_key, and aws_session_token environment variables
configuration files such as ~/.aws/credentials
for nodes on EC2, the IAM metadata provider

import boto3
# files are referred as objects in S3.
# file name is referred as key name in S3
def write_to_s3(filename, bucket_name, key):
with open(filename,'rb') as f: # Read in binary mode
return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)
# Simple call the write_to_s3 function with required argument
write_to_s3('file_name.csv',
bucket_name,
'file_name.csv')

Related

DataNotFoundError: Unable to load data for: endpoints

I'm trying to write from a data frame to CSV directly to an s3 bucket
I've tried the stringIO method but the problem is that I run into the "KeyTooLong" error.
import boto3
client = boto3.client('s3')
client.create_bucket(Bucket = 'poolpo-rent-a-car-bucket')
# checking if the bucket was created
response = client.list_buckets()
response['Buckets']
bucket_name = 'poolpo-rent-a-car-bucket'
car_costs.to_csv(f"s3://{bucket_name}/{car_costs}.csv")
This is the StringIO one
from io import StringIO
bucket_name = 'poolpo-rent-a-car-bucket'
csv_buffer = StringIO()
branch_locations.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket_name, f'{branch_locations}.csv').put(Body=csv_buffer.getvalue())
And the error
ClientError: An error occurred (KeyTooLongError) when calling the PutObject operation: Your key is too long
These are medium size dataframes, like 5000 rows and like 3-5 columns
For an unrelated reason, I had to reinstall anaconda and the problems got away.
Ended up using a way simpler approach.
import boto3
client = boto3.client('s3')
client.create_bucket(Bucket = 'poolpo-rent-a-car-bucket')
response = client.list_buckets()
response['Buckets']
car_costs.to_csv(f"s3://{bucket_name}/car_costs.csv")
One other thing that I noticed in s3 was that when I was using the f string to input the dataframe I was basically using the dataframe as a name hence why I was having the KeyTooLongError

AWS S3 and Sagemaker: No such file or directory

I have created an S3 bucket 'testshivaproject' and uploaded an image in it. When I try to access it in sagemaker notebook, it throws an error 'No such file or directory'.
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
# Define IAM role
role = get_execution_role()
my_region = boto3.session.Session().region_name # set the region of the instance
print("success :"+my_region)
Output: success :us-east-2
role
Output: 'arn:aws:iam::847047967498:role/service-role/AmazonSageMaker-ExecutionRole-20190825T121483'
bucket = 'testprojectshiva2'
data_key = 'ext_image6.jpg'
data_location = 's3://{}/{}'.format(bucket, data_key)
print(data_location)
Output: s3://testprojectshiva2/ext_image6.jpg
test = load_img(data_location)
Output: No such file or directory
There are similar questions raised (Load S3 Data into AWS SageMaker Notebook) but did not find any solution?
Thanks for using Amazon SageMaker!
I sort of guessed from your description, but are you trying to use the Keras load_img function to load images directly from your S3 bucket?
Unfortunately, the load_img function is designed to only load files from disk, so passing an s3:// URL to that function will always return a FileNotFoundError.
It's common to first download images from S3 before using them, so you can use boto3 or the AWS CLI to download the file before calling load_img.
Alternatively, since the load_img function simply creates a PIL Image object, you can create the PIL object directly from the data in S3 using boto3, and not use the load_img function at all.
In other words, you could do something like this:
from PIL import Image
s3 = boto3.client('s3')
test = Image.open(BytesIO(
s3.get_object(Bucket=bucket, Key=data_key)['Body'].read()
))
Hope this helps you out in your project!
You may use the following code to pull in a CSV file into sagemaker.
import pandas as pd
bucket='your-s3-bucket'
data_key = 'your.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
alternative formatting for data_location variable:
data_location = f's3://{bucket}/{data_key}'

How to load data from your S3 bucket to Sagemaker jupyter notebook to train the model?

I have csv files in S3 bucket, I want to use those to train model in sagemaker.
using this code but it gives an error (file not found)
import boto3
import pandas as pd
region = boto3.Session().region_name
train_data_location = 's3://taggingu-{}/train.csv'.format(region)
df=pd.read_csv(train_data_location, header = None)
print df.head
What can be the solution to this ?
Not sure but could this stackoverflow answer it? Load S3 Data into AWS SageMaker Notebook
To quote #Chhoser:
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")
Most notebook kernels have it, if missing it can be installed via pip install awswrangler.

AWS Sagemaker: AttributeError: module 'pandas' has no attribute 'core'

Let me prefix this by saying I'm very new to tensorflow and even newer to AWS Sagemaker.
I have some tensorflow/keras code that I wrote and tested on a local dockerized Jupyter notebook and it runs fine. In it, I import a csv file as my input.
I use Sagemaker to spin up a jupyter notebook instance with conda_tensorflow_p36. I modified the pandas.read_csv() code to point to my input file, now hosted on a S3 bucket.
So I changed this line of code from
import pandas as pd
data = pd.read_csv("/input.csv", encoding="latin1")
to this
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/my-sagemaker-bucket/input.csv", encoding="latin1")
and I get this error
AttributeError: module 'pandas' has no attribute 'core'
I'm not sure if it's a permissions issue. I read that as long as I name my bucket with the string "sagemaker" it should have access to it.
Pull our data from S3 for example:
import boto3
import io
import pandas as pd
# Set below parameters
bucket = '<bucket name>'
key = 'data/training/iris.csv'
endpointName = 'decision-trees'
# Pull our data from S3
s3 = boto3.client('s3')
f = s3.get_object(Bucket=bucket, Key=key)
# Make a dataframe
shape = pd.read_csv(io.BytesIO(f['Body'].read()), header=None)

AWS S3 bucket write error

I created AWS S3 bucket and tried sample kmeans example on Jupyter notebook.
Being account owner I have read/write permissions but I am unable to write logs with following error,
ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
here's the kmeans sample code,
from sagemaker import get_execution_role
role = get_execution_role()
bucket='testingshk'
import pickle, gzip, numpy, urllib.request, json
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
from sagemaker import KMeans
data_location = 's3://{}/kmeans_highlevel_example/data'.format(bucket)
output_location = 's3://{}/kmeans_example/output'.format(bucket)
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
kmeans = KMeans(role=role,
train_instance_count=2,
train_instance_type='ml.c4.8xlarge',
output_path=output_location,
k=10,
data_location=data_location)
kmeans.fit(kmeans.record_set(train_set[0]))
Even if you have all the access to the bucket, you need to provide access key and secret in order to put some object in bucket if it is private. Or if you make bucket access public to all then you can push object to bucket without any problem.