Load pytorch model from S3 bucket - amazon-s3

I want to load a pytorch model (model.pt) from a S3 bucket. I wrote the following code:
from smart_open import open as smart_open
import io
load_path = "s3://serial-no-images/yolo-models/model4/model.pt"
with smart_open(load_path) as f:
buffer = io.BytesIO(f.read())
model.load_state_dict(torch.load(buffer))
This results in the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
One solution would be to download the model locally, but I want to avoid this and load the model directly from S3. Unfortunately, I couldn't find a good solution for that online. Can someone help me out here?

According to the documentation, the following works:
from smart_open import open as smart_open
import io
load_path = "s3://serial-no-images/yolo-models/model4/model.pt"
with smart_open(load_path, 'rb') as f:
buffer = io.BytesIO(f.read())
model.load_state_dict(torch.load(buffer))
I have tried this before, but didn't see that I have to set 'rb' as argument.

Related

Error while converting csv to parquet file using pandas

I would like to upload csv as parquet file to S3 bucket. Below is the code snippet.
df = pd.read_csv('right_csv.csv')
csv_buffer = BytesIO()
df.to_parquet(csv_buffer, compression='gzip', engine='fastparquet')
csv_buffer.seek(0)
Above is giving me an error: TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How to make it work?
As per the documentation, when fastparquet is used as the engine, io.BytesIO cannot be used. auto or pyarrow engine have to be used. Quoting from the documentation.
The engine fastparquet does not accept file-like objects.
Below code works without any issues.
import io
f = io.BytesIO()
df.to_parquet(f, compression='gzip', engine='pyarrow')
f.seek(0)
As mentioned in the other answer, this is not supported. One work around would be to save as parquet to a NamedTemporaryFile. Then copy the content to a BytesIO buffer:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
df.to_parquet(tmp.name, compression='gzip', engine='fastparquet')
with open(tmp.name, 'rb') as fh:
buf = io.BytesIO(fh.read())

Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below
I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

How to implement SciBERT with pytorch; error while loading

I am trying to use SciBERT pre-trained model, namely: scibert-scivocab-uncased the following way:
!pip install pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import logging
import matplotlib.pyplot as plt
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('/Users/.../Downloads/scibert_scivocab_uncased-3.tar.gz')
And I get the following error:
EOFError: Compressed file ended before the end-of-stream marker was reached
I downloaded the file from the website (https://github.com/allenai/scibert)
I converted it from "tar" to gzip
Nothing worked.
Any hint on how to approach this?
Thank you!
In the new version of pytorch-pretrained-BERT i.e. in transformers, you can do the following to load a pretrained model after you un-tar:
import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("/your/local/path/to/scibert_scivocab_uncased")
Need to unzip the package and rename the json file to config.json
Then just address the folder pathname where you have unzipped the package. It should work

AWS S3 and Sagemaker: No such file or directory

I have created an S3 bucket 'testshivaproject' and uploaded an image in it. When I try to access it in sagemaker notebook, it throws an error 'No such file or directory'.
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
# Define IAM role
role = get_execution_role()
my_region = boto3.session.Session().region_name # set the region of the instance
print("success :"+my_region)
Output: success :us-east-2
role
Output: 'arn:aws:iam::847047967498:role/service-role/AmazonSageMaker-ExecutionRole-20190825T121483'
bucket = 'testprojectshiva2'
data_key = 'ext_image6.jpg'
data_location = 's3://{}/{}'.format(bucket, data_key)
print(data_location)
Output: s3://testprojectshiva2/ext_image6.jpg
test = load_img(data_location)
Output: No such file or directory
There are similar questions raised (Load S3 Data into AWS SageMaker Notebook) but did not find any solution?
Thanks for using Amazon SageMaker!
I sort of guessed from your description, but are you trying to use the Keras load_img function to load images directly from your S3 bucket?
Unfortunately, the load_img function is designed to only load files from disk, so passing an s3:// URL to that function will always return a FileNotFoundError.
It's common to first download images from S3 before using them, so you can use boto3 or the AWS CLI to download the file before calling load_img.
Alternatively, since the load_img function simply creates a PIL Image object, you can create the PIL object directly from the data in S3 using boto3, and not use the load_img function at all.
In other words, you could do something like this:
from PIL import Image
s3 = boto3.client('s3')
test = Image.open(BytesIO(
s3.get_object(Bucket=bucket, Key=data_key)['Body'].read()
))
Hope this helps you out in your project!
You may use the following code to pull in a CSV file into sagemaker.
import pandas as pd
bucket='your-s3-bucket'
data_key = 'your.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
alternative formatting for data_location variable:
data_location = f's3://{bucket}/{data_key}'

AWS Sagemaker: AttributeError: module 'pandas' has no attribute 'core'

Let me prefix this by saying I'm very new to tensorflow and even newer to AWS Sagemaker.
I have some tensorflow/keras code that I wrote and tested on a local dockerized Jupyter notebook and it runs fine. In it, I import a csv file as my input.
I use Sagemaker to spin up a jupyter notebook instance with conda_tensorflow_p36. I modified the pandas.read_csv() code to point to my input file, now hosted on a S3 bucket.
So I changed this line of code from
import pandas as pd
data = pd.read_csv("/input.csv", encoding="latin1")
to this
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/my-sagemaker-bucket/input.csv", encoding="latin1")
and I get this error
AttributeError: module 'pandas' has no attribute 'core'
I'm not sure if it's a permissions issue. I read that as long as I name my bucket with the string "sagemaker" it should have access to it.
Pull our data from S3 for example:
import boto3
import io
import pandas as pd
# Set below parameters
bucket = '<bucket name>'
key = 'data/training/iris.csv'
endpointName = 'decision-trees'
# Pull our data from S3
s3 = boto3.client('s3')
f = s3.get_object(Bucket=bucket, Key=key)
# Make a dataframe
shape = pd.read_csv(io.BytesIO(f['Body'].read()), header=None)