How to make prediction with sagemaker on pandas dataframe - pandas

I am using Sagemaker to train and deploy my machine learning model. As regard to prediction, it will be executed by a lambda function as a scheduled job (every hour). The process is as follows:
pull new data from S3 since last prediction
preprocess, aggregate and create prediction data set
call sagemaker endpoint and make prediction
either save result to s3 or insert to database table
Based on my finding, typically the input will either from lambda payload
data = json.loads(json.dumps(event))
payload = data['data']
print(payload)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
or read from s3 file:
my_bucket = resource.Bucket('pred_data') #subsitute this for your s3 bucket name.
obj = client.get_object(Bucket=my_bucket, Key='foo.csv')
lines= obj['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(lines)
file = io.StringIO(lines)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT,
ContentType='*/*',
Body = file.getvalue(),
Body=payload)
output = response['Body'].read().decode('utf-8')
Since I will be pulling raw data from s3 and preprocess, a pandas dataframe will be generated. Is it possible to feed this directly as the input of invoke_endpoint? I could upload the aggregated dataset to another S3 bucket, but does it have to go through the decoding, csv.reader, StringIO and all that just like the example I found or is there an easy way to do it? Is the decode step really necessary to get the output?

You can send whatever payload you want when you call InvokeEndpoint and in whatever format. You can control the contract on either side (assuming your model supports it). If you are using a model that you didn't create, look to see if it supports pre/post processing which would allow you to define the contract yourself.
In addition to this, one thing we often see customers do is to do processing within the model instead of before calling SageMaker's InvokeEndpoint. A common use case is to accept the S3 path of the object you need to do predictions on when you call InvokeEndpoint. Then the model would be responsible for downloading the S3 item and transforming it and then running the inference on that data.
Depending on the InvokeEndpoint response, it can do the same and the model can upload it to S3 and just send the S3 key back as a response. This might not be what you are looking to do but it's just an additional example of the flexibility you have when using SageMaker.

Related

reading hdf5 file from s3 to sagemaker, is the whole file transferred?

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)
But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ?
I suppose that if the whole file is transferred at each call it could cost me a lot.
When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.
You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)
As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.
Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
# your way to get a numpy array -- .get() is depreciated:
data = h5_file.get(dataset_path_in_h5)
# this is the preferred syntax to return an array:
data_arr = h5_file[dataset_path_in_h5][()]
# this returns a h5py dataset object:
data_ds = h5_file[dataset_path_in_h5] # deleted [()]

AWS lambda 10x slower running pandas preprocessing compared to endpoint

I have been experimenting with reducing prediction request latency. On the one hand I have a sagemaker inference pipeline, which has a single endpoint with a preprocessing container and a model container. The preprocessing container runs a script extracting some date features and numerical features using pandas.
I've also tested creating a a lambda deployment package with pandas following this post. So here I would do the feature extraction inside the lambda, and then call the model endpoint using the response from the lambda.
I noticed a big difference in response time, and when I looked closer I noticed the pandas operations take 10x longer in the lambda?
Here's an example of a feature extraction function that takes 5x longer, but some are over 10x longer (one function goes from 30 ms to 380).
def extract_date_features(df):
print('Getting date features.')
df['date'] = pd.to_datetime(df.date)
df['weekday'] = df.date.dt.weekday
df['year'] = df.date.dt.year
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df['weekday'] = df.date.dt.weekday
df['dayofyear'] = df.date.dt.dayofyear
#df['week'] = df.date.dt.isocalendar().week.apply(int)
df['dayofweek'] = df.date.dt.dayofweek
df['is_weekend'] = np.where(df.date.dt.dayofweek.isin([5,6]), 1,0)
df['quarter'] = df.date.dt.quarter
What would be the reason for this? And it is my understanding that the compute provided for a lambda is all handled by aws, so there's no way to select a "faster" lambda, and I'm stuck with this speed if I'm using pandas in lambda.

Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below
I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

What do you use to access CSV data on S3 and other object storage providers as a PyTorch Dataset?

My dataset is stored as a collection of CSV files in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. I'd like to train a PyTorch model based on this data but the built-in Dataset classes do not provide native support for object storage services like S3 or Google Cloud Storage (GCS), Azure Blob storage, and such. I checked the PyTorch documentation here https://pytorch.org/docs/stable/data.html# about the available Dataset classes and it comes up short when it comes to public cloud object storage support.
It looks like I have to create my own custom Dataset according to the following instructions: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class but the effort seems overwhelming: I need to figure out how to download data from the object storage to local node, parse the CSV files to read them into PyTorch tensors, and then deal with the possibility of running out of disk space since my dataset is 100s of GBs.
Since PyTorch models are trained using gradient descent and I only need to store just a small batch of data (less than 1GB) in memory at once, is there a custom dataset implementation that can help?
Check out ObjectStorage Dataset which has support for object storage services like S3 and GCS osds.readthedocs.io/en/latest/gcs.html
You can run
pip install osds
to install it and then point it at your S3 bucket to instantiate the PyTorch Dataset and DataLoader using something like
from osds.utils import ObjectStorageDataset
from torch.utils.data import DataLoader
ds = ObjectStorageDataset(f"gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv",
storage_options = {'anon' : False },
batch_size = 32768,
worker = 4,
eager_load_batches = False)
dl = DataLoader(ds, batch_size=None)
where you use your S3 location path instead of gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv. So your glob for S3 would be something like s3://<bucket name>/<object path>/*.csv depending on the bucket and the bucket directory where you store your CSV objects for the dataset.

Usage of spark.catalog.refreshTable(tablename) in S3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)
Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.
I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !