Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker - amazon-s3

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below

I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

Related

reading hdf5 file from s3 to sagemaker, is the whole file transferred?

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)
But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ?
I suppose that if the whole file is transferred at each call it could cost me a lot.
When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.
You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)
As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.
Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
# your way to get a numpy array -- .get() is depreciated:
data = h5_file.get(dataset_path_in_h5)
# this is the preferred syntax to return an array:
data_arr = h5_file[dataset_path_in_h5][()]
# this returns a h5py dataset object:
data_ds = h5_file[dataset_path_in_h5] # deleted [()]

Can I read a trained linear model from s3, without reconstructing a local copy?

In order to run a dask pipeline on a coiled cluster that uses a previously trained linear model in each task, I believe I need to read the model directly from S3. Reading the model within a task did work on a local cluster, when trying to run the code on a coiled cluster however, this did not work unfortunately.
First I saved the model to S3 using:
def save_model_s3(
model: linear_model,
filename: str,
path: str, #'name/product=models/model/'
):
s3 = boto3.resource("s3")
location = path
model_filename = f"{filename}.pkl" # .pkl or .sav
OutputFile = location + model_filename
# WRITE
with tempfile.TemporaryFile() as fp:
joblib.dump(model, fp)
fp.seek(0)
# use bucket_name and OutputFile - s3 location path in string format.
s3.Bucket("bucket-name").put_object(Key=OutputFile, Body=fp.read())
return None
Within the pipeline-tasks I then try to read the .sav/.pkl file. The only thing that worked for me so far is to read the file using the following code, however this doesn't seem to be working within a task on a coiled cluster.
def read_joblib(path):
'''
Function to load a joblib file from an s3 bucket.
Arguments:
* path: an s3 bucket or local directory path where the file is stored
Outputs:
* file: Joblib file loaded
'''
# Path is an s3 bucket
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
boto3.client("s3").download_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
f.seek(0)
file = joblib.load(f)
return file
Is there a smart way to do this in a different fashion, so that there is no need to reconstruct a local copy?

Usage of spark.catalog.refreshTable(tablename) in S3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)
Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.
I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !

How to make prediction with sagemaker on pandas dataframe

I am using Sagemaker to train and deploy my machine learning model. As regard to prediction, it will be executed by a lambda function as a scheduled job (every hour). The process is as follows:
pull new data from S3 since last prediction
preprocess, aggregate and create prediction data set
call sagemaker endpoint and make prediction
either save result to s3 or insert to database table
Based on my finding, typically the input will either from lambda payload
data = json.loads(json.dumps(event))
payload = data['data']
print(payload)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
or read from s3 file:
my_bucket = resource.Bucket('pred_data') #subsitute this for your s3 bucket name.
obj = client.get_object(Bucket=my_bucket, Key='foo.csv')
lines= obj['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(lines)
file = io.StringIO(lines)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT,
ContentType='*/*',
Body = file.getvalue(),
Body=payload)
output = response['Body'].read().decode('utf-8')
Since I will be pulling raw data from s3 and preprocess, a pandas dataframe will be generated. Is it possible to feed this directly as the input of invoke_endpoint? I could upload the aggregated dataset to another S3 bucket, but does it have to go through the decoding, csv.reader, StringIO and all that just like the example I found or is there an easy way to do it? Is the decode step really necessary to get the output?
You can send whatever payload you want when you call InvokeEndpoint and in whatever format. You can control the contract on either side (assuming your model supports it). If you are using a model that you didn't create, look to see if it supports pre/post processing which would allow you to define the contract yourself.
In addition to this, one thing we often see customers do is to do processing within the model instead of before calling SageMaker's InvokeEndpoint. A common use case is to accept the S3 path of the object you need to do predictions on when you call InvokeEndpoint. Then the model would be responsible for downloading the S3 item and transforming it and then running the inference on that data.
Depending on the InvokeEndpoint response, it can do the same and the model can upload it to S3 and just send the S3 key back as a response. This might not be what you are looking to do but it's just an additional example of the flexibility you have when using SageMaker.

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()