how can I preprocess input data before making predictions in sagemaker? - aws-java-sdk

I am calling a Sagemaker endpoint using java Sagemaker SDK. The data that I am sending needs little cleaning before the model can use it for prediction. How can I do that in Sagemaker.
I have a pre-processing function in the Jupyter notebook instance which is cleaning the training data before passing that data to train the model. Now I want to know if I can use that function while calling the endpoint or is that function already being used?
I can show my code if anyone wants?
EDIT 1
Basically, in the pre-processing, I am doing label encoding. Here is my function for preprocessing
def preprocess_data(data):
print("entering preprocess fn")
# convert document id & type to labels
le1 = preprocessing.LabelEncoder()
le1.fit(data["documentId"])
data["documentId"]=le1.transform(data["documentId"])
le2 = preprocessing.LabelEncoder()
le2.fit(data["documentType"])
data["documentType"]=le2.transform(data["documentType"])
print("exiting preprocess fn")
return data,le1,le2
Here the 'data' is a pandas dataframe.
Now I want to use these le1,le2 at the time of calling endpoint. I want to do this preprocessing in sagemaker itself not in my java code.

There is now a new feature in SageMaker, called inference pipelines. This lets you build a linear sequence of two to five containers that pre/post-process requests. The whole pipeline is then deployed on a single endpoint.
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

You need to write a script and supply that while creating you model. That script would have a input_fn where you can do your preprocessing.
Please refer to aws docs for more details.
https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet-training-inference-code-template.html

One option is to put your pre-processing code as part of an AWS Lambda function and use that Lambda to call the invoke-endpoint of SageMaker, once the pre-processing is done. AWS Lambda supports Python and it should be easy to have the same code that you have in your Jupyter notebook, also within that Lambda function. You can also use that Lambda to call external services such as DynamoDB for lookups for data enrichment.
You can find more information in the SageMaker documentations: https://docs.aws.amazon.com/sagemaker/latest/dg/getting-started-client-app.html

The SageMaker MXNet container is open source.
You add pandas do the docker container here: https://github.com/aws/sagemaker-mxnet-containers/blob/master/docker/1.1.0/Dockerfile.gpu#L4
The repo has instructions on how to build the container as well: https://github.com/aws/sagemaker-mxnet-containers#building-your-image
sagemaker container amazon-sagemaker

Related

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

I have this script where I want to get the callbacks to a separate CSV file in sagemaker custom script docker container. But when I try to run in local mode, it fails giving the following error. I have a hyper-parameter tuning job(HPO) to run and this keeps giving me errors. I need to get this local mode run correctly before doing the HPO.
In the notebook I use the following code.
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='lstm_model.py',
role=role,
code_location=custom_code_upload_location,
output_path=model_artifact_location+'/',
train_instance_count=1,
train_instance_type='local',
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={'epochs': 1},
base_job_name='hpo-lstm-local-test'
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
In my lstm_model.py script the following code is used.
lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)
regressor.fit(x_train, y_train, batch_size=batch_size,
validation_data=(x_val, y_val),
epochs=epochs,
verbose=2,
callbacks=[csv_logger]
)
I tried creating a file before hand like shown below using tensorflow backend. But it doesn't create a file. ( K : tensorflow Backend, tf: tensorflow )
filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)
I can't use any other packages like pandas to create the file as the TensorFlow docker container in SageMaker for custom scripts doesn't provide them. They give only a limited amount of packages.
Is there a way I can write the csv file to the S3 bucket location, before the fit method try to write the callback. Or is that the solution to the problem? I am not sure.
If you can even suggest other suggestions to get callbacks, I would even accept that answer. But it should be worth the effort.
This docker image is really narrowing the scope.
Well for starters, you can always make your own docker image using the Tensorflow image as a base. I work in Tensorflow 2.0 so this will be slightly different for you but here is an example of my image pattern:
# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version
# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q
# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas
# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh
# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py
I believe you are looking for a pattern similar to this, but I prefer to log all of my model data using Weights and Biases. They're a little out of data on their SageMaker integration but I'm actually in the midst of writing an updated tutorial for them. It should certainly be finished this month and include logging and comparing runs from hyperparameter tuning jobs

Using of Estamator.evaluate() on trained sagemaker tensorflow model

After I've trained and deployed the model with AWS SageMaker, I want to evaluate it on several csv files:
- category-1-eval.csv (~700000 records)
- category-2-eval.csv (~500000 records)
- category-3-eval.csv (~800000 records)
...
The right way to do this is with using Estimator.evaluate() method, as it is fast.
The problem is - I cannot find the way to restore SageMaker model into Tensorflow Estimator, is it possible?
I've tried to restore a model like this:
tf.estimator.DNNClassifier(
feature_columns=...,
hidden_units=[...],
model_dir="s3://<bucket_name>/checkpoints",
)
In AWS SageMaker documentation a different approach is described - to test the actual endpoint from the Notebook - but it takes to much time and requires a lot of API calls to the endpoint.
if you used the built-in Tensorflow container, your model has been saved in Tensorflow Serving format, e.g.:
$ tar tfz model.tar.gz
model/
model/1/
model/1/saved_model.pb
model/1/variables/
model/1/variables/variables.index
model/1/variables/variables.data-00000-of-00001
You can easily load it with Tensorflow Serving on your local machine, and send it samples to predict. More info at https://www.tensorflow.org/tfx/guide/serving

Tensorflow Keras GCP ML engine model serving

I'm working on an image classifier with tensorflow estimator + keras retraining the last layer of a pretrained application inception_v3 on GCP ML engine.
The keras model is exported with tf.keras.estimator.model_to_estimator and the input function receive the path of the image stored on GCP cloud storage open the image with tf.image.decode_jpeg and return a dataset with the following format dict(zip(['inception_v3_input'], [image])), label
I'm trying to define the tf.estimator.export.ServingInputReceiver but I'm having some trouble defining it.
The model is serving correctly the prediction with the predict method using the input function without the labels.
My idea was to reuse the input_function to decode the image passing only the path of the image on cloud storage to the prediction also for the google endpoint, but I can't understand how to do it.
Thank's for your help
If I'm understanding correctly, your question is how to get the file from Cloud Storage, considering that you want to decode the image this way:
image_decoded = tf.image.decode_jpeg(image_string)
So, in this case, you can use:
image_string = file_io.FileIO(filename, mode='r')
By importing file_io first:
from tensorflow.python.lib.io import file_io
According to the comments on this question about reading input data from GCS, using the file_read function should provide the same results since " there was a bunch of work done to abstract file io and file systems, so there all the io functionality works consistently". So you can try also with read_file function.

Hyperparameter tune for Tensorflow with hyper-engine

I find hyper-engine python tool on
https://github.com/maxim5/hyper-engine.
The example in only using mnist.
https://github.com/maxim5/hyper-engine/tree/master/hyperengine/examples.
How can I feed my own data like this example below:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/5_DataManagement/build_an_image_dataset.ipynb
HyperEngine supports custom data provider, the closest example is this one: it's generating word pairs from the text, not images, but the API is more or less clear. Basically, you only need to implement next_batch method:
def next_batch(self, batch_size):
pass
So if you want to train your network on a set of images on a disk, you simply need to write an iterator over files and yield numpy arrays upon calling the next batch.
But there is a but. Currently, HyperEngine is accepting only numpy arrays from next_batch. The example you refer to is working with TF queue API and read_images function is producing tensors, so you can't simply copy the code. Hopefully, will be a better support for various tensorflow APIs, including estimators, dataset API, queues, etc.

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft