Feature Importance for XGBoost in Sagemaker - xgboost

I have built an XGBoost model using Amazon Sagemaker, but I was unable to find anything which will help me interpret the model and validate if it has learned the right dependencies.
Generally, we can see Feature Importance for XGBoost by get_fscore() function in the python API (https://xgboost.readthedocs.io/en/latest/python/python_api.html) I see nothing of that sort in the sagemaker api(https://sagemaker.readthedocs.io/en/stable/estimators.html).
I know I can build my own model and then deploy that using sagemaker but I am curious if anyone has faced this problem and how they overcame it.
Thanks.

As of 2019-06-17, Sagemaker XGBoost model is stored on S3 in as archive named model.tar.gz. This archive consist of single pickled model file named xgboost-model.
To load the model directly from S3 without downloading, you can use the following code:
import s3fs
import pickle
import tarfile
import xgboost
model_path = 's3://<bucket>/<path_to_model_dir>/xgboost-2019-06-16-09-56-39-854/output/model.tar.gz'
fs = s3fs.S3FileSystem()
with fs.open(model_path, 'rb') as f:
with tarfile.open(fileobj=f, mode='r') as tar_f:
with tar_f.extractfile('xgboost-model') as extracted_f:
xgbooster = pickle.load(extracted_f)
xgbooster.get_fscore()

SageMaker XGBoost currently does not provide interface to retrieve feature importance from the model. You can write some code to get the feature importance from the XGBoost model. You have to get the booster object artifacts from the model in S3 and then use the following snippet
import pickle as pkl
import xgboost
booster = pkl.load(open(model_file, 'rb'))
booster.get_score()
booster.get_fscore()
Refer XGBoost doc for methods to get feature importance from the Booster object such as get_score() or get_fscore().

Although you can write a custom script like rajesh and Lukas suggested and use XGBoost as a framework to run the script (see How to Use Amazon SageMaker XGBoost for how to use the "script mode"), SageMaker has recently launched SageMaker Debugger, which allows you to retrieve feature importance from XGBoost in real time.
Notebook demonstrates how to use SageMaker Debugger to retrieve feature importance.

Related

ExampleGen on production

I was wondering how is ExampleGen used in production? I understand that their outputs can be feeded into the TFDV components of TFX to validate schema, skews, and others.
But I get lost since ExampleGen generates a train & eval split, and I don’t find why you would split the data in production into train & eval.
As far as I know, TFX is more suitable for deploying models into production, if I'm going to make a non-productive model maybe just using Tensorflow could work.
So ym questions are:
Is TFX are used for the modeling/dev part? i.e. before deploying your model.
Is it suitable to develop a model in Tensorflow and then migrate it to TFX for the production part?
Thanks!
The ExampleGen TFX Pipeline component ingests data into TFX pipelines.
In simple words, ExampleGen fetches the data from external data sources such as CSV, TFRecord, Avro, Parquet and BigQuery and generates tf.Example and tf.SequenceExample records which can be read by other TFX components. For more info, Please refer The ExampleGen TFX Pipeline Component
Is TFX are used for the modeling/dev part? i.e. before deploying your model.
Yes TFX can be used for modeling, training, serving inference, and managing deployments to online, native mobile, and JavaScript targets. Once you model is trained on TFX, you can deploy your model using TF serving and other deployment targets.
Is it suitable to develop a model in Tensorflow and then migrate it to TFX for the production part?
Once you have developed and trained a model using TFX pipeline, you can deploy it using TF serving system. You can also serve tensorflow models using TF serving. Please refer Serving a TensorFlow Model

How to use tensorflow library with sagemaker preprocessing

I want to use TensorFlow for preprocessing in sagemaker pipelines.
But, I haven't been able to find a way to use it.
Right now, I'm using this library for preprocessing:
from sagemaker.sklearn.processing import SKLearnProcessor
framework_version = "0.23-1"
sklearn_processor = SKLearnProcessor(
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name="abcd",
role=role,
)
Now, I need to use TensorFlow in preprocessing but the python module cant import TensorFlow.
Any help would be much appreciated. Thanks.
So there's no TensorFlow Processor module, the list of available processing modules are in the following doc and can be seen at the bottom of the page.
You can use a Processor and ScriptProcessor class and pass in your TF preprocessing as a script with a requirements.txt for the necessary modules.
I work for AWS and my opinions are my own.

How can I convert the model I trained with Tensorflow (python) for use with TensorflowJS without involving IBM cloud (from the step I'm at now)?

What I'm trying to do
I'm trying to learn TensorFlow object recognition and as usual with new things, I scoured the web for tutorials. I don't want to involve any third party cloud service or web development framework, I want to learn to do it with just native JavaScript, Python, and the TensorFlow library.
What I have so far
So far, I've followed a TensorFlow object detection tutorial (accompanied by a 5+ hour video) to the point where I've trained a model in Tensorflow (python) and want to convert it to run in a browser via TensorflowJS. I've also tried other tutorials and haven't seemed to find one that explains how to do this without a third party cloud / tool and React.
I know in order to use this model with tensorflow.js my goal is to get files like:
group1-shard1of2.bin
group1-shard2of2.bin
labels.json
model.json
I've gotten to the point where I created my TFRecord files and started training:
py Tensorflow\models\research\object_detection\model_main_tf2.py --model_dir=Tensorflow\workspace\models\my_ssd_mobnet --pipeline_config_path=Tensorflow\workspace\models\my_ssd_mobnet\pipeline.config --num_train_steps=100
It seems after training the model, I'm left with:
files named checkpoint, ckpt-1.data-00000-of-00001, ckpt-1.index, pipeline.config
the pre-trained model (which I believe isn't the file that changes during training, right?) ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8
I'm sure it's not hard to get from this step to the files I need, but I honestly browsed a lot of documentation and tutorials and google and didn't see an example of doing it without some third party cloud service. Maybe it's in the documentation, I'm missing something obvious.
The project directory structure looks like this:
Where I've looked for an answer
For some reason, frustratingly, every single tutorial I've found (including the one linked above) for using a pre-trained Tensorflow model for object detection via TensorFlowJS has required the use of IBM Cloud and ReactJS. Maybe they're all copying from some tutorial they found and now all the tutorials include this, I don't know. What I do know is I'm building an Electron.js desktop app and object detection shouldn't require network connectivity assuming the compute is happening on the user's device. To clarify: I'm creating an app where the user trains the model, so it's not just a matter of one time conversion. I want to be able to train with Python Tensorflow and convert the model to run on JavaScript Tensorflow without any cloud API.
So I stopped looking for tutorials and tried looking directly at the documentation at https://github.com/tensorflow/tfjs.
When you get to the section about importing pre-trained models, it says:
Importing pre-trained models
We support porting pre-trained models from:
TensorFlow SavedModel
Keras
So I followed that link to Tensorflow SavedModel, which brings us to a project called tfjs-converter. That repo says:
This repository has been archived in favor of tensorflow/tfjs.
This repo will remain around for some time to keep history but all
future PRs should be sent to tensorflow/tfjs inside the tfjs-core
folder.
All history and contributions have been preserved in the monorepo.
Which sounds a bit like a circular reference to me, considering it's directing me to the page that just told me to go here. So at this point you're wondering well is this whole library deprecated, will it work or what? I look around in this repo anyway, into: https://github.com/tensorflow/tfjs-converter/tree/master/tfjs-converter
It says:
A 2-step process to import your model:
A python pip package to convert a TensorFlow SavedModel or TensorFlow Hub module to a web friendly format. If you already have a converted model, or are using an already hosted model (e.g. MobileNet), skip this step.
JavaScript API, for loading and running inference.
And basically says to create a venv and do:
pip install tensorflowjs
tensorflowjs_converter \
--input_format=tf_saved_model \
--output_format=tfjs_graph_model \
--signature_name=serving_default \
--saved_model_tags=serve \
/mobilenet/saved_model \
/mobilenet/web_model
But wait, are the checkpoint files I have a "TensorFlow SavedModel"? This doesn't seem clear, the documentation doesn't explain. So I google it, find the documentation, and it says:
You can save and load a model in the SavedModel format using the
following APIs:
Low-level tf.saved_model API. This document describes how to use this
API in detail. Save: tf.saved_model.save(model, path_to_dir)
The linked syntax extrapolates somewhat:
tf.saved_model.save(
obj, export_dir, signatures=None, options=None
)
with an example:
class Adder(tf.Module):
#tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.float32)])
def add(self, x):
return x + x
model = Adder()
tf.saved_model.save(model, '/tmp/adder')
But so far, this isn't familiar at all. I don't understand how to take the results of my training process so far (the checkpoints) to load it into a variable model so I can pass it to this function.
This passage seems important:
Variables must be tracked by assigning them to an attribute of a
tracked object or to an attribute of obj directly. TensorFlow objects
(e.g. layers from tf.keras.layers, optimizers from tf.train) track
their variables automatically. This is the same tracking scheme that
tf.train.Checkpoint uses, and an exported Checkpoint object may be
restored as a training checkpoint by pointing
tf.train.Checkpoint.restore to the SavedModel's "variables/"
subdirectory.
And it might be the answer, but I'm not really clear on what it means as far as being "restored", or where I go from there, if that's even the right step to take. All of this is very confusing to someone learning TF which is why I looked for a tutorial that does it, but again, I can't seem to find one without third party cloud services / React.
Please help me connect the dots.
You can convert your model to TensorFlowJS format without any cloud services. I have laid out the steps below.
I'm sure it's not hard to get from this step to the files I need.
The checkpoints you see are in tf.train.Checkpoint format (relevant source code that creates these checkpoints in the object detection model code). This is different from the SavedModel and Keras formats.
We will go through these steps:
Checkpoint (current) --> SavedModel --> TensorFlowJS
Converting from tf.train.Checkpoint to SavedModel
Please see the script models/research/object_detection/export_inference_graph.py to convert the Checkpoint files to SavedModel.
The code below is taken from the docs of that script. Please adjust the paths to your specific project. --input_type should remain as image_tensor.
python export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path path/to/ssd_inception_v2.config \
--trained_checkpoint_prefix path/to/model.ckpt \
--output_directory path/to/exported_model_directory
In the output directory, you should see a savedmodel directory. We will use this in the next step.
Converting SavedModel to TensorFlowJS
Follow the instructions at https://github.com/tensorflow/tfjs/tree/master/tfjs-converter, specifically paying attention to the "TensorFlow SavedModel example". The example conversion code is copied below. Please modify the input and output paths for your project. The --signature_name and --saved_model_tags might have to be changed, but hopefully not.
tensorflowjs_converter \
--input_format=tf_saved_model \
--output_format=tfjs_graph_model \
--signature_name=serving_default \
--saved_model_tags=serve \
/mobilenet/saved_model \
/mobilenet/web_model
Using the TensorFlowJS model
I know in order to use this model with tensorflow.js my goal is to get files like:
group1-shard1of2.bin
group1-shard2of2.bin
labels.json
model.json
The steps above should create these files for you, though I don't think labels.json will be created. I am not sure what that file should contain. TensorFlowJS will use model.json to construct the inference graph, and it will load the weights from the .bin files.
Because we converted a TensorFlow SavedModel to a TensorFlowJS model, we will need to load the JS model with tf.loadGraphModel(). See the tfjs converter page for more information.
Note that for TensorFlowJS, there is a difference between a TensorFlow SavedModel and a Keras SavedModel. Here, we are dealing with a TensorFlow SavedModel.
The Javascript code to run the model is probably out of scope for this answer, but I would recommend reading this TensorFlowJS tutorial. I have included a representative javascript portion below.
import * as tf from '#tensorflow/tfjs';
import {loadGraphModel} from '#tensorflow/tfjs-converter';
const MODEL_URL = 'model_directory/model.json';
const model = await loadGraphModel(MODEL_URL);
const cat = document.getElementById('cat');
model.execute(tf.browser.fromPixels(cat));
Extra notes
... Which sounds a bit like a circular reference to me,
The TensorFlowJS ecosystem has been consolidated in the tensorflow/tfjs GitHub repository. The tfjs-converter documentation lives there now. You can create a pull request to https://github.com/tensorflow/tfjs to fix the SavedModel link to point to the tensorflow/tfjs repository.

Does Tensorflow server serve/support non-tensorflow based libraries like scikit-learn?

Actually we are creating a platform to be able to put AI usecases in production. TFX is the first choice but what if we want to use non-tensorflow based libraries like scikit learn etc and want to include a python script to create models. Will output of such a model be served by tensorflow server. How can I make sure to be able to run both tensorflow based model and non-tensorflow based libraries and models in one system design. Please suggest.
Mentioned below is the procedure to Deploy and Serve a Sci-kit Learn Model in Google Cloud Platform.
First step is to Save/Export the SciKit Learn Model using the below code:
from sklearn.externals import joblib
joblib.dump(clf, 'model.joblib')
Next step is to upload the model.joblib file to Google Cloud Storage.
After that, we need to create our model and version, specifying that we are loading up a scikit-learn model, and select the runtime version of Cloud ML engine, as well as the version of Python that we used to export this model.
Next, we need to present the data to Cloud ML Engine as a simple array, encoded as a json file, like shown below. We can use JSON Library as well.
print(list(X_test.iloc[10:11].values))
Next, we need to run the below command to perform the Inference,
gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances $INPUT_FILE
For more information, please refer this link.

Can I use AWS Sagemaker without S3

If I am not using the notebook on AWS but instead just the Sagemaker CLI and want to train a model, can I specify a local path to read from and write to?
If you use local mode with the SageMaker Python SDK, you can train using local data:
from sagemaker.mxnet import MXNet
mxnet_estimator = MXNet('train.py',
train_instance_type='local',
train_instance_count=1)
mxnet_estimator.fit('file:///tmp/my_training_data')
However, this only works if you are training a model locally, not on SageMaker. If you want to train on SageMaker, then yes, you do need to use S3.
For more about local mode: https://github.com/aws/sagemaker-python-sdk#local-mode
As far as I know, you cannot do that. Sagemaker's framework and estimator API makes it easy for SageMaker to feed in data to the model at every iteration or epoch. Feeding from local would drastically slow down the process.
That begs the question - qhy not use S3. Its cheap and fast.