Migrating legacy ML pipeline to TFX - tfx

We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline.
I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
Load image data and meta-data
Filter out ‘bad’ data based on meta-data
Determine image based statistics (classic image processing in Python):
Image level characteristics
Image region characteristics
(region is determined based on a fine-tuned EfficientDet model)
Filter out ‘bad’ data based on image statistics
Generate TFRecords from this image and meta-data
Oversample certain TFRecords for class balancing (using tf.data)
Train an image classifier
…
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)?
If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again.
This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting).
So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.

I'll try to address most questions with my experience with tfx.
I have a dataflow job that I run to pre-process my images, labels, features, etc and turn all that into tfrecords. That lives outside of tfx and is ran only when there is data refreshes.
You can do the same, here is a very simple code snippet that i use to resize all my images and create simple features.
try:
image = tf.io.decode_jpeg(image_string)
image = tf.image.resize(image,[image_resize_size,image_resize_size])
image = tf.image.convert_image_dtype(image/255.0, dtype=tf.uint8)
image_shape = image.shape
image = tf.io.encode_jpeg(image,quality=100)
feature = {
'height': _int64_feature(image_shape[0]),
'width' : _int64_feature(image_shape[1]),
'depth' : _int64_feature(image_shape[2]),
'label' : _int64_feature(labels_to_int(element[2].decode())),
'image_raw' : _bytes_feature(image.numpy())
}
tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
except:
print('image could not be decoded')
return None
Once I have the data in tfrecord format, I use the ImportExampleGen component to load the data into my tfx pipeline. This is followed by StatisticsGen which will compute statistics on the features.
When running all of this in the cloud, it is using dataflow under the covers in batch mode.
Your metadata store only caches pipeline metadata, but your data is cached in a gcs bucket and metadata store knows it. So when you re-run your pipeline, if you have caching set to True, your ImportExampleGen, StatisticsGen, SchemaGen, Transform will not be reran if the data hasn't changed. This has huge benefits in time and costs.
ExampleValidator outputs an artifact to let you know what data anomalies are in your data. I created a custom component that intakes the examplevalidator artifact and if my data doesn't meet certain criteria, I kill the pipeline by throwing an error in this component. I wish there was a component that can just stop the pipeline, but I haven't found one, so my work around was to throw an error which stops the pipeline from progressing further.
Usually when I create a tfx pipeline, it is done to automate the machine learning process. At this point we have already done class balancing, feature selection, etc. Since that falls more under the pre-processing stage.
I guess you could technically create a custom component that takes a StatisticsGen artifact, parses it and tries to do some class balancing and creates a new dataset with balances classes. But honestly, I think is better to do it at the preprocessing stage.

Related

How to retain Entity Identifier with Batch Prediction of XGBoost Model in Vertex AI

I am wondering how we can match back predictions to the entity after executing a batch prediction using an XGBoost model via Custom Training on Prebuilt Images.
When kicking off a BatchPredictionJob it expects the input to be of the form
input_1,input_2,input_3
0.1,0.2,0.3
0.4,0.5,0.6
...
for csv or
[0.1,0.2,0.3]
[0.4,0.5,0.6]
...
for jsonl with the output predictions:
{"instance":[0.1,0.2,0.3], "prediction":0.0345}
...
The output predictions then just contain these instances of input values without any indication of how to map these predictions back to the original entity. As the training is distributed I do not believe I can rely on the file ordering, does anyone have a method to do so?
Doing Batch predictions on a model runs the jobs using distributed processing which means the data is distributed among an arbitrary cluster of virtual machines, and is processed in an unpredictable order.
In the AI platform, to match the returned batch prediction with input instances an instance key needs to be defined. But in Vertex AI this feature has not been documented.
As the concept of using instance keys with prebuilt XGBoost container image on custom trained models is not mentioned in the Vertex AI docs, this issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this link.
In Vertex AI the batch prediction outputs are not ordered, for which a feature request has been raised and you can track the update on this request from this link.

load MLFlow model and plot feature importance with feature names

If I save a xgboost model in mlflow with mlflow.xgboost.log_model(model, "model") and load it with model = mlflow.xgboost.load_model("models:/model_uri") and want to plot the feature importance with xgboost.plot_importance(model) the problem is that the features are not shown with names (see plot). If I plot the feature without saving in mlflow the origin feature names are shown. Do I have to store the model in another way?
Usually, if you use a pipeline It can happen for example on FeatureUnion from sklearn.
You can try to get the feature index from the model or the last step of the pipeline and use it to retrieve the feature names from the dataset.
If you are using a pipeline you can try to get the feature the step before this problem appears or edit the step, also be aware if you are using feature selection different situations can happen.
You can use autologging to autosave the plot, but the same problem happen if it is pipeline.
You could save the model like an artifact, if you think to do it, my suggestion is to use the Dill package.

Huge size of TF records file to store on Google Cloud

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

Efficient management of large amounts of data with SageMaker for training a keras model

I'm working on a deep learning project with about 700GB of table-like time series data in thousands of .csv files (each about 15MB).
All the data is on S3 and it needs some preprocessing before being fed into the model. The question is how to best go about automating the process of loading, preprocessing and training. Is a custom keras generator with some built in preprocessing the best solution?
Preprocessing implies that this is something you might want to decouple from the model execution and run separately, possibly on a schedule or in response to new data flowing in.
If so, you'll probably want to do the preprocessing outside of SageMaker. You could orchestrate it using Glue, or you could write a custom job and run it through AWS Batch or alternatively on an EMR cluster.
That way, your Keras notebook can load the already preprocessed data, train and test through SageMaker.
With a little care, you should be able to perform at least some of the heavy lifting incrementally in the preprocessing step, saving both time and cost downstream in the Deep Learning pipeline.

Can I create one graph per model in TensorFlow?

The benefit would be that I can store and load individual models using tf.train.export_meta_graph() but I'm not sure if this usage is what TensorFlow was designed for. Does it have any negative impacts on parallelism/performance, functionality, etc to use multiple graphs in parallel, as long as I don't want to share data between them?
It's not a good idea because passing data between models would require fetching from one session, and feeding the Python object back into the other session. Locally, that's unnecessary copy operations, and it's worse in the distributed setting.
There is now export_scoped_meta_graph() and import_scoped_meta_graph() in tf.contrib.framework.meta_graph to save and load parts of a graph and using a single global graph is recommended.