load MLFlow model and plot feature importance with feature names - xgboost

If I save a xgboost model in mlflow with mlflow.xgboost.log_model(model, "model") and load it with model = mlflow.xgboost.load_model("models:/model_uri") and want to plot the feature importance with xgboost.plot_importance(model) the problem is that the features are not shown with names (see plot). If I plot the feature without saving in mlflow the origin feature names are shown. Do I have to store the model in another way?

Usually, if you use a pipeline It can happen for example on FeatureUnion from sklearn.
You can try to get the feature index from the model or the last step of the pipeline and use it to retrieve the feature names from the dataset.
If you are using a pipeline you can try to get the feature the step before this problem appears or edit the step, also be aware if you are using feature selection different situations can happen.
You can use autologging to autosave the plot, but the same problem happen if it is pipeline.
You could save the model like an artifact, if you think to do it, my suggestion is to use the Dill package.

Related

Is it possible to add custom entity labels to Spacy 3.0 config file?

I'm working on a custom NER model with spacy-transformers and roBERTa. I'm really only using the CLI for this and am trying to alter my Spacy config.cfg file to account for custom entity labels in the pipeline.
I'm new to Spacy, but I've gathered that people usually use ner.add_label to accomplish this. I wonder if I might be able to change something in [initialize.components.ner.labels] of the config, but haven't come across a good way to do that.
I can't seem to find any options to alter the config file in a similar fashion - does anyone know if this is possible, or what might be the most succinct way to achieve those custom labels?
Edited for clarity: My issue could be different than my config theory. Right now I am getting an output, but instead of text labels they are numeric labels, such as:
('Oct',383) ('2019',383) ('February',383)
Thank you in advance for your help!
If you are working with the config-based training, generally you should not have to specify the labels anywhere - spaCy will look at the training data and get the list of labels from there.
There are a few cases where this won't work.
You have labels that aren't in your training data. These can't be learned so I would just consider this an error, but sometimes you have to work with the data you've been given.
You training data is very large. In this case reading over all the training data to get a complete list of labels can be an issue. You can use the init labels command to generate data so that the input data doesn't have to be scanned every time you start training.

How to know what augmentation has been carried out tf.image.stateless_random_flip_up_down or similar API

In TF 2.x, there are a whole set of image augmentation API, take tf.image.stateless_random_flip_up_down for example. Most of these will perform the said operation at random. What I like to find out is if there’s a way to interrogate what exactly has been perform for a specific image in a specific batch. This info is critical if the target prediction involve localization like points, bounding boxes, etc. Since affine transform (like translate) performed on image, the same operation should be used to “augment” the targets (y) in a consistent manner.
I think all the image transform API in TF2.X do not return this piece of info. I would like to see if there’s easier way than creating custom ones of my own. I have done this for the older Keras data augmentation API in the past by subclasses, and would prefer not to repeat the tedium if possible.
I guess your main goal is to write a data augmentation pipeline that changes both image and its labels.
Well, then I would recommend you to use Albumentaitons library instead. It is a very popular open-source library that can easily be integrated with Tensorflow and PyTorch frameworks.
Here is its documentation: https://albumentations.ai/
Let me know if it helps!

Migrating legacy ML pipeline to TFX

We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline.
I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
Load image data and meta-data
Filter out ‘bad’ data based on meta-data
Determine image based statistics (classic image processing in Python):
Image level characteristics
Image region characteristics
(region is determined based on a fine-tuned EfficientDet model)
Filter out ‘bad’ data based on image statistics
Generate TFRecords from this image and meta-data
Oversample certain TFRecords for class balancing (using tf.data)
Train an image classifier
…
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)?
If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again.
This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting).
So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.
I'll try to address most questions with my experience with tfx.
I have a dataflow job that I run to pre-process my images, labels, features, etc and turn all that into tfrecords. That lives outside of tfx and is ran only when there is data refreshes.
You can do the same, here is a very simple code snippet that i use to resize all my images and create simple features.
try:
image = tf.io.decode_jpeg(image_string)
image = tf.image.resize(image,[image_resize_size,image_resize_size])
image = tf.image.convert_image_dtype(image/255.0, dtype=tf.uint8)
image_shape = image.shape
image = tf.io.encode_jpeg(image,quality=100)
feature = {
'height': _int64_feature(image_shape[0]),
'width' : _int64_feature(image_shape[1]),
'depth' : _int64_feature(image_shape[2]),
'label' : _int64_feature(labels_to_int(element[2].decode())),
'image_raw' : _bytes_feature(image.numpy())
}
tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
except:
print('image could not be decoded')
return None
Once I have the data in tfrecord format, I use the ImportExampleGen component to load the data into my tfx pipeline. This is followed by StatisticsGen which will compute statistics on the features.
When running all of this in the cloud, it is using dataflow under the covers in batch mode.
Your metadata store only caches pipeline metadata, but your data is cached in a gcs bucket and metadata store knows it. So when you re-run your pipeline, if you have caching set to True, your ImportExampleGen, StatisticsGen, SchemaGen, Transform will not be reran if the data hasn't changed. This has huge benefits in time and costs.
ExampleValidator outputs an artifact to let you know what data anomalies are in your data. I created a custom component that intakes the examplevalidator artifact and if my data doesn't meet certain criteria, I kill the pipeline by throwing an error in this component. I wish there was a component that can just stop the pipeline, but I haven't found one, so my work around was to throw an error which stops the pipeline from progressing further.
Usually when I create a tfx pipeline, it is done to automate the machine learning process. At this point we have already done class balancing, feature selection, etc. Since that falls more under the pre-processing stage.
I guess you could technically create a custom component that takes a StatisticsGen artifact, parses it and tries to do some class balancing and creates a new dataset with balances classes. But honestly, I think is better to do it at the preprocessing stage.

Tensorflow Object Detection API - showing loss for training and validation on one graph

I am playing with Tensorflow Object Detection API and training the Faster R-CNN network on my own dataset. I am checking the progress of learning at Tensorbord. All metrics are there, but is there a way to have both loss plots, for training and validation data, on one graph? Or do I have to dive into TOD Api code and modify it? I would like to avoid the second because during every update of the API I will have to keep in mind that some of the code is changed locally.
The underlying data for the plots is saved under different tag names (loss vs loss_1). I believe TensorBoard does not natively support displaying different tags in one plot. There might be third-party extensions to do this.
If different models used the same tag, the graphs would be combined by default (see: Plot multiple graphs in one plot using Tensorboard).

Object detection using CNTK

I am very new to CNTK.
I wanted to train a set of images (to detect objects like alcohol glasses/bottles) using CNTK - ResNet/Fast-R CNN.
I am trying to follow below documentation from GitHub; However, it does not appear to be a straight forward procedure. https://github.com/Microsoft/CNTK/wiki/Object-Detection-using-Fast-R-CNN
I cannot find proper documentation to generate ROI's for the images with different sizes and shapes. And how to create object labels based on the trained models? Can someone point out to a proper documentation or training link using which I can work on the cntk model? Please see the attached image in which I was able to load a sample image with default ROI's in the script. How do I properly set the size and label the object in the image ? Thanks in advance!
sample image loaded for training
Not sure what you mean by proper documentation. This is an implementation of the paper (https://arxiv.org/pdf/1504.08083.pdf). Looks like you are trying to generate ROI's. Can you look through the helper functions as documented at the site to parse what you might need:
To run the toy example, make sure that in PARAMETERS.py the datasetName is set to "grocery".
Run A1_GenerateInputROIs.py to generate the input ROIs for training and testing.
Run A2_RunCntk_py3.py to train a Fast R-CNN model using the CNTK Python API and compute test results.
The algo will work on several candidate regions and then generate outputs: one for the classes of objects and another one that generates the bounding boxes for the objects belonging to those classes. Please refer to the code for getting the details of the implementation.
Can someone point out to a proper documentation or training link using which I can work on the cntk model?
You can take a look at my repository on GitHub.
It will guide you through all the steps required to train your own model for object detection and classification with CNTK.
But in short the proper steps should look something like this:
Setup environment
Prepare data
Tag images (ground truth)
Download pretrained model and create mappings for your custom dataset
Run training
Evaluate the model on test set