How to get model summary from Spark ML Logistic Regression model? - apache-spark-ml

I am following an example from - https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#multinomial-logistic-regression
When I try to get the model summary, I am facing an error. Here is my code with the error -
// START
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark.read.format("libsvm").load("file:///Users/my_username/Desktop/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: \n${lrModel.interceptVector}")
val trainingSummary = lrModel.summary
org.apache.spark.SparkException: No training summary available for this LogisticRegressionModel
at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$summary$1.apply(LogisticRegression.scala:1002)
at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$summary$1.apply(LogisticRegression.scala:1002)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.classification.LogisticRegressionModel.summary(LogisticRegression.scala:1001)
... 48 elided
I want to print the metrics from the model after this step.
I have obtained the data from - https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

My bad, I was using Spark version 2.2.0 and the document says to use 2.3.0
It works on 2.3.0

Related

How to save checkpoints of quantise wrapped models in tensorflow model optimization?

Hi I am using tensorflow and model optimisations.
This is an overview of the process:
from tensorflow_model_optimization.quantization.keras import quantise_model
model = define_model()
qat_model = quantize_model(model)
qat_model.fit(...)
qat_model.save_weights("qat_weights.h5")
... Finish for Now ...
On another run
model = define_model()
qat_model = quantize_model(model)
qat_model.load_weights("qat_weights.h5")
But when I go to qat_model.fit(...) it has to start training again from 0%
So there must be a problem with either saving or loading of the weights

TF Yamnet Transfer Learning and Quantization

TLDR:
Short term: Trying to quantize a specific portion of a TF model (recreated from a TFLite model). Skip to pictures below. \
Long term: Transfer Learn on Yamnet and compile for Edge TPU.
Source code to follow along is here
I've been trying to transfer learn on Yamnet and compile for a Coral Edge TPU for a few weeks now.
Started here, but quickly realized that model wouldn't quantize and compile for the Edge TPU because of the dynamic input and out of the box TFLite quantization doesn't work well with the preprocessing of audio before Yamnet's MobileNet.
After tinkering and learning for a few weeks, I found a Yamnet model compiled for the Edge TPU (sadly without source code) and figured my best shot would be to try to recreate it in TF, then quantize, then compile to TFLite, then compile for the edge TPU. I'll also have to figure out how to set the weights - not sure if I have to/can do that pre or post quantization. Anyway, I've effectively recreated the model, but am having a hard time quantizing without a bunch of wacky behavior.
The model currently looks like this:
I want it to look like this:
For quantizing, I tried:
TFLite Model Optimization which puts tfl.quantize ops all over the place and fails to compile for the Edge TPU.
Quantization Aware Training which throws some annoying errors that I've been trying to work through.
If you know a better way to achieve the long term goal than what I proposed, please (please please please) share! Otherwise, help on specific quant ops would be great! Also, reach out for clarity
I've ran into your same issues trying to convert the Yamnet model by tensorflow into full integers in order to compile it for Coral edgetpu and I think I've found a workaround for that.
I've been trying to stick to the tutorials posted in the section tflite-model-maker and finding a solution within this API because, for experience, I found it to be a very powerful tool.
If your goal is to build a model which is fully compiled for the edgetpu (meaning all layers, including input and output ones, being converted to int8 type) I'm afraid this solution won't fit for you. But since you posted you're trying to obtain a custom model with the same structure of:
Yamnet model compiled for the Edge TPU
then I think this workaround would help you.
When you train your custom model following the basic tutorial it is possible to export the custom model both in .tflite format
model.export(models_path, tflite_filename='my_birds_model.tflite')
and full tensorflow model:
model.export(models_path, export_format=[mm.ExportFormat.SAVED_MODEL, mm.ExportFormat.LABEL])
Then it is possible to convert the full tensorflow saved model to tflite format by using the following script:
import tensorflow as tf
import numpy as np
import glob
from scipy.io import wavfile
dataset_path = '/path/to/DATASET/testing/*/*.wav'
representative_data = []
saved_model_path = './saved_model'
samples = glob.glob(dataset_path)
input_size = 15600 #Yamnet model's input size
def representative_data_gen():
for input_value in samples:
sample_rate, audio_data = wavfile.read(input_value, 'rb')
audio_data = np.array(audio_data)
splitted_audio_data = tf.signal.frame(audio_data, input_size, input_size, pad_end=True, pad_value=0) / tf.int16.max #normalization in [-1,+1] range
yield [np.float32(splitted_audio_data[0])]
tf.compat.v1.enable_eager_execution()
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
converter.experimental_new_converter = True #if you're using tensorflow<=2.2
converter.optimizations = [tf.lite.Optimize.DEFAULT]
#converter.inference_input_type = tf.uint8 # or tf.uint8
#converter.inference_output_type = tf.uint8 # or tf.uint8
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()
open(saved_model_path + "converted_model.tflite", "wb").write(tflite_model)
As you can see, the lines which tell the converter to change input/output type are commented. This is because Yamnet model expects in input normalized values of audio sample in the range [-1,+1] and the numerical representation must be float32 type. In fact the compiled model of Yamnet you posted uses the same dtype for input and output layers (float32).
That being said you will end up with a tflite model converted from the full tensorflow model produced by tflite-model-maker. The script will end with the following line:
fully_quantize: 0, inference_type: 6, input_inference_type: 0, output_inference_type: 0
and the inference_type: 6 tells you the inference operations are suitable for being compiled to coral edgetpu.
The last step is to compile the model. If you compile the model with the standard edgetpu_compiler command line :
edgetpu_compiler -s converted_model.tflite
the final model would have only 4 operations which run on the EdgeTPU:
Number of operations that will run on Edge TPU: 4
Number of operations that will run on CPU: 53
You have to add the optional flag -a which enables multiple subgraphs (it is in experimental stage though)
edgetpu_compiler -sa converted_model.tflite
After this you will have:
Number of operations that will run on Edge TPU: 44
Number of operations that will run on CPU: 13
And most of the model operations will be mapped to edgetpu, namely:
Operator Count Status
MUL 1 Mapped to Edge TPU
DEQUANTIZE 4 Operation is working on an unsupported data type
SOFTMAX 1 Mapped to Edge TPU
GATHER 2 Operation not supported
COMPLEX_ABS 1 Operation is working on an unsupported data type
FULLY_CONNECTED 3 Mapped to Edge TPU
LOG 1 Operation is working on an unsupported data type
CONV_2D 14 Mapped to Edge TPU
RFFT2D 1 Operation is working on an unsupported data type
LOGISTIC 1 Mapped to Edge TPU
QUANTIZE 3 Operation is otherwise supported, but not mapped due to some unspecified limitation
DEPTHWISE_CONV_2D 13 Mapped to Edge TPU
MEAN 1 Mapped to Edge TPU
STRIDED_SLICE 2 Mapped to Edge TPU
PAD 2 Mapped to Edge TPU
RESHAPE 1 Operation is working on an unsupported data type
RESHAPE 6 Mapped to Edge TPU

Databricks MLFlow AutoML XGBoost can't predict_proba()

I used AutoML in Databricks Notebooks for a binary classification problem and the winning model flavor was XGBoost (big surprise).
The outputted model is of this variety:
mlflow.pyfunc.loaded_model:
artifact_path: model
flavor: mlflow.sklearn
run_id: 123456789
Any idea why when I use model.predict_proba(X), I get this response?
AttributeError: 'PyFuncModel' object has no attribute 'predict_proba'
I know it is possible to get the probabilities because ROC/AUC is a metric used for tuning the model. Any help would be amazing!
I had the same issue with catboost model.
The way I solved it was by saving the artifacts in a local dir
import os
from mlflow.tracking import MlflowClient
client = MlflowClient()
local_dir = "/dbfs/FileStore/user/models"
local_path = client.download_artifacts('run_id', "model", local_dir)```
```model_path = '/dbfs/FileStore/user/models/model/model.cb'
model = CatBoostClassifier()
model = model.load_model(model_path)
model.predict_proba(test_set)```

Tensorflow/AI Cloud Platform: HyperTune trials failed to report the hyperparameter tuning metric

I'm using the tf.estimator API with TensorFlow 2.1 on Google AI Platform to build a DNN Regressor. To use AI Platform Training hyperparameter tuning, I followed Google's docs.
I used the following configuration parameters:
config.yaml:
trainingInput:
scaleTier: BASIC
hyperparameters:
goal: MINIMIZE
maxTrials: 2
maxParallelTrials: 2
hyperparameterMetricTag: rmse
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: DISCRETE
discreteValues:
- 100
- 200
- 300
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
And to add the metric to my summary, I used the following code for my DNNRegressor:
def rmse(labels, predictions):
pred_values = predictions['predictions']
rmse = tf.keras.metrics.RootMeanSquaredError(name='root_mean_squared_error')
rmse.update_state(labels, pred_values)
return {'rmse': rmse}
def train_and_evaluate(hparams):
...
estimator = tf.estimator.DNNRegressor(
model_dir = output_dir,
feature_columns = get_cols(),
hidden_units = [max(2, int(FIRST_LAYER_SIZE * SCALE_FACTOR ** i))
for i in range(NUM_LAYERS)],
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
config = run_config)
estimator = tf.estimator.add_metrics(estimator, rmse)
According to Google's documentation, the add_metric function creates a new estimator with the metric specified, which is then used as the hyperparameter metric. However, the AI Platform Training service doesn't recognise this metric:
Job details on AI Platform
On running the code locally, the rmse metric does get outputted in the logs.
So, how do I make the metric available to the Training job on AI Platform using Estimators?
Additionally, there is an option of reporting the metrics through the cloudml-hypertune Python package. But it requires the value of the metric as one of the input arguments. How do I extract the metric from tf.estimator.train_and_evaluate function (since that's the function I use to train/evaluate my estimator) to input into the report_hyperparameter_tuning_metric function?
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='rmse',
metric_value=??,
global_step=1000
)
ETA: Logs show no error. It says that the job completed successfully even though it fails.

Evaluation metrics on Spark ML multiclass classification problem

I am looking for a Multiclass classification example using Spark-Scala but I am unable to find one yet. Specifically speaking, I want to train a classification model and see all the associated metrics on training and test data.
Does Spark ML (DataFrame based API) support confusion matrix on multi-class problems?
I am looking for Spark v 2.2 and above examples. An end-to-end example would be really useful. I can't find confusion matrix evaluation here -
https://spark.apache.org/docs/2.3.0/ml-classification-regression.html
this should be it:
val metrics = new MulticlassMetrics(predictionAndLabels)
println(metrics.confusionMatrix)
classification metrics are here:
https://spark.apache.org/docs/2.3.0/mllib-evaluation-metrics.html
Assuming that model is your trained model, and test is the test-set,
this is the code snippet for calculating the confusion-matrix in python:
import pandas as pd
from pyspark.mllib.evaluation import MulticlassMetrics
predictionAndLabels = model.transform(test).select('label', 'prediction')
metrics = MulticlassMetrics(predictionAndLabels.rdd.map(lambda x: tuple(map(float, x))))
confusion_matrix = metrics.confusionMatrix().toArray()
labels = [int(l) for l in metrics.call('labels')]
confusion_matrix = pd.DataFrame(confusion_matrix , index=labels, columns=labels)
Note that the metrics.labels is not implemented in pyspark for some reason, so we're calling the scala backend directly