Error writing XGBoost Classifier to pmml with sklearn2pmml - xgboost

I want to save my XGBoost model as pmml using sklearn2pmml. I'm using Python V3.7.3 with Sklearn 0.20.3 & sklearn2pmml V0.53.0. My data is mainly binary, with just 3 columns of continuous data, I'm running my notebook in Databricks and convert my Spark dataframe to a pandas dataframe. Code snippet below
import xgboost as xgb
from sklearn_pandas import DataFrameMapper
from sklearn.compose import ColumnTransformer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler
X = pdf[continuous_features + numericCols]
y = pdf["Label"]
mapper = DataFrameMapper(
[([cont_column], [ContinuousDomain(), StandardScaler()]) for cont_column in continuous_features] +
[([c for c in numericCols], None)] # no transformation
)
clf = xgb.XGBClassifier(objective='multi:softprob',eval_metric='auc',num_class = 2,
n_jobs =6,max_delta_step=1, min_child_weight=14, gamma=1.5, subsample = 0.8,
colsample_bytree = 0.5, max_depth=10, learning_rate = 0.1)
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", clf)
])
pipeline.fit(X,y.values.reshape(-1,))
sklearn2pmml(pipeline, "xgb_V1.pmml", with_repr = True)
The pipeline fits to the data, generates a score and prediction with pipeline.score(X,y) and pipeline.predict(X), but when I try to write it to pmml, I get the following error:
Standard output is empty
Standard error:
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 47 ms.
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Converting..
Feb 21, 2020 1:53:30 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
... 7 more
Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
I thought it might be a version incompatibility issue between Sklearn and sklearn2pmml as per this post https://github.com/jpmml/sklearn2pmml/issues/197, but I think the versions I have installed should be ok. Any ideas on what's going on with this? Thanks in advance

It is probably a XGBoost package version issue. The SkLearn2PMML package expects the label encoder (XGBClassifier._le attribute) to be a "normal" Scikit-Learn label encoder class (sklearn.preprocessing.(label|_label).LabelEncoder), but in your case it's something different (xgboost.compat.XGBoostLabelEncoder).
In which XGBOost package version was this xgboost.compat.XGBoostLabelEncoder introduced? It's either some very old, or very new thing.
In any case, please open a feature request with the JPMML-SkLearn project here to have this issue sorted out.

Related

debugger throws error, running script doesn't - pandas boxplot

Title is pretty self explanatory.
Here is a minimal reproducible example (just make a .xlsx file with a column id, a column nb_inf and another called Grantham and some data (integers)).
import matplotlib.pyplot as plt
import pandas as pd
def loader() -> pd.DataFrame:
df = pd.read_excel("your_file.xlsx", "Feuil1")
df = df.set_index('id')
return df
if __name__ == "__main__":
df: pd.DataFrame = loader()
for column in df.columns:
if "Grantham" in column:
print(column)
df.boxplot(column=column, by='nb_inf', figsize=(5, 6))
plt.savefig(f"boxplots/{column}.png")
plt.close()
Running it through the Run command works perfectly well. But running it with the debugger raises the error TypeError: 'NoneType' object is not callable.
I'm using Python 3.10.2 and PyCharm 2022.1.3 (Community Edition)
More details about my PyCharm build:
Build #PC-221.5921.27, built on June 21, 2022
Runtime version: 11.0.15+10-b2043.56 amd64
VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o.
Windows 11 10.0
GC: G1 Young Generation, G1 Old Generation
Memory: 2030M
Cores: 16
Non-Bundled Plugins:
com.chesterccw.excelreader (2022.1.3)
Works in VSCode, both in debug and standard execution mode.
Have you considered switching to a better IDE than PyCharm such as VSCode?
On a more serious note, check out your debugging default option in PyCharm; it's likely there is an option you don't want there...

ModuleNotFoundError: No module named 'numpy.random.bit_generator' while importing sklearn

I installed opencv, tensorflow and other tools on my Macbook M1 air by watching this tutorial:
After installing, opencv and tensorflow works fine but when I try to import sklearn the mentioned error occurs. Here is the Error:
File ~/miniforge3/envs/ml/lib/python3.8/site-packages/scipy/stats/distributions.py:11, in <module>
8 from ._distn_infrastructure import (rv_discrete, rv_continuous, rv_frozen)
10 from . import _continuous_distns
---> 11 from . import _discrete_distns
13 from ._continuous_distns import *
14 from ._discrete_distns import *
File ~/miniforge3/envs/ml/lib/python3.8/site-packages/scipy/stats/_discrete_distns.py:21, in <module>
17 from ._distn_infrastructure import (
18 rv_discrete, _ncx2_pdf, _ncx2_cdf, get_distribution_names,
19 _check_shape)
20 import scipy.stats._boost as _boost
---> 21 from ._biasedurn import (_PyFishersNCHypergeometric,
22 _PyWalleniusNCHypergeometric,
23 _PyStochasticLib3)
25 class binom_gen(rv_discrete):
26 r"""A binomial discrete random variable.
27
28 %(before_notes)s
(...)
51
52 """
File _biasedurn.pyx:1, in init scipy.stats._biasedurn()
ModuleNotFoundError: No module named 'numpy.random.bit_generator'
The tutorial I was following suggested the exact version of numpy and python. I looked some places for help and they suggested updating the numpy. I'm not sure whether I should do that because that may break other libraries like OpenCV.
versions:
python 3.8.6
numpy 1.18.5
scikit-learn 1.1.1
scipy 1.8.1

How to load a model using Tensorflow Hub and make a prediction?

This should be a simple task: Download a model saved in tensorflow_hub format, load using tensorflow_hub, and use..
This is the model I am trying to use (simCLR stored in Google Cloud): https://console.cloud.google.com/storage/browser/simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
I downloaded the /hub folder as they say, using
gsutil -m cp -r \
"gs://simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0/hub" \
.
The /hub folder contains the files:
/saved_model.pb
/tfhub_module.pb
/variables/variables.index
/variables/variables.data-00000-of-00001
So far so good.
Now in python3, tensorflow2, tensorflow_hub 0.12 I run the following code:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
path_to_hub = '/home/my_name/my_path/simclr/hub'
# Attempt 1
m = tf.keras.models.Sequential([hub.KerasLayer(path_to_hub, input_shape=(224,224,3))])
# Attempt 2
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None,224,224,3])
# Attempt 3
m = hub.KerasLayer(hub.load(hubmod))
# Toy Data Test
X = np.random.random((1,244,244,3)).astype(np.float32)
y = m.predict(X)
None of these 3 options to load the hub model work, with the following errors:
Attempt 1 :
ValueError: Error when checking input: expected keras_layer_2_input to have shape (224, 224, 3) but got array with shape (244, 244, 3)
Attempt 2:
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node sequential_3/keras_layer_3/StatefulPartitionedCall/base_model/conv2d/Conv2D}}]] [Op:__inference_keras_scratch_graph_46402]
Function call stack:
keras_scratch_graph
Attempt 3:
ValueError: Expected a string, got <tensorflow.python.training.tracking.tracking.AutoTrackable object at 0x7fa71c7a2dd0>
These 3 attempts are all code taken from tensorflow_hub tutorials and are repeated in other answers in stackoverflow, but none works, and I don't know how to continue from those error messages.
Appreciate any help, thanks.
Update 1:
Same issues happen if I try with this ResNet50 hub/
https://storage.cloud.google.com/simclr-gcs/checkpoints/ResNet50_1x.zip
As #Frightera pointed out, there was an error with the input shapes. Also the error on "Attempt 2" was solved by allowing for memory growth on the selected GPU. "Attempt 3" still does not work, but at least there are two methods for loading and using a model saved in /hub format:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
hubmod = 'https://tfhub.dev/google/imagenet/mobilenet_v2_035_96/feature_vector/5'
# Alternative 1 - Works!
m = tf.keras.models.Sequential([hub.KerasLayer(hubmod, input_shape=(96,96,3))])
print(m.summary())
# Alternative 2 - Works!
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None, 96,96,3])
print(m.summary())
# Alternative 3 - Doesnt work
#m = hub.KerasLayer(hub.load(hubmod))
#m.build(input_shape=[None, 96,96,3])
#print(m.summary())
# Test
X = np.random.random((1,96,96,3)).astype(np.float32)
y = m.predict(X)
print(y.shape)

Different behaviour of dataclass default_factory to generate list

I'm quite new to Python so please have me excused if this question contain some newbie misunderstandings, but I've failed to google the answer for this:
On my personal laptop running Python 3.9.7 on Windows 11 this code is working without errors.
from dataclasses import dataclass, field
#dataclass
class SomeDataClass:
somelist: list[str] = field(default_factory=lambda:['foo', 'bar'])
if __name__ == '__main__':
instance = SomeDataClass()
print(instance)
But when at work running Python 3.8.5 on Windows 10 I get the following error:
File "c:\...\test_dataclass.py", line 13, in SomeDataClass
somelist: list[str] = field(default_factory=lambda:['foo', 'bar'])
TypeError: 'type' object is not subscriptable
I'd like to understand why this behaves differently and what I could do to make it work.
I would expect dataclasses to behave similarly on both computers.
You have already intuited the reason: this is a new feature in version 3.9. You can see it in the What's New article for 3.9 here.
This feature is available in version 3.8 as well, but it is not enabled by default. You can enable it in your code by including this import:
from __future__ import annotations

Error crs when reading a shapefile with geopandas in jupyter

I´m trying to read a shapefile with geopandas
Quito_full =gpd.read_file('./shapefiles/administraciones_zonales.shp')
Simple as that but I keep getting the following error
CRSError: Invalid projection: epsg:32717: (Internal Proj Error: proj_create: SQLite error on SELECT name, coordinate_system_auth_name, coordinate_system_code, geodetic_crs_auth_name, geodetic_crs_code, conversion_auth_name, conversion_code, area_of_use_auth_name, area_of_use_code, text_definition, deprecated FROM projected_crs WHERE auth_name = ? AND code = ?: no such column: area_of_use_auth_name)
This are my versions
print(sys.version)
3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
import pyproj
print(pyproj.__version__)
2.6.1.post1
import geopandas
print(geopandas.__version__)
0.9.0
Is there anyway to fix this error, I have been trying everything!.
have downloaded shape file from here https://hub.arcgis.com/datasets/esrimarketing::administraciones-zonales/explore?location=-0.167571%2C-78.559309%2C10.38
it is epsg:4326, but then projected to epsg:32717. No issues
your pyproj and geopandas versions are a bit outdated. Your error primarily points to an issue with pyproj
from pathlib import Path
import geopandas as gpd
gdf = gpd.read_file(list(Path.home().joinpath("Downloads").glob("**/Administraciones_Zonales.shp"))[0])
gdf.to_crs("epsg:32717").explore()
versions
import pyproj, sys
print(gpd.__version__, pyproj.__version__, sys.version)
0.10.2 3.3.0 3.9.10 (main, Jan 15 2022, 11:48:00)
[Clang 13.0.0 (clang-1300.0.29.3)]