Custom scikit-learn pickling doesn't work inside a grid search - serialization

I have written a scikit-learn estimator. It has a parameter and a model_ attribute that is set by fit.
class MyEstimator(BaseEstimator, TransformerMixin):
def __init__(self, param="default"):
self.param = param
self.model_ = None
def fit(self, x, y):
# Sets the value of self.model_
I want to be able to pickle MyEstimator, but the model_ object I create cannot be serialized with pickle because it is a keras model. Following the example of the blog post "Pickling Keras Models" I added the following pickling handler methods to my class.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
This replaces the unpickleable model_ member with a representation generated by keras' serializer that can be pickled. Using this customization I can call fit, serialize and deserialize, and get back my original model. Everything works.
e = MyEstimator()
e.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(e, f)
with open("myfile.pk", mode="rb") as f:
pickle.load(f) # Returns a copy of e
However, serialization does not work when I try to put MyEstimator in a pipeline and pickle the result of a GridSearchCV.
s = GridSearchCV(Pipeline([
# ...
("estimator", MyEstimator())
# ...
]))
s.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(s, f)
During the pickle.dump call I expect to see MyEstimator.__getstate__ get called with a fitted self.model_ object. (This is what happens when I serialize the model by itself, outside the grid search.) Instead self.model_ is None, so I am unable to serialize the best_estimator_ generated by my grid search.
It looks like grid search serialization is instantiating a new MyEstimator object instead of using the one that was in the pipeline. This seems wrong to me. I've looked through the scikit-learn code, but can't see where this is happening.
Is this a bug in scikit-learn, or am I doing something wrong?
(Note: keras does have a wrapper layer that can convert some keras models into scikit-learn estimators, but I can't use that here for other reasons and I'm not sure it wouldn't just have the same problem.)

The search object contains a mixed of MyEstimator objects, some of which have not had fit called on them. The fix is to check if model_ is None before trying to serialize it with the keras tools.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
I don't know why there would be any unfitted models in the search object after the grid search had completed, but there are.

Related

How to avoid memory leakage in an autoregressive model within tensorflow

Recently, I am training a LSTM with attention mechanism for regressionin tensorflow 2.9 and I met an problem during training with model.fit():
At the beginning, the training time is okay, like 7s/step. However, it was increasing during the process and after several steps, like 1000, the value might be 50s/step. Here below is a part of the code for my model:
class AttentionModel(tf.keras.Model):
def __init__(self, encoder_output_dim, dec_units, dense_dim, batch):
super().__init__()
self.dense_dim = dense_dim
self.batch = batch
encoder = Encoder(encoder_output_dim)
decoder = Decoder(dec_units,dense_dim)
self.encoder = encoder
self.decoder = decoder
def call(self, inputs):
# Creat a tensor to record the result
tempt = list()
encoder_output, encoder_state = self.encoder(inputs)
new_features = np.zeros((self.batch, 1, 1))
dec_initial_state = encoder_state
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt.append(dec_result.logits)
new_features = dec_result.logits
dec_initial_state = dec_state
result=tf.concat(tempt,1)
return result
In the official documents for tf.function, I notice: "Don't rely on Python side effects like object mutation or list appends".
Since I use a dynamic python list with append() to record the intermediate variables, I guess each time during training, a new tf.graph was added. Is the reason my training is getting slower and slower?
Additionally, what should I use instead of python list to avoid this? I have tried with a numpy.zeros matrix but it will lead to another problem:
tempt = np.zeros(shape=(1,6))
...
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt[i]=(dec_result.logits)
...
Cannot convert a symbolic tf.Tensor (decoder/dense_3/BiasAdd:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported.

Error message when trying to use huggingface pretrained Tokenizer (roberta-base)

I am pretty new at this, so there might be something I am missing completely, but here is my problem: I am trying to create a Tokenizer class that uses the pretrained tokenizer models from Huggingface. I would then like to use this class in a larger transformer model to tokenize my input data. Here is the class code
class Roberta(MyTokenizer):
from transformers import AutoTokenizer
from transformers import RobertaTokenizer
class Roberta(MyTokenizer):
def build(self, *args, **kwargs):
self.max_length = self.phd.max_length
self.untokenized_data = self.questions + self.answers
def tokenize_and_filter(self):
# Initialize the tokenizer with a pretrained model
Tokenizer = AutoTokenizer.from_pretrained('roberta')
tokenized_inputs, tokenized_outputs = [], []
inputs = Tokenizer(self.questions, padding=True)
outputs = Tokenizer(self.answers, padding=True)
tokenized_inputs = inputs['input_ids']
tokenized_outputs = outputs['input_ids']
return tokenized_inputs, tokenized_outputs
When I call the function tokenize_and_filter in my Transformer model as below
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
print(questions)
and I try to print the tokenized data, I get this message:
<bound method Roberta.tokenize_and_filter of <MyTokenizer.Roberta.Roberta object at
0x000002779A9E4D30>>
It appears that the function returns a method instead of a list or a tensor - I've tried passing the parameter 'return_tensors='tf'', I have tried using the tokenizer.encode() method, I have tried both with AutoTokenizer and with RobertaTokenizer, I have tried the batch_encode_plus() method, nothing seems to work.
Please help!
it seems this was a really stupid error on my part, I forgot to put parentheses when calling the function
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
should actually be
questions = self.get_tokenizer().tokenize_and_filter()
answers = self.get_tokenizer().tokenize_and_filter()
and it works this way :)

Weights of pre-trained BERT model not initialized

I am using the Language Interpretability Toolkit (LIT) to load and analyze a BERT model that I pre-trained on an NER task.
However, when I'm starting the LIT script with the path to my pre-trained model passed to it, it fails to initialize the weights and tells me:
modeling_utils.py:648] loading weights file bert_remote/examples/token-classification/Data/Models/results_21_03_04_cleaned_annotations/04.03._8_16_5e-5_cleaned_annotations/04-03-2021 (15.22.23)/pytorch_model.bin
modeling_utils.py:739] Weights of BertForTokenClassification not initialized from pretrained model: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
modeling_utils.py:745] Weights from pretrained model not used in BertForTokenClassification: ['bert.embeddings.position_ids']
It then simply uses the bert-base-german-cased version of BERT, which of course doesn't have my custom labels and thus fails to predict anything. I think it might have to do with PyTorch, but I can't find the error.
If relevant, here is how I load my dataset into CoNLL 2003 format (modification of the dataloader scripts found here):
def __init__(self):
# Read ConLL Test Files
self._examples = []
data_path = "lit_remote/lit_nlp/examples/datasets/NER_Data"
with open(os.path.join(data_path, "test.txt"), "r", encoding="utf-8") as f:
lines = f.readlines()
for line in lines[:2000]:
if line != "\n":
token, label = line.split(" ")
self._examples.append({
'token': token,
'label': label,
})
else:
self._examples.append({
'token': "\n",
'label': "O"
})
def spec(self):
return {
'token': lit_types.Tokens(),
'label': lit_types.SequenceTags(align="token"),
}
And this is how I initialize the model and start the LIT server (modification of the simple_pytorch_demo.py script found here):
def __init__(self, model_name_or_path):
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name_or_path)
model_config = transformers.AutoConfig.from_pretrained(
model_name_or_path,
num_labels=15, # FIXME CHANGE
output_hidden_states=True,
output_attentions=True,
)
# This is a just a regular PyTorch model.
self.model = _from_pretrained(
transformers.AutoModelForTokenClassification,
model_name_or_path,
config=model_config)
self.model.eval()
## Some omitted snippets here
def input_spec(self) -> lit_types.Spec:
return {
"token": lit_types.Tokens(),
"label": lit_types.SequenceTags(align="token")
}
def output_spec(self) -> lit_types.Spec:
return {
"tokens": lit_types.Tokens(),
"probas": lit_types.MulticlassPreds(parent="label", vocab=self.LABELS),
"cls_emb": lit_types.Embeddings()
This actually seems to be expected behaviour. In the documentation of the GPT models the HuggingFace team writes:
This will issue a warning about some of the pretrained weights not being used and some weights being randomly initialized. That’s because we are throwing away the pretraining head of the BERT model to replace it with a classification head which is randomly initialized.
So it seems to not be a problem for the fine-tuning. In my use case described above it worked despite the warning as well.

Tensorflow: List of available **kwargs

In Tensorflow API documents, I have difficulty to find all available keyword arguments.
For example, "GlobalMaxPooling1D" layer in Tensorflow API:
tf.keras.layers.GlobalMaxPool1D(
data_format='channels_last', **kwargs
)
I am curious to know where to find the information on all available keyword arguments (**kwargs).
Or is there any way to list all available keyword arguments from a built-in function or a built-in command?
In the case of tf.keras.layers.Layer subclasses, like GlobalMaxPooling1D, valid keyword arguments are
allowed_kwargs = {
'input_dim',
'input_shape',
'batch_input_shape',
'batch_size',
'weights',
'activity_regularizer',
'autocast',
}
In the __init__ of Layer subclasses, you will see a call like super().__init__(**kwargs), which passes the keyword arguments you enter to the initializer of the base Layer class.
For example:
class GlobalPooling1D(Layer):
"""Abstract class for different global pooling 1D layers."""
def __init__(self, data_format='channels_last', **kwargs):
super(GlobalPooling1D, self).__init__(**kwargs)
self.input_spec = InputSpec(ndim=3)
self.data_format = conv_utils.normalize_data_format(data_format)

Is there any solution for failing to load model in tensorflow2.3?

I try to use tf.keras.models.load_model to load saved model in tensorflow 2.3.
However, I got the same error in
https://github.com/tensorflow/tensorflow/issues/41535
It seems an important function. But this issue is still not solved. Does anyone know if there is any alternative method to implement the same result?
I found an alternative method to load custom model in tensorflow 2.3. You need to do some following changes. I will explain by some code snapshots
for __init__() of custom model. Before,
def __init__(self, mask_ratio=0.1, hyperparam=0.1, **kwargs):
layers = []
layer_configs = {}
if 'layers' in kwargs.keys():
layer_configs = kwargs['layers']
for config in layer_configs:
layer = tf.keras.layers.deserialize(config)
layers.append(layer)
super(custom_model, self).__init__(layers) # custom_model is your custom model class
self.mask_ratio = mask_ratio
self.hyperparam = hyperparam
...
After,
def __init__(self, mask_ratio=0.1, hyperparam=0.1, **kwargs):
super(custom_model, self).__init__() # custom_model is your custom model class
self.mask_ratio = mask_ratio
self.hyperparam = hyperparam
...
define two functions in your custom model class
def get_config(self):
config = {
'mask_ratio': self.mask_ratio,
'hyperparam': self.hyperparam
}
base_config = super(custom_model, self).get_config()
return dict(list(config.items()) + list(base_config.items()))
#classmethod
def from_config(cls, config):
#config = cls().get_config()
return cls(**config)
After finishing training, save model using 'h5' format
model.save(file_path, save_format='h5')
Finally, load model as following codes,
model = tf.keras.models.load_model(model_path, compile=False, custom_objects={'custom_model': custom_model})