Error message when trying to use huggingface pretrained Tokenizer (roberta-base) - tokenize

I am pretty new at this, so there might be something I am missing completely, but here is my problem: I am trying to create a Tokenizer class that uses the pretrained tokenizer models from Huggingface. I would then like to use this class in a larger transformer model to tokenize my input data. Here is the class code
class Roberta(MyTokenizer):
from transformers import AutoTokenizer
from transformers import RobertaTokenizer
class Roberta(MyTokenizer):
def build(self, *args, **kwargs):
self.max_length = self.phd.max_length
self.untokenized_data = self.questions + self.answers
def tokenize_and_filter(self):
# Initialize the tokenizer with a pretrained model
Tokenizer = AutoTokenizer.from_pretrained('roberta')
tokenized_inputs, tokenized_outputs = [], []
inputs = Tokenizer(self.questions, padding=True)
outputs = Tokenizer(self.answers, padding=True)
tokenized_inputs = inputs['input_ids']
tokenized_outputs = outputs['input_ids']
return tokenized_inputs, tokenized_outputs
When I call the function tokenize_and_filter in my Transformer model as below
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
print(questions)
and I try to print the tokenized data, I get this message:
<bound method Roberta.tokenize_and_filter of <MyTokenizer.Roberta.Roberta object at
0x000002779A9E4D30>>
It appears that the function returns a method instead of a list or a tensor - I've tried passing the parameter 'return_tensors='tf'', I have tried using the tokenizer.encode() method, I have tried both with AutoTokenizer and with RobertaTokenizer, I have tried the batch_encode_plus() method, nothing seems to work.
Please help!

it seems this was a really stupid error on my part, I forgot to put parentheses when calling the function
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
should actually be
questions = self.get_tokenizer().tokenize_and_filter()
answers = self.get_tokenizer().tokenize_and_filter()
and it works this way :)

Related

Why spacy morphologizer doesn't work when we use a custom tokenizer?

I don't understand why when i'm doing this
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
Tokens don't have any morph assigned
print([tok.morph for tok in doc])
> ['','','','','']
Is this behavior expected? If yes, why ? (spacy v3.0.7)
The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.
I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.

AttributeError: 'Tensor' object has no attribute 'numpy' eager execution is enabled using version 2.4.1

I've been trying to convert a generator I built to a tf.data.dataset.
I've come far and now I have something simple like this
def parse_image(filename):
file = tf.io.read_file(filename) # this will work only with filename as tensor
image = tf.image.decode_image(file)
return image
def transform_img(img):
img = parse_image(img).numpy()
img = transforms_train(image = img)["image"]
return img
transform img works as expected when I call it on a filename itself. like:
plt.imshow(transform_img(array_of_filenames[0]))
but when I map it on a dataset
dataset = tf.data.Dataset.from_tensor_slices(array_of_filenames)
dataset = dataset.map(transform_img)
I get the error in the title.
I am doing something silly again aren't I?
Thanks for helping!
It is not possible to use numpy inside the map function of tensorflow dataset. Otherwise, you need to wrap the function in tf.py_function or tf.numpy_function. So it should look like the following:
dataset = dataset.map(lambda: item: tf.py_function(transform_img, [item], [tf.float32]))
The first argument of py_function is the preprocessing function you want, the second argument is the parameter to pass to the function. The final argument is the dtype of the return of preprocess function. (same applies to tf.numpy_function)
I don't remember reading this in documentation but in a tutorial, you can find it here.

Is there any solution for failing to load model in tensorflow2.3?

I try to use tf.keras.models.load_model to load saved model in tensorflow 2.3.
However, I got the same error in
https://github.com/tensorflow/tensorflow/issues/41535
It seems an important function. But this issue is still not solved. Does anyone know if there is any alternative method to implement the same result?
I found an alternative method to load custom model in tensorflow 2.3. You need to do some following changes. I will explain by some code snapshots
for __init__() of custom model. Before,
def __init__(self, mask_ratio=0.1, hyperparam=0.1, **kwargs):
layers = []
layer_configs = {}
if 'layers' in kwargs.keys():
layer_configs = kwargs['layers']
for config in layer_configs:
layer = tf.keras.layers.deserialize(config)
layers.append(layer)
super(custom_model, self).__init__(layers) # custom_model is your custom model class
self.mask_ratio = mask_ratio
self.hyperparam = hyperparam
...
After,
def __init__(self, mask_ratio=0.1, hyperparam=0.1, **kwargs):
super(custom_model, self).__init__() # custom_model is your custom model class
self.mask_ratio = mask_ratio
self.hyperparam = hyperparam
...
define two functions in your custom model class
def get_config(self):
config = {
'mask_ratio': self.mask_ratio,
'hyperparam': self.hyperparam
}
base_config = super(custom_model, self).get_config()
return dict(list(config.items()) + list(base_config.items()))
#classmethod
def from_config(cls, config):
#config = cls().get_config()
return cls(**config)
After finishing training, save model using 'h5' format
model.save(file_path, save_format='h5')
Finally, load model as following codes,
model = tf.keras.models.load_model(model_path, compile=False, custom_objects={'custom_model': custom_model})

Printing value of tensorflow variable inside object

I am trying to print the value of a Tensorflow variable that is defined inside an object. To better illustrate my issue, I am currently trying to run the monodepth library. It has 2 main files, dataloader and main. Basically dataloader iterates over a text file of
class MonodepthDataloader(object):
"""monodepth dataloader"""
def __init__(self, data_path, filenames_file, params, dataset, mode):
self.data_path = data_path
self.params = params
self.dataset = dataset
self.mode = mode
self.left_image_batch = None
self.right_image_batch = None
input_queue = tf.train.string_input_producer([filenames_file], shuffle=False)
line_reader = tf.TextLineReader()
_, line = line_reader.read(input_queue)
split_line = tf.string_split([line]).values
...
left_image_path = tf.string_join([self.data_path, split_line[0]])
left_image_o = self.read_image(left_image_path)
I am trying to print out left_image_path to verify that it is being generated correctly. However this is in an object being called by monodepth_main. That is monodepth_main calls the dataloader with the following lines:
dataloader = MonodepthDataloader(args.data_path, args.filenames_file, params, args.dataset, args.mode)
left = dataloader.left_image_batch
As a result I can't just use sess.run(x). I have also tried using tf.Print(line, [line]) but nothing shows up.
How do I print out the value of a tensorflow variable inside an object? Specifically left_image_path?

Custom scikit-learn pickling doesn't work inside a grid search

I have written a scikit-learn estimator. It has a parameter and a model_ attribute that is set by fit.
class MyEstimator(BaseEstimator, TransformerMixin):
def __init__(self, param="default"):
self.param = param
self.model_ = None
def fit(self, x, y):
# Sets the value of self.model_
I want to be able to pickle MyEstimator, but the model_ object I create cannot be serialized with pickle because it is a keras model. Following the example of the blog post "Pickling Keras Models" I added the following pickling handler methods to my class.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
This replaces the unpickleable model_ member with a representation generated by keras' serializer that can be pickled. Using this customization I can call fit, serialize and deserialize, and get back my original model. Everything works.
e = MyEstimator()
e.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(e, f)
with open("myfile.pk", mode="rb") as f:
pickle.load(f) # Returns a copy of e
However, serialization does not work when I try to put MyEstimator in a pipeline and pickle the result of a GridSearchCV.
s = GridSearchCV(Pipeline([
# ...
("estimator", MyEstimator())
# ...
]))
s.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(s, f)
During the pickle.dump call I expect to see MyEstimator.__getstate__ get called with a fitted self.model_ object. (This is what happens when I serialize the model by itself, outside the grid search.) Instead self.model_ is None, so I am unable to serialize the best_estimator_ generated by my grid search.
It looks like grid search serialization is instantiating a new MyEstimator object instead of using the one that was in the pipeline. This seems wrong to me. I've looked through the scikit-learn code, but can't see where this is happening.
Is this a bug in scikit-learn, or am I doing something wrong?
(Note: keras does have a wrapper layer that can convert some keras models into scikit-learn estimators, but I can't use that here for other reasons and I'm not sure it wouldn't just have the same problem.)
The search object contains a mixed of MyEstimator objects, some of which have not had fit called on them. The fix is to check if model_ is None before trying to serialize it with the keras tools.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
I don't know why there would be any unfitted models in the search object after the grid search had completed, but there are.