Why spacy morphologizer doesn't work when we use a custom tokenizer? - spacy

I don't understand why when i'm doing this
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
Tokens don't have any morph assigned
print([tok.morph for tok in doc])
> ['','','','','']
Is this behavior expected? If yes, why ? (spacy v3.0.7)

The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.
I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.

Related

How can I make a DataFrame from a ColumnTransformer composed by LOO Encoder, OHE and Ordinal Encoder?

Since the feature get_feature_names() was deprecated from "native" categorical encoders in Sklearn (actually it was replaced by get_feature_names_out()), how could I make a DataFrame where the transformed variables have their proper names since inside the ColumnTransformer has encoders whose respond for get_feature_names_out() and others for get_feature_names()? Here is the situation:
features_pipe = make_column_transformer(
(OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
(OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
(ce.LeaveOneOutEncoder(), ['State (US)'])
).fit(X_train, y_train)
X_train_encoded = features_pipe.transform(X_train)
X_test_encoded = features_pipe.transform(X_test)
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns= features_pipe.get_features_names_out())
X_train_encoded_df.head()
I got this error: AttributeError: 'ColumnTransformer' object has no attribute 'get_features_names_out'
That's because LeaveOneOutEncoder does not support get_feature_names_out(). It supports get_feature_names().
How could I overcome this issue and print my DataFrame correctly?
I have this same issue once in the past.
If you don't mind using a subclass of ColumnTransformer you can create it and just modify to call get_feature_names() when get_feature_names_out() was not available.
In this case, you should declare the class
from sklearn.compose import ColumnTransformer
from sklearn.compose._column_transformer import _is_empty_column_selection
class MyColumnTransformer(ColumnTransformer):
def __init__(self, transformers, **kwargs):
super().__init__(transformers=transformers, **kwargs)
def _get_feature_name_out_for_transformer(
self, name, trans, column, feature_names_in
):
column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
if trans == "drop" or _is_empty_column_selection(column):
return
elif trans == "passthrough":
return names
if not hasattr(trans, "get_feature_names_out"):
return trans.get_feature_names()
return trans.get_feature_names_out(names)
Although the use of ColumnTransformer is not as simple as using make_column_transformer, it's much more customizable.
So, in this case, you also have to pass a name to each transformer using the following schema:
(name, transformer, columns)
features_pipe = MyColumnTransformer(transformers=
[
('OHE', OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
('OE', OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
('LOOE', ce.LeaveOneOutEncoder(), ['State (US)'])
])
features_pipe.fit(X_train, y_train)
and finally continue the code the way you suggested.
If you don't want to append the transformer names to the features names, just include verbose_feature_names_out=False when initializing MyColumnTransformer.

Error message when trying to use huggingface pretrained Tokenizer (roberta-base)

I am pretty new at this, so there might be something I am missing completely, but here is my problem: I am trying to create a Tokenizer class that uses the pretrained tokenizer models from Huggingface. I would then like to use this class in a larger transformer model to tokenize my input data. Here is the class code
class Roberta(MyTokenizer):
from transformers import AutoTokenizer
from transformers import RobertaTokenizer
class Roberta(MyTokenizer):
def build(self, *args, **kwargs):
self.max_length = self.phd.max_length
self.untokenized_data = self.questions + self.answers
def tokenize_and_filter(self):
# Initialize the tokenizer with a pretrained model
Tokenizer = AutoTokenizer.from_pretrained('roberta')
tokenized_inputs, tokenized_outputs = [], []
inputs = Tokenizer(self.questions, padding=True)
outputs = Tokenizer(self.answers, padding=True)
tokenized_inputs = inputs['input_ids']
tokenized_outputs = outputs['input_ids']
return tokenized_inputs, tokenized_outputs
When I call the function tokenize_and_filter in my Transformer model as below
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
print(questions)
and I try to print the tokenized data, I get this message:
<bound method Roberta.tokenize_and_filter of <MyTokenizer.Roberta.Roberta object at
0x000002779A9E4D30>>
It appears that the function returns a method instead of a list or a tensor - I've tried passing the parameter 'return_tensors='tf'', I have tried using the tokenizer.encode() method, I have tried both with AutoTokenizer and with RobertaTokenizer, I have tried the batch_encode_plus() method, nothing seems to work.
Please help!
it seems this was a really stupid error on my part, I forgot to put parentheses when calling the function
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
should actually be
questions = self.get_tokenizer().tokenize_and_filter()
answers = self.get_tokenizer().tokenize_and_filter()
and it works this way :)

Two models of the same architecture with same weights giving different results

Problem
After copying weights from a pretrained model, I do not get the same output.
Description
tf2cv repository provides pretrained models in TF2 for various backbones. Unfortunately the codebase is of limited use to me because they use subclassing via tf.keras.Model which makes it very hard to extract intermediate outputs and gradients at will. I therefore embarked upon rewriting the codes for the backbones using the functional API. After rewriting the resnet architecture codes, I copied their weights into my model and saved them in SavedModel format. In order to test if it is correctly done, I gave an input to my model instance and theirs and the results were different.
My approaches to debugging the problem
I checked the number of trainable and non-trainable parameters and they are the same between my model instance and theirs.
I checked if all trainable weights have been copied which they have.
My present line of thinking
I think it might be possible that weights have not been copied to the correct layers. For example :- Layer X and Layer Y might have weights of the same shape but during weight copying, weights of layer Y might have gone into Layer X and vice versa. This is only possible if I have not mapped the layer names between the two models properly.
However I have exhaustively checked and have not found any error so far.
The Code
My code is attached below. Their (tfcv) code for resnet can be found here
Please note that resnet_orig in the following snippet is the same as here
My converted code can be found here
from vision.image import resnet as myresnet
from glob import glob
from loguru import logger
import tensorflow as tf
import resnet_orig
import re
import os
import numpy as np
from time import time
from copy import deepcopy
tf.random.set_seed(time())
models = [
'resnet10',
'resnet12',
'resnet14',
'resnetbc14b',
'resnet16',
'resnet18_wd4',
'resnet18_wd2',
'resnet18_w3d4',
'resnet18',
'resnet26',
'resnetbc26b',
'resnet34',
'resnetbc38b',
'resnet50',
'resnet50b',
'resnet101',
'resnet101b',
'resnet152',
'resnet152b',
'resnet200',
'resnet200b',
]
def zipdir(path, ziph):
# ziph is zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file),
os.path.relpath(os.path.join(root, file),
os.path.join(path, '..')))
def find_model_file(model_type):
model_files = glob('*.h5')
for m in model_files:
if '{}-'.format(model_type) in m:
return m
return None
def remap_our_model_variables(our_variables, model_name):
remapped = list()
reg = re.compile(r'(stage\d+)')
for var in our_variables:
newvar = var.replace(model_name, 'features/features')
stage_search = re.search(reg, newvar)
if stage_search is not None:
stage_search = stage_search[0]
newvar = newvar.replace(stage_search, '{}/{}'.format(stage_search,
stage_search))
newvar = newvar.replace('conv_preact', 'conv/conv')
newvar = newvar.replace('conv_bn','bn')
newvar = newvar.replace('logits','output1')
remapped.append(newvar)
remap_dict = dict([(x,y) for x,y in zip(our_variables, remapped)])
return remap_dict
def get_correct_variable(variable_name, trainable_variable_names):
for i, var in enumerate(trainable_variable_names):
if variable_name == var:
return i
logger.info('Uffff.....')
return None
layer_regexp_compiled = re.compile(r'(.*)\/.*')
model_files = glob('*.h5')
a = np.ones(shape=(1,224,224,3), dtype=np.float32)
inp = tf.constant(a, dtype=tf.float32)
for model_type in models:
logger.info('Model is {}.'.format(model_type))
model = eval('myresnet.{}(input_height=224,input_width=224,'
'num_classes=1000,data_format="channels_last")'.format(
model_type))
model2 = eval('resnet_orig.{}(data_format="channels_last")'.format(
model_type))
model2.build(input_shape=(None,224, 224,3))
model_name=find_model_file(model_type)
logger.info('Model file is {}.'.format(model_name))
original_weights = deepcopy(model2.weights)
if model_name is not None:
e = model2.load_weights(model_name, by_name=True, skip_mismatch=False)
print(e)
loaded_weights = deepcopy(model2.weights)
else:
logger.info('Pretrained model is not available for {}.'.format(
model_type))
continue
diff = [np.mean(x.numpy()-y.numpy()) for x,y in zip(original_weights,
loaded_weights)]
our_model_weights = model.weights
their_model_weights = model2.weights
assert (len(our_model_weights) == len(their_model_weights))
our_variable_names = [x.name for x in model.weights]
their_variable_names = [x.name for x in model2.weights]
remap_dict = remap_our_model_variables(our_variable_names, model_type)
new_weights = list()
for i in range(len(our_model_weights)):
our_name = model.weights[i].name
remapped_name = remap_dict[our_name]
source_index = get_correct_variable(remapped_name, their_variable_names)
new_weights.append(
model2.weights[source_index].value())
logger.debug('Copying from {} ({}) to {} ({}).'.format(
model2.weights[
source_index].name,
model2.weights[source_index].value().shape,
model.weights[
i].name,
model.weights[i].value().shape))
logger.info(len(new_weights))
logger.info('Setting new weights')
model.set_weights(new_weights)
logger.info('Finished setting new weights.')
their_output = model2(inp)
our_output = model(inp)
logger.info(np.max(their_output.numpy() - our_output.numpy()))
logger.info(diff) # This must be 0.0
break

My question is about "module 'textacy' has no attribute 'Doc'"

Can't find module 'textacy' has no attribute 'Doc'
I am trying to extract verb phrases from spacy but there is such no library. Please help me how can I extract the verb phrases or adjective phrases using spacy. I want to do full shallow parsing.
def extract_named_nouns(row_series):
"""Combine nouns and non-numerical entities.
Keyword arguments:
row_series -- a Pandas Series object
"""
ents = set()
idxs = set()
# remove duplicates and merge two lists together
for noun_tuple in row_series['nouns']:
for named_ents_tuple in row_series['named_ents']:
if noun_tuple[1] == named_ents_tuple[1]:
idxs.add(noun_tuple[1])
ents.add(named_ents_tuple)
if noun_tuple[1] not in idxs:
ents.add(noun_tuple)
return sorted(list(ents), key=lambda x: x[1])
def add_named_nouns(df):
"""Create new column in data frame with nouns and named ents.
Keyword arguments:
df -- a dataframe object
"""
df['named_nouns'] = df.apply(extract_named_nouns, axis=1)
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
from textacy import io
#using spacy for nlp
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc.load(sentence, metadata=metadata, lang='en_core_web_sm')
# doc = textacy.corpus.Corpus(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
module 'textacy' has no attribute 'Doc'
Try following the examples here: https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#make-a-doc
It should be as simple as:
doc = textacy.make_spacy_doc("The author is writing a new book.", lang='en_core_web_sm')
You might look into just using spacy (without textacy) with its built-in Matcher instead (https://spacy.io/usage/rule-based-matching).
spacy_lang = textacy.load_spacy_lang("en_core_web_en")
docx_textacy = spacy_lang(sentence)

Custom scikit-learn pickling doesn't work inside a grid search

I have written a scikit-learn estimator. It has a parameter and a model_ attribute that is set by fit.
class MyEstimator(BaseEstimator, TransformerMixin):
def __init__(self, param="default"):
self.param = param
self.model_ = None
def fit(self, x, y):
# Sets the value of self.model_
I want to be able to pickle MyEstimator, but the model_ object I create cannot be serialized with pickle because it is a keras model. Following the example of the blog post "Pickling Keras Models" I added the following pickling handler methods to my class.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
This replaces the unpickleable model_ member with a representation generated by keras' serializer that can be pickled. Using this customization I can call fit, serialize and deserialize, and get back my original model. Everything works.
e = MyEstimator()
e.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(e, f)
with open("myfile.pk", mode="rb") as f:
pickle.load(f) # Returns a copy of e
However, serialization does not work when I try to put MyEstimator in a pipeline and pickle the result of a GridSearchCV.
s = GridSearchCV(Pipeline([
# ...
("estimator", MyEstimator())
# ...
]))
s.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(s, f)
During the pickle.dump call I expect to see MyEstimator.__getstate__ get called with a fitted self.model_ object. (This is what happens when I serialize the model by itself, outside the grid search.) Instead self.model_ is None, so I am unable to serialize the best_estimator_ generated by my grid search.
It looks like grid search serialization is instantiating a new MyEstimator object instead of using the one that was in the pipeline. This seems wrong to me. I've looked through the scikit-learn code, but can't see where this is happening.
Is this a bug in scikit-learn, or am I doing something wrong?
(Note: keras does have a wrapper layer that can convert some keras models into scikit-learn estimators, but I can't use that here for other reasons and I'm not sure it wouldn't just have the same problem.)
The search object contains a mixed of MyEstimator objects, some of which have not had fit called on them. The fix is to check if model_ is None before trying to serialize it with the keras tools.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
I don't know why there would be any unfitted models in the search object after the grid search had completed, but there are.