How not to get "datum" as the lemma for "data" when using Spacy? - spacy

I've run into a quite common word "data" which gets assigned a lemma "datum" from lookups exceptions table spacy uses. I understand that the lemma is technically correct, but in today's english, "data" in its basic form is just "data".
I am using the lemmas to get a sort of keywords from text and if I have a text about data, I can't possibly tag it with "datum".
I was wondering if there is another way to arrive at plain "data" then constructing another "my_exceptions" dictionary used for overriding post-processing.
Thanks for any suggestions.

You could use Lemminflect which works as an add-in pipeline component for SpaCy. It should give you better results.
To use it with SpaCy, just import lemminflect and call the new ._.lemma() function on the Token, ie.. token._.lemma(). Here's an example..
import lemminflect
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got the data')
for token in doc:
print('%-6s %-6s %s' % (token.text, token.lemma_, token._.lemma()))
I -PRON- I
got get get
the the the
data datum data
Lemminflect has a prioritized list of lemmas, based on occurrence in corpus data. You can see all lemmas with...
print(lemminflect.getAllLemmas('data'))
{'NOUN': ('data', 'datum')}

It's relatively easy to customize the lemmatizer once you know where to look. The original lemmatizer tables are from the package spacy-lookups-data and are loaded in the model under nlp.vocab.lookups. You can use a local install of spacy-lookups-data to customize the tables for new/blank models, but if you just want to make a few modifications to the entries for an existing model, you can modify the lemmatizer tables on the fly.
Depending on whether your pipeline includes a tagger, the lemmatizer may be referring to rules+exceptions (with a tagger) or to a simple lookup table (without a tagger), both of which include an exception that lemmatizes data to datum by default. If you remove this exception, you should get data as the lemma for data.
For a pipeline that includes a tagger (rule-based lemmatizer)
# tested with spaCy v2.2.4
import spacy
nlp = spacy.load("en_core_web_sm")
# remove exception from rule-based exceptions
lemma_exc = nlp.vocab.lookups.get_table("lemma_exc")
del lemma_exc[nlp.vocab.strings["noun"]]["data"]
assert nlp.vocab.morphology.lemmatizer("data", "NOUN") == ["data"]
# "data" with the POS "NOUN" has the lemma "data"
doc = nlp("data")
doc[0].pos_ = "NOUN" # just to make sure the POS is correct
assert doc[0].lemma_ == "data"
For a pipeline without a tagger (simple lookup lemmatizer)
import spacy
nlp = spacy.blank("en")
# remove exception from lookups
lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
del lemma_lookup[nlp.vocab.strings["data"]]
assert nlp.vocab.morphology.lemmatizer("data", "") == ["data"]
doc = nlp("data")
assert doc[0].lemma_ == "data"
For both: save model for future use with these modifications included in the lemmatizer tables
nlp.to_disk("/path/to/model")
Also be aware that the lemmatizer uses a cache, so make any changes before running your model on any texts or you may run into problems where it returns lemmas from the cache rather than the updated tables.

Related

How to load customized NER model from disk with SpaCy?

I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
LABEL = 'DISTRICT'
TRAIN_DATA = [
(
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
}),
(
'We need to deliver it to somewhere', {
'entities': []
}),
]
ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
nlp.disable_pipes("tagger")
nlp.disable_pipes("parser")
nlp.disable_pipes("attribute_ruler")
nlp.disable_pipes("lemmatizer")
nlp.disable_pipes("tok2vec")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from spacy.training import Example
for i in range(25):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
ner.to_disk('/home/feru/ner')
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('/home/feru/ner')
nlp.add_pipe(ner)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
spacy.pipeline.ner.EntityRecognizer.init()
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.

SpaCy use Lemmatizer as stand-alone component

I want to use SpaCy's lemmatizer as a standalone component (because I have pre-tokenized text, and I don't want to re-concatenate it and run the full pipeline because SpaCy will most likely tokenize differently in some cases).
I found the lemmatizer in the package but I somehow needs to load the dictionaries with the rules to initialize this Lemmatizer.
These files must be somewhere in the model of the English or German model, right? I couldn't find them there.
from spacy.lemmatizer import Lemmatizer
where do the LEMMA_INDEX, etc. files are comming from?
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
I found a similar question here: Spacy lemmatizer issue/consistency
but this one did not entirely answer how to get these dictionary files from the model. The spacy.lang.* parameter seems to no longer exist in newer versions.
Here's an extracted bit of code I had, that used the SpaCy lemmatizer by itself. I'm not somewhere I can run it so it might have a small bug or two if I made an editing mistake.
Note that in general, you need to know the upos for the word in order to lemmatize correctly. This code will return all the possible lemmas but I would advise modifying it to pass in the correct upos for your word.
class SpacyLemmatizer(object):
def __init__(self, smodel):
import spacy
self.lemmatizer = spacy.load(smodel).vocab.morphology.lemmatizer
# get the lemmas for every upos
def getLemmas(self, entry):
possible_lemmas = set()
for upos in ('NOUN', 'VERB', 'ADJ', 'ADV'):
lemmas = self.lemmatizer(entry, upos, morphology=None)
lemma = lemmas[0] # See morphology.pyx::lemmatize
possible_lemmas.add( lemma )
return possible_lemmas

TFX StatisticsGen for image data

Hi I've trying to get a TFX Pipeline going just as an exercise really. I'm using ImportExampleGen to load TFRecords from disk. Each Example in the TFRecord contains a jpg in the form of a byte string, height, width, depth, steering and throttle labels.
I'm trying to use StatisticsGen but I'm receiving this warning;
WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string. and crashing my Colab Notebook. As far as I can tell all the byte-string images in the TFRecord are not corrupt.
I cannot find concrete examples on StatisticsGen and handling image data. According to the docs Tensorflow Data Validation can deal with image data.
In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.
But I'm not sure how this fits in with StatisticsGen.
Here is the code that instantiates an ImportExampleGen then the StatisticsGen
from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from tfx.proto import example_gen_pb2
examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param
# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio = 10-train_ratio
output = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(splits=[
example_gen_pb2.SplitConfig.Split(name='train',
hash_buckets=train_ratio),
example_gen_pb2.SplitConfig.Split(name='eval',
hash_buckets=eval_ratio)
]))
example_gen = ImportExampleGen(input=examples,
output_config=output)
context.run(example_gen)
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
Thanks in advance.
From git issue response
Thanks Evan Rosen
Hi Folks,
The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can't be serialized. We'll look into finding a better default that handles image data more elegantly.
In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from #tall-josh:
Open In Colab
The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
# Load autogenerated schema (using stats from small batch)
schema = tfx.utils.io_utils.SchemaReader().read(
tfx.utils.io_utils.get_only_uri_in_dir(
tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))
# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
if feature.name == 'image/raw':
feature.image_domain.SetInParent()
# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
os.path.join(user_schema_dir, 'schema.pbtxt'), schema)
# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
instance_name='import_user_schema',
source_uri=user_schema_dir,
artifact_type=tfx.types.standard_artifacts.Schema)
# Run the user schema ImportNode
context.run(user_schema_importer)
Hopefully you find this workaround is useful. In the meantime, we'll take a look at a better default experience for image-valued features.
Groked this and found the solution to be dramatically simpler than i thought...
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
import logging
...
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
...
context = InteractiveContext(pipeline_name='my_pipe')
...
c = StatisticsGen(...)
...
context.run(c)

When creating a Doc using the standard constructor the model is not loaded [E029]

I'm trying to use SpaCY and instantiate the Doc object using the constructor:
words = ["hello", "world", "!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
but when I do that, if I try to use the dependency parser:
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
I get the error:
ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
While if I use the method nlp("Hello world!") that does not happens.
The reason I do that, is because I use the entity extraction from a third party application I want to supply to SpaCy my tokenisation and my entities.
Something like this:
## Convert tokens
words, spaces = convert_to_spacy2(tokens_)
## Creating a new document with the text
doc = Doc(nlp.vocab, words=words, spaces=spaces)
## Loading entities in the spaCY document
entities = []
for s in myEntities:
entities.append(Span(doc=doc, start=s['tokenStart'], end=s['tokenEnd'], label=s['type']))
doc.ents = entities
What should I do? load the pipeline by myself in the document, and exclude the tokeniser for example?
Thank you in advance
nlp() returns a Doc where the tokenizer and all the pipeline components in nlp.pipeline have been applied to the document.
If you create a Doc by hand, the tokenizer and the pipeline components are not loaded or applied at any point.
After creating a Doc by hand, you can still apply individual pipeline components from a loaded model:
nlp = spacy.load('en_core_web_sm')
nlp.tagger(doc)
nlp.parser(doc)
Then you can add your own entities to the document. (Note that if your tokenizer is very different from the default tokenizer used when training a model, the performance may not be as good.)

Using spacy visualizer with custom data

I want to visualize a sentence using Spacy's named entity visualizer. I have a sentence with some user defined labels over the tokens, and I want to visualize them using the NER rendering API.
I don't want to train and produce a predictive model, I have all needed labels from an external source, just need the visualization without messing too much with front-end libraries.
Any ideas how?
Thank you
You can manually modify the list of entities (doc.ents) and add new spans using token offsets. Be aware that entities can't overlap at all.
import spacy
from spacy.tokens import Span
nlp = spacy.load('en', disable=['ner'])
doc = nlp("I see an XYZ.")
doc.ents = list(doc.ents) + [Span(doc, 3, 4, "NEWENTITYTYPE")]
print(doc.ents[0], doc.ents[0].label_)
Output:
XYZ NEWENTITYTYPE