For a Named Entity Recognition task in Dutch with spaCy, I added entities using EntityRuler. When I add the ruler to the pipeline in my notebook:
nlp = spacy.load("nl_core_news_md")
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = complete_dicts # This is a list of dictionaries, e.g. [{"label": "PERSON", "pattern": "Staf Aerts"}, {"label": "PERSON", "pattern": "Meyrem Almaci"}]
ruler.add_patterns(patterns)
the NER-pipeline works very well. However, when I save it to my disk and then load this model again using
nlp.from_disk("path/to_model")
the model misses entities that are added through the EntityRuler.
I found nothing in the documentation why this would happen. I would be grateful for anyone who has an explanation for this! Thanks.
To load a saved model, use spacy.load:
nlp = spacy.load("/path/to/model")
More details about how spacy.load works (including nlp.from_disk): https://spacy.io/usage/processing-pipelines#pipelines
Related
I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
LABEL = 'DISTRICT'
TRAIN_DATA = [
(
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
}),
(
'We need to deliver it to somewhere', {
'entities': []
}),
]
ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
nlp.disable_pipes("tagger")
nlp.disable_pipes("parser")
nlp.disable_pipes("attribute_ruler")
nlp.disable_pipes("lemmatizer")
nlp.disable_pipes("tok2vec")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from spacy.training import Example
for i in range(25):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
ner.to_disk('/home/feru/ner')
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('/home/feru/ner')
nlp.add_pipe(ner)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
spacy.pipeline.ner.EntityRecognizer.init()
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.
I've found that spaCy's similarity does a decent job of comparing my documents using "en_core_web_lg" out-of-the-box.
I'd like to tighten up relationships in some areas and thought adding custom NER labels to the model would help, but my results before and after show no improvements, even though I've been able to create a test set of custom entities.
Now I'm wondering, was my theory completely wrong, or could I simply be missing something in my pipeline?
If I was wrong, what's the best approach to improve results? Seems like some sort of custom labeling should help.
Here's an example of what I've tested so far:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ Not too shabby, but I'd like to see results closer to 0.85 in this example.
So, I use EntityRuler and add some patterns to try and tighten up the relationships:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
Adding GoldParse seems to be important, so I added the following and updated NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
You can see my custom entities are working, but the test results show zero improvement:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators', 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
Any tips would be greatly appreciated!
Doc.similarity only uses the word vectors, not any other annotation. From the Doc API:
The default estimate is cosine similarity using an average of word vectors.
I found my solution was nestled in this tutorial: Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer.
I avoided sentiment analysis tutorials, due to binary classification, since I need support for multiple categories. The trick was to set multi_class='auto' on the LogisticRegression linear model, and to use average='micro' on the precision score and precision recall, so all my text data, like entities, were leveraged:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
and...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
Hope this helps save someone some time!
I'm trying to use SpaCY and instantiate the Doc object using the constructor:
words = ["hello", "world", "!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
but when I do that, if I try to use the dependency parser:
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
I get the error:
ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
While if I use the method nlp("Hello world!") that does not happens.
The reason I do that, is because I use the entity extraction from a third party application I want to supply to SpaCy my tokenisation and my entities.
Something like this:
## Convert tokens
words, spaces = convert_to_spacy2(tokens_)
## Creating a new document with the text
doc = Doc(nlp.vocab, words=words, spaces=spaces)
## Loading entities in the spaCY document
entities = []
for s in myEntities:
entities.append(Span(doc=doc, start=s['tokenStart'], end=s['tokenEnd'], label=s['type']))
doc.ents = entities
What should I do? load the pipeline by myself in the document, and exclude the tokeniser for example?
Thank you in advance
nlp() returns a Doc where the tokenizer and all the pipeline components in nlp.pipeline have been applied to the document.
If you create a Doc by hand, the tokenizer and the pipeline components are not loaded or applied at any point.
After creating a Doc by hand, you can still apply individual pipeline components from a loaded model:
nlp = spacy.load('en_core_web_sm')
nlp.tagger(doc)
nlp.parser(doc)
Then you can add your own entities to the document. (Note that if your tokenizer is very different from the default tokenizer used when training a model, the performance may not be as good.)
I want to visualize a sentence using Spacy's named entity visualizer. I have a sentence with some user defined labels over the tokens, and I want to visualize them using the NER rendering API.
I don't want to train and produce a predictive model, I have all needed labels from an external source, just need the visualization without messing too much with front-end libraries.
Any ideas how?
Thank you
You can manually modify the list of entities (doc.ents) and add new spans using token offsets. Be aware that entities can't overlap at all.
import spacy
from spacy.tokens import Span
nlp = spacy.load('en', disable=['ner'])
doc = nlp("I see an XYZ.")
doc.ents = list(doc.ents) + [Span(doc, 3, 4, "NEWENTITYTYPE")]
print(doc.ents[0], doc.ents[0].label_)
Output:
XYZ NEWENTITYTYPE
First of all, I already know how to manually add float or image summaries. I can construct a tf.Summary protobuf manually. But what about text summaries? I look at the definition for summary protobuf here, but I don't find a "string" value option there.
TensorBoard's text plugin offers a pb method that lets you create text summaries outside of a TensorFlow environment.
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/text/summary.py#L74
Example usage:
import tensorboard as tb
text_summary_proto = tb.summary.pb('fooTag', 'text data')
John Hoffman's answer is great, though the tb.summary.pb API seems not available as of TF 1.x. You can instead use the following APIs:
tb.summary.text_pb("key", "content of the text data")
Just FYI, tb.summary has many similar methods for other types of summary as well:
'audio', audio_pb',
'custom_scalar', 'custom_scalar_pb',
'histogram', 'histogram_pb',
'image', 'image_pb',
'pr_curve', 'pr_curve_pb',
'pr_curve_raw_data_op',
'pr_curve_raw_data_pb',
'pr_curve_streaming_op',
'scalar', 'scalar_pb',
'text', 'text_pb'