Spacy EN Model issue - spacy

Need to know the difference between spaCy's en and en_core_web_sm model.
I am trying to do NER with Spacy.( For Organization name)
Please find bellow the script I am using
import spacy
nlp = spacy.load("en_core_web_sm")
text = "But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
And above providing me no output.
But when I use “en” model
import spacy
nlp = spacy.load("en")
text = "But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
it provides me desired output:
Google 4 10 ORG
Apple’s Siri 92 104 ORG
iPhones 119 126 ORG
Amazon 132 138 ORG
Echo and Dot 182 194 ORG
What is going wrong in this?
Please help.
can I use en_core_web_sm model to have the same output like en model. if so please advice how to do it. Python 3 script with pandas df as input are solicited. Thanks

So each model is a Machine Learning model trained on top of a specific corpus (a text 'dataset'). This makes it so that each model can tag entries differently - especially because some models were trained on less data than others.
Currently Spacy offers 4 models for english, as presented in: https://spacy.io/models/en/
According to https://github.com/explosion/spacy-models, a model can be downloaded in several distinct ways:
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
# out-of-the-box: download best-matching default model
python -m spacy download en
Probably, when you downloaded the 'en' model, the best matching default model was not 'en_core_web_sm'.
Also, keep in mind that these models are updated every once in a while, which may have caused you to have two different versions of the same model.

Loading spacy.load('en_core_web_sm') instead of spacy.load('en') should help.

In my system result are same in both case
Code:-
import spacy
nlp = spacy.load("en_core_web_sm")
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
import spacy
nlp = spacy.load("en")
text = """But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Related

Spacy entity linking: wiki dataset not connected

This relates to the Spacy entity linking library: https://github.com/egerber/spaCy-entity-linker
When I use the following code:
# python -m spacy_entity_linker "download_knowledge_base"
import spacy
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("entity_linker", last=True)
doc = nlp("I watched the Pirates of the Caribbean last silvester")
all_linked_entities = doc._.linkedEntities
for sent in doc.sents:
sent._.linkedEntities.pretty_print()
I get: ValueError: [E139] Knowledge base for component 'entity_linker' is empty. Use the methods kb.add_entity and kb.add_alias to add entries.
I might need to add the downloaded knowledge base somewhere but it is nowhere stated.
The original code states that add.pipe should be (entity linking has a different name):
nlp.add_pipe("entityLinker", last=True)
But then i get the error:
ValueError: [E002] Can't find factory for 'entityLinker' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class
Where are things going wrong?
installed the correct libraries
# pip install spacy-entity-linker
# python -m spacy_entity_linker "download_knowledge_base"
also have Spacy, the language model, python 3.8.
The SpaCy entity linker library assumes that you have spacy itself configured. As described here, you need to install a language core:
python -m spacy download en_core_web_md

How to load customized NER model from disk with SpaCy?

I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
LABEL = 'DISTRICT'
TRAIN_DATA = [
(
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
}),
(
'We need to deliver it to somewhere', {
'entities': []
}),
]
ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
nlp.disable_pipes("tagger")
nlp.disable_pipes("parser")
nlp.disable_pipes("attribute_ruler")
nlp.disable_pipes("lemmatizer")
nlp.disable_pipes("tok2vec")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from spacy.training import Example
for i in range(25):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
ner.to_disk('/home/feru/ner')
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('/home/feru/ner')
nlp.add_pipe(ner)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
spacy.pipeline.ner.EntityRecognizer.init()
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.

Using LaBSE deployed to Google Cloud AI Platform

I deployed LaBSE model to AI Platform in the last past few days.
The issue I encounter is the answer of the request is above the limit of 2MB.
Several ideas I had to improve the situation:
make AI Platform return minified (not beautifully formatted) JSON (without spaces and newlines everywhere
make AI Plateform return the results in a binary format
since the answer is composed of ~13 outputs : change it to only one output
Do you know any ways of doing 1) or 2) ?
I spent lost of efforts on 3). I'm sure this one is possible. For example by editing the network before uploading it. Here are stuff I tried so far for that:
VERSION = 'v1'
MODEL = 'labse_2_b'
MODEL_DIR = BUCKET + '/' + MODEL
# Download the model
! wget 'https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed' \
--output-document='{MODEL}.tar.gz'
! mkdir {MODEL}
! tar -xzvf '{MODEL}.tar.gz' -C {MODEL}
# Attempts to load the model, edit it, and save it:
model.save(export_path, save_format='tf') # ValueError: Model <keras.engine.sequential.Sequential object at 0x7f87e833c650>
# cannot be saved because the input shapes have not been set.
# Usually, input shapes are automatically determined from calling
# `.fit()` or `.predict()`.
# To manually set the shapes, call `model.build(input_shape)`.
model.build(input_shape=(None,)) # cannot find a proper shape
# create a AI Plateform model version:
! gsutil -m cp -r '{MODEL}' {MODEL_DIR} # upload model to Google Cloud Storage
! gcloud ai-platform versions create $VERSION \
--model {MODEL} \
--origin {MODEL_DIR} \
--runtime-version=2.1 \
--framework='tensorflow' \
--python-version=3.7 \
--region="{REGION}"
Could some please help with with that?
Thanks a lot in advance,
EDIT :
For those wondering about this limitation, as in the comments below : Here are some complementary pieces of information:
A short sentence as
"I wish you a pleasant flight and a good meal aboard this plane."
which is just 16 parts of words long:
[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]
cannot be processed:
Response size too large. Received at least 3220082 bytes; max is 2000000.". Details: "Response size too large. Received at least 3220082 bytes; max is 2000000.

Using spacy visualizer with custom data

I want to visualize a sentence using Spacy's named entity visualizer. I have a sentence with some user defined labels over the tokens, and I want to visualize them using the NER rendering API.
I don't want to train and produce a predictive model, I have all needed labels from an external source, just need the visualization without messing too much with front-end libraries.
Any ideas how?
Thank you
You can manually modify the list of entities (doc.ents) and add new spans using token offsets. Be aware that entities can't overlap at all.
import spacy
from spacy.tokens import Span
nlp = spacy.load('en', disable=['ner'])
doc = nlp("I see an XYZ.")
doc.ents = list(doc.ents) + [Span(doc, 3, 4, "NEWENTITYTYPE")]
print(doc.ents[0], doc.ents[0].label_)
Output:
XYZ NEWENTITYTYPE

How to get a 'dobj' in spacy

In the following Tweet spacy dependency tagger states that disrupt (VB) is a dobj of healthcare market (NN). As these two terms are connected I would like to extract them as one phrase. Is there any way to navigate the parse tree so I can extract the dobj of a word? If I do the folllowing I get market but not 'heathcare market'
from spacy.en import English
from spacy.symbols import nsubj, VERB,dobj
nlp = English()
doc = nlp('Juniper Research: AI start-ups set to disrupt healthcare market, with $800 million to be spent on CAD Systems by 2022')
for possible_subject in doc:
if possible_subject.dep == dobj:
print(possible_subject.text)
You can do this as below using noun chunks
for np in doc.noun_chunks:
if np.root.dep == dobj:
print(np.root.text)
print(np.text)