How to get a 'dobj' in spacy - spacy

In the following Tweet spacy dependency tagger states that disrupt (VB) is a dobj of healthcare market (NN). As these two terms are connected I would like to extract them as one phrase. Is there any way to navigate the parse tree so I can extract the dobj of a word? If I do the folllowing I get market but not 'heathcare market'
from spacy.en import English
from spacy.symbols import nsubj, VERB,dobj
nlp = English()
doc = nlp('Juniper Research: AI start-ups set to disrupt healthcare market, with $800 million to be spent on CAD Systems by 2022')
for possible_subject in doc:
if possible_subject.dep == dobj:
print(possible_subject.text)

You can do this as below using noun chunks
for np in doc.noun_chunks:
if np.root.dep == dobj:
print(np.root.text)
print(np.text)

Related

Is it possible to improve spaCy's similarity results with custom named entities?

I've found that spaCy's similarity does a decent job of comparing my documents using "en_core_web_lg" out-of-the-box.
I'd like to tighten up relationships in some areas and thought adding custom NER labels to the model would help, but my results before and after show no improvements, even though I've been able to create a test set of custom entities.
Now I'm wondering, was my theory completely wrong, or could I simply be missing something in my pipeline?
If I was wrong, what's the best approach to improve results? Seems like some sort of custom labeling should help.
Here's an example of what I've tested so far:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ Not too shabby, but I'd like to see results closer to 0.85 in this example.
So, I use EntityRuler and add some patterns to try and tighten up the relationships:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
Adding GoldParse seems to be important, so I added the following and updated NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
You can see my custom entities are working, but the test results show zero improvement:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators', 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
Any tips would be greatly appreciated!
Doc.similarity only uses the word vectors, not any other annotation. From the Doc API:
The default estimate is cosine similarity using an average of word vectors.
I found my solution was nestled in this tutorial: Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer.
I avoided sentiment analysis tutorials, due to binary classification, since I need support for multiple categories. The trick was to set multi_class='auto' on the LogisticRegression linear model, and to use average='micro' on the precision score and precision recall, so all my text data, like entities, were leveraged:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
and...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
Hope this helps save someone some time!

How to visualize the SpaCy word embedding as scatter plot?

Each word in SpaCy is represented by a vector of length 300. How can I plot these words on a scatter plot to get a visual perspective on how close any 2 words are?
There's a new package called whatlies that does exactly this: https://rasahq.github.io/whatlies/
See a short spacy example: https://spacy.io/universe/project/whatlies
When working with small-to-medium-sized texts, ScatterText is a tool which can be used to discover words that have distinguishing features. It also enables users to create interactive scatter plots that contain non-overlapping term labels.
Intall via -https://pypi.org/project/scattertext/
import spacy
import scattertext as st
nlp = spacy.load('en')
corpus = st.CorpusFromPandas(convention_df,
category_col='party',
text_col='text',
nlp=nlp).build()

Spacy EN Model issue

Need to know the difference between spaCy's en and en_core_web_sm model.
I am trying to do NER with Spacy.( For Organization name)
Please find bellow the script I am using
import spacy
nlp = spacy.load("en_core_web_sm")
text = "But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
And above providing me no output.
But when I use “en” model
import spacy
nlp = spacy.load("en")
text = "But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
it provides me desired output:
Google 4 10 ORG
Apple’s Siri 92 104 ORG
iPhones 119 126 ORG
Amazon 132 138 ORG
Echo and Dot 182 194 ORG
What is going wrong in this?
Please help.
can I use en_core_web_sm model to have the same output like en model. if so please advice how to do it. Python 3 script with pandas df as input are solicited. Thanks
So each model is a Machine Learning model trained on top of a specific corpus (a text 'dataset'). This makes it so that each model can tag entries differently - especially because some models were trained on less data than others.
Currently Spacy offers 4 models for english, as presented in: https://spacy.io/models/en/
According to https://github.com/explosion/spacy-models, a model can be downloaded in several distinct ways:
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
# out-of-the-box: download best-matching default model
python -m spacy download en
Probably, when you downloaded the 'en' model, the best matching default model was not 'en_core_web_sm'.
Also, keep in mind that these models are updated every once in a while, which may have caused you to have two different versions of the same model.
Loading spacy.load('en_core_web_sm') instead of spacy.load('en') should help.
In my system result are same in both case
Code:-
import spacy
nlp = spacy.load("en_core_web_sm")
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
import spacy
nlp = spacy.load("en")
text = """But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Using spacy visualizer with custom data

I want to visualize a sentence using Spacy's named entity visualizer. I have a sentence with some user defined labels over the tokens, and I want to visualize them using the NER rendering API.
I don't want to train and produce a predictive model, I have all needed labels from an external source, just need the visualization without messing too much with front-end libraries.
Any ideas how?
Thank you
You can manually modify the list of entities (doc.ents) and add new spans using token offsets. Be aware that entities can't overlap at all.
import spacy
from spacy.tokens import Span
nlp = spacy.load('en', disable=['ner'])
doc = nlp("I see an XYZ.")
doc.ents = list(doc.ents) + [Span(doc, 3, 4, "NEWENTITYTYPE")]
print(doc.ents[0], doc.ents[0].label_)
Output:
XYZ NEWENTITYTYPE

Merge two CountVectorizers and compute Cosine Similarity

I'm trying to implement a technique described in an information retrieval paper, where documents are decomposed into vectors and then, their cosine similarity is computed, much like how it is explained here: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
In the example, we have:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
However, from time to time I'll get a new document. Is there a way to calculate the cosine similarity of this new document without recreating the documents tuple and the tfidf_matrix?
Yes, you can do:
new_docs = [
"This is new doc 1",
"This is new doc 2",
]
new_tfidf_matrix = tfidf_vectorizer.predict(new_docs)
cosine_similarity(new_tfidf_matrix, tfidf_matrix)
If you think the new docs will have new vocabulary not present in the training dataset, then you should consider retraining the Vectorizer with tfidf_vectorizer.fit(all_docs).