Can I do any analysis on spacy display using NER? - pandas

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )

The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

Related

Spacy NER: Extract all Persons before a specific word

I know I can use spacy entity named recognition to extract persons in a text. But I only want to extract the person or personS who are before the word "asked".
Should I use Matcher together with NER? I am new to Spacy so apologies if the question is simple
Desired Output:
Louis Ng
Current Output:
Louis Ng
Lam Pin Min
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp (
"Mr Louis Ng asked what kind of additional support can we give to sectors and businesses where the human interaction cannot be mechanised. Mr Lam Pin Min replied that businesses could hire extra workers in such cases."
)
for ent in doc.ents:
# Print the entity text and label
print(ent.text, ent.label_)
You can use a Matcher to find PERSON entities that precede a specific word:
pattern = [{"ENT_TYPE": "PERSON"}, {"ORTH": "asked"}]
Because each dict corresponds to a single token, this pattern would only match the last word of the entity ("Ng"). You could let the first dict match more than one token with {"ENT_TYPE": "PERSON", "OP": "+"}, but this runs the risk of matching two person entities in a row in an example like "Before Ms X spoke to Ms Y Ms Z asked ...".
To be able to match a single entity more easily with a Matcher, you can add the component merge_entities to the end of your pipeline (https://spacy.io/api/pipeline-functions#merge_entities), which merges each entity into a single token. Then this pattern would match "Louis Ng" as one token.

Remove whitespace from spacy doc.ents

I am trying to run a spacy model for NER. I have Doc object and doc.ents shows below output
(august 3, 2021, book building offer, bse, nse)
All the tags have space.Due to this i am receiving below error.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data
Can anyone suggest how to remove this whitespace?
Found out a problem in rule based labeling, the entity spans had leading or trailing whitespace.
To solve above problem, you can use below function:
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
If you have documents labeled in new .spacy format then you can use below helper function:
new_docs = []
for doc in skweak.utils.docbin_reader("./ipo_v3.spacy",spacy_model_name='en_core_web_trf'):
new_docs.append(doc)
Once you have all the docs, you can use them to convert into JSON format:
doc_examples = []
for i,j in enumerate(new_docs):
spans_ex = [(ent.start_char,ent.end_char,ent.label_) for ent in new_docs[i].ents]
doc_examples.append([new_docs[i].text,{"entities":spans_ex}])
Then you can simply use :
doc_examples = trim_entity_spans(doc_examples)

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Make spacy nlp.pipe process tuples of text and additional information to add as document features?

Apparently for doc in nlp.pipe(sequence) is much faster than running for el in sequence: doc = nlp(el) ..
The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).
I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.
It sounds like you might be looking for the as_tuples argument of nlp.pipe? If you set as_tuples=True, you can pass in a stream of (text, context) tuples and spaCy will yield (doc, context) tuples (instead of just Doc objects). You can then use the context and add it to custom attributes etc.
Here's an example:
data = [
("Some text to process", {"meta": "foo"}),
("And more text...", {"meta": "bar"})
]
for doc, context in nlp.pipe(data, as_tuples=True):
# Let's assume you have a "meta" extension registered on the Doc
doc._.meta = context["meta"]
A bit late, but in case someone comes looking for this in 2022:
There is no official/documented way to access the context (the second tuple) for the Doc object from within a pipeline. However, the context does get written to an internal doc._context attribute, so we can use this internal attribute to access the context from within our pipelines.
For example:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
data = [
("stackoverflow is great", {"attr1": "foo", "attr2": "bar"}),
("The sun is shining today", {"location": "Hawaii"})
]
# Set up your custom pipeline. You can access the doc's context from
# within your pipeline, such as {"attr1": "foo", "attr2": "bar"}
#Language.component("my_pipeline")
def my_pipeline(doc):
print(doc._context)
return doc
# Add the pipeline
nlp.add_pipe("my_pipeline")
# Process the data and do something with the doc and/or context
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc)
print(context)
If you are interested in the source code, see the nlp.pipe method and the internal nlp._ensure_doc_with_context methods here: https://github.com/explosion/spaCy/blob/6b83fee58db27cee70ef8d893cbbf7470db4e242/spacy/language.py#L1535

Generate set of Nouns and verbs from n different descriptions, list out descriptions that match a noun and verb

Im new to NLP, i have with columns app name and its description. Data looks like this
app1, description1 (some information of app1, how it works)
app2, description2
.
.
app(n), description(n)
From these descriptions i need to generate a limited set of nouns and verbs. In the final application, when we pair a noun and verb from this list, output should be of list of apps that satisfy that noun+verb.
I dont have any idea where to start, can you please guide me where to start. Thank you.
The task of finding the morpho-syntactic category of words in a sentence is called part-of-speech (or PoS) tagging.
In your case, you probably need also to tokenize your text first.
To do so, you can use nltk, spacy, or the Stanford NLP tagger (among other tools).
Note that depending on the model you use, there can be several labels for nouns (singular nouns, plural nouns, proper nouns) and verbs (depending on the tense and person).
Example with NLTK:
import nltk
description = "This description describes apps with words."
tokenized_description = nltk.word_tokenize(description)
tagged_description = nltk.pos_tag(tokenized_description)
#tagged_description:
# [('This', 'DT'), ('description', 'NN'), ('describes', 'VBZ'), ('apps', 'RP'), ('with', 'IN'), ('words', 'NNS'), ('.', '.')]
# map the tags to a smaller set of tags
universal_tags_description = [(word, nltk.map_tag("wsj", "universal", tag)) for word, tag in tagged_description]
# universal_tags_description:
# [('This', 'DET'), ('description', 'NOUN'), ('describes', 'VERB'), ('apps', 'PRT'), ('with', 'ADP'), ('words', 'NOUN'), ('.', '.')]
filtered = [(word, tag) for word, tag in universal_tags_description if tag in {'NOUN', 'VERB'}]
# filtered:
# [('description', 'NOUN'), ('describes', 'VERB'), ('words', 'NOUN')]