Make spacy nlp.pipe process tuples of text and additional information to add as document features? - spacy

Apparently for doc in nlp.pipe(sequence) is much faster than running for el in sequence: doc = nlp(el) ..
The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).
I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.

It sounds like you might be looking for the as_tuples argument of nlp.pipe? If you set as_tuples=True, you can pass in a stream of (text, context) tuples and spaCy will yield (doc, context) tuples (instead of just Doc objects). You can then use the context and add it to custom attributes etc.
Here's an example:
data = [
("Some text to process", {"meta": "foo"}),
("And more text...", {"meta": "bar"})
]
for doc, context in nlp.pipe(data, as_tuples=True):
# Let's assume you have a "meta" extension registered on the Doc
doc._.meta = context["meta"]

A bit late, but in case someone comes looking for this in 2022:
There is no official/documented way to access the context (the second tuple) for the Doc object from within a pipeline. However, the context does get written to an internal doc._context attribute, so we can use this internal attribute to access the context from within our pipelines.
For example:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
data = [
("stackoverflow is great", {"attr1": "foo", "attr2": "bar"}),
("The sun is shining today", {"location": "Hawaii"})
]
# Set up your custom pipeline. You can access the doc's context from
# within your pipeline, such as {"attr1": "foo", "attr2": "bar"}
#Language.component("my_pipeline")
def my_pipeline(doc):
print(doc._context)
return doc
# Add the pipeline
nlp.add_pipe("my_pipeline")
# Process the data and do something with the doc and/or context
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc)
print(context)
If you are interested in the source code, see the nlp.pipe method and the internal nlp._ensure_doc_with_context methods here: https://github.com/explosion/spaCy/blob/6b83fee58db27cee70ef8d893cbbf7470db4e242/spacy/language.py#L1535

Related

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

spacy tokenizer: is there a way to use regex as a key in custom exceptions for update_exc

It is possible to add custom exceptions to spacy tokenizer. And these exceptions work fine.
However, as far as I know, it's possible to use only strings as keys to match for those exceptions. It's done this way:
import spacy
from spacy.tokens import Doc, Span, Token
from spacy.util import update_exc
from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS
from spacy.symbols import ORTH, NORM, LEMMA, POS, TAG
CUSTOM_EXCEPTIONS = {
# prevent '3g' to be splitted into ['3', 'g']
"3g": [{ORTH: "3g", LEMMA: "3g"}],
}
spacy.lang.tokenizer_exceptions.BASE_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, CUSTOM_EXCEPTIONS)
Is there a way to add an regexp-keyed exception, to, say, match phone number?
Something like this(highlighted in bold):
CUSTOM_EXCEPTIONS = {
# prevent '3g' to be splitted into ['3', 'g']
"3g": [{ORTH: "3g", LEMMA: "3g"}],
r'([\(]?\+[\(]?\d{2}[\)]?[ ]?\d{2} \d{2} \d{2} \d{2})': [{LEMMA: match_result} for match_result in match_results]
}
The only clue I found is:
https://github.com/explosion/spaCy/issues/840
In that revision of tokenizer_exceptions.py there was some way to use regexps as keys for tokenizer exceptions(however, I haven't found any examples to do so)
But in current revisions, at least initial analysis hasn't shown any ways to do s
So is there a way to solve this task?
(input: regex as a key for exception, output - phone numbers with spaces inside)
No, there's no way to have regular expressions as tokenizer exceptions. The tokenizer only looks for exceptions as exact string matches, mainly for reasons of speed. The other difficulty for this kind of example is that tokenizer exceptions currently can't contain spaces. (Support for spaces is planned for a future version of spacy, but not regexes, which would still be too slow.)
I think the best way to do this would be to add a custom pipeline component at the beginning of the pipeline that retokenizes the document with the retokenizer: https://spacy.io/api/doc#retokenize. You can provide any required attributes like lemmas while retokenizing.

Is there a fast way to get the tokens for each sentence in spaCy?

To split my sentence into tokens I'm doing the following whichi is slow
import spacy nlp = spacy.load("en_core_web_lg")
text = "This is a test. This is another test"
sentence_tokens = []
doc = nlp(text)
for sent in doc.sents:
words = nlp(sent.text)
all = []
for w in words:
all.append(w)
sentence_tokens.append(all)
I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()
The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.
So if you need a list of strings, one for each token, you can do:
sentence_tokens = []
for sent in doc.sents:
sentence_tokens.append([token.text for token in sent])
Or even shorter:
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.
texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
# do something with the tokens
To just do the rule-based tokenization, which is very fast, run:
nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])
There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])
If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).
(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)