spaCy: custom infix regex rule to split on `:` for patterns like mailto:johndoe#gmail.com is not applied consistently - tokenize

With the default tokenizer, spaCy treats mailto:johndoe#gmail.com as one single token.
I tried the following:
nlp = spacy.load('en_core_web_lg')
infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', )
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:johndoe#gmail.c, it does what I want:
nlp("mailto:johndoe#gmail.c")
# [mailto, :, johndoe#gmail.c]
However, if I apply the tokenizer to mailto:johndoe#gmail.com, it does not work as intended.
nlp("mailto:johndoe#gmail.com")
# [mailto:johndoe#gmail.com]
I wonder if there is a way to fix this inconsistency?

There's a tokenizer exception pattern for URLs, which matches things like mailto:johndoe#gmail.com as one token. It knows that top-level domains have at least two letters so it matches gmail.co and gmail.com but not gmail.c.
You can override it by setting:
nlp.tokenizer.token_match = None
Then you should get:
[t.text for t in nlp("mailto:johndoe#gmail.com")]
# ['mailto', ':', 'johndoe#gmail.com']
[t.text for t in nlp("mailto:johndoe#gmail.c")]
# ['mailto', ':', 'johndoe#gmail.c']
If you want the URL tokenization to be as by default except for mailto:, you could modify the URL_PATTERN from lang/tokenizer_exceptions.py (also see how TOKEN_MATCH is defined right below it) and use that rather than None.

Related

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

spacy tokenizer: is there a way to use regex as a key in custom exceptions for update_exc

It is possible to add custom exceptions to spacy tokenizer. And these exceptions work fine.
However, as far as I know, it's possible to use only strings as keys to match for those exceptions. It's done this way:
import spacy
from spacy.tokens import Doc, Span, Token
from spacy.util import update_exc
from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS
from spacy.symbols import ORTH, NORM, LEMMA, POS, TAG
CUSTOM_EXCEPTIONS = {
# prevent '3g' to be splitted into ['3', 'g']
"3g": [{ORTH: "3g", LEMMA: "3g"}],
}
spacy.lang.tokenizer_exceptions.BASE_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, CUSTOM_EXCEPTIONS)
Is there a way to add an regexp-keyed exception, to, say, match phone number?
Something like this(highlighted in bold):
CUSTOM_EXCEPTIONS = {
# prevent '3g' to be splitted into ['3', 'g']
"3g": [{ORTH: "3g", LEMMA: "3g"}],
r'([\(]?\+[\(]?\d{2}[\)]?[ ]?\d{2} \d{2} \d{2} \d{2})': [{LEMMA: match_result} for match_result in match_results]
}
The only clue I found is:
https://github.com/explosion/spaCy/issues/840
In that revision of tokenizer_exceptions.py there was some way to use regexps as keys for tokenizer exceptions(however, I haven't found any examples to do so)
But in current revisions, at least initial analysis hasn't shown any ways to do s
So is there a way to solve this task?
(input: regex as a key for exception, output - phone numbers with spaces inside)
No, there's no way to have regular expressions as tokenizer exceptions. The tokenizer only looks for exceptions as exact string matches, mainly for reasons of speed. The other difficulty for this kind of example is that tokenizer exceptions currently can't contain spaces. (Support for spaces is planned for a future version of spacy, but not regexes, which would still be too slow.)
I think the best way to do this would be to add a custom pipeline component at the beginning of the pipeline that retokenizes the document with the retokenizer: https://spacy.io/api/doc#retokenize. You can provide any required attributes like lemmas while retokenizing.

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

How to add a url suffix before performing a callback in scrapy

I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?
rules = (
Rule(LinkExtractor(allow=('something1'))),
Rule(LinkExtractor(allow=('something2'))),
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)
You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex.
You'd end up with:
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
callback='parse_archive', process_value=lambda x: x+'/fullspecs')
See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

Matplotlib: how to set dashes in a dictionary?

Can anybody tell me how to use a custom dash sequence in a dictionary. I cannot get that running and the only thing I cannot work with (not a programmer) is the documentation =-(
def lineCycler(): #must be invoked for every plot again to get the same results in every plot
#hasy="#7b9aae"
_styles = [{'color':'#b21a6a', 'ls':'-'},
{'color':'#65a4cb', 'ls':'[5,2,10,5]'},# this shoul be some custom dash sequnece
{'color':'#22b27c', 'ls':'-.'},
{'color':'k', 'ls':'--'}
]
_linecycler=cycle(_styles)
return _linecycler
Use dashes keyword for that (and you need a list, instead of a string):
def lineCycler():
_styles = [{'color':'#b21a6a', 'ls':'-'},
{'color':'#65a4cb', 'dashes':[5,2,10,5]},
{'color':'#22b27c', 'ls':'-.'},
{'color':'k', 'ls':'--'}
]
_linecycler=cycle(_styles)
return _linecycler