Is there a fast way to get the tokens for each sentence in spaCy? - spacy

To split my sentence into tokens I'm doing the following whichi is slow
import spacy nlp = spacy.load("en_core_web_lg")
text = "This is a test. This is another test"
sentence_tokens = []
doc = nlp(text)
for sent in doc.sents:
words = nlp(sent.text)
all = []
for w in words:
all.append(w)
sentence_tokens.append(all)
I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()

The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.
So if you need a list of strings, one for each token, you can do:
sentence_tokens = []
for sent in doc.sents:
sentence_tokens.append([token.text for token in sent])
Or even shorter:
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.
texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
# do something with the tokens

To just do the rule-based tokenization, which is very fast, run:
nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])
There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])
If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).
(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

Related

Remove whitespace from spacy doc.ents

I am trying to run a spacy model for NER. I have Doc object and doc.ents shows below output
(august 3, 2021, book building offer, bse, nse)
All the tags have space.Due to this i am receiving below error.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data
Can anyone suggest how to remove this whitespace?
Found out a problem in rule based labeling, the entity spans had leading or trailing whitespace.
To solve above problem, you can use below function:
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
If you have documents labeled in new .spacy format then you can use below helper function:
new_docs = []
for doc in skweak.utils.docbin_reader("./ipo_v3.spacy",spacy_model_name='en_core_web_trf'):
new_docs.append(doc)
Once you have all the docs, you can use them to convert into JSON format:
doc_examples = []
for i,j in enumerate(new_docs):
spans_ex = [(ent.start_char,ent.end_char,ent.label_) for ent in new_docs[i].ents]
doc_examples.append([new_docs[i].text,{"entities":spans_ex}])
Then you can simply use :
doc_examples = trim_entity_spans(doc_examples)

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Make spacy nlp.pipe process tuples of text and additional information to add as document features?

Apparently for doc in nlp.pipe(sequence) is much faster than running for el in sequence: doc = nlp(el) ..
The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).
I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.
It sounds like you might be looking for the as_tuples argument of nlp.pipe? If you set as_tuples=True, you can pass in a stream of (text, context) tuples and spaCy will yield (doc, context) tuples (instead of just Doc objects). You can then use the context and add it to custom attributes etc.
Here's an example:
data = [
("Some text to process", {"meta": "foo"}),
("And more text...", {"meta": "bar"})
]
for doc, context in nlp.pipe(data, as_tuples=True):
# Let's assume you have a "meta" extension registered on the Doc
doc._.meta = context["meta"]
A bit late, but in case someone comes looking for this in 2022:
There is no official/documented way to access the context (the second tuple) for the Doc object from within a pipeline. However, the context does get written to an internal doc._context attribute, so we can use this internal attribute to access the context from within our pipelines.
For example:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
data = [
("stackoverflow is great", {"attr1": "foo", "attr2": "bar"}),
("The sun is shining today", {"location": "Hawaii"})
]
# Set up your custom pipeline. You can access the doc's context from
# within your pipeline, such as {"attr1": "foo", "attr2": "bar"}
#Language.component("my_pipeline")
def my_pipeline(doc):
print(doc._context)
return doc
# Add the pipeline
nlp.add_pipe("my_pipeline")
# Process the data and do something with the doc and/or context
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc)
print(context)
If you are interested in the source code, see the nlp.pipe method and the internal nlp._ensure_doc_with_context methods here: https://github.com/explosion/spaCy/blob/6b83fee58db27cee70ef8d893cbbf7470db4e242/spacy/language.py#L1535

How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0
after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
SnowballStemmer
is algorithm executing stemming
stemming is just the process of breaking a word down into its root.
is passed the argument 'english'   ↦   porter2 stemming algorithm
more precisely, this 'english' argument   ↦   stem.snowball.EnglishStemmer
(porter2 stemmer considered to be better than the original porter stemmer)
 
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
list comprehension loops over our tokenized input list tokenized
(tokenized can also be any other other iterable input instance)
list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
list comprehension then collects only the set of English stems
i.e., it is a list that should collect only stemmed English word tokens
 
Caveat:   list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words
Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text = f.read()
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
print(new_text)
I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
sentence=''
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(d.check(word)):
if(sentence ==''):
sentence = sentence + word
else:
sentence = sentence +" "+ word
print(sentence)
eng.append(sentence)
return eng
To save it just do this!
sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)
Hope it helps anyone!