can we display multiple records in displacy in spacy - spacy

I have been using space for quite some-time and I really liked the displacy
Is there a way where we can serve multiple texts from my data-set in the web-page as-in a small arrow to redirect to next record and mark the entities.
The code I'm using is as follows.
def validate(VAL_DATA):
nlp = spacy.load(args.model + '/nn')
for text, _ in VAL_DATA:
doc = nlp(text)
displacy.serve(doc, style='ent')
for ent in doc.ents:
print("entity: " + ent.label_ +"\t" + "text: " + ent.text)
VAL_DATA is my validation set and it has multiple records in it.
Thanks in advance.

Not sure if I got your question right, but if you want to mark entities found for mulitple docs you can just do the following:
def validate(VAL_DATA):
nlp = spacy.load(args.model + '/nn')
docs = list(nlp.pipe(VAL_DATA))
entities = [doc.ents for doc in docs]
displacy.serve(docs, style="ent")
At least for me it worked just fine.
Might work fo you too?

Related

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Is there a fast way to get the tokens for each sentence in spaCy?

To split my sentence into tokens I'm doing the following whichi is slow
import spacy nlp = spacy.load("en_core_web_lg")
text = "This is a test. This is another test"
sentence_tokens = []
doc = nlp(text)
for sent in doc.sents:
words = nlp(sent.text)
all = []
for w in words:
all.append(w)
sentence_tokens.append(all)
I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()
The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.
So if you need a list of strings, one for each token, you can do:
sentence_tokens = []
for sent in doc.sents:
sentence_tokens.append([token.text for token in sent])
Or even shorter:
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.
texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
# do something with the tokens
To just do the rule-based tokenization, which is very fast, run:
nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])
There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])
If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).
(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0
after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
SnowballStemmer
is algorithm executing stemming
stemming is just the process of breaking a word down into its root.
is passed the argument 'english'   ↦   porter2 stemming algorithm
more precisely, this 'english' argument   ↦   stem.snowball.EnglishStemmer
(porter2 stemmer considered to be better than the original porter stemmer)
 
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
list comprehension loops over our tokenized input list tokenized
(tokenized can also be any other other iterable input instance)
list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
list comprehension then collects only the set of English stems
i.e., it is a list that should collect only stemmed English word tokens
 
Caveat:   list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words
Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text = f.read()
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
print(new_text)
I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
sentence=''
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(d.check(word)):
if(sentence ==''):
sentence = sentence + word
else:
sentence = sentence +" "+ word
print(sentence)
eng.append(sentence)
return eng
To save it just do this!
sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)
Hope it helps anyone!

Title and description aren't indexed with collective.dexteritytextindexer

I have lots of Dexterity content types, some of them are just containers and are left with just the Title and Description (from plone.app.dexterity.behaviors.metadata.IBasic behavior).
I can find them by searching the text inside their title or description.
But for some complex content types I'm using collective.dexteritytextindexer to index some more fields and it works fine, I can find the text on the fields I marked to be indexed.
However the Title and Description are no longer available for searching. I tried something like:
class IMyContent(form.Schema):
"""My content type description
"""
dexteritytextindexer.searchable('title')
dexteritytextindexer.searchable('description')
dexteritytextindexer.searchable('long_desc')
form.widget(long_desc = WysiwygFieldWidget)
long_desc = schema.Text (
title = _(u"Rich description"),
description = _(u"Complete description"),
required = False,
)
...
But I can't see the content of title and description on the SearchableText column in the portal_catalog, and thus the results don't show them.
Any idea what I'm missing?
Cheers,
Got pretty much the same issue. Following the documentation on http://pypi.python.org/pypi/collective.dexteritytextindexer I used
from collective import dexteritytextindexer
from plone.autoform.interfaces import IFormFieldProvider
from plone.directives import form
from zope import schema
from zope.interface import alsoProvides
class IMyBehavior(form.Schema):
dexteritytextindexer.searchable('specialfield')
specialfield = schema.TextField(title=u'Special field')
alsoProvides(IMyBehavior, IFormFieldProvider)
to get my own fields indexed. However, the code
from plone.app.dexterity.interfaces import IBasic
from collective.dexteritytextindexer.utils import searchable
searchable(IBasic, 'title')
searchable(IBasic, 'description')
Didn't work. The import of IBasic fails. Seems this can easily be solved by importing
from plone.app.dexterity.behaviors.metadata import IBasic
The problem is probably that the field is coming from the IBasic or IDublineCore behaviour and not from your schema. I don't know enough about collective.dexteritytextindexer to know how to work around this, though.
Another option may be to just use plone.indexer and create your own SearchableText indexer that returns "%s %s %s" % (context.title, context.description, context.long_desc,). See the Dexterity docs for details.
As a reference this is the code I ended up writing:
#indexer(IMyDexterityType)
def searchableIndexer(context):
transforms = getToolByName(context, 'portal_transforms')
long_desc = context.long_desc // long_desc is a rich text field
if long_desc is not None:
long_desc = transforms.convert('html_to_text', long_desc).getData()
contacts = context.contacts // contacts is also a rich text field
if contacts is not None:
contacts = transforms.convert('html_to_text', contacts).getData()
return "%s %s %s %s" % (context.title, context.description, long_desc, contacts,)
grok.global_adapter(searchableIndexer, name="SearchableText")